Jellyfish: A New Approach to Data Preprocessing Using Local Large Language Models

4 min readOct 24, 2024

In recent years, large language models (LLMs) like OpenAI’s GPT series have transformed various fields, including natural language processing (NLP) and data analytics. However, the focus on data preprocessing (DP) — a crucial step in data mining pipelines — has recently come into the spotlight with the development of Jellyfish. This model offers a novel solution to tackle DP challenges using instruction-tuned, locally-deployable LLMs, ensuring both data security and customization.

Overview of Jellyfish

Jellyfish is designed to address multiple DP tasks, including:

Error Detection (ED) — Identifying inconsistencies in data records.
Data Imputation (DI) — Inferring missing values in datasets.
Schema Matching (SM) — Determining whether two attributes from different datasets represent the same concept.
Entity Matching (EM) — Inferring if two data records refer to the same entity.

These tasks are vital in ensuring data quality and enabling efficient data integration. Jellyfish is developed to overcome limitations seen in traditional models and mainstream LLMs, such as reliance on external APIs and limited customization for domain-specific applications.

Why Use Jellyfish for Data Preprocessing?

The design of Jellyfish provides several key advantages over other LLM-based solutions like GPT-3.5 and GPT-4:

Local Deployment and Security: Unlike models that rely on external APIs, Jellyfish can be deployed on local GPUs. This ensures data privacy and prevents potential breaches.
Cost Efficiency: Jellyfish operates efficiently on mid-range hardware (7B to 13B parameter models) and avoids the high costs associated with larger commercial models.
Customizability: Users can fine-tune the model for specific tasks or domains through prompt engineering, without needing extensive retraining.
Interpretability and Reasoning: Jellyfish outperforms many LLMs by providing interpretable reasoning for DP results, making it easier for users to understand why certain decisions are made.

Instruction Tuning and Knowledge Injection

Jellyfish leverages instruction tuning, a process that trains LLMs on task-specific datasets in a supervised fashion. This enables the model to follow human instructions more effectively. The construction of Jellyfish involves two key processes:

Data Configuration and Knowledge Injection: Data preprocessing tasks often rely on specific rules or domain knowledge. Jellyfish allows users to inject this knowledge into the model via prompts, enhancing performance on unseen datasets.
Reasoning Data: To boost the model’s interpretability, Jellyfish uses reasoning data during instruction tuning. This allows the model to not only provide DP solutions but also explain the underlying logic.

Performance Comparison

Jellyfish was evaluated against other LLMs, including GPT-3.5, GPT-4, and various non-LLM methods. The results highlight the effectiveness of Jellyfish across both seen and unseen datasets:

Seen Datasets: Jellyfish-13B outperformed or matched GPT-4 in many cases, particularly in tasks like schema and entity matching.
Unseen Datasets: The Jellyfish models demonstrated strong generalization capabilities, delivering competitive performance even without additional retraining.

Applications Beyond Standard DP Tasks

Jellyfish also supports new tasks like:

Column Type Annotation (CTA) — Inferring the type of columns in tables without headers.
Attribute Value Extraction (AVE) — Extracting attribute values from text descriptions.

These tasks show Jellyfish’s flexibility in adapting to a wide range of scenarios. By employing prompt engineering techniques, users can customize Jellyfish for these tasks with minimal effort.

Challenges and Future Directions

While Jellyfish offers several advantages, some challenges remain:

Computational Limitations: Despite its smaller size compared to GPT models, Jellyfish still requires significant resources for tuning and inference.
Token Limitations: Like many LLMs, Jellyfish can only process a limited number of tokens at a time, which may lead to inconsistencies when dealing with large datasets.
Risk of Hallucination: Although reasoning data improves interpretability, Jellyfish, like other LLMs, may occasionally generate incorrect or nonsensical results.

Future improvements could include integrating retrieval-augmented generation (RAG) techniques to further enhance the model’s ability to access external knowledge.

Key Features of Jellyfish

Jellyfish stands out for its comprehensive selection of algorithms and its ease of use. Here are some of the key features and algorithms provided by the Jellyfish package:

Levenshtein Distance: Measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other. It’s widely used in spell checkers and DNA sequence analysis.
Damerau-Levenshtein Distance: Similar to the Levenshtein distance, but it also considers the transposition of two adjacent characters as a single operation. This is particularly useful for typo correction.
Jaro and Jaro-Winkler Distance: These metrics measure the similarity between two strings, with the Jaro-Winkler variant giving more favorable ratings to strings that match from the beginning. This is useful for matching names and titles.
Soundex and Metaphone: Phonetic algorithms that convert words to codes based on their sounds in English. These are useful for matching names that sound alike but are spelled differently.
Hamming Distance: useful in scenarios where you need to compare binary data or when the strings you’re comparing are known to be of the same length.

These are the main features, there are couple of more features that you can explore in the documentation here

Getting Started with Jellyfish

You can easily install Jellyfish via pip:

pip install jellyfish

Once installed, you can start using Jellyfish to compare strings. Here’s a simple example that demonstrates the use of the Levenshtein distance:

import jellyfish

# Compare two strings
string1 = "hello"
string2 = "hallo"# Calculate the Levenshtein Distance
distance = jellyfish.levenshtein_distance(string1, string2)print(f"The Levenshtein Distance between '{string1}' and '{string2}' is: {distance}")

This example will output the Levenshtein Distance between “hello” and “hallo”, illustrating the basic usage of Jellyfish for string comparison.

Practical Applications

Jellyfish can be used in a variety of applications, from data cleaning to natural language processing tasks. Here are a few examples:

Data Deduplication: Identifying and merging duplicate records in databases.
Typo Correction: Offering suggestions for misspelled words in search queries or text entries.
Record Linkage: Matching records across different databases, such as user accounts or bibliographic records.

Conclusion

The Jellyfish package is a powerful and versatile tool for anyone working with text data in Python. With its wide range of string comparison algorithms, it offers solutions for numerous challenges in text processing and data management. Whether you’re a seasoned data scientist or a developer embarking on a new project, Jellyfish provides a robust foundation for any application requiring fuzzy string matching or phonetic comparisons.