Achieving Data Readiness for Generative AI — A Comprehensive Approach

8 min readOct 25, 2024

It is estimated that data scientists devote approximately 70% of their time to the preparation of data. Hence, making the process efficient is an absolute necessity.

One of the concerns in this data preparation process is overfitting, where a model becomes overly specialized to the training data, making it less effective at handling new, real-world scenarios. A glaring example is Amazon’s AI recruiting tool, which exhibited bias by favoring male candidates. This was an unintended consequence of the model’s training data, reflecting the inherent gender imbalances within the tech industry. It’s a stark reminder that even the most advanced AI models can perpetuate biases if not meticulously groomed and monitored.

In recent years, the scope of data has broadened considerably, with structured and semi-structured data finding their way into Gen AI-driven insights. However, unstructured data still remains relatively untapped, presenting a goldmine of potential. The underutilization of unstructured data represents a lost opportunity for enterprises to unlock hidden insights, fueling the potential for even more transformative Gen AI applications.

Building a Data Dictionary

Creating a data dictionary for Gen AI models involves various sources and stages. The process begins with the identification of key data sources. Log files, originating within an organization, serve as the original source, offering valuable insights into system behavior and user interactions. Sources of structured data such as Online Transaction Processing (OLTP), Enterprise Resource Planning (ERP), and Customer Relationship Management (CRM) platforms serve as the foundation for AI model training. In parallel, unstructured data from sources such as Sharepoints, Content Management Systems (CMS), and Digital Asset Management (DAM) systems adds context-rich information.

The next crucial step involves transitioning from a data dictionary, which maps out the existing data, to a more comprehensive data catalog. This catalog not only inventories all data sources but also allows for annotations from a business context, making it an indispensable tool for data management. Simultaneously, organizations invest time in feature engineering, a process that transforms raw data into meaningful features essential for AI models. These features are carefully crafted to enhance the predictive power and relevance of the AI system.

Distinct Stages

Once data sources are identified, data moves through distinct stages within the repository. This typically includes a bronze stage for raw data, a silver stage for cleaned data, and a golden stage for final, high-quality data. Data profiling is essential at this point, where the structure, quality, and characteristics of the data are analyzed to ensure alignment with AI objectives. Amidst these changes, organizations are increasingly transitioning from traditional data lakes to lakehouse architectures, incorporating new data formats and storage solutions such as Delta Lake. It is important to note that data comes in streams or batches, making data preparation and consolidation a continuous process.

Token Limit and Data Split

One of the critical challenges in data training is the token limit of LLMs. Each model can only process a certain number of tokens at a time. Tokens are the basic units of text, which can be as short as one character or as long as one word.

Each training example is limited to 4096 tokens

This token limit has significant implications on the model’s functionality and the cost associated with training and usage. To handle longer inputs or outputs, data must be divided into smaller chunks or sequences that fit within this token limit. This process of categorization and splitting must be carried out thoughtfully to ensure context preservation. If done haphazardly, it can result in fragmented responses or loss of essential information.

Token limit is directly related to the computational resources required for training and inference. Larger models with higher token capacities demand more extensive computational power, which translates to increased training costs. Moreover, longer input sequences or conversations that approach the token limit may also result in higher usage costs, as each processed token contributes to the overall cost of using the model.

Manual Data Processing

Gen AI models often require manual data processing to ensure the input data is clean, relevant, and free from noise. This critical step involves recognizing and resolving errors, inconsistencies, or irrelevant information in the dataset.

Data Profiling: Data profiling involves a thorough examination of the dataset, including column analysis, histogram generation, and anomaly detection. For example, when dealing with clickstream traffic data on a website, data stewards may use data profiling techniques to identify anomalies. These anomalies could be attributed to various factors, such as the effects of a marketing campaign or a potential denial-of-service (DoS) attack.

Imputation of Data: Imputation comes into play when there are gaps or missing values in the dataset. For instance, in the context of image or video data, low-resolution images or those with exposure issues may need to be discarded or enhanced. This process involves analyzing the histogram of images to identify those with extra exposure.

Cleaning Low-Fidelity Data: Low-fidelity data, which may contain inaccuracies, noise, or inconsistencies, requires special attention during manual data processing. Data stewards work to refine this data to improve its quality. For example, in the preparation process, masking sensitive information such as card numbers can be crucial to ensure data privacy and security.

Data Labeling: In addition to these critical steps, data labeling is a fundamental component of manual data processing. It involves annotating data points with meaningful labels or tags, enabling AI models to interpret the data effectively. For instance, in image recognition tasks, data labeling may involve marking objects within images to train the model to recognize them accurately.

Guardrails against Irrelevant Data

Fending off irrelevant data is essential for smooth functioning. Gen AI models can hallucinate or generate information that doesn’t exist in the training data if they encounter ambiguous information. Guardrails serve as protective mechanisms, helping these models generate contextually appropriate responses. They are especially critical because Gen AI models, like all AI systems, can sometimes produce misleading information when faced with contradictory data.

These guardrails encompass several key aspects, including ensuring contextual relevance, fact-checking and verification, content moderation, and plausibility checks. For instance, they ensure that AI responses align with the input context, verify accuracy through fact-checking tools, prevent the generation of inappropriate content through content moderation mechanisms, and assess the plausibility of generated data.

To enforce these guardrails, organizations commonly utilize various tools and techniques such as rule-based filters, content moderation APIs, fact-checking services, and advanced contextual analysis methods. Tools such as Google Perspective API or the Microsoft Content Moderator API can automatically assess and filter content for inappropriate or harmful language. Services such as Snopes or FactCheck.org provide fact-checking data that can be used to verify the accuracy of information generated by AI models.

Understanding Security Issues and Mitigating Risks

As enterprises embrace Generative AI, it is also important for them to understand the emerging security threats and vulnerabilities. To safeguard against potential risks, a multifaceted security strategy is essential, spanning three crucial aspects: data at rest, in transit, and during processing. Data protection at rest includes implementation of robust encryption and strict access controls, while in transit, usage of secure transmission protocols helps prevent tampering by malicious actors. During processing, the deployment of anomaly detection systems can assist in identifying unusual behavior. Lastly, continuous monitoring and auditing is essential to promptly respond to suspicious activities.

Ethical Considerations in Security

Beyond technical preparedness, data readiness also involves ethical considerations that are crucial for responsible deployment. Integrating ethical principles into Gen AI models is essential to minimize biases, ensure privacy, prevent intellectual property violations, and create an ethical AI landscape.

Mitigating Bias: One of the foremost ethical concerns is the potential for biases in data and algorithms. Bias can lead to discriminatory outcomes, perpetuating societal inequalities. To address this issue, employ diverse and representative datasets to train models, minimizing biases rooted in underrepresentation. Regularly audit and assess AI models for bias and take corrective actions. Follow best practices and guidelines, such as the AI Ethics Guidelines by IEEE.
Privacy Issues: As AI systems become more sophisticated, the risk of data breaches and unauthorized access looms large. To safeguard privacy, implement robust data anonymization and encryption techniques. Adhere to global data protection regulations. Refer to the Electronic Frontier Foundation’s AI and Machine Learning Privacy guide for in-depth insights into privacy concerns.
Intellectual Property Violations: The creation of new content raises concerns about intellectual property rights. To prevent IP violations, clearly define the ownership of AI-generated content in your organization’s policies. Respect copyright laws and intellectual property rights when using such content. Explore the World Intellectual Property Organization’s resources on AI and IP rights for guidance on this complex issue.

Assessing Data Readiness and the Role of Data Lineage

Determining when a Gen AI model is truly data-ready is not a linear process. Instead, it’s a multifaceted endeavor that requires a holistic approach. While there’s no one-size-fits-all answer, certain guidelines can help enterprises gauge a certain level of readiness. One of the key aspects to consider in this assessment is data lineage, which traces the journey of data from its source to its usage. Here’s a comprehensive approach to evaluate a model’s data readiness:

Comprehensive Evaluation: Begin by assessing the organization and structure of your data. Ensure it is well-organized, denormalized for context preservation, and indexed for efficient retrieval. Dividing data into manageable chunks within the token limits of the model is crucial to prevent processing limitations. Denormalizing data, where applicable, helps maintain context and prevents fragmentation that could hinder the model’s understanding of the complete picture.

Accuracy Assessment: Analyze the accuracy and trustworthiness of your existing data analytics, reporting, and business intelligence dashboards. Identify and rectify any irrelevant or outdated information that might adversely affect model performance. Ensuring the data is accurate and up-to-date is foundational for data readiness.

Cleaning and Preparation: Implement robust data cleaning processes to eliminate errors. Develop effective strategies to identify and handle contradictory data, as it can significantly impact the reliability of AI model outputs. Integrate data cleaning processes seamlessly into your data pipeline to maintain data quality over time.

Continuous Monitoring and Improvement: Establish a systematic approach for consistent monitoring of model performance. Be prepared to adapt and refine your data readiness strategies based on ongoing monitoring and evolving requirements. The dynamic nature of data and AI necessitates continuous improvement to maintain optimal model performance.

Data Readiness In Action

One recent use case that we worked on involved an insurance provider. With the vast amount of complex documentation involved including images, tables and manuals the insurance provider wanted to empower the customer care team with a copilot which caters to their customers by providing accurate information on coverages and claims processing. We are helping them to improve their data readiness by implementing a number of techniques:

• Semantic Enrichment via dictionary — we enriched their contracts with information such as policy summaries and claims histories. This helped to improve the accuracy of the generative AI model when answering questions about customer eligibility for coverage.

• De-normalized and summarized the contracts to make them more accessible to the generative AI model. This involved breaking down large, complex datasets into smaller, more manageable chunks. Gen AI models typically have limits on the number of tokens that they can process at a time, with our approach, we could break it down into chunks to fit 500 tokens per contract.

• Choosing appropriate vector dimensions, indexes, and metrics to find the best combination for the company’s need was crucial as these parameters had a significant impact on the performance of the application by removing unrelated chunks from vector lookups.

• Rerank the data chunks again — we found it useful to rerank data chunks with an additional layer of the supervised similarity measure after they had been processed in order to improve the quality of the responses. Data chunks (sentences, paragraphs, or documents) were initially reranked using Cohere’s re-rank and then further with our custom trained DeBERTa. This led to a much lower and acceptable standard deviation.

• Cost saving — Data preparation techniques helped surface relevant data chunks more efficiently, in turn reducing TopK chunks by one-fourth and number of LLM calls required to get the same levels of accuracy and output qualities.

Data readiness not only allows enterprises to navigate challenges but also empowers them to seize the opportunities presented by ongoing technological advancements.