Data Lineage with SQL and Gen AI
I’ve been reading for a while about applications of Gen AI in data analytics. There were quite few interesting use cases to be implemented within a company. As data analysts, data engineers, (sometimes) data scientists, people should know how to get data out of a database but the real struggle is when someone wants to retrace the data journey throughout the SQL code. And that was the light bulb switching on. What about a Gen AI powered data lineage for complex SQL statements?
As many of you know for sure, data lineage is a real pain because business users need it to better understand their KPIs and from where they come from but IT/data people are in trouble when it comes to generalizing the data mapping process.
The simple approach
The first thing I wanted to do is simply asking an LLM somthing easy and see what would be the results. So… Let’s give it a try with ChatGPT 3.5!
My first attempt went pretty well, I simply asked to map entities and attributes from a CREATE VIEW statement.
Starting from the following SQL statement, generate a JSON with the following information:
- entities: which are all the tables or views used or created within the SQL statement.
- attributes: which are all the fields belonging to a specific entityCREATE VIEW [Sales].[vProductAndDescription]
WITH SCHEMABINDING
AS
— Let’s also try
SELECT
p.[ProductID]
,p.[Name]
,pm.[Name] AS [ProductModel]
,pd.[Description]
FROM [Sales].[Product] p
INNER JOIN [Sales].[ProductModel] pm
ON p.[ModelID] = pm.[ModelID]
INNER JOIN [Sales].[ProductDescription] pd
ON pmx.[ProductDescID] = pd.[ProductDescID];
The output was already interesting, simple and clear:
{
"entities": {
"tables": ["Sales.Product", "Sales.ProductModel", "Sales.ProductDescription"],
"views": ["Sales.vProductAndDescription"]
},
"attributes": {
"Sales.Product": ["ProductID", "Name", "ModelID"],
"Sales.ProductModel": ["ModelID", "Name", "ProductDescID"],
"Sales.ProductDescription": ["ProductDescID", "Description"],
"Sales.vProductAndDescription": ["ProductID", "Name", "ProductModel", "Description"]
}
}
I tried also opening a new prompt and making the same question, this time the JSON output was slightly different (of course) but the result was still good, even better perhaps:
{
"entities": [
{
"name": "Sales.vProductAndDescription",
"type": "view",
"fields": [
"ProductID",
"Name",
"ProductModel",
"Description"
]
},
{
"name": "Sales.Product",
"type": "table",
"fields": [
"ProductID",
"Name",
"ModelID"
]
},
{
"name": "Sales.ProductModel",
"type": "table",
"fields": [
"ModelID",
"Name"
]
},
{
"name": "Sales.ProductDescription",
"type": "table",
"fields": [
"ProductDescID",
"Description"
]
}
]
}
At this point I decided to go some steps forward and tried to put a little pressure to that AI with this simple statement. Let’s try to add some information (such as the Schema) and test our first moves with data lineage. This time the output was not amusing.
My new question was:
if possibile, keep trace of the schema and save it to a dedicated attribute whithin the entities. try to add also this information:
— relationships: which is the relations between attributes from different entities. an attribute that is used as a query output to create a different entity identifies a relation between entites. keep trace of sourceEntity, targetEntity, sourceAttribute and targetAttribute.
Even though It kept trace of the SQL schema, It ended up misunderstanding my intentions (again, of course). As you can see from the JSON output below, ChatGPT 3.5 considered the relations as every kind of correlation between attributes, that implies that also every JOIN CLAUSE is treated as a relation which is not what we want.
{
"entities": [
{
"name": "Sales.vProductAndDescription",
"schema": "Sales",
"type": "view",
"fields": [
"ProductID",
"Name",
"ProductModel",
"Description"
]
},
{
"name": "Sales.Product",
"schema": "Sales",
"type": "table",
"fields": [
"ProductID",
"Name",
"ModelID"
],
"relationships": [
{
"sourceEntity": "Sales.Product",
"targetEntity": "Sales.vProductAndDescription",
"sourceAttribute": "ProductID",
"targetAttribute": "ProductID"
},
{
"sourceEntity": "Sales.Product",
"targetEntity": "Sales.ProductModel",
"sourceAttribute": "ModelID",
"targetAttribute": "ModelID"
}
]
},
{
"name": "Sales.ProductModel",
"schema": "Sales",
"type": "table",
"fields": [
"ModelID",
"Name"
],
"relationships": [
{
"sourceEntity": "Sales.ProductModel",
"targetEntity": "Sales.vProductAndDescription",
"sourceAttribute": "ModelID",
"targetAttribute": "ModelID"
}
]
},
{
"name": "Sales.ProductDescription",
"schema": "Sales",
"type": "table",
"fields": [
"ProductDescID",
"Description"
],
"relationships": [
{
"sourceEntity": "Sales.ProductDescription",
"targetEntity": "Sales.vProductAndDescription",
"sourceAttribute": "ProductDescID",
"targetAttribute": "ProductDescID"
}
]
}
]
}
But wait a minute… What if we ask directly about data lineage instead of trying to translate the concept?
Forget the relations. Try to add the following information:
-lineage: which is the lineage between attributes in the SQL statement. Try to identify sourceEntity, targetEntity, sourceAttribute and targetAttribute.
This time we got a good one even though it missed one of the attributes in the vProductAndDecription lineage (the name of the Product from the Product table). Nothing that a better prompting could not adjust:
{
"entities": [
{
"name": "Sales.vProductAndDescription",
"schema": "Sales",
"type": "view",
"fields": [
"ProductID",
"Name",
"ProductModel",
"Description"
],
"lineage": [
{
"sourceEntity": "Sales.Product",
"targetEntity": "Sales.vProductAndDescription",
"sourceAttribute": "ProductID",
"targetAttribute": "ProductID"
},
{
"sourceEntity": "Sales.ProductModel",
"targetEntity": "Sales.vProductAndDescription",
"sourceAttribute": "Name",
"targetAttribute": "ProductModel"
},
{
"sourceEntity": "Sales.ProductDescription",
"targetEntity": "Sales.vProductAndDescription",
"sourceAttribute": "Description",
"targetAttribute": "Description"
}
]
},
{
"name": "Sales.Product",
"schema": "Sales",
"type": "table",
"fields": [
"ProductID",
"Name",
"ModelID"
],
"lineage": []
},
{
"name": "Sales.ProductModel",
"schema": "Sales",
"type": "table",
"fields": [
"ModelID",
"Name"
],
"lineage": []
},
{
"name": "Sales.ProductDescription",
"schema": "Sales",
"type": "table",
"fields": [
"ProductDescID",
"Description"
],
"lineage": []
}
]
}
Caveats
I simplified all the tests I made but a can summurize some caveats:
- If you ask the same question in different sessions, you will get quite different answers. If you aim to standardize the process, be careful to lock down the answers you want to receive or, at least, the way you want to get them.
- Sometimes you have to explain every single concept to get your answer and sometimes you have to trust the LLM. This is the funny part of the prompting to me.
- Keep in mind very clearly what’s your final target because, due to this potentially unlimited technology, every while you’ll be tempted to add features and get more untill you’ll lose the main focus. Be focused!