Data Platform

Suchismita Sahu
14 min readSep 23, 2024

--

Lets build a Data Platform.

What is Data Platform

It is a ecosystem in the data stack, built making use of network effects between publishers and consumers providing improved developer experience, a sustainable marketplace and business model thereby increasing the Organization’s revenue. (Please refer article for these terminologies).

So, data platform is not a data storage layer, its a centralised metadata storage layer where required data governance, access control and security can be provided and maintained. This can be achieved through Data Catalog.

Objectives

1. Centralized Data Management

  • Create a unified platform that centralizes data from various sources across the organization, which facilitates better data governance, improves data accessibility, and reduces data silos. Mention the number of data sources with types of data.

2. Scalability

  • Design the platform to scale with the growing volume, variety, and velocity of data, which ensures that the platform can handle increased data loads and support future data initiatives without performance degradation. Mention the data volume, latency and throughput of data availability for data products.

3. Data Quality and Consistency

  • Implement mechanisms to ensure the accuracy, completeness, and consistency of data across the platform, which improves decision-making by providing reliable data and reduces the risk of errors in analysis. Mention percentage accuracy data accuracy and completeness required to build high quality data products using these data.

4. Real-time Data Processing

  • Enable the platform to process and analyze data in real-time or near real-time Which supports timely decision-making and allows for immediate insights, which is crucial for applications like monitoring and alerting. Mention the use cases which need real time and near real time data access.

5. Interoperability

  • Ensure the platform can integrate seamlessly with various tools, technologies, and systems used within the organization, which provides flexibility in adopting new technologies and integrating with existing systems, enhancing the overall data ecosystem. Prepare the data architecture and mention different technologies, third party tools, cloud infra to support a robust and scalable data platform.

6. Data Security and Compliance:

  • Implement robust security measures and ensure compliance with relevant regulations and standards, which protects sensitive data from unauthorized access and ensures the platform meets legal and regulatory requirements. Mention different standards such as HIPAA, GDPR as per your industry

7. Self-service Analytics:

  • Empower users across the organization to access, analyze, and visualize data without requiring extensive technical expertise, which increases data-driven decision-making across departments and reduces the burden on IT teams. Mention how many data teams or products and growth rate of these on an annual basis.

8. Cost-efficiency:

  • Optimize the platform’s architecture and operations to minimize costs while maximizing performance and capabilities which ensures that the data platform is sustainable and delivers value within budget constraints. Mention target cost savings on infrastructure.

9. Support for Advanced Analytics and AI/ML:

  • Provide the necessary infrastructure and tools to support advanced analytics, machine learning, and AI applications which enables the organization to leverage data for predictive analytics, automation, and other AI-driven initiatives. Mention different types of data products to be supported and their needs.

10. Data Governance and Compliance:

  • Implement policies, procedures, and technologies that ensure proper data management, usage, and compliance which maintains data integrity, ensures compliance with regulations, and aligns with corporate governance policies.

11. Enhancing Customer Experience:

  • Use the data platform to gather insights that improve customer interactions and satisfaction which leads to better customer retention, personalized services, and a stronger competitive edge.

12. Operational Efficiency:

  • Streamline data operations and reduce the time and effort required to manage and analyze data which increases productivity, reduces operational costs, and speeds up time-to-insight.

Centralized vs Decentralized vs Domain Driven (Data Mesh) Data Platform

Centralized Data: Consolidation for Efficiency

Centralized data refers to the practice of storing and managing all data in a single, central repository. Here, data is collected from various sources and consolidated into one system, commonly referred to as a data warehouse. Let’s delve into the advantages and challenges associated with this approach.

Advantages

1. Efficient data management:

Centralizing data allows for streamlined data management processes. With a single data repository, businesses can easily organize, update, and maintain data integrity.

2. Improved data analysis:

A central data repository facilitates comprehensive data analysis, enabling businesses to derive meaningful insights and make data-driven decisions more efficiently.

3. Enhanced security:

Centralized data often benefit from robust security measures. Implementing stringent access controls and encryption mechanisms becomes more manageable, reducing the risk of unauthorized data breaches.

Challenges of Centralized Data

1. Data silos:

While centralization aims to consolidate data, it can inadvertently lead to the creation of data silos. Different departments or teams within an organization might hoard data, hindering cross-functional collaboration and diminishing the potential for holistic insights.

2. Single point of failure:

Relying solely on a central data repository introduces a single point of failure. If the centralized system encounters issues, such as technical glitches or cyber-attacks, it can significantly disrupt operations and potentially compromise the entire dataset.

3. Privacy concerns:

Centralized data raises privacy concerns, especially when dealing with sensitive user or customer information. Organizations must implement robust privacy protocols to ensure compliance with data protection regulations and maintain the trust of their users.

Decentralized Data: Empowering Autonomy

Decentralized data, on the other hand, promotes the distribution of data across multiple locations or systems. Rather than relying on a single central repository, data is stored in diverse nodes, often interconnected via a network. Let’s explore the advantages and challenges associated with this approach.

Advantages

1. Enhanced data ownership:

Decentralization empowers individuals or departments within an organization to own and manage their data. This autonomy fosters innovation, as it allows teams to tailor their data management practices to their specific needs.

2. Improved scalability:

Decentralized systems are inherently scalable, as data can be distributed across multiple nodes. This flexibility enables businesses to expand their operations without facing the limitations of a centralized infrastructure.

3. Resilience and fault tolerance:

Decentralized data architecture provides resilience against system failures. Even if one node encounters issues, other nodes can continue to function independently, ensuring business continuity and data availability.

Challenges of Decentralized Data

1. Data consistency:

Maintaining data consistency across multiple decentralized nodes can be challenging. Synchronization and version control mechanisms must be in place to ensure that data remains accurate and up-to-date across the network.

2. Complex data integration:

Integrating data from multiple decentralized sources can be complex and time-consuming. Data interoperability and compatibility become critical considerations to ensure seamless data exchange between different nodes.

3. Increased security risks:

With data dispersed across multiple nodes, securing decentralized data becomes more intricate. Each node must be adequately protected to prevent unauthorized access or tampering. Robust encryption, access controls, and authentication mechanisms are essential to mitigate security risks effectively.

Data Mesh proposes a paradigm shift by advocating for a domain-oriented decentralized approach to data management. Instead of relying on a central data team, Data Mesh advocates for data ownership and governance distributed across different domains or business units within an organization.

In a Data Mesh architecture, each domain or business unit becomes responsible for its data products, including data collection, storage, processing, and analysis. This approach promotes autonomy, scalability, and agility by allowing teams closest to the data to make decisions and derive value from it. Data Mesh emphasizes the importance of clear data product ownership, well-defined APIs, and data quality monitoring to ensure the reliability and usability of the data products across the organization.

Data Mesh recognizes the complexity and diversity of data in modern organizations and acknowledges that a centralized or purely decentralized approach may not effectively address these challenges. By embracing the principles of Data Mesh, organizations can foster a culture of data collaboration, where teams work together to build and leverage data products that align with their specific domain expertise.

It is worth noting that implementing a Data Mesh architecture requires careful planning, coordination, and a shift in organizational mindset. However, for organizations seeking a more distributed and flexible approach to data management, exploring the principles and practices of Data Mesh can offer new insights and opportunities.

Key Principles of Data Mesh Architecture

  • Data Mesh is an organizational approach to managing distributed data architecture. It advocates for domain-oriented decentralized data ownership and architecture, treating data as a product and applying principles of product thinking to data management.
  • Key Characteristics:
  • Domain-Oriented Teams: Data mesh aligns with the principles of domain-driven design (DDD), emphasizing bounded contexts, ubiquitous language, aggregates, entities, value objects, contexts, or subdomains to model, structure, or organize data around business domains, contexts, or areas.
  • Federated Data Ownership: Data mesh advocates for a distributed, federated data architecture where data, datasets, or data products are treated as first-class citizens and are discoverable, accessible, interoperable, or reusable across domains, teams, or organizational boundaries.
  • Self-serve Data Infrastructure: It encourages the creation, standardization, or encapsulation of data products, APIs, or interfaces that encapsulate data capabilities, functionalities, or services, enabling seamless integration, consumption, or interaction with data assets.
  • Data as a Product: Data itself acts as a product in marketplace to get used by any third party vendor for their ML model training.
  • Use Cases:
  • Decentralized data ownership
  • Cross-functional collaboration
  • Scalable and agile data architecture

Personas

  • Business Users
  • Data Scientists/Analysts/Machine Learning Engineers
  • Security and Compliance Officers

Evaluation Metrics

Data Ownership

Scale out data sharing and generating value from data in step with Organization’s growth

  • Increased number of domains that provides analytical data
  • Increased number of domains that consumes analytical data
  • Increased peer to peer data sharing
  • Data business truthfulness- increased alignment between dev, business and operations

Data as a Product

Increase efficiency and effectiveness of data sharing within and across Organisational’s domains

  • Increased usage
  • Growth of active users
  • User satisfaction
  • User conversion rate from search & discovery to read & use.
  • Usability
  • Quality & Security
  • Data availability
  • Data risk
  • Change fail ratio
  • User Confidence & Trust
  • Timeliness, completeness, integrity standards compliance
  • Interoperability

Self Serve:

Increase domains autonomy with lower cognitive load and lower cost of data ownership

  • increase domain autonomy with self serve
  • Coverage of automated tasks
  • Platform users net promoter score
  • Backlog and release dependencies from domain teams to platform teams
  • Increase services coverage
  • Rate of platform product services usage
  • Number of active users in the platform and per platform service
  • Abstract complexity
  • Cost of data product life cycle management
  • Change fail ratio of data products
  • Number of data products using the platform
  • Lead time to build, test, deploy and use data products

Federated Computational Governance

Generate higher order intelligence securely and consistently -in step with Organisational growth

  • Active engagement of domains in global governance operation
  • Domains and data product owners who are active members in global federated governance
  • Rate of new global policies established and adopted by domains.
  • Mesh wide interoperability, reliability and consistency
  • Ratio of data products implementing latest versions of policies
  • Reduce governance friction through automatio
  • Lead time to detect and resolution of new data policy breaches
  • Number of active users of data products complying with policies

Data Catalog

Data Governance journey involves three obvious major components: people, processes and technologies. Some companies choose to launch an enterprise program and start with people (e.g. organisational structures, ownership, etc.) and processes (e.g. policies, standard operating procedures, etc.), others create a small enthusiastic data management group and start a data democratisation initiative promoting offensive Data Governance in a practical way — through a Data Catalog implementation. Any of these styles have their own challenges, advantages and disadvantages, but

Roughly there are 4 main categories of Data Catalogs.

  1. Stand-alone solutions offer key and additional data cataloging components within a single tool. Commercial and open source offerings are available and examples include Alation, Atlan, data.world, Zeenea, Amundsen, DataHub.
  2. Platform solutions offer key data cataloguing functions with modules providing additional capabilities like Data Quality, Data Privacy, some even MDM. Examples include Ataccama, Collibra, IBM, Informatica, Precisely, Talend.
  3. Cloud native Data Catalogs which provide key components mostly limited within the cloud service provider environment. Use cases such as orchestration and ETL-processes are the main focus. Examples include AWS Glue, Azure Purview, Google Data Catalog (part of Dataplex)
  4. Tool-specific Data Catalogs (add-ons) which support a specific tool. For examples within the area of business intelligence by providing key components as well as purpose related additional cataloguing features. A good example would be Tableau Catalog.

Looking into two last categories Databricks Unity catalog which is gaining traction with the speed of light is an interesting case as it initially could be considered as tool-specific one, but with all the latest developments it is now closer to the cloud native ones or even stand-alone.

Data Catalog maturity levels

This is an indicative way of dividing into maturity levels and borders can be blurred. However in practice these four main level have been observed.

L1 — Technical metadata hub. It is a metadata registry for data available in the data platform with ad-hoc curation based on crowdsourcing enabled by advanced users. It performs mostly metadata ingestion from various data sources on-prem and cloud with ad-hoc data modelling and use by advanced users (e.g. data analysts) to find data to build advanced analytics applications​. Sometimes it can be a good start for enabling data democratisation especially in agile environments in the “from chaos to structure” implementation approach which pertains certain risks (see below).

L2 — Curated data inventory. It is a curated data registry with foundational governance capabilities, data classification and user collaboration. Metadata can be fetched from various places including other data catalogs (e.g. cloud native). Integration with communication systems (e.g. Slack) is possible via API and plays a key role for data curation. Since data becomes more structured, data development can leverage that for data search and understanding context. Data Lineage becomes more important and should be provided up to the level of analytics applications​.

L3 — Data Governance Platform. It is a catalog integrated with Data Governance processes where automation of tasks is happening and it becomes a single point for data onboarding, assessment and metrics collection. Data Governance brings several new requirements as Data Quality, Data Classification and executing workflows. These features can either belong to the catalog itself or be provided by 3rd party tools via API integration. Since data is curated and governed it can be used in business applications consumed by business users​.

L4 — Enterprise Data Marketplace. It is a single point of data discovery and access in the enterprise for all categories of data users. Data Marketplace can be either internal only or span across multiple external data consumers and providers, thus API integration with external systems is required​.

Moving from one level to another might require additional capabilities to enable growth and sustainable adoption. Let’s look into core and additional data catalog capabilities and define what is necessary for each level.

Data Catalog capabilities

Data Management capabilities provided by a Data Catalog сan be divided into these major categories each containing capabilities which might be required at different levels of maturity.

  1. Data Inventory (L1+) allows to register data sources, organise and describe data by ingesting and curating business, technical and operational metadata. This capability includes Data source connectivity, Data sampling, Business Glossary, Data Dictionary, Metadata Management and Data Lineage
  2. Data Assessment (L1+) performs the evaluation of data with fitness for use, which includes Data profiling, measuring data risk via classification, PII detection and tracking of data usage to understand how popular datasets are or perform audits. Data Quality assessment also falls into this capability though is likely to be either provided by an additional module of a platform type catalog (e.g. Collibra, Informatica) or sourced from a 3rd party tool via API integration. Either way it is critical to have Data Quality information in the Data Catalog to complete fitness for use assessment.
  3. Data Discovery (L1+) enables users to locate the data asset they need via google like search, exploration and recommendations. This capability is a key for the success of a Data Catalog adoption and sustainable growth of the user community. It is important to highlight that some Data Catalog solutions separate this capability into a Marketplace add-on allowing not only to combine external and internal datasets, but also making it an online shop experience providing the option of requesting access via a shopping cart.
  4. Data Governance (L3+) enables data curation activities via defining roles and responsibilities, rules (fullness of asset curation), policies (e.g. data retention or archiving), tasks automation and standardisation via workflows (e.g. change asset metadata or request access to a dataset) and manual or automated tagging including sensitive data definition.
  5. Data Collaboration (L2+) enables communication and metadata crowdsourcing via tagging, rating, reviewing, sharing and texting. This is a key capability to facilitate data curation. With a reasonable amount of non-invasive governance can boost the tool adoption and metadata quality.
  6. AI automation and assistance (L2+) facilitates data curation by supporting users and taking over manual tasks, enabling data catalogs to scale. Most of the capabilities potentially can be supported by AI functions to a certain extent, e.g. in the area of data ingestion, data labelling, classification and search.
  7. Adoption tracking and Audit (L3+) allows to monitor and measure data catalog performance, analyse user behavior for changes tracking and log users activity to analyse tool adoption progress. Some solutions have embedded and customisable dashboards to make this task a pleasant experience.

Maturity indication above is not strict and some features might be relevant to different levels. What is important to understand is that maturity level growth means scaling up and growth of user community and curation demand which in turn will require more automation and AI augmentation.

MVP for Data Catalog

As mentioned above Data Catalog can be implemented at different stages of Data Governance program and have various roles. These are the three approaches observed in practice each of them having advantages and risks.

Iterative governed approach ​based on data sources/data domains with planned governance enhancements​ starts with the awareness creation plan, prioritised data domains, key roles available from the start. It enables fast and safe business user onboarding thus maximising business value.

What to consider:

  • High upfront planning and alignment efforts
  • Minimum viable training should be provided to key roles
  • Data Catalog tool should be carefully selected based on detailed requirements
  • Limited collaboration at the start and more centralised control

When it might not work:

  • Agile end-user community of advanced data professionals might not need upfront highly governed data catalog and can do curation via crowdsourcing and organic stewardship efforts
  • Open-source or cloud data catalog with limited capabilities and unfriendly UI

From chaos to structure​ aimed to bring all the metadata in and let users collaborate to curate and data governance to evolve​ gradually. Agile end-user community of advanced data professionals doesn’t need upfront highly governed data catalog and can do curation via crowdsourcing and organic stewardship efforts. Bringing all metadata in at once can help reveal duplicate datasets and provide a comprehensive picture on initial Data Quality state via profiling.

What to consider:

  • Training should be provided to all advanced catalog users
  • Data Catalog tool should be carefully selected based on detailed requirements
  • License/Usage costs should be carefully considered as some data catalog solutions charge per the amount of datasets profiled and volume of metadata loaded

When it might not work:

  • Open-source or cloud data catalog with limited collaboration, profiling and sharing capabilities
  • Highly regulated data environment with sensitive data
  • Governance-first approach to data management

Mixed​ approach with different parts of the catalog following its own approach​ and view permissions applied to restrict access. This fits mixed skill level user communities and prioritised data domains. It is possible to start adding business value immediately for part of the domains and grow other domains organically via crowdsourced curation. Some key roles should be available from the start and others emerge organically. Advanced users are not limited with highly curated datasets.

What to consider:

  • High user access security set-up effort
  • Minimum viable training should be provided to all catalog users
  • Data Catalog tool should be carefully selected based on detailed requirements (especially security)
  • Highly depends on DG operating model type (centralised vs federated)

When it might not work:

  • Open-source or cloud data catalog with limited security capabilities
  • Centralised DG Operating model with limited representation within data domains

What approach to take depends on multiple things including but not limited to Data Governance strategy, business goals, company culture, DataOps practices and user community.

Most likely in any approach on a high level the following steps should be taken to enable a successful data catalog implementation and adoption:

  1. Assess your needs and goals to map them to Data Catalog capabilities and create efficient enablement plan​
  2. Review your data processes and tech landscape to define required integrations and customisations​
  3. Review your Data Governance model or create one to enable Data Catalog adoption and operational efficiency​
  4. Create thorough implementation plan including MVP phase and ensure smooth execution to streamline value generation​

Before starting the MVP take some time to prepare and think of the following aspects of the future solution:

  • What would be the initial Critical Data Elements, data domains and data sources?
  • Who will be your data domain champions and data stewards? Can these key people allocate time to support the initiative?
  • What level of Data Catalog are you planning to build during MVP.?
  • What would be key Data Catalog capabilities you would like to start with

--

--

Suchismita Sahu
Suchismita Sahu

Written by Suchismita Sahu

Working as a Technical Product Manager at Jumio corporation, India. Passionate about Technology, Business and System Design.

No responses yet