Data Modeling Fundamentals

Data Modeling is a mature discipline and is still pivotal in every data initiative. The goal of this article is to provide short but effective coverage of fundamental data modeling concepts and techniques, in a single read.

Suchismita Sahu
13 min readOct 11, 2024

Data Model Abstractions

A data model is a representation of the data structures and relationships used in a database or a system. It provides a way to organize and structure data in a logical and meaningful way, making it easier to understand and work with. Data models can be used to design, implement and maintain databases, and support business processes and decision-making.

There are several types of data models, including:

  1. Conceptual Data Model: This type of model provides a high-level view of the data and its relationships. It is used to define the overall structure of the data and to identify the main entities and their attributes.
  2. Logical Data Model: This type of model provides a detailed view of the data and its relationships. It defines the structure of the data in terms of entities, attributes, and relationships. It also defines the integrity constraints and business rules that apply to the data. It is worth noting that the Logical Data Model is the foundation of the physical data model and it is the starting point for creating a physical data model. The logical data model is used to define the data requirements and it is then translated into a physical data model that is implemented in a specific technology.
  3. Physical Data Model: This type of model provides a detailed view of the physical storage of the data, including the database design, data types, and indexes. It is used to implement the logical data model and to optimize the performance of the database. The transformation from a logical to a physical data model involves the transformation of the entities and relationships in the logical model to the specific structures and constraints in the physical model.

Data modeling is an iterative process and it’s important to iterate and refine the model according to the business requirement and feedback from stakeholders. It’s also important to validate the data model and to test it against real data.

Data modeling is a critical step in the design and development of a database or an application, as it ensures that the data is organized and structured in a way that supports business requirements and decision-making.

Logical Data Model Entity Types

In a logical data model, entities are the fundamental building blocks that represent real-world objects or concepts. There are several types of entities that can be used in a logical data model, including:

  1. Strong entities: Entities that have a unique identifier and can exist independently in the real world. For example, a customer or an order.
  2. Weak entities: Entities that do not have a unique identifier and rely on a relationship with a strong entity to exist. For example, an invoice line item or a purchase order detail.
  3. Associative entities: Entities that represent the relationship between two or more other entities. For example, a customer-order relationship or a product-supplier relationship.

These types of entities can be used in combination to represent the different objects and relationships in the system, and are used to represent the different objects and relationships in the system.

Logical Data Model Entity Relationship Types

In a logical data model, relationships between entities are used to define how different entities in the data model are related to one another. There are several types of relationships that can be used in a logical data model:

  1. One-to-one: A one-to-one relationship is a relationship where a single instance of one entity is related to a single instance of another entity. For example, a single customer can have one account.
  2. One-to-many: A one-to-many relationship is a relationship where a single instance of one entity is related to multiple instances of another entity. For example, a single customer can have multiple orders.
  3. Many-to-many: A many-to-many relationship is a relationship where multiple instances of one entity are related to multiple instances of another entity. For example, multiple customers can place an order for the same product.
  4. Supertype-subtype: A supertype-subtype relationship is a relationship where a general entity (supertype) is related to specific entities (subtypes). For example, an Employee entity could be a supertype to entities for Full-Time Employees and Part-Time Employees.
  5. Recursive relationship: A recursive relationship is a relationship where an entity is related to itself. For example, a manager-employee relationship, where an employee can also be a manager of other employees.
  6. Association relationship: An association relationship is a relationship that describes the relationship between two entities without explicit cardinality.

These different types of relationships can be used in combination to create a logical data model that accurately represents the relationships between different entities in the data and can be used to support the various business needs of an organization.

Data Model Keys

In a data model, keys are used to uniquely identify a record within a table and to establish relationships between tables. There are several types of keys that can be used, including:

  1. Primary key: A primary key is a unique identifier for a record within a table. It must be unique for each record and cannot be null. For example, a customer table might have a primary key of “customer_id”, where each customer is assigned a unique ID.
  2. Foreign key: A foreign key is a field in a table that refers to the primary key of another table. It is used to establish a link between two tables and to ensure referential integrity. For example, an order table might have a foreign key of “customer_id” that references the primary key of the customer table.
  3. Alternate key: An alternate key is a unique identifier for a record within a table that can be used as an alternative to the primary key. It is used when there are multiple unique identifiers for a record. For example, a product table might have an alternate key of “product_code” in addition to the primary key “product_id”.
  4. Surrogate key: A surrogate key is a unique identifier for a record within a table that is not related to the natural attributes of the data. It is used when there is no natural primary key. For example, an Employee table might have a surrogate key “Employee_number”, which is a unique identifier assigned to each employee.

It’s important to choose the right key for a table, as it will affect how the data is organized and how it can be queried and accessed. Primary keys and foreign keys are used to establish relationships between tables, while alternate keys and surrogate keys are used to uniquely identify records within a table.

Logical Data Model Cardinality

In a logical data model, the relationship cardinality represents the number of occurrences of one entity for each occurrence of another entity. There are several types of relationship cardinality, including:

  1. One-to-one (1:1): One occurrence of entity A is related to one and only one occurrence of entity B, and vice versa. For example, a passport and a person, a person can have one passport and one passport can be assigned to one person.
  2. One-to-many (1:N): One occurrence of entity A is related to many occurrences of entity B, but each occurrence of entity B is related to one and only one occurrence of entity A. For example, a customer and an order, a customer can have many orders but an order can only be assigned to one customer.
  3. Many-to-many (M:N): Many occurrences of entity A are related to many occurrences of entity B, and vice versa. For example, a student and a course, a student can be enrolled in many courses and a course can have many students.
  4. Zero-to-one (0:1): Zero or one occurrence of entity A is related to one and only one occurrence of entity B, and vice versa. For example, an employee and a manager, an employee may have a manager but a manager may not have an employee.
  5. Zero-to-many (0:N): Zero or one occurrence of entity A is related to many occurrences of entity B, but each occurrence of entity B is related to zero or one occurrence of entity A. For example, a company and a branch, a company can have many branches but a branch must have only one company

It’s important to correctly define the cardinality of the relationships between entities in a logical data model, as this will affect the way the data is organized and how it can be queried and accessed in the physical data model.

Data Modeling Normal Forms

Normalization is a process used in data modeling to organize data into separate tables based on their logical relationships. The normal form of a data model refers to the level of normalization that has been achieved. There are several normal forms that can be applied to a data model, including:

  1. First Normal Form (1NF): This is the most basic level of normalization. It requires that each table has a primary key and that all data is atomic, meaning that each column contains only one value and that there are no repeating groups of data.
  2. Second Normal Form (2NF): In addition to meeting the requirements of 1NF, a table in 2NF must not have any partial dependencies. A partial dependency is when a non-primary key column depends on only part of the primary key.
  3. Third Normal Form (3NF): In addition to meeting the requirements of 2NF, a table in 3NF must not have any transitive dependencies. A transitive dependency is when a non-primary key column depends on another non-primary key column.
  4. Boyce-Codd Normal Form (BCNF): This is a higher level of normalization than 3NF. In addition to meeting the requirements of 3NF, a table in BCNF must not have any non-trivial functional dependencies. A non-trivial functional dependency is when a non-primary key column depends on the entire primary key.
  5. Fourth Normal Form (4NF) and Fifth Normal Form (5NF) also exist but are not widely used.

The goal of normalization is to minimize data redundancy and to improve the integrity and maintainability of the data model. However, it’s important to balance the benefits of normalization against the performance and complexity of the data model. In some cases, it may be necessary to denormalize the data model to improve performance or to support specific business requirements.

Data Modeling Approaches

Normalized Data Model: Normalized data models are based on the principles of database normalization, which are designed to eliminate data redundancy and improve data integrity. This approach involves breaking down a data model into multiple, related tables, with each table containing a specific set of data. This allows for the elimination of data redundancy and makes it easier to maintain data integrity. Normalized models are well-suited for environments where data is relatively stable and there is a high degree of consistency in the data.

Pros:

  • It can improve data integrity and consistency
  • It requires less disk space than a denormalized or data vault model
  • It is good for read-only and reporting purposes

Cons:

  • It can be less flexible and adaptable to change
  • It can be difficult to handle large volumes of data
  • It is not well-suited for integrating data from multiple sources
  • It can be difficult to handle complex relationships between data

Denormalized Data Model: A denormalized data model is a form of data modeling similar to a normalized model, but with less emphasis on normalization principles. This approach is characterized by storing redundant data across multiple tables and is often used in situations where performance is a primary concern, such as in data warehousing or reporting scenarios.

Pros:

  • It can improve query performance by reducing the number of joins required
  • It can simplify the data model and make it easier to understand
  • It is well-suited for read-heavy and reporting scenarios

Cons:

  • It can increase data redundancy and complexity
  • It can make it more difficult to maintain data integrity and consistency
  • It can be difficult to handle complex relationships between data
  • It requires more disk space than a normalized model

Data Vault: Data vault modeling is a method that focuses on modeling data in a way that is flexible and adaptable to change. It uses a hub-and-spoke structure, where the hub represents a central point of truth for a given piece of data, and the spokes represent different versions or representations of that data. This approach is well-suited for environments where data needs to be integrated from multiple sources, and where the schema is likely to change over time.

Pros:

  • It is highly flexible and adaptable to change
  • It can handle large volumes of data
  • It is well-suited for integrating data from multiple sources
  • It can handle complex relationships between data
  • It can support both historical and current data

Cons:

  • It can be complex to implement
  • It requires specialized training to design and maintain
  • It requires more disk space than a normalized model

Dimensional Model: Dimensional modeling is a data modeling technique used in data warehousing and business intelligence. It is characterized by a star or snowflake schema, where a central fact table is connected to multiple-dimension tables. The fact table contains the quantitative data and the dimension tables contain the descriptive data. Dimensional models are designed to support efficient querying and reporting of data.

Pros:

  • It is simple and easy to understand
  • It is optimized for querying and reporting
  • It allows for the efficient aggregation of data

Cons:

  • It can be less flexible and adaptable to change
  • It can be difficult to handle complex relationships between data
  • It may not be well-suited for transactional systems

Indexes

Indexes are used in data modeling to improve the performance of database operations, such as searching and sorting. An index is a separate data structure that stores a mapping of the values in a specific column to the location of the corresponding rows in the table. There are several types of indexes that can be used in data modeling:

  1. Primary Key Index: A primary key index is an index that is automatically created when a primary key is defined for a table. It is used to ensure the uniqueness of the primary key values and to provide a quick way to locate a specific row in the table.
  2. Unique Index: A unique index is an index that is created on a non-primary key column, but it is used to ensure the uniqueness of the values in that column.
  3. Clustered Index: A clustered index is an index that determines the physical order of the data in a table. Each table can have only one clustered index, and it is usually created on the primary key.
  4. Non-Clustered Index: A non-clustered index is an index that provides a quick way to locate a specific row in a table, but it does not affect the physical order of the data. Multiple non-clustered indexes can be created on a table.
  5. Composite Indexes: An index that is created on multiple columns is called a composite index.
  6. Covering Indexes: An index that contains all the columns that are required to execute a query is called a covering index.

Indexes can be created on one or more columns in a table, and they can be used to improve the performance of queries that filter or sort data based on the indexed columns. However, it’s important to balance the benefits of indexes against the costs of maintaining them, as they can consume additional storage space and can slow down data modification operations.

Database Constraints

A database constraint is a rule that is enforced by the database management system to ensure that data stored in the database adheres to certain conditions. These conditions can be used to maintain data integrity and consistency. There are several types of database constraints, including:

  • Primary Key Constraints: A primary key is a unique identifier for each record in a table, and it cannot be null. For example, a primary key constraint on the “id” column of a “customers” table would ensure that each customer has a unique ID and that no ID is left null.
  • Foreign Key Constraints: A foreign key is a column or set of columns in a table that refers to the primary key of another table. This creates a link between two tables, and it is used to maintain referential integrity. For example, a foreign key constraint on the “order_id” column of an “order_items” table would ensure that each order item is associated with a valid order.
  • Check Constraints: A check constraint is used to validate the data in a column based on a Boolean expression. For example, a check constraint on the “age” column of a “employees” table would ensure that the age value is between 18 and 65.
  • Unique Constraints: A unique constraint ensures that the data in a column or set of columns is unique across the whole table. For example, a unique constraint on the “email” column of a “customers” table would ensure that no two customers have the same email address.

Cascading actions are the actions that happen when a user performs an action on a parent table that affects the data in related child tables. For example, when a user deletes a record from the parent table, the related records in the child table will also be deleted. This can be a problem if you want to keep the data in the child table, so you can use “restricted cascade” to prevent this cascading effect.

For example, let’s say you have two tables: “orders” and “order_items”. The “orders” table has a primary key constraint on the “order_id” column, and the “order_items” table has a foreign key constraint on the “order_id” column that references the “order_id” column in the “orders” table. If you apply restricted cascade on the foreign key constraint on the “order_items” table, it means that if a user tries to delete an order from the “orders” table, the database management system will check the “order_items” table for any records that have the same “order_id” value, and it will prevent the delete action if there are any records found, so that the data in the “order_items” table is preserved.

Data Modeling Guiding Principles

Here are several guiding principles that can be used to manage a data model effectively:

  1. Understand the business requirements: The data model should be designed to support the organization’s specific business requirements. It’s important to understand the data requirements of the different stakeholders and to work closely with them to ensure that the data model meets their needs.
  2. Keep it simple: The data model should be kept as simple as possible, with a clear and easy-to-understand structure. This will make it easier to maintain and update, and will also make it more efficient to query and access the data.
  3. Use standard conventions: The data model should use standard conventions for naming, data types, and relationships, and should be consistent throughout. This will make it easier for others to understand and work with the model.
  4. Use normalization (for logical data models): The data model should be normalized to minimize data redundancy and improve data integrity.
  5. Validate the data model: The data model should be validated against the business requirements, and any issues or inconsistencies should be identified and resolved before the physical data model is implemented.
  6. Keep it flexible: The data model should be designed to be flexible and adaptable so that it can be easily modified to accommodate changes in business requirements or data structures.
  7. Use Data Modeling tools: Data modeling tools can be used to create, manage and validate data models, they also can help to document the model and to make it more readable.
  8. Keep it updated: The data model should be reviewed and updated regularly to ensure that it continues to meet the business requirements and to take into account any changes in the data structures or requirements.

Following these principles can help to ensure that the data model is accurate, efficient, and aligned with the needs of the organization, which will make it easier to implement and maintain the physical data model, and ultimately, help to improve the quality and integrity of the data.

Popular Data Modeling Tools

  • ERwin
  • PowerDesigner
  • ER/Studio
  • MagicDraw
  • MySQL Workbench
  • Oracle SQL Developer Data Modeler
  • IBM InfoSphere Data Architect

--

--

Suchismita Sahu
Suchismita Sahu

Written by Suchismita Sahu

Working as a Technical Product Manager at Jumio corporation, India. Passionate about Technology, Business and System Design.

No responses yet