Post

Data Modeling

Explore the essentials of data modeling, from basic concepts and ER diagrams to advanced techniques like normalization and denormalization, in this comprehensive guide designed for beginners and experts alike.

What is Data Modeling?


The goal of data modeling to illustrate the types of data used and stored within the system, the relationships among these data types, the ways the data can be grouped and organized and its formats and attributes.

Importance of Data Modeling

  • Clarifies Data Requirements
  • Improves Database Design
  • Enhances Data Quality
  • Facilitates Data Integration
  • Supports Data Management
  • Optimizes Query Performance
  • Reduces Development Costs
  • Ensures Compliance and Governance
  • Supports Data Analytics

Types of Data Models


Desktop View

1. Conceptual

The conceptual data model provides high level overview, focusing on the entities(tables) and their relationships without going into technical details. It is more like a blueprint of the overall structure. Usually, Business Analysts, Data Architects, and Stakeholders work on building this model.

2. Logical

The logical data model adds more detail to the conceptual model by defining the attributes(columns) of each entity and the relationship between them but it does not consider how the data will be physically stored. Modelled by Data Analysts, Data Modelers, Data Architects, etc.

3. Physical

The physical data model translates the logical model into an actual database implementation, specifying tables, columns, data types, and constraints to accurately reflect how data will be stored, accessed, and managed in the database system. This is usually done by Data Engineers, Database Engineers.

Components of Data Models


1. Entities

Entities represent objects or concepts that has data stored about them. You can think of it as a table in a database. Each entity is distinct in the real world with an independent existence. Example of entities in a University database can be Student, Course, Instructor, and Department.

2. Attributes

Attributes are the properties or characteristics of an entity. They describe various aspects of the entity and hold data values. For the Student entity mentioned above, attributes might include StudentID, Name, Email, and DateOfBirth. For the Course entity, attributes could be CourseName, Credits, and Department. Think of entities as the columns of a table in a database.

Attribute
StudentID
Name
Email
DateOfBirth
Attribute
CourseID
CourseName
Credits
DepartmentID

3. Relationships

Relationships describe how entities interact with one another. They define the associations between different entities.

  • One-to-One: Each instance of one entity relates to one instance of another entity. In the below tables, each student has one card.

    StudentIDNameAge
    1Alice20
    2Bob22
    CardIDStudentIDIssueDate
    10112023-09-01
    10222023-09-02
  • One-to-Many: One instance of an entity relates to multiple instances of another entity. In the below tables, one Professor offers multiple Courses.

    ProfessorIDNameDepartment
    1Dr. SmithComputer Science
    2Dr. JohnsonMathematics
    CourseIDCourseNameProfessorID
    1Algorithms1
    2Data Structures1
    3Calculus2
  • Many-to-Many: Multiple instances of one entity relate to multiple instances of another entity. In the below tables, Students enroll in multiple Courses, and each Course has multiple Students.

    StudentIDNameAge
    1Alice20
    2Bob22
    3Charlie21
    CourseIDCourseName
    1Algorithms
    2Data Structures
    3Calculus
    EnrollmentIDStudentIDCourseID
    111
    212
    322
    433
    531

4. Constraints

Constraints are rules applied to data to ensure its integrity and consistency within the database.

  • Primary Key: Ensures that each record in a table is unique and not null. In a Student table, the StudentID column is designated as the primary key, ensuring that every student has a unique identifier.

    StudentIDNameAge
    1Alice20
    2Bob22
  • Foreign Key: Ensures referential integrity by linking foreign keys to primary keys in another table. In an Enrollment table, the StudentID column is a foreign key that references the StudentID column in the above Student table, ensuring that each enrollment record corresponds to an existing student.

    EnrollmentIDStudentIDCourseID
    11101
    22102
    31103
  • Unique Key: Ensures all values in a column or a set of columns are unique. In an User table, the Email column is a unique key, ensuring that no two users can have the same email address.

    UserIDUsernameEmail
    1john_doejohn@example.com
    2jane_smithjane@example.com
    3bob_jonesbob@example.com
  • Composite Key: A combination of two or more columns used to create a unique identifier for a record. In an Enrollment table, the combination of StudentID and CourseID columns forms a composite key, ensuring that each student can enroll in a specific course only once.

    StudentIDCourseIDEnrollmentDate
    11012023-09-01
    11022023-09-02
    21012023-09-03
  • Candidate Key: A column, or set of columns, that can uniquely identify any record without any null values. In a Car table, both the VIN (Vehicle Identification Number) and LicensePlate could serve as candidate keys because either can uniquely identify a car.

    VINLicensePlateModel
    1HGCM82633A123456ABC123Honda Accord
    1HGCM82633A123457XYZ789Toyota Camry
  • Alternate Key: A candidate key that is not the primary key. In an Employee table, if the EmployeeID is the primary key, the NationalInsuranceNumber can be an alternate key because it also uniquely identifies an employee but is not chosen as the primary key.

    EmployeeIDNationalInsuranceNumberName
    1NI123456AJohn Doe
    2NI789012BJane Smith
  • Super Key: A set of one or more columns that can uniquely identify a record in a table. In an Order table, the combination of OrderID, CustomerID, and OrderDate can be considered a super key, as the combination of these columns uniquely identifies each order, even though it may include unnecessary columns for uniqueness.

    OrderIDCustomerIDOrderDate
    110012023-07-01
    210022023-07-02
  • Not Null: Ensures that a column cannot have a NULL value. In an Employee table, the LastName column has a NOT NULL constraint, ensuring that every employee record must have a last name.

    EmployeeIDFirstNameLastNameAge
    1JohnDoe30
    2JaneSmith25
  • Check: Ensures that all values in a column satisfy a specific condition. In a Product table, the Price column has a CHECK constraint to ensure that the price is always greater than zero.

    ProductIDProductNamePrice 
    1Laptop999.99 
    2Mouse19.99 
    3Keyboard-10.00<– This entry would violate the CHECK constraint
  • Default: Assigns a default value to a column when no value is specified. In an Order table, the OrderStatus column has a DEFAULT constraint that sets the default status to ‘Pending’ if no status is provided.

    OrderIDCustomerIDOrderDateOrderStatus 
    110012023-07-01Pending 
    210022023-07-02Shipped 
    310032023-07-03 <– This entry would have ‘Pending’ as default OrderStatus

5. Indexes


Indexes are used to improve the speed of data retrieval operations on a database table. They work by creating a data structure that allows the database to find data more quickly than if it had to scan the entire table. Here are some examples of different types of indexes and when they are used:

  • Clustered Index: It determines the physical order of data in a table where it stores data in the same order as the index. Each table can have only one clustered index and used when queries search for range and retrieves large amount of data in sorted order. By default, the primary key constraint creates a clustered index.

  • Non-Clustered Index: It creates a seperate structure from the table data and contains a sorted list of reference to the table, but the data itself is not stored in the index order. A table can have multiple non-clustered indexes. Used when queries look for exact match and when multiple columns need to be indexed.

FeatureClustered IndexNon-Clustered Index
Physical OrderDetermines the physical order of dataDoes not affect physical order
Number per TableOnly oneMultiple
Use CasesRange queries, large data set retrievalExact match queries, indexing multiple columns
StorageData stored in the order of the indexSeparate structure with pointers to data
ExamplePrimary key, frequently queried large tablesSearch fields, composite indexes
  • Single Column Index: This is the most basic type of index that is applied on a single column. Used when queries frequently search on a single column.

    Index on LastName in the Employee table.

    EmployeeIDFirstNameLastNameAge
    1JohnDoe30
    2JaneSmith25
    3BobJohnson40
  • Composite Index: An index on multiple columns and is used when queries frequently search on a combination of columns.

    Composite index on CustomerID and OrderDate in the Order table.

    OrderIDCustomerIDOrderDateAmount
    110012023-07-01250.00
    210022023-07-02150.00
    310012023-07-03300.00
  • Full-Text Index: Supports efficient searches on large text fields. Used when queries involve searching for keywords within text columns.

    Full-text index on Content in the Article table.

    ArticleIDTitleContent
    1Database IndexesIndexes are data structures…
    2Full-Text SearchFull-text search allows…
    3SQL PerformancePerformance tuning in SQL…
  • Spatial Index: Supports efficient searches on spatial data and is used when working with geographic or geometric data.

    Spatial index on Coordinates in the Location table.

    LocationIDNameCoordinates
    1Central ParkPOINT(40.785091 -73.968285)
    2Golden GatePOINT(37.819929 -122.478255)
    3Eiffel TowerPOINT(48.858844 2.294351)

Single column, composite, full-text or spatial indexes can be either clustered or non-clustered. They refer to how many columns are included in the index, while clustered and non-clustered refer to how the data is stored and organized in the database.

6. Diagrams


Diagrams visually represent the entities, attributes, and relationships within a data model. These depend on the type of technique that is used for Data Modeling. Let’s look at two common techniques that are used.

  • Entity Relationship(ER) Diagrams: It illustrate entities, their attributes, and the relationships between them, and is a default technique for modeling and designing relational databases. Below is an example.

---
title: ER Diagram
---

erDiagram
    STUDENT {
        int StudentID PK "Primary Key, NOT NULL"
        string Name "NOT NULL"
        int Age "CHECK (Age > 0)"
    }

    COURSE {
        int CourseID PK "Primary Key, NOT NULL"
        string CourseName "NOT NULL"
        int Credits "CHECK (Credits > 0)"
    }

    ENROLLMENT {
        int EnrollmentID PK "Primary Key, NOT NULL"
        int StudentID FK "Foreign Key, NOT NULL"
        int CourseID FK "Foreign Key, NOT NULL"
        date EnrollmentDate "NOT NULL"
    }

    STUDENT ||--o{ ENROLLMENT : enrolls
    COURSE ||--o{ ENROLLMENT : includes

  • Unified Modeling Language(UML) Diagrams: It describe the structure of a system by showing its classes, attributes, methods, and the relationships among objects. Often used in object-oriented design. Below is an example.

---
title: UML Diagram
---

classDiagram
    class Student {
        -int StudentID
        -string Name
        -int Age
        +enroll(courseID)
    }

    class Course {
        -int CourseID
        -string CourseName
        -int Credits
        +assign(studentID)
    }

    class Enrollment {
        -int EnrollmentID
        -int StudentID
        -int CourseID
        -date EnrollmentDate
    }

    Student "1" -- "0..*" Enrollment : enrolls
    Course "1" -- "0..*" Enrollment : includes

Normalization


Normalization aims to reduce redundancy and improve data integrity. It involves dividing large tables into smaller, related tables and defining relationships between them. The goal is to ensure that each data is stored only once, which minimizes redundancy and avoids anomalies.

This generally improves write performances by reducing the need for data duplication and ensuring that updates, deletes and inserts are handled consistently and accurately. It also eases maintenance and increases flexibility. However, the impact on read performance can vary and be impacted negatively if many joins are required to assemble the data from normalized tables.

1. First Normal Form (1NF)

Ensure each column contains atomic (indivisible) values and each column contains values of a single type.

  • Original Table

    StudentIDNameCourses
    1AliceMath, Science
    2BobMath, English
  • Normalized Table (1NF)

    StudentIDNameCourse
    1AliceMath
    1AliceScience
    2BobMath
    2BobEnglish

2. Second Normal Form (2NF)

Meet all the requirements of 1NF and ensure that all non-key attributes are fully dependent on the primary key.

  • Original Table(1NF)

    StudentIDNameCourseInstructor
    1AliceMathDr. Smith
    1AliceScienceDr. Johnson
    2BobMathDr. Smith
    2BobEnglishDr. Clark
  • Normalized Table(2NF)

    StudentIDName
    1Alice
    2Bob
    CourseInstructor
    MathDr. Smith
    ScienceDr. Johnson
    EnglishDr. Clark
    StudentIDCourse
    1Math
    1Science
    2Math
    2English

3. Third Normal Form (3NF)

Meet all the requirements of 2NF and ensure that all the attributes are functionally dependent only on the primary key.

  • Original Table (2NF)

    CourseIDCourseNameInstructorIDInstructorName
    1Math101Dr. Smith
    2Science102Dr. Johnson
    3English103Dr. Clark
  • Normalized table (3NF)

    CourseIDCourseName
    1Math
    2Science
    3English
    InstructorIDInstructorName
    101Dr. Smith
    102Dr. Johnson
    103Dr. Clark
    CourseIDInstructorID
    1101
    2102
    3103

Advanced Normalization


4. Boyce-Codd Normal Form (BCNF)

BCNF is used to eliminate redundancy and anomalies in the database, ensuring that the database is free from certain types of update anomalies that can still exist in 3NF.

Example to Illustrate BCNF

Consider a scenario with a university database where we have a table that records the subjects taught by instructors in different departments.

  • Original Table (Not in BCNF)

    InstructorSubjectDepartment
    Dr. SmithDatabasesCS
    Dr. ClarkDatabasesIT
    Dr. SmithNetworkingCS
    Dr. BrownNetworkingIT
  • Normalization Process to BCNF

    To achieve BCNF, we decompose the table into two tables:

    InstructorSubject
    Dr. SmithDatabases
    Dr. ClarkDatabases
    Dr. SmithNetworking
    Dr. BrownNetworking
    SubjectDepartment
    DatabasesCS
    DatabasesIT
    NetworkingCS
    NetworkingIT

Now, both tables are in BCNF because the left-hand side of every non-trivial functional dependency is a superkey.

5. Fourth Normal Form (4NF)

The table must be in Boyce-Codd Normal Form (BCNF) and it should not have any multi-valued dependencies.

Example to Illustrate 4NF

Consider a university database where students can have multiple phone numbers and multiple email addresses.

  • Original Table (Not in 4NF)

    StudentIDPhoneNumberEmailAddress
    11234567890alice@example.com
    10987654321alice@uni.edu
    25555555555bob@example.com
    24444444444bob@uni.edu
  • Normalization Process to 4NF

    To achieve 4NF, we decompose the table into two tables:

    StudentIDPhoneNumber
    11234567890
    10987654321
    25555555555
    24444444444
    StudentIDEmailAddress
    1alice@example.com
    1alice@uni.edu
    2bob@example.com
    2bob@uni.edu

    Both tables are now in 4NF because they do not have any multi-valued dependencies.

6. Fifth Normal Form (5NF)

The table must be in Fourth Normal Form (4NF) and there should be no join dependencies that are not implied by the candidate keys.

Example to Illustrate 5NF

Consider a project management database where projects can involve multiple suppliers and parts.

  • Original Table (Not in 5NF)

    ProjectIDSupplierIDPartID
    11015001
    11015002
    11025001
    21035003
  • Normalization Process to 5NF

    To achieve 5NF, we decompose the table into three tables:

    ProjectIDSupplierID
    1101
    1102
    2103
    ProjectIDPartID
    15001
    15002
    25003
    SupplierIDPartID
    1015001
    1015002
    1025001
    1035003

    These tables are now in 5NF because they eliminate redundancy and allow the original table to be reconstructed through joins.

The decision to normalize beyond 3NF often depends on the specific business requirements and the nature of the data. Achieving BCNF, 4NF, and 5NF can lead to a highly normalized database, which may increase the complexity of queries and decrease performance due to the need for multiple joins. While it’s less common to encounter BCNF, 4NF, and 5NF in everyday work compared to 3NF, understanding these higher normal forms is crucial for certain complex database designs.

Denormalization


Denormalization is the process of combining normalized tables to improve read performance and simplify database queries. While normalization aims to reduce redundancy and ensure data integrity, denormalization introduces some redundancy intentionally to reduce the number of joins needed to retrieve data, thereby speeding up read operations.

Advantages of Denormalization:

  • Faster Reads
  • Simplified Queries
  • Optimized Performance

Disadvantages of Denormalization:

  • Increased Redundancy
  • Simplified Queries
  • Optimized Performance

Example to Illustrate Denormalization

  • Normalized Tables

    CustomerIDNameEmail
    1Alicealice@example.com
    2Bobbob@example.com
    OrderIDCustomerIDOrderDate
    10112023-07-01
    10222023-07-02
    OrderDetailIDOrderIDProductIDQuantity
    10011015012
    10021015021
    10031025035
    ProductIDProductNamePrice
    501Laptop1000
    502Mouse20
    503Keyboard50
  • Denormalized Table

    OrderIDCustomerIDCustomerNameEmailOrderDateProductIDProductNameQuantityPrice
    1011Alicealice@example.com2023-07-01501Laptop21000
    1011Alicealice@example.com2023-07-01502Mouse120
    1022Bobbob@example.com2023-07-02503Keyboard550

Dimensional Modeling

Dimensional modeling is a design technique used in data warehousing and business intelligence systems to structure data for querying and reporting. It organizes data into a schema that is optimized for retrieval and analysis rather than for transaction processing. The primary elements of dimensional modeling are facts and dimensions.

  • Facts: Central tables that contain quantitative data for analysis.
  • Dimensions: Tables that contain descriptive attributes related to the facts.

Advantages of Dimensional Modeling

  • Optimized for Queries
  • Simplifies Reporting
  • Scalability

1. Star Schema

The simplest form of dimensional modeling. It consists of a central fact table surrounded by dimension tables. Each dimension table is directly linked to the fact table.


---
title: Star Schema
---

erDiagram
    SALES {
        int SalesID
        int DateID
        int ProductID
        int StoreID
        int QuantitySold
        int TotalRevenue
    }
    
    DATE {
        int DateID
        date Date
        string Month
        string Quarter
        int Year
    }
    
    PRODUCT {
        int ProductID
        string ProductName
        string Category
        float Price
    }
    
    STORE {
        int StoreID
        string StoreName
        string Location
    }
    
    SALES ||--o{ DATE : ""
    SALES ||--o{ PRODUCT : ""
    SALES ||--o{ STORE : ""

2. Snowflake Schema

A more normalized version of the star schema where dimension tables are further broken down into sub-dimension tables.


---
title: Snowflake Schema
---

erDiagram
    SALES {
        int SalesID
        int DateID
        int ProductID
        int StoreID
        int QuantitySold
        int TotalRevenue
    }
    
    DATE {
        int DateID
        date Date
        string Month
        string Quarter
        int Year
    }
    
    PRODUCT {
        int ProductID
        string ProductName
        int CategoryID
        float Price
    }
    
    CATEGORY {
        int CategoryID
        string CategoryName
    }
    
    STORE {
        int StoreID
        string StoreName
        string Location
    }
    
    SALES ||--o{ DATE : ""
    SALES ||--o{ PRODUCT : ""
    SALES ||--o{ STORE : ""
    PRODUCT ||--o{ CATEGORY : ""
  1. Erwin Data Modeler - Comprehensive tool for data modeling, supports forward and reverse engineering, integrates with various databases.

  2. Microsoft Visio - Versatile diagramming tool that supports database modeling through templates and add-ons, integrates with other Microsoft Office tools.

  3. Oracle SQL Developer Data Modeler - Free tool from Oracle, supports multi-dimensional modeling, integrates with Oracle database products.

This post is licensed under CC BY 4.0 by the author.