Mastering Database Normalization: Best Practices and Techniques

Database normalization is a technique used to organize data in a relational database. It involves breaking down a large table into smaller tables and defining relationships between them. The normalization goal is to eliminate redundant data and ensure that each piece of information is stored in only one place.

Relational databases are the backbone of many software systems. They allow us to store, manage, and retrieve data in an organized and efficient way. However, as the size and complexity of our data grow, so do the challenges of maintaining its integrity and consistency. This is where database normalization comes in. 

Database Normalization
Database Normalization

Normalizing a database can improve data integrity and reduce data anomalies. 

In this article, we’ll discuss what normalization is, the various normal forms in SQL, the problems that can arise without normalization, and how normalization improves the performance of relational databases.

What Is Normalization in Database?

Database normalization is the process of organizing data in a database to minimize redundancy and dependency.
The goal is to ensure that each piece of data is stored in one place and in one format. This improves data integrity, eliminates data duplication, and simplifies data management.

Normalization is achieved by breaking down a larger table into smaller tables and establishing relationships between them. Relationships are defined between tables using foreign keys.
This process is called normalization because it brings the data to a “normal” form that is easier to work with and maintain.

Normalization is also essential for properly functioning database operations, including queries and joins.

List of Normal Forms in SQL

There are several normal forms in SQL, each with its own set of rules. The most common normal forms are:

  • First Normal Form (1NF)
  • Second Normal Form (2NF)
  • Third Normal Form (3NF)
  • Boyce and Codd Normal Form (BCNF)
  • Fourth Normal Form (4NF)

Problems Without Using Database Normalization

A database not being normalized correctly can lead to data anomalies. Three types of anomalies can occur insertion, updation, and deletion.

01. Insertion Anomaly

An Insertion anomaly occurs when you can’t add data to a table without adding unrelated data. For example, imagine you have a table that contains customer orders. If you don’t normalize the data, you might include the customer’s address in the same table. It means you can’t add an order for a new customer without their address, even if you don’t have that information yet.

02. Updation Anomaly

An Updation anomaly occurs when you have to update the same data in multiple places. For example, imagine you have a table that contains customer orders. If you don’t normalize the data, you might include the customer’s name and address in the same table. If a customer moves, you must update their address in every order they’ve placed.

03. Deletion Anomaly

A deletion anomaly occurs when you delete data from a table and unintentionally delete other data as well.

Specifically, deletion anomaly occurs when a row of data is deleted from a table containing the information required by other tables in the database. This can result in the loss of information that is necessary for the other tables to function properly.

For example, consider a database containing Orders and Customers tables. Each order is associated with a customer, and the Customers table contains information about each customer. If a customer places an order and deletes their account, a deletion anomaly may occur if their customer information is deleted along with their account. 

This would result in the loss of important information about the customer that is required for the Orders table to function properly.

It is important to design databases with data integrity in mind to prevent deletion anomalies, including establishing relationships between tables and enforcing referential integrity constraints. 

This helps ensure that deleting data from one table does not result in the unintended loss of information from other tables in the database.

Example: Database Normalization

Before Normalization, the data is stored in a single table with redundant information. After Normalization, the data is separated into multiple tables to eliminate redundancy and improve data consistency.

Let’s consider a table that stores information about customers and their orders. Here’s what the table could look like before normalization:

CustomerIDCustomerNameCustomerAddressOrderIDOrderDateOrderAmount
1Robert Smith123 Main St.12023-01-01$100.00
2Maria Doe456 Elm St.22023-01-02$200.00
1Robert Smith123 Main St.32023-01-03$50.00
3Bob Johnson789 Oak St.42023-01-04$75.00

As you can see, there is some redundancy in the table. For example, Robert Smith’s information is repeated twice, leading to data inconsistencies and making it more challenging to update the information in the table.

Now, let’s normalize this table. First, we can create a separate table for customers and their information:

Here is an example of how the data in the tables might look after normalization:

Table 1: Customers

CustomerIDCustomerNameCustomerAddress
1Robert Smith123 Main St.
2Maria Doe456 Elm St.
3Bob Johnson789 Oak St.

And then, we can create a separate table for orders and their information:

Table 2: Orders

OrderIDOrderDateOrderAmountCustomerID
12023-01-01$100.001
22023-01-01$200.002
32023-01-01$50.001
42023-01-01$75.003

In the above example, the Customers table contains unique customer information, while the Orders table contains information specific to each order, including the customer who placed the order.

Thus, Normalization eliminates redundancy and improves data consistency by separating the data into multiple tables and establishing relationships between them.

First Normal Form (1NF)

First Normal Form (1NF) is a fundamental concept in database normalization that helps eliminate data redundancy and ensure data integrity. The primary goal of 1NF is to eliminate repeating groups and ensure that each field in a table contains a single atomic value.

A table that conforms to 1NF must meet the following requirements:

  • Atomic values: Every cell in a table must contain a single, atomic value.
  • Unique column names: Each column in a table must have a unique name.
  • Same data type: Entries in a column must be of the same data type.
  • Ordered rows: Each row in a table must be uniquely identifiable, such as through a primary key.
  • The order in which data is stored does not matter.

Example: First Normal Form (1NF)

To convert the Customers table to 1NF, we need to ensure that each attribute has a single value, and there are no repeating groups of attributes. Here is an example of how we can achieve 1NF for the Customers table:

Original Customers table:

CustomerIDCustomerNameCustomerAddress
1Robert Smith123 Main St.
2Maria Doe456 Elm St.
3Bob Johnson789 Oak St.

1NF Customers table:

CustomerIDFirstNameLastNameStreetNumberStreetNameCityStateZipCode
1RobertSmith123Main St.AnytownCA12345
2MariaDoe456Elm St.AnytownCA12345
3BobJohnson789Oak St.AnytownCA12345

Second Normal Form (2NF)

The Second Normal Form (2NF) is achieved when a table is in 1NF, and all the non-key attributes fully depend on the primary key. This means that every non-key attribute must depend on the whole primary key and not just a part of it. If there are any partial dependencies in a table, it is not in 2NF.

  • A table must first meet all requirements of the First Normal Form (1NF) to satisfy the Second Normal Form,  
  • Any data subsets that are common to multiple rows in a table should be separated and placed into different tables. 
  • These new tables should establish relationships with their parent tables by using foreign keys.

To explain this further, let’s use the example tables from the previous question:

CustomerIDFirstNameLastNameStreetNumberStreetNameCityStateZipCode
1RobertSmith123Main St.AnytownCA12345
2MariaDoe456Elm St.AnytownCA12345
3BobJohnson789Oak St.AnytownCA12345
OrderIDOrderDateOrderAmountCustomerID
12023-01-01$100.001
22023-01-01$200.002
32023-01-01$50.001
42023-01-01$75.003

Explanation:

  • The table is already in 1NF, with no repeating groups or columns.
  • The primary key for the Orders table is the OrderID column, and CustomerID is a foreign key that references the Customer table’s primary key.
  • In the Customer table, the columns are directly related to the primary key CustomerID.
  • In the Orders table, the columns are directly related to the primary key OrderID, not the foreign key CustomerID.

Here, we have eliminated the partial dependency by creating two tables and establishing a relationship between them using the CustomerID column as a foreign key in the orders table.

By splitting the table, we have eliminated the data redundancy that occurs when data is repeated in a table. This ensures that each table contains only unique data, making it easy to update, maintain and query the database.

In summary, the Second Normal Form (2NF) ensures that all non-key attributes in a table depend on the entire primary key, not just part of it. By eliminating partial dependencies, we can reduce data redundancy and improve data integrity in our database.

Third Normal Form (3NF)

The Third Normal Form (3NF) requires that a table must be in Second Normal Form (2NF) and should not have any transitive dependencies. Transitive dependency means when a non-key column is determined by another non-key column, which is determined by the primary key. This can cause data redundancy and inconsistency.

In simpler terms, a table is in 3NF if it has been normalized to the point where each non-key column is only dependent on the primary key of the table.

The rules for Third Normal Form (3NF) are as follows:

  1. The table must already be in 2NF.
  2. Every non-key column must be directly dependent on the primary key.
  3. There should be no transitive dependency between non-key columns.

Example: Third Normal Form (3NF)

Here, We will understand transitive dependency using our existing tables.

Let’s take the example of the Customers table we have been using:

CustomerIDCustomerNameCustomerAddress
1Robert Smith123 Main St.
2Maria Doe456 Elm St.
3Bob Johnson789 Oak St.

Suppose we add a new column CustomerState to the table to track the state where the customer lives:

CustomerIDCustomerNameCustomerAddressCustomerState
1Robert Smith123 Main St.New York
2Maria Doe456 Elm St.California
3Bob Johnson789 Oak St.New York

Here, we have introduced a transitive dependency between CustomerState and CustomerID through CustomerAddress. That is, CustomerState is determined by CustomerAddress, which is determined by CustomerID.

This can cause data redundancy and inconsistency because if the CustomerAddress for a customer changes, then we need to update the CustomerState for that customer as well. This can lead to inconsistencies if we forget to update one of the fields.

To remove this transitive dependency, we can split the Customer table into two separate tables, one for the customers and one for their addresses:

So finally, we will now have a total of three tables:

  • Customers
  • Addresses
  • Orders

Table 1: Customers

CustomerIDCustomerName
1Robert Smith
2Maria Doe
3Bob Johnson

Table 2: Addresses

CustomerIDCustomerAddressCustomerState
1123 Main St.New York
2456 Elm St.California
3789 Oak St.New York

Here, the CustomerID column is the primary key for the Customers table and a foreign key in the Addresses table. This way, we can avoid transitive dependency and store the data more efficiently and consistently.

Table 3: Orders

OrderIDOrderDateOrderAmountCustomerID
12023-01-01$100.001
22023-01-01$200.002
32023-01-01$50.001
42023-01-01$75.003

In conclusion, the Third Normal Form (3NF) helps to remove the transitive dependencies and makes the table more consistent and reliable. It is important to note that normalization does not guarantee performance optimization but helps maintain data consistency, which can lead to better performance.

Boyce and Codd Normal Form (BCNF or 3.5NF)

Boyce and Codd Normal Form (BCNF) is a higher level of database normalization, which eliminates more complex types of data anomalies that cannot be resolved by a third normal form (3NF). BCNF is named after R. Boyce and E.F. Codd, who proposed this normal Form in the 1970s.

Boyce-Codd Normal Form (BCNF) is a higher level of database normalization beyond 3NF. It is based on the functional dependencies within a table and ensures that all attributes depend only on the candidate keys.

The rules for BCNF are as follows:

  • Meet all the requirements of the third normal form (3NF).
  • Every determinant in the table must be a candidate key.
  • No non-prime attribute should be functionally dependent on any proper subset of a candidate key.

In simpler terms, BCNF states that for a table to be in BCNF:

  • Every non-trivial functional dependency should be a dependency on a superkey.
  • Every candidate key in the table should have a non-trivial functional dependency on all the attributes in the table.

Candidate key

A candidate key refers to one or a combination of columns in a database table that can be used as a unique identifier. Having more than one candidate key in a single table is possible, and each can be considered a primary key.

primary key

The primary key is the unique identifier for a table, and it’s used to enforce data integrity by ensuring that each record in the table is uniquely identified. The primary key must be unique, not null, and can’t be changed or updated once set. Only one candidate key can be designated as the primary key.

To illustrate the concept of BCNF, consider the following table:

Here’s how the OrderDetails table may look like before normalizing it to BCNF:

Table: OrderDetails

OrderIDProductIDProductNameProductDescriptionSupplierIDSupplierNameSupplierCountryQuantityPrice
11001iPhone 12Smartphone1AppleUSA2$2400
11002AirPods ProWireless Earphones1AppleUSA1$249
21003Galaxy S21Smartphone2SamsungSouth Korea1$799
21004Galaxy BudsWireless Earbuds2SamsungSouth Korea2$199
31001iPhone 12Smartphone1AppleUSA1$1200
31005MacBook ProLaptop1AppleUSA1$1500

After normalizing OrderDetails to BCNF, It’s split into following tables:

Table: Products

ProductIDProductNameProductDescription
1001iPhone 12Smartphone
1002AirPods ProWireless Earphones
1003Galaxy S21Smartphone
1004Galaxy BudsWireless Earbuds
1005MacBook Pro 2021Laptop

Primary Key: ProductID

Table: Suppliers

SupplierIDSupplierNameSupplierCountry
1AppleUSA
2SamsungSouth Korea

Primary Key: SupplierID

Table: OrderDetails

OrderIDProductIDQuantityPrice
110012$1200
110021$249
210031$799
210042$199
310011$600
310051$1500
  • Candidate key: (OrderID, ProductID)

Table: ProductSupplier

ProductIDSupplierID
10011
10021
10032
10042
10051

Candidate Key: (ProductID, SupplierID)

Table: OrderSupplier

OrderIDSupplierID
11
22
31

Candidate Key: (OrderID, SupplierID)

Fourth Normal Form (4NF)

The Fourth Normal Form (4NF) is a level of database normalization that reduces data redundancy and improves data integrity. It deals with multi-valued dependencies, which occur when a table contains two or more independent multi-valued facts about an entity.

In simpler terms, 4NF helps to break down complex information into smaller, more manageable pieces, ensuring that each piece is only stored in one place.

The rules of the Fourth Normal Form (4NF) are as follows:

  • The table must already be in Third Normal Form (3NF)
  • There must be no non-trivial multi-valued dependencies between the candidate keys of the table

Example: 4NF

To illustrate this, let’s consider the following example. Suppose we have a table called “Books” with columns for “Book ID”, “Title”, “Author”, and “Genres”. In this case, we have a multi-valued dependency because a single book can have multiple genres, and each genre can be associated with multiple books.

To apply 4NF, we need to create two tables:

  • Genres and
  • Books

The “Genres” table contains a list of book genres, each identified by a unique “GenreID” column. The “Books” table contains information about individual books, including a primary key, “BookID”, the book title and author name, and a foreign key “GenreID” that references the “Genres” table.

After 4NF, the “Books” table would look like this:

BookIDBookTitleAuthorNameGenreID
1The Catcher in the RyeJ.D. Salinger1
21984George Orwell3
3The Da Vinci CodeDan Brown4
4Pride and PrejudiceJane Austen5
5The Great GatsbyF. Scott Fitzgerald1
6The HobbitJ.R.R. Tolkien3
7The Hunger GamesSuzanne Collins3
8The Girl with the Dragon TattooStieg Larsson4

After applying 4NF, the “Genres” table would look like this:

GenreIDGenreName
1Fiction
2Non-Fiction
3Science Fiction
4Mystery
5Romance

How does data normalization improve the performance of relational databases?

Data normalization improves the performance of relational databases in the following ways:

  1. Reduces data redundancy: By eliminating data redundancy, normalization reduces the amount of data stored in the database, which reduces the storage requirements and can lead to faster queries.
  2. Improves data consistency: Normalization helps ensure that data is consistent by eliminating update anomalies, which can cause inconsistencies when data is updated in one place but not in others.
  3. Reduces the number of null values: By organizing data into separate tables, normalization reduces the number of null values in the database, improving query performance.
  4. Simplifies queries: Normalization simplifies queries by reducing the need to join multiple tables together to obtain data.
  5. Enables better indexing: Normalized tables can be indexed more effectively, improving query performance.
  6. Facilitates database maintenance: Normalization makes it easier to maintain the database by reducing the need for a complex update and delete operations.

FAQs:

Q: What is Database Normalization? 

Database normalization is the process of organizing a database to reduce redundancy and dependency among tables. It aims to improve the integrity and efficiency of a database by reducing data duplication.

Q: Why is database normalization necessary? 

Database Normalization is important because it helps to eliminate data redundancy and dependency in a database. Doing so ensures that a database is consistent and that data is stored in a way that is easy to access and maintain.

Q: What are the different normal forms in database normalization?

There are several Normal Forms in database normalization, including the
First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), Boyce-Codd Normal Form (BCNF), and Fourth Normal Form (4NF).

Q: What is the purpose of each normal form? 

The purpose of each normal form is to eliminate a specific type of data redundancy or dependency.
For example, 1NF eliminates repeating groups, 2NF eliminates partial dependencies, 3NF eliminates transitive dependencies, BCNF eliminates certain types of functional dependencies, and 4NF eliminates multi-valued dependencies.

Q: How does normalization improve database performance?

Normalization improves database performance by reducing the amount of data duplication and inconsistency, making it easier to access and maintain data. This results in faster queries, reduced storage space, and improved database integrity.

Q: Is normalization always necessary?

Normalization is not always necessary, and it depends on the specific requirements and constraints of the database.
In some cases, it may be acceptable to have redundant or dependent data if it improves performance or meets other business requirements.

References: WikiPedia-Database Normalization

You might want to read this too:

Shekh Ali
4 1 vote
Article Rating
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments