Denormalization?

Previously, I normalized my tables to Sixth Normal Form (6NF). Now, I want to consider denormalizing, if and when it’s appropriate.

Denormalizing is different than an un-normalized schema, which is never, ever, recommended.

Strategies for denormalization appear in data-warehousing designs, specifically in OLAP star schemas. However, a snowflake schema is often a viable (and normalized) alternative in data-warehouse environments.

For all the complexity in normal form definitions [I neglected to mention Domain-Key Normal Form (DKNF) which would bring us to eight separate definitions], in practice normalizing a schema is relatively simple. Splitting a table that can be split (without data loss) is normalizing, preserving or merging is denormalizing.

Storage

The nature of relational database engines rarely justifies denormalizing, and often the consequences of denormalizing can incur explosive storage growth.

For example, consider if we add 10-million users to our previous schema, two-thousand domains, ten-thousand roles, and 200-thousand partner websites. You’d see something like this:

count(foobar_users) :: 10,000,000
count(foobar_domain_map) :: 2,000 < x < 20-billion
count(foobar_roles) :: 10,000 < x < 20,000,000
...

The mapping tables present potentially large storage requirements (although very reasonable on even modest hardware). Let’s consider denormalized examples:

-- 1NF
count(foobar_users) :: 10,000,000 < x < 40,000,000,000,000,000,000

-- 2NF
count(foobar_users) :: 10,000,000
count(foobar_partners) :: 200,000 < x < 4,000,000,000,000,000
count(foobar_role_map) :: 10,000,000 < x < 200,000,000,000,000

That’s 200-trillion and 4-quadrillion in the 2NF, and 40-quintillion in the 1NF upper bound. The reason is that these tables implicitly contain many-to-many mappings which have combinatorial growth. With terabyte storage you’d still need a few million servers in the 1NF case. The culprits are combinations of users, domains, roles, and partners — normalizing all of these didn’t happen till 4NF and 5NF.

When to Denormalize

I denormalize when there’s a justifiable reason not to split and I’m absolutely certain it won’t impact storage growth.

Third-party software (e.g, for OLAP cube analysis) may require a denormalized star scheme as an integration point. It is recommended to provide a denormalized view that abstracts the normalized schema, but not to denormalize the schema itself (e.g., a star-schema view of a snowflake schema).

For a practical example of denormalizing, let’s consider the previous schema. I purposely left-out any auto-increment id’s. If we were to add these in (and maintain 6NF) there would be a new table for each id — which would nearly double the number of tables in the schema. Let’s add only user_id, domain_id, and a login_id.

CREATE TABLE foobar_users (
    user_id INT AUTO_INCREMENT PRIMARY KEY,
    username VARCHAR(20) NOT NULL
);

CREATE TABLE foobar_domains (
    domain_id INT AUTO_INCREMENT PRIMARY KEY,
    domain VARCHAR(12) NOT NULL
);

CREATE TABLE foobar_logins (
    login_id INT AUTO_INCREMENT PRIMARY KEY,
    login_name VARCHAR(20) NOT NULL
);

CREATE TABLE foobar_user_status (
    user_id INT NOT NULL PRIMARY KEY,
    status VARCHAR(12) NOT NULL
);

CREATE TABLE foobar_dept_map (
    user_id INT NOT NULL PRIMARY KEY,
    dept VARCHAR(20) NOT NULL
);

CREATE TABLE foobar_roles (
    domain_id INT NOT NULL,
    role VARCHAR(12) NOT NULL,
    PRIMARY KEY(domain_id, role)
);

CREATE TABLE foobar_role_map (
    user_id INT NOT NULL,
    role VARCHAR(12) NOT NULL,
    PRIMARY KEY(user_id, role)
);

CREATE TABLE foobar_login_map (
    user_id INT NOT NULL PRIMARY KEY,
    domain_id INT NOT NULL,
    login_id INT NOT NULL
);

CREATE TABLE foobar_login_pw (
    login_id INT NOT NULL PRIMARY KEY,
    login_pw VARCHAR(32) NOT NULL
);

CREATE TABLE foobar_partners (
    partner VARCHAR(20) NOT NULL PRIMARY KEY,
    website VARCHAR(255) NOT NULL,
    registration VARCHAR(12) NOT NULL
);

CREATE TABLE foobar_domain_map (
    user_id INT NOT NULL,
    domain_id INT NOT NULL,
    PRIMARY KEY(user_id, domain_id)
);

CREATE TABLE foobar_partner_map (
    user_id INT NOT NULL,
    partner VARCHAR(20) NOT NULL,
    PRIMARY KEY(user_id, partner)
);

I want to keep this manageable, so I will collapse the simple tables, and preserve the new auto-increment id’s. Done correctly, I can denormalize without impacting any of the functional requirements that were solved by normalizing:

-- added user_id and collapsed status and dept back into table
CREATE TABLE foobar_users (
    user_id INT AUTO_INCREMENT PRIMARY KEY,
    username VARCHAR(20) NOT NULL,
    status VARCHAR(12) NOT NULL, 
    dept VARCHAR(20) NOT NULL
);

-- added domain_id
CREATE TABLE foobar_domains (
    domain_id INT AUTO_INCREMENT PRIMARY KEY,
    domain VARCHAR(20) NOT NULL
);

CREATE TABLE foobar_roles (
    domain_id INT NOT NULL,
    role VARCHAR(12) NOT NULL,
    PRIMARY KEY(domain_id, role)
);

-- added login_id and login_name, and collapsed with login_pw
CREATE TABLE foobar_logins (
    login_id INT AUTO_INCREMENT PRIMARY KEY,
    login_name VARCHAR(20) NOT NULL,
    login_pw VARCHAR(32) NOT NULL
);

CREATE TABLE foobar_partners (
    partner VARCHAR(20) NOT NULL PRIMARY KEY,
    website VARCHAR(255) NOT NULL,
    registration VARCHAR(12) NOT NULL
);

CREATE TABLE foobar_role_map (
    user_id INT NOT NULL,
    role VARCHAR(12) NOT NULL,
    PRIMARY KEY(user_id, role)
);

CREATE TABLE foobar_domain_map (
    user_id INT NOT NULL,
    domain_id INT NOT NULL,
    login_id INT NOT NULL,
    PRIMARY KEY(user_id, domain_id)
);

CREATE TABLE foobar_partner_map (
    user_id INT NOT NULL,
    partner VARCHAR(20) NOT NULL,
    PRIMARY KEY(user_id, partner)
);

This is about as denormalized as I would go in production code. In fact, each table appears to be in DKNF, and most are in 6NF. Importantly, I’ve normalized all mappings to avoid explosive storage requirements and denormalized the non-mapping tables into appropriate domains.