Lesson 06: Normalization (1NF through BCNF)

Lesson 06: Normalization (1NF through BCNF)

Previous: 05_Functional_Dependencies.md | Next: 07_Advanced_Normalization.md


Topic: Database Theory Lesson: 6 of 16 Prerequisites: Functional dependencies, attribute closure, minimal cover (Lesson 05) Objective: Understand normalization from 1NF through BCNF, master decomposition algorithms, verify lossless-join and dependency-preservation properties, and apply normalization to real-world schemas

1. Introduction

Normalization is the process of organizing a relational database schema to reduce redundancy and eliminate certain types of data anomalies. Introduced by Edgar F. Codd in 1970, it provides a systematic, theory-driven approach to schema design.

1.1 The Problem: Poor Schema Design

Consider this single-relation design for a university:

UniversityCourse(
    student_id, student_name, student_addr,
    course_id, course_title, dept_name, dept_building,
    instructor_id, instructor_name,
    grade, semester
)

This "universal relation" stores everything in one table. While it works for simple queries, it suffers from serious problems.

1.2 Data Anomalies

Update Anomaly

If the Computer Science department moves to a new building, we must update dept_building in every row where dept_name = 'Computer Science'. If we miss even one row, the data becomes inconsistent.

Before:
| course_id | dept_name | dept_building |
|-----------|-----------|---------------|
| CS101     | CS        | Watson Hall   |
| CS201     | CS        | Watson Hall   |    โ† must update ALL rows
| CS301     | CS        | Watson Hall   |

If we update only the first row:
| CS101     | CS        | Taylor Hall   |    โ† updated
| CS201     | CS        | Watson Hall   |    โ† inconsistent!
| CS301     | CS        | Watson Hall   |    โ† inconsistent!

Insertion Anomaly

We cannot record a new department (e.g., its name and building) unless a student enrolls in one of its courses, because student_id is part of the primary key.

Deletion Anomaly

If the last student enrolled in a course drops it, we lose not just the enrollment data but also the course title, instructor assignment, and department information.

1.3 Root Cause: Redundancy from FD Violations

All three anomalies stem from the same root cause: attributes that depend on only part of the key, or on non-key attributes, are stored redundantly. Normalization eliminates this redundancy through systematic decomposition guided by functional dependencies.

1.4 Goals of Normalization

  1. Eliminate redundancy: Each fact is stored exactly once
  2. Prevent anomalies: Updates, insertions, and deletions are clean
  3. Preserve information: No data is lost during decomposition (lossless join)
  4. Preserve constraints: FDs are still enforceable (dependency preservation)

2. First Normal Form (1NF)

2.1 Definition

Definition: A relation is in First Normal Form (1NF) if: 1. All attributes contain only atomic (indivisible) values 2. There are no repeating groups or arrays 3. Each row is uniquely identifiable (has a primary key)

1NF is the baseline requirement for being a valid relation in the relational model.

2.2 Violations and Fixes

Violation 1: Non-atomic values

| student_id | name       | phone_numbers          |
|------------|------------|------------------------|
| 101        | Alice      | 555-1234, 555-5678     |    โ† multi-valued!
| 102        | Bob        | 555-9999               |

Fix: Create a separate row for each phone number, or a separate table:

Student(student_id, name)
StudentPhone(student_id, phone_number)

| student_id | phone_number |
|------------|--------------|
| 101        | 555-1234     |
| 101        | 555-5678     |
| 102        | 555-9999     |

Violation 2: Repeating groups

| order_id | item1  | qty1 | item2  | qty2 | item3  | qty3 |
|----------|--------|------|--------|------|--------|------|
| 1001     | Pen    | 5    | Paper  | 10   | NULL   | NULL |

Fix: Normalize into two tables:

Order(order_id, order_date, customer_id)
OrderItem(order_id, item_name, quantity)

2.3 1NF and the Relational Model

In strict relational theory, a relation is by definition in 1NF โ€” the relational model doesn't allow non-atomic domains. However, in practice, many systems allow arrays (PostgreSQL int[]), JSON columns, or comma-separated values. While sometimes useful for performance, these violate the spirit of 1NF and make FD-based reasoning difficult.


3. Second Normal Form (2NF)

3.1 Partial Dependency

Definition: A partial dependency exists when a non-prime attribute (an attribute not part of any candidate key) is functionally dependent on a proper subset of a candidate key.

In other words, some attribute depends on only part of the key, not the whole key.

3.2 Definition

Definition: A relation is in Second Normal Form (2NF) if: 1. It is in 1NF, and 2. Every non-prime attribute is fully functionally dependent on every candidate key (no partial dependencies)

Note: 2NF is only relevant when a candidate key is composite (has more than one attribute). If all candidate keys are single attributes, the relation is automatically in 2NF.

3.3 Example

Consider:

StudentCourse(student_id, course_id, student_name, grade)

Candidate key: {student_id, course_id}

FDs: - {student_id, course_id} โ†’ grade (full dependency) - student_id โ†’ student_name (partial dependency! student_name depends on only part of the key)

This violates 2NF.

Decomposition to achieve 2NF:

Student(student_id, student_name)
    Key: {student_id}
    FD: student_id โ†’ student_name

Enrollment(student_id, course_id, grade)
    Key: {student_id, course_id}
    FD: {student_id, course_id} โ†’ grade

3.4 Formal Test for 2NF

For each candidate key K and each non-prime attribute A: 1. Check if there exists a proper subset X โŠ‚ K such that X โ†’ A 2. If any such partial dependency exists, the relation is not in 2NF


4. Third Normal Form (3NF)

4.1 Transitive Dependency

Definition: A transitive dependency exists when a non-prime attribute A depends on another non-prime attribute B, which in turn depends on a candidate key K:

K โ†’ B โ†’ A, where B is not a superkey and A is not part of any candidate key.

4.2 Definition

Definition: A relation schema R is in Third Normal Form (3NF) if for every non-trivial functional dependency X โ†’ A where A is a single attribute: 1. X is a superkey of R, OR 2. A is a prime attribute (member of some candidate key)

Equivalently: no non-prime attribute is transitively dependent on any candidate key.

4.3 Example

Consider:

Employee(emp_id, dept_id, dept_name, dept_location)

Candidate key: {emp_id}

FDs: - emp_id โ†’ dept_id, dept_name, dept_location - dept_id โ†’ dept_name, dept_location

The dependency emp_id โ†’ dept_name is transitive through dept_id: - emp_id โ†’ dept_id โ†’ dept_name

This violates 3NF because dept_name depends on dept_id (a non-superkey) and dept_name is not a prime attribute.

Decomposition to achieve 3NF:

Employee(emp_id, dept_id)
    Key: {emp_id}
    FD: emp_id โ†’ dept_id

Department(dept_id, dept_name, dept_location)
    Key: {dept_id}
    FD: dept_id โ†’ dept_name, dept_location

4.4 The Importance of "Prime Attribute" Exception

The condition "A is a prime attribute" in 3NF is what distinguishes it from BCNF. This exception allows certain FDs where the dependent is part of a candidate key, even if the determinant is not a superkey. This exception is what makes 3NF decomposition always dependency-preserving (unlike BCNF).


5. Boyce-Codd Normal Form (BCNF)

5.1 Definition

Definition: A relation schema R is in Boyce-Codd Normal Form (BCNF) if for every non-trivial functional dependency X โ†’ Y:

X is a superkey of R.

BCNF is strictly stronger than 3NF. It eliminates the "prime attribute" exception: every determinant must be a superkey, period.

5.2 Relationship: 3NF vs BCNF

BCNF โŠ‚ 3NF โŠ‚ 2NF โŠ‚ 1NF

Every BCNF relation is in 3NF.
Every 3NF relation is in 2NF.
Every 2NF relation is in 1NF.

But not vice versa.

5.3 Example: 3NF but Not BCNF

Consider:

TeachingAssignment(student_id, course, instructor)

Business rules: - Each instructor teaches exactly one course: instructor โ†’ course - Each student has exactly one instructor per course: {student_id, course} โ†’ instructor - A student can have only one instructor for a given course

Candidate keys: {student_id, course} and {student_id, instructor}

FDs: - {student_id, course} โ†’ instructor - {student_id, instructor} โ†’ course - instructor โ†’ course

Check 3NF for instructor โ†’ course: - instructor is not a superkey โœ— - course is a prime attribute (part of candidate key {student_id, course}) โœ“

So the relation is in 3NF (condition 2 of the 3NF definition is satisfied).

Check BCNF for instructor โ†’ course: - instructor is not a superkey โœ—

So the relation is not in BCNF.

5.4 When 3NF and BCNF Differ

3NF and BCNF differ only when: 1. There are multiple overlapping candidate keys, AND 2. A non-superkey attribute determines part of a candidate key

In practice, this situation is relatively rare. Most relations that are in 3NF are also in BCNF.


6. Decomposition Properties

When we decompose a relation to achieve a higher normal form, we must ensure the decomposition is "good." Two critical properties define what "good" means.

6.1 Lossless-Join Property

Definition: A decomposition of R into Rโ‚, Rโ‚‚, ..., Rโ‚™ has the lossless-join property if for every legal instance r of R:

r = ฯ€_{Rโ‚}(r) โ‹ˆ ฯ€_{Rโ‚‚}(r) โ‹ˆ ... โ‹ˆ ฯ€_{Rโ‚™}(r)

In other words, we can reconstruct the original relation by naturally joining the decomposed relations. No information is lost.

Binary Decomposition Test

For a decomposition of R into Rโ‚ and Rโ‚‚:

Theorem: The decomposition is lossless-join if and only if:

(Rโ‚ โˆฉ Rโ‚‚) โ†’ (Rโ‚ - Rโ‚‚) โˆˆ Fโบ, or (Rโ‚ โˆฉ Rโ‚‚) โ†’ (Rโ‚‚ - Rโ‚) โˆˆ Fโบ

The common attributes must functionally determine all of one side. Equivalently, the common attributes must be a superkey of at least one of the decomposed relations.

Example

Decompose Employee(emp_id, dept_id, dept_name) into: - Rโ‚(emp_id, dept_id) and Rโ‚‚(dept_id, dept_name)

Common attributes: {dept_id} Rโ‚‚ - Rโ‚ = {dept_name}

Is dept_id โ†’ dept_name in Fโบ? Yes!

So this decomposition is lossless-join. โœ“

Why Lossless-Join Matters

Without the lossless-join property, joining the decomposed tables produces spurious tuples โ€” rows that didn't exist in the original relation:

Original:
| A | B | C |
|---|---|---|
| 1 | 2 | 3 |
| 4 | 2 | 5 |

Decompose into R1(A,B) and R2(B,C):
R1:          R2:
| A | B |    | B | C |
|---|---|    |---|---|
| 1 | 2 |    | 2 | 3 |
| 4 | 2 |    | 2 | 5 |

R1 โ‹ˆ R2:
| A | B | C |
|---|---|---|
| 1 | 2 | 3 |    โ† original โœ“
| 1 | 2 | 5 |    โ† SPURIOUS! โœ—
| 4 | 2 | 3 |    โ† SPURIOUS! โœ—
| 4 | 2 | 5 |    โ† original โœ“

Here, B does not determine A or C, so the decomposition is lossy (not lossless).

6.2 Dependency Preservation

Definition: A decomposition of R into Rโ‚, Rโ‚‚, ..., Rโ‚™ is dependency-preserving if:

(Fโ‚ โˆช Fโ‚‚ โˆช ... โˆช Fโ‚™)โบ = Fโบ

where Fแตข is the set of FDs in Fโบ that involve only attributes of Rแตข.

In simpler terms: every FD in the original F can be verified by checking constraints within individual decomposed relations, without needing to join tables.

Why Dependency Preservation Matters

If a decomposition is not dependency-preserving, some FDs can only be enforced by joining multiple tables โ€” this is expensive and often impractical. Without dependency preservation, the DBMS cannot efficiently maintain data integrity.

Example

R(A, B, C) with F = { A โ†’ B, B โ†’ C }

Decompose into Rโ‚(A, C) and Rโ‚‚(B, C).

FDs on Rโ‚: Can we enforce A โ†’ B? No โ€” B is not in Rโ‚. FDs on Rโ‚‚: Can we enforce A โ†’ B? No โ€” A is not in Rโ‚‚.

The FD A โ†’ B is not preserved. To check it, we'd need to join Rโ‚ and Rโ‚‚.

A better decomposition: Rโ‚(A, B) and Rโ‚‚(B, C) โ€” this preserves both A โ†’ B and B โ†’ C.

6.3 Can We Always Have Both?

Normal Form Lossless Join Dependency Preservation
3NF Always achievable โœ“ Always achievable โœ“
BCNF Always achievable โœ“ Not always โœ—

This is the key tradeoff: BCNF is a stricter normal form, but achieving it may sacrifice dependency preservation. 3NF guarantees both properties.


7. The Lossless-Join Test Algorithm (For n-ary Decomposition)

For decompositions into more than two relations, we use a tabular algorithm.

7.1 Algorithm (Chase Test)

ALGORITHM: LosslessJoinTest(R, F, {Rโ‚, Rโ‚‚, ..., Rโ‚™})
INPUT:  R = {Aโ‚, Aโ‚‚, ..., Aโ‚˜} (m attributes)
        F = set of FDs
        {Rโ‚, Rโ‚‚, ..., Rโ‚™} = decomposition
OUTPUT: TRUE if lossless-join, FALSE otherwise

Step 1: Create an n ร— m matrix.
        Row i corresponds to Rแตข, column j to attribute Aโฑผ.

Step 2: Initialize the matrix:
        If Aโฑผ โˆˆ Rแตข, set entry[i][j] = aโฑผ (distinguished symbol)
        Otherwise, set entry[i][j] = bแตขโฑผ (subscripted symbol)

Step 3: REPEAT
            FOR EACH FD (X โ†’ Y) IN F DO
                Find all rows that agree on all columns in X
                For those rows, make their Y-columns equal:
                    If any row has aโฑผ for some column in Y, set all matching rows to aโฑผ
                    Otherwise, pick one bแตขโฑผ and set all matching rows to that value
            END FOR
        UNTIL no change occurs

Step 4: If any row has all distinguished symbols (aโ‚, aโ‚‚, ..., aโ‚˜), RETURN TRUE.
        Otherwise, RETURN FALSE.

7.2 Worked Example

R(A, B, C, D, E), F = { A โ†’ C, B โ†’ C, C โ†’ D, DE โ†’ C, CE โ†’ A }

Decomposition: Rโ‚(A, D), Rโ‚‚(A, B), Rโ‚ƒ(B, E), Rโ‚„(C, D, E), Rโ‚…(A, E)

Step 1-2: Initial matrix

A B C D E
Rโ‚ aโ‚ bโ‚โ‚‚ bโ‚โ‚ƒ aโ‚„ bโ‚โ‚…
Rโ‚‚ aโ‚ aโ‚‚ bโ‚‚โ‚ƒ bโ‚‚โ‚„ bโ‚‚โ‚…
Rโ‚ƒ bโ‚ƒโ‚ aโ‚‚ bโ‚ƒโ‚ƒ bโ‚ƒโ‚„ aโ‚…
Rโ‚„ bโ‚„โ‚ bโ‚„โ‚‚ aโ‚ƒ aโ‚„ aโ‚…
Rโ‚… aโ‚ bโ‚…โ‚‚ bโ‚…โ‚ƒ bโ‚…โ‚„ aโ‚…

Step 3: Apply FDs iteratively

Apply A โ†’ C: Rows with same A value. - Rows Rโ‚, Rโ‚‚, Rโ‚… have A = aโ‚. Make their C columns equal. - Rโ‚: bโ‚โ‚ƒ, Rโ‚‚: bโ‚‚โ‚ƒ, Rโ‚…: bโ‚…โ‚ƒ. No distinguished symbol. Pick bโ‚โ‚ƒ for all. - Rโ‚: bโ‚โ‚ƒ, Rโ‚‚: bโ‚โ‚ƒ, Rโ‚…: bโ‚โ‚ƒ

Apply B โ†’ C: Rows Rโ‚‚, Rโ‚ƒ have B = aโ‚‚. - Rโ‚‚: bโ‚โ‚ƒ, Rโ‚ƒ: bโ‚ƒโ‚ƒ. Pick bโ‚โ‚ƒ. - Rโ‚ƒ: bโ‚โ‚ƒ

Apply C โ†’ D: Rows with same C value. - Rows Rโ‚, Rโ‚‚, Rโ‚ƒ, Rโ‚… all have C = bโ‚โ‚ƒ. Rโ‚„ has C = aโ‚ƒ (different). - Rโ‚: aโ‚„, Rโ‚‚: bโ‚‚โ‚„, Rโ‚ƒ: bโ‚ƒโ‚„, Rโ‚…: bโ‚…โ‚„. Rโ‚ has distinguished aโ‚„. Set all to aโ‚„. - Rโ‚‚: aโ‚„, Rโ‚ƒ: aโ‚„, Rโ‚…: aโ‚„

Apply DE โ†’ C: Rows with same D and E values. - Rโ‚ƒ: D=aโ‚„, E=aโ‚…; Rโ‚„: D=aโ‚„, E=aโ‚…; Rโ‚…: D=aโ‚„, E=aโ‚…. - These rows agree on DE. Make C equal: Rโ‚ƒ: bโ‚โ‚ƒ, Rโ‚„: aโ‚ƒ, Rโ‚…: bโ‚โ‚ƒ. Rโ‚„ has aโ‚ƒ. Set all to aโ‚ƒ. - Rโ‚ƒ: aโ‚ƒ, Rโ‚…: aโ‚ƒ.

Apply CE โ†’ A: Rows with same C and E. - Rโ‚ƒ: C=aโ‚ƒ, E=aโ‚…; Rโ‚„: C=aโ‚ƒ, E=aโ‚…; Rโ‚…: C=aโ‚ƒ, E=aโ‚…. - Rโ‚… has A=aโ‚ (distinguished). Set Rโ‚ƒ: aโ‚, Rโ‚„: aโ‚.

Now re-apply A โ†’ C: Rows Rโ‚,Rโ‚‚,Rโ‚ƒ,Rโ‚„,Rโ‚… have A=aโ‚. - C values: Rโ‚=bโ‚โ‚ƒ, Rโ‚‚=bโ‚โ‚ƒ, Rโ‚ƒ=aโ‚ƒ, Rโ‚„=aโ‚ƒ, Rโ‚…=aโ‚ƒ. Has aโ‚ƒ. Set all to aโ‚ƒ. - Rโ‚=aโ‚ƒ, Rโ‚‚=aโ‚ƒ.

Re-apply C โ†’ D on all rows (now all C=aโ‚ƒ): already all D=aโ‚„. No change.

Check row Rโ‚…: A=aโ‚, B=bโ‚…โ‚‚, C=aโ‚ƒ, D=aโ‚„, E=aโ‚…. Still has bโ‚…โ‚‚ for B.

Apply B โ†’ C: Rโ‚‚ has B=aโ‚‚, Rโ‚ƒ has B=aโ‚‚. Both already C=aโ‚ƒ. No change.

Check: Row Rโ‚ƒ = (aโ‚, aโ‚‚, aโ‚ƒ, aโ‚„, aโ‚…) โ€” all distinguished symbols!

Result: The decomposition is lossless-join. โœ“


8. 3NF Synthesis Algorithm

The 3NF synthesis algorithm produces a decomposition that is both lossless-join and dependency-preserving.

8.1 Algorithm

ALGORITHM: 3NF_Synthesis(R, F)
INPUT:  R = relation schema, F = set of FDs
OUTPUT: Decomposition {Rโ‚, Rโ‚‚, ..., Rโ‚™} in 3NF with lossless join
        and dependency preservation

Step 1: Compute the minimal cover F_min of F.

Step 2: For each FD X โ†’ A in F_min:
            Create a relation schema Rแตข = X โˆช {A}
        Group FDs with the same LHS:
            If X โ†’ Aโ‚, X โ†’ Aโ‚‚, ..., X โ†’ Aโ‚– all have the same X,
            create one schema Rแตข = X โˆช {Aโ‚, Aโ‚‚, ..., Aโ‚–}

Step 3: If none of the schemas contains a candidate key of R,
        add a schema Rโ‚– = any candidate key of R.

Step 4: Remove any schema Rแตข that is a subset of another schema Rโฑผ.

RETURN {Rโ‚, Rโ‚‚, ..., Rโ‚™}

8.2 Why Each Step Matters

  • Step 1 (minimal cover): Ensures no redundant FDs generate extra tables
  • Step 2 (one table per LHS group): Directly preserves each FD
  • Step 3 (add key if needed): Guarantees lossless-join property
  • Step 4 (remove subsets): Eliminates redundant tables

8.3 Worked Example

R(A, B, C, D, E, H) with:

F = { A โ†’ BC,  E โ†’ HA,  B โ†’ D }

Step 1: Compute minimal cover

Decompose RHS:

F = { A โ†’ B, A โ†’ C, E โ†’ H, E โ†’ A, B โ†’ D }

Check extraneous LHS: All LHS are single attributes. Nothing to simplify.

Check redundant FDs: - A โ†’ B: Remove. {A}โบ under rest = {A, C} โˆช ... = {A, C}. B โˆ‰ {A}โบ. Keep. - A โ†’ C: Remove. {A}โบ under rest = {A, B, D} (via Aโ†’B, Bโ†’D). C โˆ‰ {A}โบ. Keep. - E โ†’ H: Remove. {E}โบ under rest = {E, A, B, C, D} (Eโ†’A, Aโ†’B, Aโ†’C, Bโ†’D). H โˆ‰ {E}โบ. Keep. - E โ†’ A: Remove. {E}โบ under rest = {E, H}. A โˆ‰ {E}โบ. Keep. - B โ†’ D: Remove. {B}โบ under rest = {B}. D โˆ‰ {B}โบ. Keep.

Minimal cover:

F_min = { A โ†’ B, A โ†’ C, E โ†’ H, E โ†’ A, B โ†’ D }

Step 2: Group by LHS and create schemas

LHS FDs Schema
A A โ†’ B, A โ†’ C Rโ‚(A, B, C)
E E โ†’ H, E โ†’ A Rโ‚‚(E, H, A)
B B โ†’ D Rโ‚ƒ(B, D)

Step 3: Check for candidate key

Compute {E}โบ = {E, H, A, B, C, D} = all attributes. So {E} is a candidate key.

Is E in any schema? Rโ‚‚ contains E. Rโ‚‚'s key is E. โœ“ (A candidate key is present.)

Step 4: Remove subset schemas

None is a subset of another.

Final decomposition:

Rโ‚(A, B, C)    โ€” key: {A}
Rโ‚‚(E, H, A)    โ€” key: {E}
Rโ‚ƒ(B, D)       โ€” key: {B}

All in 3NF โœ“, lossless-join โœ“, dependency-preserving โœ“.


9. BCNF Decomposition Algorithm

9.1 Algorithm

ALGORITHM: BCNF_Decomposition(R, F)
INPUT:  R = relation schema, F = set of FDs
OUTPUT: Decomposition into BCNF relations with lossless join

result โ† {R}

WHILE there exists Rแตข in result that is not in BCNF DO
    Find an FD X โ†’ Y that violates BCNF in Rแตข
    (i.e., X is not a superkey of Rแตข, and Y โŠ„ X)

    Compute Xโบ with respect to F

    Replace Rแตข with:
        Rโ‚ = Xโบ โˆฉ attributes(Rแตข)    (everything X determines within Rแตข)
        Rโ‚‚ = X โˆช (attributes(Rแตข) - Xโบ)   (X plus what it doesn't determine)
END WHILE

RETURN result

The key decomposition step splits Rแตข into: - Rโ‚: all attributes that X determines (within Rแตข) โ€” X is a key of Rโ‚ - Rโ‚‚: X plus the remaining attributes โ€” X is a foreign key back to Rโ‚

This guarantees lossless join (Rโ‚ โˆฉ Rโ‚‚ = X, and X โ†’ Rโ‚).

9.2 Worked Example

R(A, B, C, D) with F = { AB โ†’ C, C โ†’ B, AB โ†’ D }

Step 1: Is R in BCNF?

Candidate key: {A, B} ({A,B}โบ = {A,B,C,D}). Also {A, C} is a candidate key ({A,C}โบ: Cโ†’B gives {A,B,C}, ABโ†’D gives {A,B,C,D}).

Check FDs: - AB โ†’ C: {A,B} is a superkey. โœ“ - C โ†’ B: {C}โบ = {C, B}. C is not a superkey. BCNF violation! - AB โ†’ D: {A,B} is a superkey. โœ“

Step 2: Decompose on C โ†’ B

Compute {C}โบ โˆฉ {A,B,C,D} = {B, C} โˆฉ {A,B,C,D} = {B, C}

  • Rโ‚ = {B, C} with FD: C โ†’ B (key: C)
  • Rโ‚‚ = {C} โˆช ({A,B,C,D} - {B,C}) = {A, C, D}

Project FDs onto Rโ‚‚(A, C, D): - AB โ†’ C becomes irrelevant (B โˆ‰ Rโ‚‚) - C โ†’ B becomes irrelevant (B โˆ‰ Rโ‚‚) - AB โ†’ D becomes... need to check: under projected FDs, {A,C}โบ on Rโ‚‚. Cโ†’B is not in Rโ‚‚. But from original: AC โ†’ D? {A,C}โบ = {A,C,B,D} (Cโ†’B, ABโ†’D). So AC โ†’ D holds. Key of Rโ‚‚ is {A, C}. - Check: is AC a superkey of Rโ‚‚? {A,C}โบ restricted to Rโ‚‚ = {A,C,D}. Yes. โœ“

Step 3: Check Rโ‚ and Rโ‚‚

  • Rโ‚(B, C), FD: C โ†’ B. Key: {C}. C is a superkey. BCNF โœ“
  • Rโ‚‚(A, C, D), FD: AC โ†’ D. Key: {A, C}. AC is a superkey. BCNF โœ“

Final BCNF decomposition:

Rโ‚(B, C)      โ€” key: {C}
Rโ‚‚(A, C, D)   โ€” key: {A, C}

Check dependency preservation: - AB โ†’ C: cannot be checked from Rโ‚ or Rโ‚‚ alone (A and B are never in the same table). Not preserved! - C โ†’ B: in Rโ‚. โœ“ - AB โ†’ D: not directly preserved, but AC โ†’ D is in Rโ‚‚.

This illustrates the BCNF tradeoff: we achieved BCNF but lost the FD AB โ†’ C.

9.3 BCNF vs 3NF: The Tradeoff

Property 3NF Synthesis BCNF Decomposition
Lossless Join โœ“ Always โœ“ Always
Dependency Preservation โœ“ Always โœ— Not always
Redundancy Elimination Good (minimal) Best (none from FDs)
When to prefer When dependency preservation is critical When minimal redundancy is critical

Practical guidance: - Start with BCNF decomposition - If dependency preservation fails, fall back to 3NF - In practice, most schemas achieve BCNF without losing dependencies


10. Complete Worked Example: Unnormalized to BCNF

10.1 Scenario

A company tracks project assignments:

ProjectAssignment(
    emp_id, emp_name, emp_phone,
    dept_id, dept_name, dept_budget,
    proj_id, proj_name, proj_budget,
    hours_worked, role
)

Business rules (FDs):

F = {
    emp_id โ†’ emp_name, emp_phone, dept_id,
    dept_id โ†’ dept_name, dept_budget,
    proj_id โ†’ proj_name, proj_budget,
    {emp_id, proj_id} โ†’ hours_worked, role
}

Candidate key: {emp_id, proj_id}

10.2 Check 1NF

All values are atomic (single values, no arrays). โœ“ In 1NF.

10.3 Check 2NF

Non-prime attributes: emp_name, emp_phone, dept_id, dept_name, dept_budget, proj_name, proj_budget, hours_worked, role.

Partial dependencies (attribute depends on proper subset of key {emp_id, proj_id}): - emp_id โ†’ emp_name, emp_phone, dept_id (partial: depends on emp_id alone) - proj_id โ†’ proj_name, proj_budget (partial: depends on proj_id alone)

Not in 2NF. Decompose:

Employee(emp_id, emp_name, emp_phone, dept_id)
    FDs: emp_id โ†’ emp_name, emp_phone, dept_id

Project(proj_id, proj_name, proj_budget)
    FDs: proj_id โ†’ proj_name, proj_budget

Assignment(emp_id, proj_id, hours_worked, role)
    FDs: {emp_id, proj_id} โ†’ hours_worked, role

Now in 2NF. โœ“

10.4 Check 3NF

Employee(emp_id, emp_name, emp_phone, dept_id)

Key: {emp_id}

Are there transitive dependencies? - emp_id โ†’ dept_id (direct) โœ“ - But where are dept_name, dept_budget? They were removed โ€” but wait, we also have dept_id โ†’ dept_name, dept_budget from the original FDs. Since dept_name and dept_budget are not in this relation anymore, no transitive dependency exists within this relation.

Actually, let's reconsider. The original FD dept_id โ†’ dept_name, dept_budget means Employee should not contain dept_name or dept_budget. And it doesn't โ€” they went with the decomposition in step 2. But we need a Department table:

Employee(emp_id, emp_name, emp_phone, dept_id)

Is there any transitive dependency among the remaining attributes? emp_id โ†’ emp_name, emp_phone, dept_id. All are direct dependencies from the key. No non-prime attribute determines another non-prime attribute within this table (emp_phone doesn't determine dept_id, etc.).

In 3NF โœ“.

But the FD dept_id โ†’ dept_name, dept_budget from the original schema is "orphaned." We need a Department table:

Department(dept_id, dept_name, dept_budget)
    Key: {dept_id}

All relations now in 3NF:

Employee(emp_id, emp_name, emp_phone, dept_id)    โ€” key: {emp_id}
Department(dept_id, dept_name, dept_budget)         โ€” key: {dept_id}
Project(proj_id, proj_name, proj_budget)           โ€” key: {proj_id}
Assignment(emp_id, proj_id, hours_worked, role)     โ€” key: {emp_id, proj_id}

10.5 Check BCNF

For each relation, check: is every determinant a superkey?

  • Employee: emp_id โ†’ (emp_name, emp_phone, dept_id). emp_id is the key. โœ“
  • Department: dept_id โ†’ (dept_name, dept_budget). dept_id is the key. โœ“
  • Project: proj_id โ†’ (proj_name, proj_budget). proj_id is the key. โœ“
  • Assignment: {emp_id, proj_id} โ†’ (hours_worked, role). {emp_id, proj_id} is the key. โœ“

All in BCNF! โœ“

10.6 Summary of Decomposition

ORIGINAL (unnormalized):
    ProjectAssignment(emp_id, emp_name, emp_phone, dept_id, dept_name,
                      dept_budget, proj_id, proj_name, proj_budget,
                      hours_worked, role)

FINAL (BCNF):
    Employee(emp_id, emp_name, emp_phone, dept_id)
    Department(dept_id, dept_name, dept_budget)
    Project(proj_id, proj_name, proj_budget)
    Assignment(emp_id, proj_id, hours_worked, role)

Anomalies eliminated: - Update: Changing a department name requires updating only one row in Department - Insert: Can add a department without any employees - Delete: Removing all assignments for a project doesn't lose project information


11. Normalization in SQL

11.1 Implementing the Normalized Schema

CREATE TABLE Department (
    dept_id     INT PRIMARY KEY,
    dept_name   VARCHAR(100) NOT NULL,
    dept_budget DECIMAL(12, 2) NOT NULL
);

CREATE TABLE Employee (
    emp_id    INT PRIMARY KEY,
    emp_name  VARCHAR(100) NOT NULL,
    emp_phone VARCHAR(20),
    dept_id   INT NOT NULL,
    FOREIGN KEY (dept_id) REFERENCES Department(dept_id)
);

CREATE TABLE Project (
    proj_id     INT PRIMARY KEY,
    proj_name   VARCHAR(100) NOT NULL,
    proj_budget DECIMAL(12, 2) NOT NULL
);

CREATE TABLE Assignment (
    emp_id       INT NOT NULL,
    proj_id      INT NOT NULL,
    hours_worked DECIMAL(6, 2) NOT NULL DEFAULT 0,
    role         VARCHAR(50) NOT NULL,
    PRIMARY KEY (emp_id, proj_id),
    FOREIGN KEY (emp_id) REFERENCES Employee(emp_id),
    FOREIGN KEY (proj_id) REFERENCES Project(proj_id)
);

11.2 Verifying Normalization via Queries

To check for potential normalization issues in an existing database:

-- Check for potential 2NF violations: partial dependencies
-- If a non-key column has duplicate values correlated with part of a composite key
SELECT emp_id, COUNT(DISTINCT emp_name) AS name_count
FROM project_assignment_denormalized
GROUP BY emp_id
HAVING COUNT(DISTINCT emp_name) > 1;
-- If this returns rows, emp_name is inconsistently stored (update anomaly)

-- Check for potential 3NF violations: transitive dependencies
-- Columns that move together might indicate a missing entity
SELECT dept_id, COUNT(DISTINCT dept_name) AS names
FROM employee_denormalized
GROUP BY dept_id
HAVING COUNT(DISTINCT dept_name) > 1;
-- If this returns rows, dept_name is inconsistent for a given dept_id

12. Summary of Normal Forms

Normal Form Condition Eliminates
1NF Atomic values, no repeating groups Non-relational structures
2NF 1NF + no partial dependencies Redundancy from partial key dependencies
3NF 2NF + no transitive dependencies Redundancy from transitive dependencies
BCNF Every determinant is a superkey All FD-based redundancy

Decision Flowchart

Start: Relation R with FDs F

Is R in 1NF?
โ”œโ”€โ”€ No โ†’ Remove non-atomic values and repeating groups
โ””โ”€โ”€ Yes โ†“

Is R in 2NF?
โ”œโ”€โ”€ No โ†’ Remove partial dependencies (separate out attributes
โ”‚         that depend on part of the key)
โ””โ”€โ”€ Yes โ†“

Is R in 3NF?
โ”œโ”€โ”€ No โ†’ Remove transitive dependencies (separate out attributes
โ”‚         that depend on non-key attributes)
โ””โ”€โ”€ Yes โ†“

Is R in BCNF?
โ”œโ”€โ”€ No โ†’ Check: is dependency preservation acceptable to lose?
โ”‚   โ”œโ”€โ”€ Yes โ†’ Decompose using BCNF algorithm
โ”‚   โ””โ”€โ”€ No โ†’ Stay at 3NF
โ””โ”€โ”€ Yes โ†’ Done!

13. Exercises

Exercise 1: Identifying Normal Forms

For each relation, identify the highest normal form (1NF, 2NF, 3NF, or BCNF):

a) R(A, B, C, D), Key: {A, B}, FDs: A โ†’ C, AB โ†’ D

b) R(A, B, C), Key: {A}, FDs: A โ†’ B, B โ†’ C

c) R(A, B, C, D), Key: {A}, FDs: A โ†’ BCD

d) R(A, B, C), Keys: {A, B} and {A, C}, FDs: AB โ†’ C, AC โ†’ B, B โ†’ C, C โ†’ B

Solution **a)** A โ†’ C is a partial dependency (C depends on part of key {A,B}). **1NF** (not 2NF). **b)** A โ†’ B (direct from key, OK), B โ†’ C (transitive: A โ†’ B โ†’ C). Not 3NF. But no partial dependency (single-attribute key), so 2NF. **2NF** (not 3NF). **c)** Only FD is from the key. Every determinant (A) is a superkey. **BCNF**. **d)** B โ†’ C: B is not a superkey, but C is a prime attribute (part of key {A,C}). So 3NF holds. B is not a superkey, so BCNF fails. **3NF** (not BCNF).

Exercise 2: 3NF Synthesis

Apply the 3NF synthesis algorithm to:

R(A, B, C, D, E) with F = { A โ†’ B, BC โ†’ D, D โ†’ E, E โ†’ C }

Solution **Step 1: Minimal cover** Decompose RHS: already single attributes. Check extraneous LHS in BC โ†’ D: - Remove B: {C}โบ = {C}. D โˆ‰ {C}โบ. Keep B. - Remove C: {B}โบ = {B}. D โˆ‰ {B}โบ. Keep C. Check redundant FDs: - A โ†’ B: {A}โบ without Aโ†’B = {A}. B โˆ‰ {A}โบ. Keep. - BC โ†’ D: {B,C}โบ without BCโ†’D = {B,C}. D โˆ‰ {B,C}โบ. Keep. - D โ†’ E: {D}โบ without Dโ†’E = {D}. E โˆ‰ {D}โบ. Keep. - E โ†’ C: {E}โบ without Eโ†’C = {E}. C โˆ‰ {E}โบ. Keep. F_min = { A โ†’ B, BC โ†’ D, D โ†’ E, E โ†’ C } **Step 2: Create schemas (group by LHS)** - Rโ‚(A, B) from A โ†’ B - Rโ‚‚(B, C, D) from BC โ†’ D - Rโ‚ƒ(D, E) from D โ†’ E - Rโ‚„(E, C) from E โ†’ C **Step 3: Check for candidate key** {A}โบ = {A, B}. Not all. {A, C}โบ = {A, B, C, D, E}. All! Candidate key: {A, C}. {A, E}โบ = {A, B, C, D, E} (Eโ†’C, Aโ†’B, BCโ†’D, Dโ†’E). All! Candidate key: {A, E}. {A, D}โบ = {A, B, D, E, C}. All! Candidate key: {A, D}. None of Rโ‚-Rโ‚„ contains {A,C}, {A,E}, or {A,D} entirely. - Rโ‚ = {A,B}: no - Rโ‚‚ = {B,C,D}: no A - Rโ‚ƒ = {D,E}: no A - Rโ‚„ = {E,C}: no A Add Rโ‚… = {A, C} (or {A, D} or {A, E}). **Step 4: Remove subsets** Rโ‚„(E, C) โІ Rโ‚‚(B, C, D)? No (E not in Rโ‚‚). No subsets to remove. **Final decomposition:**
Rโ‚(A, B)       โ€” key: {A}
Rโ‚‚(B, C, D)    โ€” key: {B, C}
Rโ‚ƒ(D, E)       โ€” key: {D}
Rโ‚„(E, C)       โ€” key: {E}
Rโ‚…(A, C)       โ€” key: {A, C} (candidate key of R)
All in 3NF โœ“, lossless-join โœ“, dependency-preserving โœ“.

Exercise 3: BCNF Decomposition

Decompose into BCNF:

R(A, B, C, D) with F = { AB โ†’ C, C โ†’ A, C โ†’ D }

Solution Candidate keys: {A,B} and {B,C} (verify: {A,B}โบ = {A,B,C,D}, {B,C}โบ = {A,B,C,D}). Check BCNF: - AB โ†’ C: {A,B} is a superkey. โœ“ - C โ†’ A: {C}โบ = {A,C,D}. C is not a superkey. **BCNF violation!** - C โ†’ D: Same issue. **BCNF violation!** Decompose on C โ†’ A (or C โ†’ AD): - {C}โบ โˆฉ {A,B,C,D} = {A,C,D} - Rโ‚ = (A, C, D) with key {C} - Rโ‚‚ = {C} โˆช ({A,B,C,D} - {A,C,D}) = (B, C) with key {B,C} Check Rโ‚(A, C, D): - C โ†’ A: C is a key of Rโ‚. โœ“ - C โ†’ D: C is a key of Rโ‚. โœ“ - BCNF โœ“ Check Rโ‚‚(B, C): - No non-trivial FDs with determinant that's not a superkey. - BCNF โœ“ **BCNF decomposition: Rโ‚(A, C, D), Rโ‚‚(B, C)** Dependency preservation: AB โ†’ C requires joining Rโ‚ and Rโ‚‚. **Not preserved.**

Exercise 4: Lossless-Join Verification

Verify whether the following decomposition has the lossless-join property:

R(A, B, C, D) with F = { A โ†’ B, B โ†’ C }

Decomposition: Rโ‚(A, B), Rโ‚‚(A, C), Rโ‚ƒ(B, D)

Solution Use the chase test: Initial matrix: | | A | B | C | D | |----|----|-----|-----|-----| | Rโ‚ | aโ‚ | aโ‚‚ | bโ‚โ‚ƒ | bโ‚โ‚„ | | Rโ‚‚ | aโ‚ | bโ‚‚โ‚‚ | aโ‚ƒ | bโ‚‚โ‚„ | | Rโ‚ƒ | bโ‚ƒโ‚| aโ‚‚ | bโ‚ƒโ‚ƒ | aโ‚„ | Apply A โ†’ B: Rโ‚ and Rโ‚‚ agree on A (= aโ‚). - Rโ‚.B = aโ‚‚, Rโ‚‚.B = bโ‚‚โ‚‚. Rโ‚ has distinguished. Set Rโ‚‚.B = aโ‚‚. | | A | B | C | D | |----|----|----|-----|-----| | Rโ‚ | aโ‚ | aโ‚‚ | bโ‚โ‚ƒ | bโ‚โ‚„ | | Rโ‚‚ | aโ‚ | aโ‚‚ | aโ‚ƒ | bโ‚‚โ‚„ | | Rโ‚ƒ | bโ‚ƒโ‚| aโ‚‚ | bโ‚ƒโ‚ƒ | aโ‚„ | Apply B โ†’ C: Rโ‚, Rโ‚‚, Rโ‚ƒ agree on B (= aโ‚‚). - C values: bโ‚โ‚ƒ, aโ‚ƒ, bโ‚ƒโ‚ƒ. Has aโ‚ƒ. Set all to aโ‚ƒ. | | A | B | C | D | |----|----|----|----| ----| | Rโ‚ | aโ‚ | aโ‚‚ | aโ‚ƒ | bโ‚โ‚„ | | Rโ‚‚ | aโ‚ | aโ‚‚ | aโ‚ƒ | bโ‚‚โ‚„ | | Rโ‚ƒ | bโ‚ƒโ‚| aโ‚‚ | aโ‚ƒ | aโ‚„ | No more changes from further iterations. Check rows: No row has all distinguished symbols. Row Rโ‚ has bโ‚โ‚„, Row Rโ‚‚ has bโ‚‚โ‚„, Row Rโ‚ƒ has bโ‚ƒโ‚. **The decomposition is NOT lossless-join.** โœ— The problem: Rโ‚ƒ(B, D) shares only B with the others, and B is not a key of any relation containing D's determining attributes. A correct decomposition: Rโ‚(A, B), Rโ‚‚(B, C, D) โ€” this is lossless since B โ†’ C holds and {B} is a key of Rโ‚‚ restricted to {B,C}. Actually, wait: B โ†’ D is not given. The FDs are only A โ†’ B and B โ†’ C. So D has no determining FD. Let's reconsider: {A}โบ = {A,B,C}. The key must include D somehow: key = {A, D}. Better decomposition: Rโ‚(A, B, C) and Rโ‚‚(A, D). Common = {A}. A โ†’ {B,C}. {A} is a key of Rโ‚. Lossless โœ“.

Exercise 5: Full Normalization

Normalize the following to 3NF using the synthesis algorithm:

Library(isbn, title, author_id, author_name, publisher_id,
        publisher_name, publisher_city, branch_id, branch_name, copies)

FDs:

isbn โ†’ title, author_id, publisher_id
author_id โ†’ author_name
publisher_id โ†’ publisher_name, publisher_city
{isbn, branch_id} โ†’ copies
branch_id โ†’ branch_name
Solution **Step 1: Minimal cover** Decompose RHS:
isbn โ†’ title, isbn โ†’ author_id, isbn โ†’ publisher_id,
author_id โ†’ author_name,
publisher_id โ†’ publisher_name, publisher_id โ†’ publisher_city,
{isbn, branch_id} โ†’ copies,
branch_id โ†’ branch_name
Already minimal (all single attributes on RHS, no extraneous LHS, no redundant FDs). **Step 2: Group by LHS** - Rโ‚(isbn, title, author_id, publisher_id) โ€” from isbn โ†’ title, author_id, publisher_id - Rโ‚‚(author_id, author_name) โ€” from author_id โ†’ author_name - Rโ‚ƒ(publisher_id, publisher_name, publisher_city) โ€” from publisher_id โ†’ publisher_name, publisher_city - Rโ‚„(isbn, branch_id, copies) โ€” from {isbn, branch_id} โ†’ copies - Rโ‚…(branch_id, branch_name) โ€” from branch_id โ†’ branch_name **Step 3: Candidate key = {isbn, branch_id}** Rโ‚„ contains {isbn, branch_id}. โœ“ **Step 4: No subsets to remove.** **Final 3NF decomposition:**
Book(isbn, title, author_id, publisher_id)         โ€” key: {isbn}
Author(author_id, author_name)                      โ€” key: {author_id}
Publisher(publisher_id, publisher_name, publisher_city) โ€” key: {publisher_id}
BranchStock(isbn, branch_id, copies)                โ€” key: {isbn, branch_id}
Branch(branch_id, branch_name)                      โ€” key: {branch_id}
This is also in BCNF since every determinant is a single-attribute key (or composite key in BranchStock).

Exercise 6: Anomaly Identification

Given the following relation and sample data, identify specific update, insertion, and deletion anomalies:

CourseSection(course_id, section, semester, instructor, building, room)

FDs: {course_id, section, semester} โ†’ instructor, building, room
     building, room โ†’ capacity   (assume capacity is also an attribute)
| course_id | section | semester | instructor | building | room | capacity |
|-----------|---------|----------|------------|----------|------|----------|
| CS101     | 1       | Fall24   | Dr. Smith  | Watson   | 101  | 50       |
| CS101     | 2       | Fall24   | Dr. Jones  | Watson   | 101  | 50       |
| CS201     | 1       | Fall24   | Dr. Smith  | Watson   | 201  | 30       |
| CS201     | 1       | Spr25    | Dr. Smith  | Taylor   | 105  | 40       |
Solution **Update anomaly**: If the capacity of Watson 101 changes (e.g., renovation adds seats), we must update multiple rows (rows 1 and 2). If we update only row 1, rows 1 and 2 become inconsistent. **Insertion anomaly**: We cannot record that Taylor 302 has capacity 60 unless there is a course section scheduled in that room. There's no way to store room information independently. **Deletion anomaly**: If CS201 Section 1 Spring 2025 is cancelled (delete row 4), we lose the information that Taylor 105 has capacity 40 (assuming no other section uses that room). **Root cause**: The transitive dependency {course_id, section, semester} โ†’ {building, room} โ†’ capacity creates redundancy. **Fix**: Decompose into:
CourseSection(course_id, section, semester, instructor, building, room)
Room(building, room, capacity)

14. Summary

Concept Key Idea
1NF Atomic values only โ€” foundation of relational model
2NF No partial dependencies โ€” every non-prime attribute depends on the full key
3NF No transitive dependencies โ€” non-prime attributes depend only on keys
BCNF Every determinant is a superkey โ€” strictest FD-based form
Lossless Join Natural join recovers original data โ€” mandatory
Dependency Preservation All FDs checkable without joins โ€” desirable but sometimes sacrificed for BCNF
3NF Synthesis Guarantees both lossless join and dependency preservation
BCNF Decomposition Guarantees lossless join; may lose dependency preservation

Normalization through BCNF handles all redundancy caused by functional dependencies. However, there are other types of dependencies โ€” multivalued dependencies and join dependencies โ€” that require higher normal forms. We explore these in the next lesson.


Previous: 05_Functional_Dependencies.md | Next: 07_Advanced_Normalization.md

to navigate between lessons