Lesson 06: Normalization (1NF through BCNF)
Lesson 06: Normalization (1NF through BCNF)¶
Previous: 05_Functional_Dependencies.md | Next: 07_Advanced_Normalization.md
Topic: Database Theory Lesson: 6 of 16 Prerequisites: Functional dependencies, attribute closure, minimal cover (Lesson 05) Objective: Understand normalization from 1NF through BCNF, master decomposition algorithms, verify lossless-join and dependency-preservation properties, and apply normalization to real-world schemas
1. Introduction¶
Normalization is the process of organizing a relational database schema to reduce redundancy and eliminate certain types of data anomalies. Introduced by Edgar F. Codd in 1970, it provides a systematic, theory-driven approach to schema design.
1.1 The Problem: Poor Schema Design¶
Consider this single-relation design for a university:
UniversityCourse(
student_id, student_name, student_addr,
course_id, course_title, dept_name, dept_building,
instructor_id, instructor_name,
grade, semester
)
This "universal relation" stores everything in one table. While it works for simple queries, it suffers from serious problems.
1.2 Data Anomalies¶
Update Anomaly¶
If the Computer Science department moves to a new building, we must update dept_building in every row where dept_name = 'Computer Science'. If we miss even one row, the data becomes inconsistent.
Before:
| course_id | dept_name | dept_building |
|-----------|-----------|---------------|
| CS101 | CS | Watson Hall |
| CS201 | CS | Watson Hall | โ must update ALL rows
| CS301 | CS | Watson Hall |
If we update only the first row:
| CS101 | CS | Taylor Hall | โ updated
| CS201 | CS | Watson Hall | โ inconsistent!
| CS301 | CS | Watson Hall | โ inconsistent!
Insertion Anomaly¶
We cannot record a new department (e.g., its name and building) unless a student enrolls in one of its courses, because student_id is part of the primary key.
Deletion Anomaly¶
If the last student enrolled in a course drops it, we lose not just the enrollment data but also the course title, instructor assignment, and department information.
1.3 Root Cause: Redundancy from FD Violations¶
All three anomalies stem from the same root cause: attributes that depend on only part of the key, or on non-key attributes, are stored redundantly. Normalization eliminates this redundancy through systematic decomposition guided by functional dependencies.
1.4 Goals of Normalization¶
- Eliminate redundancy: Each fact is stored exactly once
- Prevent anomalies: Updates, insertions, and deletions are clean
- Preserve information: No data is lost during decomposition (lossless join)
- Preserve constraints: FDs are still enforceable (dependency preservation)
2. First Normal Form (1NF)¶
2.1 Definition¶
Definition: A relation is in First Normal Form (1NF) if: 1. All attributes contain only atomic (indivisible) values 2. There are no repeating groups or arrays 3. Each row is uniquely identifiable (has a primary key)
1NF is the baseline requirement for being a valid relation in the relational model.
2.2 Violations and Fixes¶
Violation 1: Non-atomic values
| student_id | name | phone_numbers |
|------------|------------|------------------------|
| 101 | Alice | 555-1234, 555-5678 | โ multi-valued!
| 102 | Bob | 555-9999 |
Fix: Create a separate row for each phone number, or a separate table:
Student(student_id, name)
StudentPhone(student_id, phone_number)
| student_id | phone_number |
|------------|--------------|
| 101 | 555-1234 |
| 101 | 555-5678 |
| 102 | 555-9999 |
Violation 2: Repeating groups
| order_id | item1 | qty1 | item2 | qty2 | item3 | qty3 |
|----------|--------|------|--------|------|--------|------|
| 1001 | Pen | 5 | Paper | 10 | NULL | NULL |
Fix: Normalize into two tables:
Order(order_id, order_date, customer_id)
OrderItem(order_id, item_name, quantity)
2.3 1NF and the Relational Model¶
In strict relational theory, a relation is by definition in 1NF โ the relational model doesn't allow non-atomic domains. However, in practice, many systems allow arrays (PostgreSQL int[]), JSON columns, or comma-separated values. While sometimes useful for performance, these violate the spirit of 1NF and make FD-based reasoning difficult.
3. Second Normal Form (2NF)¶
3.1 Partial Dependency¶
Definition: A partial dependency exists when a non-prime attribute (an attribute not part of any candidate key) is functionally dependent on a proper subset of a candidate key.
In other words, some attribute depends on only part of the key, not the whole key.
3.2 Definition¶
Definition: A relation is in Second Normal Form (2NF) if: 1. It is in 1NF, and 2. Every non-prime attribute is fully functionally dependent on every candidate key (no partial dependencies)
Note: 2NF is only relevant when a candidate key is composite (has more than one attribute). If all candidate keys are single attributes, the relation is automatically in 2NF.
3.3 Example¶
Consider:
StudentCourse(student_id, course_id, student_name, grade)
Candidate key: {student_id, course_id}
FDs: - {student_id, course_id} โ grade (full dependency) - student_id โ student_name (partial dependency! student_name depends on only part of the key)
This violates 2NF.
Decomposition to achieve 2NF:
Student(student_id, student_name)
Key: {student_id}
FD: student_id โ student_name
Enrollment(student_id, course_id, grade)
Key: {student_id, course_id}
FD: {student_id, course_id} โ grade
3.4 Formal Test for 2NF¶
For each candidate key K and each non-prime attribute A: 1. Check if there exists a proper subset X โ K such that X โ A 2. If any such partial dependency exists, the relation is not in 2NF
4. Third Normal Form (3NF)¶
4.1 Transitive Dependency¶
Definition: A transitive dependency exists when a non-prime attribute A depends on another non-prime attribute B, which in turn depends on a candidate key K:
K โ B โ A, where B is not a superkey and A is not part of any candidate key.
4.2 Definition¶
Definition: A relation schema R is in Third Normal Form (3NF) if for every non-trivial functional dependency X โ A where A is a single attribute: 1. X is a superkey of R, OR 2. A is a prime attribute (member of some candidate key)
Equivalently: no non-prime attribute is transitively dependent on any candidate key.
4.3 Example¶
Consider:
Employee(emp_id, dept_id, dept_name, dept_location)
Candidate key: {emp_id}
FDs: - emp_id โ dept_id, dept_name, dept_location - dept_id โ dept_name, dept_location
The dependency emp_id โ dept_name is transitive through dept_id: - emp_id โ dept_id โ dept_name
This violates 3NF because dept_name depends on dept_id (a non-superkey) and dept_name is not a prime attribute.
Decomposition to achieve 3NF:
Employee(emp_id, dept_id)
Key: {emp_id}
FD: emp_id โ dept_id
Department(dept_id, dept_name, dept_location)
Key: {dept_id}
FD: dept_id โ dept_name, dept_location
4.4 The Importance of "Prime Attribute" Exception¶
The condition "A is a prime attribute" in 3NF is what distinguishes it from BCNF. This exception allows certain FDs where the dependent is part of a candidate key, even if the determinant is not a superkey. This exception is what makes 3NF decomposition always dependency-preserving (unlike BCNF).
5. Boyce-Codd Normal Form (BCNF)¶
5.1 Definition¶
Definition: A relation schema R is in Boyce-Codd Normal Form (BCNF) if for every non-trivial functional dependency X โ Y:
X is a superkey of R.
BCNF is strictly stronger than 3NF. It eliminates the "prime attribute" exception: every determinant must be a superkey, period.
5.2 Relationship: 3NF vs BCNF¶
BCNF โ 3NF โ 2NF โ 1NF
Every BCNF relation is in 3NF.
Every 3NF relation is in 2NF.
Every 2NF relation is in 1NF.
But not vice versa.
5.3 Example: 3NF but Not BCNF¶
Consider:
TeachingAssignment(student_id, course, instructor)
Business rules: - Each instructor teaches exactly one course: instructor โ course - Each student has exactly one instructor per course: {student_id, course} โ instructor - A student can have only one instructor for a given course
Candidate keys: {student_id, course} and {student_id, instructor}
FDs: - {student_id, course} โ instructor - {student_id, instructor} โ course - instructor โ course
Check 3NF for instructor โ course: - instructor is not a superkey โ - course is a prime attribute (part of candidate key {student_id, course}) โ
So the relation is in 3NF (condition 2 of the 3NF definition is satisfied).
Check BCNF for instructor โ course: - instructor is not a superkey โ
So the relation is not in BCNF.
5.4 When 3NF and BCNF Differ¶
3NF and BCNF differ only when: 1. There are multiple overlapping candidate keys, AND 2. A non-superkey attribute determines part of a candidate key
In practice, this situation is relatively rare. Most relations that are in 3NF are also in BCNF.
6. Decomposition Properties¶
When we decompose a relation to achieve a higher normal form, we must ensure the decomposition is "good." Two critical properties define what "good" means.
6.1 Lossless-Join Property¶
Definition: A decomposition of R into Rโ, Rโ, ..., Rโ has the lossless-join property if for every legal instance r of R:
r = ฯ_{Rโ}(r) โ ฯ_{Rโ}(r) โ ... โ ฯ_{Rโ}(r)
In other words, we can reconstruct the original relation by naturally joining the decomposed relations. No information is lost.
Binary Decomposition Test¶
For a decomposition of R into Rโ and Rโ:
Theorem: The decomposition is lossless-join if and only if:
(Rโ โฉ Rโ) โ (Rโ - Rโ) โ Fโบ, or (Rโ โฉ Rโ) โ (Rโ - Rโ) โ Fโบ
The common attributes must functionally determine all of one side. Equivalently, the common attributes must be a superkey of at least one of the decomposed relations.
Example¶
Decompose Employee(emp_id, dept_id, dept_name) into: - Rโ(emp_id, dept_id) and Rโ(dept_id, dept_name)
Common attributes: {dept_id} Rโ - Rโ = {dept_name}
Is dept_id โ dept_name in Fโบ? Yes!
So this decomposition is lossless-join. โ
Why Lossless-Join Matters¶
Without the lossless-join property, joining the decomposed tables produces spurious tuples โ rows that didn't exist in the original relation:
Original:
| A | B | C |
|---|---|---|
| 1 | 2 | 3 |
| 4 | 2 | 5 |
Decompose into R1(A,B) and R2(B,C):
R1: R2:
| A | B | | B | C |
|---|---| |---|---|
| 1 | 2 | | 2 | 3 |
| 4 | 2 | | 2 | 5 |
R1 โ R2:
| A | B | C |
|---|---|---|
| 1 | 2 | 3 | โ original โ
| 1 | 2 | 5 | โ SPURIOUS! โ
| 4 | 2 | 3 | โ SPURIOUS! โ
| 4 | 2 | 5 | โ original โ
Here, B does not determine A or C, so the decomposition is lossy (not lossless).
6.2 Dependency Preservation¶
Definition: A decomposition of R into Rโ, Rโ, ..., Rโ is dependency-preserving if:
(Fโ โช Fโ โช ... โช Fโ)โบ = Fโบ
where Fแตข is the set of FDs in Fโบ that involve only attributes of Rแตข.
In simpler terms: every FD in the original F can be verified by checking constraints within individual decomposed relations, without needing to join tables.
Why Dependency Preservation Matters¶
If a decomposition is not dependency-preserving, some FDs can only be enforced by joining multiple tables โ this is expensive and often impractical. Without dependency preservation, the DBMS cannot efficiently maintain data integrity.
Example¶
R(A, B, C) with F = { A โ B, B โ C }
Decompose into Rโ(A, C) and Rโ(B, C).
FDs on Rโ: Can we enforce A โ B? No โ B is not in Rโ. FDs on Rโ: Can we enforce A โ B? No โ A is not in Rโ.
The FD A โ B is not preserved. To check it, we'd need to join Rโ and Rโ.
A better decomposition: Rโ(A, B) and Rโ(B, C) โ this preserves both A โ B and B โ C.
6.3 Can We Always Have Both?¶
| Normal Form | Lossless Join | Dependency Preservation |
|---|---|---|
| 3NF | Always achievable โ | Always achievable โ |
| BCNF | Always achievable โ | Not always โ |
This is the key tradeoff: BCNF is a stricter normal form, but achieving it may sacrifice dependency preservation. 3NF guarantees both properties.
7. The Lossless-Join Test Algorithm (For n-ary Decomposition)¶
For decompositions into more than two relations, we use a tabular algorithm.
7.1 Algorithm (Chase Test)¶
ALGORITHM: LosslessJoinTest(R, F, {Rโ, Rโ, ..., Rโ})
INPUT: R = {Aโ, Aโ, ..., Aโ} (m attributes)
F = set of FDs
{Rโ, Rโ, ..., Rโ} = decomposition
OUTPUT: TRUE if lossless-join, FALSE otherwise
Step 1: Create an n ร m matrix.
Row i corresponds to Rแตข, column j to attribute Aโฑผ.
Step 2: Initialize the matrix:
If Aโฑผ โ Rแตข, set entry[i][j] = aโฑผ (distinguished symbol)
Otherwise, set entry[i][j] = bแตขโฑผ (subscripted symbol)
Step 3: REPEAT
FOR EACH FD (X โ Y) IN F DO
Find all rows that agree on all columns in X
For those rows, make their Y-columns equal:
If any row has aโฑผ for some column in Y, set all matching rows to aโฑผ
Otherwise, pick one bแตขโฑผ and set all matching rows to that value
END FOR
UNTIL no change occurs
Step 4: If any row has all distinguished symbols (aโ, aโ, ..., aโ), RETURN TRUE.
Otherwise, RETURN FALSE.
7.2 Worked Example¶
R(A, B, C, D, E), F = { A โ C, B โ C, C โ D, DE โ C, CE โ A }
Decomposition: Rโ(A, D), Rโ(A, B), Rโ(B, E), Rโ(C, D, E), Rโ (A, E)
Step 1-2: Initial matrix
| A | B | C | D | E | |
|---|---|---|---|---|---|
| Rโ | aโ | bโโ | bโโ | aโ | bโโ |
| Rโ | aโ | aโ | bโโ | bโโ | bโโ |
| Rโ | bโโ | aโ | bโโ | bโโ | aโ |
| Rโ | bโโ | bโโ | aโ | aโ | aโ |
| Rโ | aโ | bโ โ | bโ โ | bโ โ | aโ |
Step 3: Apply FDs iteratively
Apply A โ C: Rows with same A value. - Rows Rโ, Rโ, Rโ have A = aโ. Make their C columns equal. - Rโ: bโโ, Rโ: bโโ, Rโ : bโ โ. No distinguished symbol. Pick bโโ for all. - Rโ: bโโ, Rโ: bโโ, Rโ : bโโ
Apply B โ C: Rows Rโ, Rโ have B = aโ. - Rโ: bโโ, Rโ: bโโ. Pick bโโ. - Rโ: bโโ
Apply C โ D: Rows with same C value. - Rows Rโ, Rโ, Rโ, Rโ all have C = bโโ. Rโ has C = aโ (different). - Rโ: aโ, Rโ: bโโ, Rโ: bโโ, Rโ : bโ โ. Rโ has distinguished aโ. Set all to aโ. - Rโ: aโ, Rโ: aโ, Rโ : aโ
Apply DE โ C: Rows with same D and E values. - Rโ: D=aโ, E=aโ ; Rโ: D=aโ, E=aโ ; Rโ : D=aโ, E=aโ . - These rows agree on DE. Make C equal: Rโ: bโโ, Rโ: aโ, Rโ : bโโ. Rโ has aโ. Set all to aโ. - Rโ: aโ, Rโ : aโ.
Apply CE โ A: Rows with same C and E. - Rโ: C=aโ, E=aโ ; Rโ: C=aโ, E=aโ ; Rโ : C=aโ, E=aโ . - Rโ has A=aโ (distinguished). Set Rโ: aโ, Rโ: aโ.
Now re-apply A โ C: Rows Rโ,Rโ,Rโ,Rโ,Rโ have A=aโ. - C values: Rโ=bโโ, Rโ=bโโ, Rโ=aโ, Rโ=aโ, Rโ =aโ. Has aโ. Set all to aโ. - Rโ=aโ, Rโ=aโ.
Re-apply C โ D on all rows (now all C=aโ): already all D=aโ. No change.
Check row Rโ : A=aโ, B=bโ โ, C=aโ, D=aโ, E=aโ . Still has bโ โ for B.
Apply B โ C: Rโ has B=aโ, Rโ has B=aโ. Both already C=aโ. No change.
Check: Row Rโ = (aโ, aโ, aโ, aโ, aโ ) โ all distinguished symbols!
Result: The decomposition is lossless-join. โ
8. 3NF Synthesis Algorithm¶
The 3NF synthesis algorithm produces a decomposition that is both lossless-join and dependency-preserving.
8.1 Algorithm¶
ALGORITHM: 3NF_Synthesis(R, F)
INPUT: R = relation schema, F = set of FDs
OUTPUT: Decomposition {Rโ, Rโ, ..., Rโ} in 3NF with lossless join
and dependency preservation
Step 1: Compute the minimal cover F_min of F.
Step 2: For each FD X โ A in F_min:
Create a relation schema Rแตข = X โช {A}
Group FDs with the same LHS:
If X โ Aโ, X โ Aโ, ..., X โ Aโ all have the same X,
create one schema Rแตข = X โช {Aโ, Aโ, ..., Aโ}
Step 3: If none of the schemas contains a candidate key of R,
add a schema Rโ = any candidate key of R.
Step 4: Remove any schema Rแตข that is a subset of another schema Rโฑผ.
RETURN {Rโ, Rโ, ..., Rโ}
8.2 Why Each Step Matters¶
- Step 1 (minimal cover): Ensures no redundant FDs generate extra tables
- Step 2 (one table per LHS group): Directly preserves each FD
- Step 3 (add key if needed): Guarantees lossless-join property
- Step 4 (remove subsets): Eliminates redundant tables
8.3 Worked Example¶
R(A, B, C, D, E, H) with:
F = { A โ BC, E โ HA, B โ D }
Step 1: Compute minimal cover
Decompose RHS:
F = { A โ B, A โ C, E โ H, E โ A, B โ D }
Check extraneous LHS: All LHS are single attributes. Nothing to simplify.
Check redundant FDs: - A โ B: Remove. {A}โบ under rest = {A, C} โช ... = {A, C}. B โ {A}โบ. Keep. - A โ C: Remove. {A}โบ under rest = {A, B, D} (via AโB, BโD). C โ {A}โบ. Keep. - E โ H: Remove. {E}โบ under rest = {E, A, B, C, D} (EโA, AโB, AโC, BโD). H โ {E}โบ. Keep. - E โ A: Remove. {E}โบ under rest = {E, H}. A โ {E}โบ. Keep. - B โ D: Remove. {B}โบ under rest = {B}. D โ {B}โบ. Keep.
Minimal cover:
F_min = { A โ B, A โ C, E โ H, E โ A, B โ D }
Step 2: Group by LHS and create schemas
| LHS | FDs | Schema |
|---|---|---|
| A | A โ B, A โ C | Rโ(A, B, C) |
| E | E โ H, E โ A | Rโ(E, H, A) |
| B | B โ D | Rโ(B, D) |
Step 3: Check for candidate key
Compute {E}โบ = {E, H, A, B, C, D} = all attributes. So {E} is a candidate key.
Is E in any schema? Rโ contains E. Rโ's key is E. โ (A candidate key is present.)
Step 4: Remove subset schemas
None is a subset of another.
Final decomposition:
Rโ(A, B, C) โ key: {A}
Rโ(E, H, A) โ key: {E}
Rโ(B, D) โ key: {B}
All in 3NF โ, lossless-join โ, dependency-preserving โ.
9. BCNF Decomposition Algorithm¶
9.1 Algorithm¶
ALGORITHM: BCNF_Decomposition(R, F)
INPUT: R = relation schema, F = set of FDs
OUTPUT: Decomposition into BCNF relations with lossless join
result โ {R}
WHILE there exists Rแตข in result that is not in BCNF DO
Find an FD X โ Y that violates BCNF in Rแตข
(i.e., X is not a superkey of Rแตข, and Y โ X)
Compute Xโบ with respect to F
Replace Rแตข with:
Rโ = Xโบ โฉ attributes(Rแตข) (everything X determines within Rแตข)
Rโ = X โช (attributes(Rแตข) - Xโบ) (X plus what it doesn't determine)
END WHILE
RETURN result
The key decomposition step splits Rแตข into: - Rโ: all attributes that X determines (within Rแตข) โ X is a key of Rโ - Rโ: X plus the remaining attributes โ X is a foreign key back to Rโ
This guarantees lossless join (Rโ โฉ Rโ = X, and X โ Rโ).
9.2 Worked Example¶
R(A, B, C, D) with F = { AB โ C, C โ B, AB โ D }
Step 1: Is R in BCNF?
Candidate key: {A, B} ({A,B}โบ = {A,B,C,D}). Also {A, C} is a candidate key ({A,C}โบ: CโB gives {A,B,C}, ABโD gives {A,B,C,D}).
Check FDs: - AB โ C: {A,B} is a superkey. โ - C โ B: {C}โบ = {C, B}. C is not a superkey. BCNF violation! - AB โ D: {A,B} is a superkey. โ
Step 2: Decompose on C โ B
Compute {C}โบ โฉ {A,B,C,D} = {B, C} โฉ {A,B,C,D} = {B, C}
- Rโ = {B, C} with FD: C โ B (key: C)
- Rโ = {C} โช ({A,B,C,D} - {B,C}) = {A, C, D}
Project FDs onto Rโ(A, C, D): - AB โ C becomes irrelevant (B โ Rโ) - C โ B becomes irrelevant (B โ Rโ) - AB โ D becomes... need to check: under projected FDs, {A,C}โบ on Rโ. CโB is not in Rโ. But from original: AC โ D? {A,C}โบ = {A,C,B,D} (CโB, ABโD). So AC โ D holds. Key of Rโ is {A, C}. - Check: is AC a superkey of Rโ? {A,C}โบ restricted to Rโ = {A,C,D}. Yes. โ
Step 3: Check Rโ and Rโ
- Rโ(B, C), FD: C โ B. Key: {C}. C is a superkey. BCNF โ
- Rโ(A, C, D), FD: AC โ D. Key: {A, C}. AC is a superkey. BCNF โ
Final BCNF decomposition:
Rโ(B, C) โ key: {C}
Rโ(A, C, D) โ key: {A, C}
Check dependency preservation: - AB โ C: cannot be checked from Rโ or Rโ alone (A and B are never in the same table). Not preserved! - C โ B: in Rโ. โ - AB โ D: not directly preserved, but AC โ D is in Rโ.
This illustrates the BCNF tradeoff: we achieved BCNF but lost the FD AB โ C.
9.3 BCNF vs 3NF: The Tradeoff¶
| Property | 3NF Synthesis | BCNF Decomposition |
|---|---|---|
| Lossless Join | โ Always | โ Always |
| Dependency Preservation | โ Always | โ Not always |
| Redundancy Elimination | Good (minimal) | Best (none from FDs) |
| When to prefer | When dependency preservation is critical | When minimal redundancy is critical |
Practical guidance: - Start with BCNF decomposition - If dependency preservation fails, fall back to 3NF - In practice, most schemas achieve BCNF without losing dependencies
10. Complete Worked Example: Unnormalized to BCNF¶
10.1 Scenario¶
A company tracks project assignments:
ProjectAssignment(
emp_id, emp_name, emp_phone,
dept_id, dept_name, dept_budget,
proj_id, proj_name, proj_budget,
hours_worked, role
)
Business rules (FDs):
F = {
emp_id โ emp_name, emp_phone, dept_id,
dept_id โ dept_name, dept_budget,
proj_id โ proj_name, proj_budget,
{emp_id, proj_id} โ hours_worked, role
}
Candidate key: {emp_id, proj_id}
10.2 Check 1NF¶
All values are atomic (single values, no arrays). โ In 1NF.
10.3 Check 2NF¶
Non-prime attributes: emp_name, emp_phone, dept_id, dept_name, dept_budget, proj_name, proj_budget, hours_worked, role.
Partial dependencies (attribute depends on proper subset of key {emp_id, proj_id}): - emp_id โ emp_name, emp_phone, dept_id (partial: depends on emp_id alone) - proj_id โ proj_name, proj_budget (partial: depends on proj_id alone)
Not in 2NF. Decompose:
Employee(emp_id, emp_name, emp_phone, dept_id)
FDs: emp_id โ emp_name, emp_phone, dept_id
Project(proj_id, proj_name, proj_budget)
FDs: proj_id โ proj_name, proj_budget
Assignment(emp_id, proj_id, hours_worked, role)
FDs: {emp_id, proj_id} โ hours_worked, role
Now in 2NF. โ
10.4 Check 3NF¶
Employee(emp_id, emp_name, emp_phone, dept_id)
Key: {emp_id}
Are there transitive dependencies? - emp_id โ dept_id (direct) โ - But where are dept_name, dept_budget? They were removed โ but wait, we also have dept_id โ dept_name, dept_budget from the original FDs. Since dept_name and dept_budget are not in this relation anymore, no transitive dependency exists within this relation.
Actually, let's reconsider. The original FD dept_id โ dept_name, dept_budget means Employee should not contain dept_name or dept_budget. And it doesn't โ they went with the decomposition in step 2. But we need a Department table:
Employee(emp_id, emp_name, emp_phone, dept_id)
Is there any transitive dependency among the remaining attributes? emp_id โ emp_name, emp_phone, dept_id. All are direct dependencies from the key. No non-prime attribute determines another non-prime attribute within this table (emp_phone doesn't determine dept_id, etc.).
In 3NF โ.
But the FD dept_id โ dept_name, dept_budget from the original schema is "orphaned." We need a Department table:
Department(dept_id, dept_name, dept_budget)
Key: {dept_id}
All relations now in 3NF:
Employee(emp_id, emp_name, emp_phone, dept_id) โ key: {emp_id}
Department(dept_id, dept_name, dept_budget) โ key: {dept_id}
Project(proj_id, proj_name, proj_budget) โ key: {proj_id}
Assignment(emp_id, proj_id, hours_worked, role) โ key: {emp_id, proj_id}
10.5 Check BCNF¶
For each relation, check: is every determinant a superkey?
- Employee: emp_id โ (emp_name, emp_phone, dept_id). emp_id is the key. โ
- Department: dept_id โ (dept_name, dept_budget). dept_id is the key. โ
- Project: proj_id โ (proj_name, proj_budget). proj_id is the key. โ
- Assignment: {emp_id, proj_id} โ (hours_worked, role). {emp_id, proj_id} is the key. โ
All in BCNF! โ
10.6 Summary of Decomposition¶
ORIGINAL (unnormalized):
ProjectAssignment(emp_id, emp_name, emp_phone, dept_id, dept_name,
dept_budget, proj_id, proj_name, proj_budget,
hours_worked, role)
FINAL (BCNF):
Employee(emp_id, emp_name, emp_phone, dept_id)
Department(dept_id, dept_name, dept_budget)
Project(proj_id, proj_name, proj_budget)
Assignment(emp_id, proj_id, hours_worked, role)
Anomalies eliminated: - Update: Changing a department name requires updating only one row in Department - Insert: Can add a department without any employees - Delete: Removing all assignments for a project doesn't lose project information
11. Normalization in SQL¶
11.1 Implementing the Normalized Schema¶
CREATE TABLE Department (
dept_id INT PRIMARY KEY,
dept_name VARCHAR(100) NOT NULL,
dept_budget DECIMAL(12, 2) NOT NULL
);
CREATE TABLE Employee (
emp_id INT PRIMARY KEY,
emp_name VARCHAR(100) NOT NULL,
emp_phone VARCHAR(20),
dept_id INT NOT NULL,
FOREIGN KEY (dept_id) REFERENCES Department(dept_id)
);
CREATE TABLE Project (
proj_id INT PRIMARY KEY,
proj_name VARCHAR(100) NOT NULL,
proj_budget DECIMAL(12, 2) NOT NULL
);
CREATE TABLE Assignment (
emp_id INT NOT NULL,
proj_id INT NOT NULL,
hours_worked DECIMAL(6, 2) NOT NULL DEFAULT 0,
role VARCHAR(50) NOT NULL,
PRIMARY KEY (emp_id, proj_id),
FOREIGN KEY (emp_id) REFERENCES Employee(emp_id),
FOREIGN KEY (proj_id) REFERENCES Project(proj_id)
);
11.2 Verifying Normalization via Queries¶
To check for potential normalization issues in an existing database:
-- Check for potential 2NF violations: partial dependencies
-- If a non-key column has duplicate values correlated with part of a composite key
SELECT emp_id, COUNT(DISTINCT emp_name) AS name_count
FROM project_assignment_denormalized
GROUP BY emp_id
HAVING COUNT(DISTINCT emp_name) > 1;
-- If this returns rows, emp_name is inconsistently stored (update anomaly)
-- Check for potential 3NF violations: transitive dependencies
-- Columns that move together might indicate a missing entity
SELECT dept_id, COUNT(DISTINCT dept_name) AS names
FROM employee_denormalized
GROUP BY dept_id
HAVING COUNT(DISTINCT dept_name) > 1;
-- If this returns rows, dept_name is inconsistent for a given dept_id
12. Summary of Normal Forms¶
| Normal Form | Condition | Eliminates |
|---|---|---|
| 1NF | Atomic values, no repeating groups | Non-relational structures |
| 2NF | 1NF + no partial dependencies | Redundancy from partial key dependencies |
| 3NF | 2NF + no transitive dependencies | Redundancy from transitive dependencies |
| BCNF | Every determinant is a superkey | All FD-based redundancy |
Decision Flowchart¶
Start: Relation R with FDs F
Is R in 1NF?
โโโ No โ Remove non-atomic values and repeating groups
โโโ Yes โ
Is R in 2NF?
โโโ No โ Remove partial dependencies (separate out attributes
โ that depend on part of the key)
โโโ Yes โ
Is R in 3NF?
โโโ No โ Remove transitive dependencies (separate out attributes
โ that depend on non-key attributes)
โโโ Yes โ
Is R in BCNF?
โโโ No โ Check: is dependency preservation acceptable to lose?
โ โโโ Yes โ Decompose using BCNF algorithm
โ โโโ No โ Stay at 3NF
โโโ Yes โ Done!
13. Exercises¶
Exercise 1: Identifying Normal Forms¶
For each relation, identify the highest normal form (1NF, 2NF, 3NF, or BCNF):
a) R(A, B, C, D), Key: {A, B}, FDs: A โ C, AB โ D
b) R(A, B, C), Key: {A}, FDs: A โ B, B โ C
c) R(A, B, C, D), Key: {A}, FDs: A โ BCD
d) R(A, B, C), Keys: {A, B} and {A, C}, FDs: AB โ C, AC โ B, B โ C, C โ B
Solution
**a)** A โ C is a partial dependency (C depends on part of key {A,B}). **1NF** (not 2NF). **b)** A โ B (direct from key, OK), B โ C (transitive: A โ B โ C). Not 3NF. But no partial dependency (single-attribute key), so 2NF. **2NF** (not 3NF). **c)** Only FD is from the key. Every determinant (A) is a superkey. **BCNF**. **d)** B โ C: B is not a superkey, but C is a prime attribute (part of key {A,C}). So 3NF holds. B is not a superkey, so BCNF fails. **3NF** (not BCNF).Exercise 2: 3NF Synthesis¶
Apply the 3NF synthesis algorithm to:
R(A, B, C, D, E) with F = { A โ B, BC โ D, D โ E, E โ C }
Solution
**Step 1: Minimal cover** Decompose RHS: already single attributes. Check extraneous LHS in BC โ D: - Remove B: {C}โบ = {C}. D โ {C}โบ. Keep B. - Remove C: {B}โบ = {B}. D โ {B}โบ. Keep C. Check redundant FDs: - A โ B: {A}โบ without AโB = {A}. B โ {A}โบ. Keep. - BC โ D: {B,C}โบ without BCโD = {B,C}. D โ {B,C}โบ. Keep. - D โ E: {D}โบ without DโE = {D}. E โ {D}โบ. Keep. - E โ C: {E}โบ without EโC = {E}. C โ {E}โบ. Keep. F_min = { A โ B, BC โ D, D โ E, E โ C } **Step 2: Create schemas (group by LHS)** - Rโ(A, B) from A โ B - Rโ(B, C, D) from BC โ D - Rโ(D, E) from D โ E - Rโ(E, C) from E โ C **Step 3: Check for candidate key** {A}โบ = {A, B}. Not all. {A, C}โบ = {A, B, C, D, E}. All! Candidate key: {A, C}. {A, E}โบ = {A, B, C, D, E} (EโC, AโB, BCโD, DโE). All! Candidate key: {A, E}. {A, D}โบ = {A, B, D, E, C}. All! Candidate key: {A, D}. None of Rโ-Rโ contains {A,C}, {A,E}, or {A,D} entirely. - Rโ = {A,B}: no - Rโ = {B,C,D}: no A - Rโ = {D,E}: no A - Rโ = {E,C}: no A Add Rโ = {A, C} (or {A, D} or {A, E}). **Step 4: Remove subsets** Rโ(E, C) โ Rโ(B, C, D)? No (E not in Rโ). No subsets to remove. **Final decomposition:**Rโ(A, B) โ key: {A}
Rโ(B, C, D) โ key: {B, C}
Rโ(D, E) โ key: {D}
Rโ(E, C) โ key: {E}
Rโ
(A, C) โ key: {A, C} (candidate key of R)
Exercise 3: BCNF Decomposition¶
Decompose into BCNF:
R(A, B, C, D) with F = { AB โ C, C โ A, C โ D }
Solution
Candidate keys: {A,B} and {B,C} (verify: {A,B}โบ = {A,B,C,D}, {B,C}โบ = {A,B,C,D}). Check BCNF: - AB โ C: {A,B} is a superkey. โ - C โ A: {C}โบ = {A,C,D}. C is not a superkey. **BCNF violation!** - C โ D: Same issue. **BCNF violation!** Decompose on C โ A (or C โ AD): - {C}โบ โฉ {A,B,C,D} = {A,C,D} - Rโ = (A, C, D) with key {C} - Rโ = {C} โช ({A,B,C,D} - {A,C,D}) = (B, C) with key {B,C} Check Rโ(A, C, D): - C โ A: C is a key of Rโ. โ - C โ D: C is a key of Rโ. โ - BCNF โ Check Rโ(B, C): - No non-trivial FDs with determinant that's not a superkey. - BCNF โ **BCNF decomposition: Rโ(A, C, D), Rโ(B, C)** Dependency preservation: AB โ C requires joining Rโ and Rโ. **Not preserved.**Exercise 4: Lossless-Join Verification¶
Verify whether the following decomposition has the lossless-join property:
R(A, B, C, D) with F = { A โ B, B โ C }
Decomposition: Rโ(A, B), Rโ(A, C), Rโ(B, D)
Solution
Use the chase test: Initial matrix: | | A | B | C | D | |----|----|-----|-----|-----| | Rโ | aโ | aโ | bโโ | bโโ | | Rโ | aโ | bโโ | aโ | bโโ | | Rโ | bโโ| aโ | bโโ | aโ | Apply A โ B: Rโ and Rโ agree on A (= aโ). - Rโ.B = aโ, Rโ.B = bโโ. Rโ has distinguished. Set Rโ.B = aโ. | | A | B | C | D | |----|----|----|-----|-----| | Rโ | aโ | aโ | bโโ | bโโ | | Rโ | aโ | aโ | aโ | bโโ | | Rโ | bโโ| aโ | bโโ | aโ | Apply B โ C: Rโ, Rโ, Rโ agree on B (= aโ). - C values: bโโ, aโ, bโโ. Has aโ. Set all to aโ. | | A | B | C | D | |----|----|----|----| ----| | Rโ | aโ | aโ | aโ | bโโ | | Rโ | aโ | aโ | aโ | bโโ | | Rโ | bโโ| aโ | aโ | aโ | No more changes from further iterations. Check rows: No row has all distinguished symbols. Row Rโ has bโโ, Row Rโ has bโโ, Row Rโ has bโโ. **The decomposition is NOT lossless-join.** โ The problem: Rโ(B, D) shares only B with the others, and B is not a key of any relation containing D's determining attributes. A correct decomposition: Rโ(A, B), Rโ(B, C, D) โ this is lossless since B โ C holds and {B} is a key of Rโ restricted to {B,C}. Actually, wait: B โ D is not given. The FDs are only A โ B and B โ C. So D has no determining FD. Let's reconsider: {A}โบ = {A,B,C}. The key must include D somehow: key = {A, D}. Better decomposition: Rโ(A, B, C) and Rโ(A, D). Common = {A}. A โ {B,C}. {A} is a key of Rโ. Lossless โ.Exercise 5: Full Normalization¶
Normalize the following to 3NF using the synthesis algorithm:
Library(isbn, title, author_id, author_name, publisher_id,
publisher_name, publisher_city, branch_id, branch_name, copies)
FDs:
isbn โ title, author_id, publisher_id
author_id โ author_name
publisher_id โ publisher_name, publisher_city
{isbn, branch_id} โ copies
branch_id โ branch_name
Solution
**Step 1: Minimal cover** Decompose RHS:isbn โ title, isbn โ author_id, isbn โ publisher_id,
author_id โ author_name,
publisher_id โ publisher_name, publisher_id โ publisher_city,
{isbn, branch_id} โ copies,
branch_id โ branch_name
Book(isbn, title, author_id, publisher_id) โ key: {isbn}
Author(author_id, author_name) โ key: {author_id}
Publisher(publisher_id, publisher_name, publisher_city) โ key: {publisher_id}
BranchStock(isbn, branch_id, copies) โ key: {isbn, branch_id}
Branch(branch_id, branch_name) โ key: {branch_id}
Exercise 6: Anomaly Identification¶
Given the following relation and sample data, identify specific update, insertion, and deletion anomalies:
CourseSection(course_id, section, semester, instructor, building, room)
FDs: {course_id, section, semester} โ instructor, building, room
building, room โ capacity (assume capacity is also an attribute)
| course_id | section | semester | instructor | building | room | capacity |
|-----------|---------|----------|------------|----------|------|----------|
| CS101 | 1 | Fall24 | Dr. Smith | Watson | 101 | 50 |
| CS101 | 2 | Fall24 | Dr. Jones | Watson | 101 | 50 |
| CS201 | 1 | Fall24 | Dr. Smith | Watson | 201 | 30 |
| CS201 | 1 | Spr25 | Dr. Smith | Taylor | 105 | 40 |
Solution
**Update anomaly**: If the capacity of Watson 101 changes (e.g., renovation adds seats), we must update multiple rows (rows 1 and 2). If we update only row 1, rows 1 and 2 become inconsistent. **Insertion anomaly**: We cannot record that Taylor 302 has capacity 60 unless there is a course section scheduled in that room. There's no way to store room information independently. **Deletion anomaly**: If CS201 Section 1 Spring 2025 is cancelled (delete row 4), we lose the information that Taylor 105 has capacity 40 (assuming no other section uses that room). **Root cause**: The transitive dependency {course_id, section, semester} โ {building, room} โ capacity creates redundancy. **Fix**: Decompose into:CourseSection(course_id, section, semester, instructor, building, room)
Room(building, room, capacity)
14. Summary¶
| Concept | Key Idea |
|---|---|
| 1NF | Atomic values only โ foundation of relational model |
| 2NF | No partial dependencies โ every non-prime attribute depends on the full key |
| 3NF | No transitive dependencies โ non-prime attributes depend only on keys |
| BCNF | Every determinant is a superkey โ strictest FD-based form |
| Lossless Join | Natural join recovers original data โ mandatory |
| Dependency Preservation | All FDs checkable without joins โ desirable but sometimes sacrificed for BCNF |
| 3NF Synthesis | Guarantees both lossless join and dependency preservation |
| BCNF Decomposition | Guarantees lossless join; may lose dependency preservation |
Normalization through BCNF handles all redundancy caused by functional dependencies. However, there are other types of dependencies โ multivalued dependencies and join dependencies โ that require higher normal forms. We explore these in the next lesson.
Previous: 05_Functional_Dependencies.md | Next: 07_Advanced_Normalization.md