Using MySQL, Normalisation
Workshop Requirements
You should have experience of using databases and be familiar with retrieving data from multiple tables using SQL joins. It is also advised that you have undertaken parts 1 -5 of this MySQL series so that you are familiar with the examples used so far.
Introduction
In the previous workshops in the MySQL series, I've hinted at elements of database theory (such as normalisation and foreign keys) without fully explaining these concepts and why people use them when building databases. This workshop aims to fill this gap, but from previous experience in trying to teach this material, I know that a lot of people when learning about normalisation have problems relating the theory to practice. This workshop aims to bridge this gap by providing numerous examples that illustrate both why we need to normalise at each step and also what effect is has on the data. It is not intended to be an indepth theoretical discussion of database theory, however that said we cannot normalise by studying the data alone and thus we will need to touch on a few other database concepts to fully understand what is going on.
Unusually for these workshops I'm going to give figure numbers to the examples used and store them in external files. This is due to the size of some of the tables, and it may be worth printing these figures before continuing.
- Figure One - Unnormalised Table - HTML | PDF
- Figure Two - Problematic First Normal Form - HTML | PDF
- Figure Three - Correct First Normal Form - HTML | PDF
- Figure Four - Correct Second Normal Form - HTML | PDF
- Figure Five - Partial Third Normal Form - HTML | PDF
- Figure Six - Third Normal Form - HTML | PDF
What is normalisation?
Normalisation is a process whereby the tables in a database are optimised to remove the potential for redundancy. Two main problems may arise if this is not done:
- Repeated data makes a database bigger.
- Multiple instances of the same values make maintaining the data more difficult and can create anomalies.
I'm not going to develop the idea of specific anomalies much here, but to briefly illustrate an update anomaly imagine the problem of updating a customer address. If there are multiple instances of that customer in a database, any query designed to update the address will have to update them all. That may seem easy - as we've seen in previous workshops, SQL is quite capable of updating information based on field values. But if one of those field values (say an instance of customer name) has been input incorrectly, then the query would fail to update the address in that row.
Normalisation aims to remove this redundancy by applying rules in a series of stages, splitting tables and creating relationships between unique identifiers (keys), to ensure that the database table structure is efficient, but data can still be accurately manipulated. It can also be used in conjunction with other database modeling techniques, such as Entity-Relationship diagrams, but for simple databases normalisation can sometimes be enough.
Before we begin looking at the stages of normalisation it is worth having a brief look at some of the relational database concepts that inform the process.
Relational Databases
Relational databases, such as MySQL, are so called as they rely on a series of relationships to connect data stored in different tables. For example in Workshop 5 we looked at joining tables based on a common ID number. This ID number was stored as the uniquely valued field (Primary Key) in one table (for example in the artists table), but was also present in another table (the cds table) acting as a reference (Foreign Key) that allowed data to be retrieved from both tables when the values matched. For ease of understanding we used unique numbers, but a key can have any value provided it conforms to a few rules.
Terminology
When were we were looking at the physical database in the other workshops we used certain terms such as TABLE, COLUMN, ROW and FIELD. Discussion of database theory uses different terms:
Relation/ Entity - These are the same as a table.
Attributes - These are similar to columns in that they describe the type of data stored.
Domain - This is values within the same attribute (a collection of fields that exist in the same column).
Tuple - This is a record similar to a row.
As I'm trying to tie the theoretical discussion to the practical database experience that the first Virtual Workshops provide, I will use both sets of terms in this workshops for clarity and to reinforce their meanings.
Different Keys
The Primary Key is frequently used to identify a specific row (although other keys may exist) and all the other values are dependent on that value to be meaningfully identified. A primary key is usually one attribute (column) in a relation (table), but can be a combination of attributes (Composite Key). If we consider the following attributes some cds may have.
Catalogue No. | Artist | Title |
The obvious Primary Key for a CDs table would be the Catalogue No. as each cd has already been given a unique identifier. But consider if we knew for a FACT that each artist could only release one CD per year (perhaps something a cold war communist five year plan would envisage ;-).
CDs | |||
Catalogue No. | Artist | Year | Title |
By adding the Year attribute we could also use the combination of Artist and Year to identify the row, as together they form a Composite Key. Thus we have two potential or Candidate Keys. We would probably still choose the Catalogue No. as it is simpler to use one column. Consider a Radio station wishing to record how many times a day they played a CD. The combination of Catalogue No. and the date would be sufficient to identify that record in a new Play History table.
Play History | ||
Catalogue No. | Date | Times Played |
In this situation the Catalogue No. number is a Primary Key in the CDs table but only part of the Composite Key in the Play History table as the date is required to form the whole key. An alternative could have been to create a new unique single-attribute Primary Key for the table. As we still need to know the catalogue number in order to work out how many times a DJ has played Britney, Christina or Pink (BBC Radio One is just rubbish isn't it? ;-). The catalogue number would become the Foreign Key in the Play History table, i.e. a Foreign Key is a attribute (column) in a table that refers to a Primary Key in another table.
Play History | |||
HistoryID | Catalogue No. | Date | Times Played |
During the Normalisation process we will look at the rules that determine how tables (and thus Keys) are formed.
NULLs
NULLs are used in databases to signify that no value exists, but are not really values themselves. Instead they are just an indictor that a particular field has no value. Some argue this is better that leaving a blank value or having a default value. I personally see little value in a blank field, but default values (such as inserting the current date if no date is offered) can be useful. The interesting point about NULLs is that no Primary Key can contain a NULL and thus it's useful when comparing Candidate Keys to see if one could potentially contain a NULL, thus disqualifying that key.
Functional Dependency
This is another important concept and describes the relationship that columns have to the Primary Key. If the value of a column can be determined by knowing the value of the Primary Key and no other, then that column is said to be functionally dependent. Or more simply, if we know the value of Primary Key we can find out the value of any other dependent column. This dependency is often written as follows (where A is the Primary Key and B is the dependent column):
A → B
So for the CDs table above we could express the dependency of the title column as either:
Catalogue No → Title or
(Artist, Year) → Title
This is important as we will see later as having a column that is NOT wholly dependent on the Primary Key causes redundancy. Having introduced some of the terminology of Normalisation we can begin to look at the stages.
Describing relations (tables)
When describing tables there is a way of expressing their structure that we will follow here. This includes the name of the relation, the attributes within the relation and which attributes are key. The format of this is:
RELATION (attribute one, attribute two, attribute three etc)
NOTE: The key attribute is underlined to signify its status and the relation is always UPPERCASE.
For example consider how we might express the CDs table above:
CDS (Catalogue No, Artist, Year, Title)
Stages of Normalisation
There are several stages of normalisation that a database structure can be subject to, each with rules that constrain the database further and each creating what is called a Normal Form. These are, in order:
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Boyce Codd Normal Form (BCNF)
- Fourth Normal Form (4NF)
- Fifth Normal Form (5FN)
As the database goes through each stage of normalisation it is said to be in the normal form of that stage, i.e. my database is 2NF and I need it to be 3NF. We are not going to take a detailed look at all of these Normal Forms as BCNF, 4NF and 5NF are probably overkill for small to medium databases, with the first 3 normal forms usually being sufficient. There is also one other stage - that of the Un-normalised Normal Form (UNF) which is the starting point for Normalisation. To begin our examination we must first create this Un-normalised data.
Our Scenario
In order to look at database normalisation we are going to use a fictional CD library (quaint in the post-napster internet age I know). Monitoring who has which CD is done by logging the CDs in and out of a loans logbook that contains the following information (with sample data):
Date Borrowed | Due Back | Borrower | Artist | Title |
1st Dec | 8th Dec | David Findlay | Britney Spears Pink Christina Aguilera |
Britney Cant Take Me Home Stripped |
8th Dec | 15th Dec | Iain Brown | Darius Christina Aguilera Will Young |
Dive In Stripped From Now On |
Loans Log
Information is also stored about each borrower in a separate register:
Borrower | Address | Telephone No. | |
David Findlay | Davies Lodge 23 Driven St Edinburgh EH5 APE |
0131 555 5579 | david@keithjbrown.co.uk |
Iain Brown | 54 Home St, Edinburgh, EH23 8TF |
0131 555 5580 | iain@keithjbrown.co.uk |
Borrower Log
Finally there is a ledger that stores information about the CDs:
Artist | Title | Year | Label | No. of Tracks | Copy |
Britney Spears | Britney | 2001 | Jive | 14 | 1 |
Pink | Cant Take Me Home | 2002 | LaFace | 15 | 3 |
Christina Aguilera | Stripped | 2002 | RCA | 20 | 1 |
Darius | Dive In | 2002 | Mercury | 13 | 2 |
Will Young | From Now On | 2002 | RCA | 13 | 2 |
CDs Log
We can see that by keeping the information in different places there has been some common sense organisation to avoid repeatedly entering the same data. This is quite common, but doesn't really help up as we need to use the normalisation rules to determine our structure. Thus to begin with we'll merge the fields (attributes) into a one-table structure along with the same sample data (figure one).
We have also created a column called 'Transaction No' which we use as a primary key to uniquely identify each lending transaction. We do this as there is no other simple method of identifying the row, as the obvious Composite Key is pretty complex (date borrowed, name, artist, title ), and thus assigning a unique 'Transaction No' is sensible at this time.
Note: As I mentioned above, a modeling tool like Entity Relationship diagramming is good to use before normalisation. This would be essential on larger database projects as it would be pretty impossible to create one big flat database to begin normalising. It is also worth noting that although we have included sample data, we are normalising the database structure and not the data itself. It can be easier however to spot potential problems when using example data.
So to begin the normalisation process, we use the Un-normalised data and apply our first rules.
First Normal Form (1NF)
For a relation (table) to conform be 1NF, it must obey the following rules:
To ensure that all attributes (columns) are atomic - what this means is that any single field should only have a value for ONE thing. For example in a 'cars' database a field in a car may have a value 'Blue Ford Focus'. This field is not atomic as it contains more than one piece of information (colour, manufacturer and model). To demostrate the problem with this, try to imagine constructing a query to select all Ford cars. This would be difficult as the value 'Ford' is trapped in the middle of the field. We must therefore split this column into three (colour, manufacturer, model).
There must be no repeating groups. This is perhaps most easily understood as there cannot be repeating columns (that which contain the same type of information). For example a cars field may contain a list of Ford cars (Focus, Mondeo, Puma). As we know we must also make this field atomic, but splitting the column into three (cars1, cars2, cars3) to hold these values isn't an option this time as that would create three domains that contain similar data. In practice this would make querying the databases difficult as we would have to create a statement to match potential values in 3 columns (or four, or five or a hundred if there was a list of a hundred). Splitting the field so that each value was in a separate row would require the other values in the rows to be repeated. The solution is to change the structure of the database by extracting the repeating groups into a new relation.
1NF in practice
Looking at the Un-normalised data we can see that there are non-atomic attributes in the following columns:
- Borrower
- Address
- The elements that make up the CD information (Artist, Title, Year, Label, No_of_Tracks and Copy)
All of these attributes will have to be normalised so the domain contains only atomic values. To begin, let's look at the Borrower attribute. If you look at a section of the table in figure one, the borrower contains a forename and a surname expressed as a single borrower. Thus we should split this attribute in two. The following excerpt from the table illustrates how this would be implemented:
TransactionNo. | Forename | Surname | Address | TeleNo |
1 | David | Findlay | Davies Lodge 23 Driven St Edinburgh EH5 APE |
0131 555 5579 |
2 | Iain | Brown | 54 Home St Edinburgh EH23 8TF |
0131 555 5580 |
Similarly the address should be broken down into different attributes. Anyone that has filled out an application form will be familiar with this structure.
- Address Line One
- Address Line Two (for the purposes of this workshop we'll stop at two address lines - in reality you may need more)
- City
- Postcode
Once more we can see the effect that this has on an excerpt of the table:
TransactionNo. | Forename | Surname | Address_One | Address_Two | City | Postcode |
1 | David | Findlay | Davies Lodge | 23 Driven St | Edinburgh | EH5 APE |
2 | Iain | Brown | 54 Home St | Edinburgh | EH23 8TF |
The Repeating Group
If we have a look at the remaining non-atomic attributes we can see that they are similar informational attributes about a CD. In order to make these attributes atomic we cannot repeat the trick of creating additional columns as each repeated column would hold values of a similar type. If we take the artist field for example:
TransactionNo. | (Other Attributes) | Artist_One | Artist_Two | Artist_Three |
1 | ... | Britney Spears | Pink | Christina Aguilera |
2 | ... | Darius | Christina Aguilera | Will Young |
If we split each original field into 3 we can see the value 'Christina Aguilera' appears twice, once in Artist_Two and once in Artist_Three. If we were searching to see who had borrowed CDs by Christina we'd have to create a query to search ALL the artist fields. Also if someone borrowed a 4th CD as part of a transaction we would have to create a fourth Artist column and so on. The other reason that repeating groups are not allowed to be split into columns is that we MUST know the number of columns required in order to create the database.
The alternative, for someone who knew nothing about normalisation, would be to just create extra rows and copy the values in non-repeating attributes - that way we could make sure that each field contained atomic values as seen in figure two.
This is a trap that we could fall into by looking at only the data. We have not actually changed the structure of the database and thus the problem of repeating groups still exists. This can be seen by the fact that the Primary Key is no longer unique. The correct solution is to remove the repeating groups (Artist, Title, Year, Label, No_of_Tracks and Copy) into a new 'Borrowed CDs' relation .
TransactionNo | Artist | Title | Year | Label | No_of_Tracks | Copy |
1 | Britney Spears | Britney | 2001 | Jive | 14 | 1 |
1 | Pink | Cant Take Me Home | 2002 | LaFace | 15 | 3 |
1 | Christina Aguilera | Stripped | 2002 | RCA | 20 | 1 |
2 | Darius | Dive In | 2002 | Mercury | 13 | 2 |
2 | Will Young | From Now On | 2002 | RCA | 13 | 2 |
2 | Christina Aguilera | Stripped | 2002 | RCA | 20 | 1 |
Because we need to be able to join this data back with that left in the original table we also need to include the TransactionNo in this Borrowed CDs relation which will act (along with a qualifying value) as Key for this relation. The qualifying value, allowing us to identify each CD in the library, combined with the transaction number will be able to tell us the specific CD borrowed per transaction. Thus we next need to define the qualifying value which will be the least number of attributes required to uniquely identify the CD. This could be (Artist, Title) if we were sure that this was a unique indentifer for a CD, but we already can see that there are different copies of the same CD. We could try adding Copy to the Key (Artist, Title, Copy), but in practice having a single unique value for each CD is preferable. Thus we will also create a cdID attribute in this relation and use that to represent the uniqueness of a CD.
After achieving First Normal Form our database has two relations as can be seen in figure three or can be expressed as follows:
LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City,
Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS (TransactionNo, cdID, Artist, Title, Year, Label,
No_of_Tracks, Copy)
Second Normal Form (2NF)
The second normal form has one rule that must be followed:
All non-key attributes must be dependant ON THE WHOLE KEY and not just one attribute of the key. This obviously only applies to Composite Keys and means that attributes in the relation that only contain information (and have no role in the structure of the database) must be functionally dependent on all parts of the Key. Another way of thinking about this is that if there are some attributes that are not dependent on all parts of the Key this means that they are in the wrong relation.
For example: Imagine we have a cars_bought table where each row (record) contains information about the car that a customer bought, which has a Primary Key of (CustomerID, Model). Information about the car (such as engine size, year etc) could be derived from just one part of the Key (Model) and thus isn't dependent on the whole key. Whereas if we wanted to find out the colour of the car the customer bought then we could search for the customer and this would give us the colour - assuming he has only bought one car. BUT what if the customer has ordered two cars, of different models and different colours. In order to find the colour of a specific car we would need to know the CustomerID AND the model number. Thus colour is dependent on the WHOLE Primary Key and not just a part of it. What this tells us is that colour is in the correct relation as it is specific to the customer and the model, whereas information about the car, which is dependent upon the Model alone, should be in a different table.
2NF in practice
When removing the partial dependencies from our database we can only look at tables that have a Composite Key and not a sole attribute as the Primary Key. As the loans Relation has a Primary Key of 'TransactionNo' this table doesn't require to be changed in this Normal Form. The second relation (Borrowed CDs) DOES have a Composite Key so we must examine the attributes of this table to see if they are all dependent on the whole Key.
We can see that ALL the CD information attributes are dependent on the cdID attribute rather than the whole Key. This is because they aren't really about the CDs borrowed and more about just the CDs bought and thus belong on their own relation - CDs Bought.
cdID | Artist | Title | Year | Label | No_of_Tracks | Copy |
1 | Britney Spears | Britney | 2001 | Jive | 14 | 1 |
2 | Pink | Cant Take Me Home | 2002 | LaFace | 15 | 3 |
3 | Christina Aguilera | Stripped | 2002 | RCA | 20 | 1 |
4 | Darius | Dive In | 2002 | Mercury | 13 | 2 |
5 | Will Young | From Now On | 2002 | RCA | 13 | 2 |
CDs Bought
And that means that we can remove all the rest of CD attributes from the Borrowed CDs table.
TransactionNo | cdID |
1 | 1 |
1 | 2 |
1 | 3 |
2 | 4 |
2 | 5 |
2 | 3 |
Borrowed CDs
The database in the second normal form will now have the relations as shown in figure four and is expressed as:
LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City,
Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS(TransactionNo, cdID, Artist, Title, Year, Label,
No_of_Tracks, Copy)
CDS BOUGHT(cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)
Third Normal Form (3NF)
To achieve Third Normal Form no attribute must be dependent on a non-key attribute. This means that every informational attribute must be DIRECTLY dependent on the Primary Key and not on another column. If we again look at an imagined cars database, a customer table contains information such as address, city, postcode and also a column called shipping cost. The value of shipping cost changes in relation to where the car should be delivered, and thus is not directly dependent on the customer, but the city. Thus we would need to create another separate relation to hold the information about cities and shipping costs.
3NF in practice
As is good practice we should examine every table in turn to see if the non-key attributes are indeed directly dependent on the Primary Key.
If we look at the loans relation we can see that the columns AddressOne, AddressTwo, City, Postcode, and E-mail are not dependent on the TransactionNo, but rather a Composite Key of Forename and Surname. Thus they should be removed from this table into a new relation Borrowers. As there is a possibility that there could be more that one David Findlay or Iain Brown, this Composite Key isn't very effective so once more we will replace it with a unique numeric reference (BorrowerID). This will become the new Primary Key of Borrowers and a foreign Key in the loans relation.
The effect that this has on the tables can be seen in figure five.
The Borrowed CDs relation only has two attributes, which together make up the Primary Key and thus is not required to be normalised. The final table CDs Bought can however be nornailised further. Year, Label and No of Tracks are dependent on a Composite Key of Artist + Title. Copy is dependent on the cdID i.e. that which relates a particular physical CD that has been purchased, rather than album details such as artist or label. Thus we can create another relation CD Releases with a releaseID that identifies an album of which there may be millions of copies.
releaseID | Artist | Title | Year | Label | No_of_Tracks |
1 | Britney Spears | Britney | 2001 | Jive | 14 |
2 | Pink | Cant Take Me Home | 2002 | LaFace | 15 |
3 | Christina Aguilera | Stripped | 2002 | RCA | 20 |
4 | Darius | Dive In | 2002 | Mercury | 13 |
5 | Will Young | From Now On | 2002 | RCA | 13 |
CD Releases Relation
We of course have to replace these dependent attributes in the CDs Bought relation with the releaseID.
cdID | releaseID | Copy |
1 | 1 | 1 |
2 | 2 | 3 |
3 | 3 | 1 |
4 | 4 | 2 |
5 | 5 | 2 |
CDs Bought Relation
To further illustrate the difference between the CDs and the releases consider that we have bought two new CDs that will be available to borrow. One is new to our database (Robbie Williams - Escapology) whereas the Christina Aguilera - Stripped release has been so popular we decided to buy another copy.
Adding the Robbie album would mean adding data to two relations, CDs Bought and CD Releases, like so.
releaseID | Artist | Title | Year | Label | No_of_Tracks |
6 | Robbie Williams | Escapology | 2002 | Chrysalis |
14 |
CD Releases Relation
cdID | releaseID | Copy |
6 | 6 | 12/12/2002 |
CDs Bought Relation
Whereas we would only need to add information about the Christina Aguilera album to the CDs Bought relation, as its release information has already been entered into the CD Releases relation.
cdID | releaseID | Copy |
7 | 3 | 12/12/2002 |
CDs Bought Relation
Thus after three levels of normalisation we have a table structure as shown in figure six and is expressed as so:
LOANS (TransactionNo, BorrowerID, Date_Borrowed, Due_Back)
BORROWER (BorrowerID, Forename, Surname, AddressOne, AddressTwo, City,
Postcode, TeleNo, E-mail)
BORROWED CDS (TransactionNo, cdID, Artist, Title, Year, Label,
No_of_Tracks, Copy)
CDS BOUGHT (cdID, releaseID, Copy)
CD RELEASES (releaseID, Artist, Title, Year, Label, No_of_Tracks)
Other Normal Forms
As mentioned above there are more Normal Forms that we could use to normalise our database further. We are not going to do this, because in practice the anomalies that they are designed to remove are rare. We will however have a brief look at BCNF so that there is at least some understanding of that norm should you come across it. If you are confused or struggling with the above, it would perhaps be worth skipping past this section until you are more confident about your understanding of the first three Normal Forms.
Boyce-Codd Normal Form (BCNF)
This often referred to as a strong 3NF and states that each determinant must be a Candidate Key. In a relation there may be more than just the Primary Key from which we can derive the values of other attributes, i.e. from another attribute or combination of attributes. These are known as Candidate Keys and in 3NF it is possible to have only some of the attributes functionally dependent on them. In BCNF all attributes must depend on all Candidate Keys.
Variations
While the rules of Normalisation stay the same the practical implementation of them differs. The method I've used above is the one that I think is easiest to understand for the beginner, but there are a couple of others worth mentioning in case you see them discussed elsewhere. It doesn't really matter which you use as the derived data structure is the same.
Composite Key in the First Relation
In the above example we had 1NF of :
LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City,
Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS(TransactionNo, cdID, Artist, Title, Year, Label,
No_of_Tracks, Copy)
We are reminded that there was a Composite Key in the second relation (Borrowed CDs). We could have just as easily created the Composite Key in the first relation, where we remove (Artist, Title, Year, Label, No_of_Tracks, Copy) to a new relation called CDs Bought, create the Primary Key cdID for that relation and create a new attribute in the Loans relation that is a Foreign Key and the qualifying value for the TransactionNo. This would have duplicated the data in the non-repeating groups, which doesn't matter as we're normalising the database structure not the actual data, and that repetition is removed by the later Normal Forms. This would leave a 1NF of:
LOANS (cdID, TransactionNo, Forename, Surname, AddressOne,
AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
CDS BOUGHT (cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)
This means that in to achieve 2NF we no longer look at the second relation, but instead the first relation to determine if the attributes are functionally dependent on the whole Composite Key. All the attributes are functionally dependent on the TransactionNo. not the whole Key, so we remove them to a new relation called Transaction Info. 2NF is thus expressed:
LOANS (cdID, TransactionNo)
TRANSACTION INFO (TransactionNo, Forename, Surname, AddressOne, AddressTwo,
City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
CDS BOUGHT (cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)
We can see that this is the same data structure as the original method at 2NF, but we've just chosen different names due to the order in which a relation was first established.
1NF with only one relation
This is another method that is quite popular. If we recall the rules for 1NF they are:
- To ensure that all attributes (columns) contain only atomic values
- There must be no repeating groups
This says nothing about not repeating data or having to create new relations to resolve the problems. If we look at figure two, we can see that the data in the non-repeating attributes has indeed been repeated. The problem we had with this is that there was no longer a Primary Key. We can however make a Composite Key by using the same cdID as we did above to represent the uniqueness, except this time in just one relation. So 1NF would be:
LOANS (cdID, TransactionNo, Forename, Surname, AddressOne, AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back, Artist, Title, Year, Label, No_of_Tracks, Copy)
The consequence of this is that to achieve 2NF would be a longer stage. Initially we can see (similar to previous examples) that the attributes (Artist, Title, Year, Label, No_of_Tracks, Copy) are dependant on only the cdID part of the key and thus can be removed to a new relation:
LOANS (cdID, TransactionNo, Forename, Surname, AddressOne,
AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
CDS BOUGHT (cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)
This creates a structure identical to that of the 'Composite Key in the First Relation' example in 1NF, where the loan information attributes are only dependent on the Transaction No, and thus we have to split the relation again:
LOANS (cdID, TransactionNo)
TRANSACTION INFO (TransactionNo, Forename, Surname, AddressOne, AddressTwo,
City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
CDS BOUGHT (cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)
The database structure is now 2NF.
Without A Joining Table and cdID
This produces the only truly normalised database schema in this workshop - and I'll explain why in a moment. But first what is a 'Joining Table'? This is a table that is an entirely Composite Key of attributes which are Foreign Keys that contain no informational data and thus its function is to join the Primary Keys of other tables together.
For example: in each of the database structures we have produced so far there has been a relation with two attributes (cdID, TransactionNo) so that if we look up a Transaction No we can retrieve the cds by the corresponding cdIDs. In strict normalisation this table should not exist in this form, it has come into existence because we created an artificial key (cdID) to represent the Composite Key (Artist, Title, Copy). This is something that is done quite frequently in practice (why I've done so above), but is theoretically wrong. To explain let's normalise the structure again.
Because we cannot replace the Composite Key elements with cdID in 1NF, we are left with a structure like so:
LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City,
Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS (TransactionNo, Artist, Title, Copy,
Year, Label, No_of_Tracks)
As we can see, the Composite Key for the Borrowed CDs relation is now a lot more complicated. When we achieve 2NF we would need to create a structure like so:
LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City,
Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS (TransactionNo, Artist, Title, Copy)
CDS BOUGHT (Artist, Title, Copy, Year, Label, No_of_Tracks)
This has meant that any join we perform using SQL becomes much more complicated, as we are having to match more fields. Consider the following sample MySQL Inner Join to retrieve all the information about a bought cd.
With cdID:
SELECT * FROM cds_bought, cds_released WHERE (cds_bought.cdID = cds_released.cdID);
Without cdID:
SELECT * FROM cds_bought, cds_released WHERE (cds_bought.Artist = cds_released.Artist) AND (cds_bought.Title = cds_released.Title) AND (cds_bought.Copy = cds_released.Copy);
As we observe, joining tables on three fields is more complicated than on one field.
NOTE: The names of tables and columns are slightly different as the physical implementation of any database is best done without spaces in the names, whereas in the theoretical design stage spaces aid readability.
Thus to summarise, we have replaced the 'true' Composite Key with an artificial Primary Key, in order to make things easier in the physical implementation, BUT THERE IS NO RULE IN NORMALISATION THAT SAYS YOU SHOULD.
Conclusion
This has hopefully been a pretty straightforward guide to a complicated (for the novice anyway) subject. I first got into this whole web design game from a media standpoint rather than a technical computer standpoint. When I started using databases on the web, I had to become familiar with concepts that are second nature to those from a formal technical background, which was tricky at first due to the target audience of the literature. Thus I hope that this VW can be of some use to those coming to normalisation from other backgrounds as well as those in IT seeking a little clarification.