Saturday, July 12, 2008

RELATIONAL DATA MODEL

Data represented in the form of two-dimensional tables or relations
Each column of the table is an attribute of the object set represented
Each row of the table (called a tuple) is an instance of the object set
The order of rows and columns is immaterial
No duplicate rows and no duplicate column names
Example: STUDENT (Student-Id, Name, SSN, Major)


Codd's Rules for RDBMS

In 1985, Edgar Codd published a set of 13 rules which he defined as an evaluation scheme for a product which claimed to be a Relational DBMS.

Although these rules were later extended – and they now number in the 100s – they still serve as a foundation for a definition of a Relational DBMS.

The foundation rule

This is Codd's Rule Zero which is to foundation for the other rules.

The rules states that:

any system which claims to be a relation database management system must be able to manage databases entirely through its relational capabilities
and means that the RDBMS must support:

A data definition language

A data manipulation language

A data integrity language

A data control language

A Data Control Language is a computer language for controlling access to data in a database. Examples of DCL commands are GRANT and REVOKE.

all of which must work on several records at a time, that is, on a relation.

Rule 1 : The information Rule.

"All information in a relational data base is represented explicitly at the logical level and in exactly one way - by values in tables."

Everything within the database exists in tables and is accessed via table access routines.

Information rule

All information in a RDB must be represented explicitly at the logical level in just one way, that is, by values in tables.


Rule 2 : Guaranteed access Rule.

"Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name."

To access any data-item you specify which column within which table it exists, there is no reading of characters 10 to 20 of a 255 byte string.

Guaranteed access rule


Each item of data in a RDB must be guaranteed to be logically accessible by using a combination of table name, primary key name, and primary key value.

Rule 3: Systematic treatment of null values.

"Null values (distinct from the empty character string or a string of blank characters and distinct from zero or any other number) are supported in fully relational DBMS for representing missing information and inapplicable information in a systematic way, independent of data type."

If data does not exist or does not apply then a value of NULL is applied, this is understood by the RDBMS as meaning non-applicable data.

Systematic nulls rule

Null values are distinct from an empty character string, a string of blank characters, a zero or any other number, and must be supported by representing missing and/or inapplicable information in a systematic way.

Rule 4 : Dynamic on-line catalog based on the relational model.

"The data base description is represented at the logical level in the same way as-ordinary data, so that authorized users can apply the same relational language to its interrogation as they apply to the regular data."

The Data Dictionary is held within the RDBMS, thus there is no-need for off-line volumes to tell you the structure of the database.

Dynamic catalogue rule

The description of the DB must be represented at the logical level just like ordinary data. Authorised users must be able to use the data manipulation language to interrogate the DB in the same way that they interrogate ordinary data.

Rule 5 : Comprehensive data sub-language Rule.

"A relational system may support several languages and various modes of terminal use (for example, the fill-in-the-blanks mode). However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and that is comprehensive in supporting all the following items

Data Definition
View Definition
Data Manipulation (Interactive and by program).
Integrity Constraints
Authorization.

Every RDBMS should provide a language to allow the user to query the contents of the RDBMS and also manipulate the contents of the RDBMS.

Comprehensive data sub-language rule


A RDB must support at least one language which is capable of supporting:


Data definition

View definition

Data manipulation

Interactively

By program in a conventional host language

Integrity constraints

Authorisation

Transaction boundaries

Rule 6 : .View updating Rule

"All views that are theoretically updatable are also updatable by the system."

Not only can the user modify data, but so can the RDBMS when the user is not logged-in.

View updating rule

All views of the data which are theoretically updatable must be updatable in practice by the DBMS.

Rule 7 : High-level insert, update and delete.

"The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data but also to the insertion, update and deletion of data."

The user should be able to modify several tables by modifying the view to which they act as base tables.

High-level language rule

The capability of handling a relation as a single operation applies to:

The retrieval of data
and also:

The insertion of data

The updating of data

The deletion of data
and these must be possible:

Interactively

By program in a conventional host language

Rule 8 : Physical data independence.

"Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representations or access methods."

The user should not be aware of where or upon which media data-files are stored

Physical data independence rule

Application systems and terminal activities must be logically unimpaired whenever any changes are made to the way in which the physical data is stored, the storage structures which are used, or the access methods.

Rule 9 : Logical data independence.

"Application programs and terminal activities remain logically unimpaired when information-preserving changes of any kind that theoretically permit un-impairment are made to the base tables."

User programs and the user should not be aware of any changes to the structure of the tables (such as the addition of extra columns).

Logical data independence rule

Application systems and terminal activities must be logically unimpaired whenever any changes are made to the way in which the logical data is organised, such as when a table is dispersed to several tables to optimised storage or performance.

Rule 10 : Integrity independence.

"Integrity constraints specific to a particular relational data base must be definable in the relational data sub-language and storable in the catalog, not in the application programs."

If a column only accepts certain values, then it is the RDBMS which enforces these constraints and not the user program, this means that an invalid value can never be entered into this column, whilst if the constraints were enforced via programs there is always a chance that a buggy program might allow incorrect values into the system.

Integrity independence rule

Integrity constraints must be definable in the RDB sub-language and stored in the system catalogue and not within individual application programs.

Rule 11 : Distribution independence.

"A relational DBMS has distribution independence."

The RDBMS may spread across more than one system and across several networks, however to the end-user the tables should appear no different to those that are local.

Distribution independence rule

Application systems and terminal activities must be logically unimpaired whenever the data is redistributed amongst several locations on a data communications network.

Rule 12 : Non-subversion Rule.

"If a relational system has a low-level (single-record-at-a-time) language, that low level cannot be used to subvert or bypass the integrity Rules and constraints expressed in the higher level relational language (multiple-records-at-a-time)."

Non-subversion rule


If the DB has any means of handling a single record at a time, that low-level of working must not be able to subvert or avoid the integrity rules which are expressed in a higher-level language which handles multiple records at a time

The RDBMS should prevent users from accessing the data without going through the Oracle data-read functions.
In Rule 5 Codd stated that an RDBMS required a Query Language, however Codd does not explicitly state that SQL should be the query tool, just that there should be a tool, and many of the initial products had their own tools, Oracle had UFI (User Friendly Interface), Ingres had QUEL (QUery Execution Language) and the never released DB1 had a language called sequel, the acronym SQL is often pronounced such as it was sequel that provided the core functionality to SQL.
Even when the vendors eventually all started offering SQL the flavours were/are all radically different and contained wildly varying syntax. This situation was somewhat resolved in the late 80's when ANSI brought out their first definition of the SQL syntax.
This has since been upgraded to version 2 and now all vendors offer a standard core SQL, however ANSI SQL is somewhat limited and thus all RDBMS providers offer extensions to SQL which may differ from vendor to vendor.


Keys in a Relation

A key is a minimal set of attributes that uniquely identifies each row
A composite key consists of more than one attribute
If there is more than one key for a relation, they are all called candidate keys
The candidate key that means most to the user is usually selected as the primary key
Key attributes cannot be null
Ex: GRADE (Student-Id, Course-Id, Grade_Letter)

Functional Dependencies

A functional dependency is a relationship between attributes
Examples:
Student ID determines Major:
StuID => Major
StuID => (Name, Major)
(StuID, Course) => Grade
Attribute(s) to the left of arrow called determinant(s)

Anomalies in a Relation

A table that meets minimum definition of a relation may not have an effective structure
An anomaly is a weakness in the way a relation is set up
Consider the following table:
Id Name Major Course-Id Course-Desc
112 Boyd IS MGS404 Database Management
112 Boyd IS MGQ302 Operations Management
236 Smith MK MGS405 Systems Analysis.
This table has all three anomalies: insert, update, and delete

Insert, Update and Delete Anomalies

Insert anomaly caused when a new course needs to be inserted that is not registered by a student
Update anomaly caused when Major is updated in one row only for Id 112.
Delete anomaly caused when Id 236 removed from the table; we lose MGS405 course description
Anomalies are avoided by splitting the offending relation into multiple relations (decomposition)

Normalization

Normalization Rules provide the mechanism for the decomposition of rules to avoid anomalies. The various forms of Normalization from First to Fourth convert complex data structures into simple, stable data structures
Normalization Forms higher than Fourth are not useful enough and not considered in this course
We will also study briefly domain/key normal form (DK/NF) which avoids having to look for anomalies and higher normal forms

First Normal Form (1NF)

A relation is said to be in First Normal Form if there is no attribute value occurring as a set of values are repeating groups. Example:
STUDENT (Student-Id, Name, Major, Course1-Id, Course1-Name, Course2-Id, Course2-Name...)
To put the above table in 1NF, extend the course data downward into multiple rows
STUDENT (Student-Id, Name, Major, Course-Id, Course-Name, Grade)

Second Normal Form (2NF)

A relation is said to be in 2NF if no non-key attribute is functionally dependent on just part of the key
Relations with single attribute keys (i.e. not composite keys) are already in 2NF
Decomposition method is used to convert to 2NF
STUDENT (Student-Id, Student-Name, Major)
COURSE (Course-Id, Course-Name, Instructor, Office)
REGISTRATION (Student-Id, Course-Id, Grade)

Third Normal Form (3NF)

A transitive dependency occurs when a non-key attribute is dependent on one or more other non-key attributes: A => B, A => C, and B => C; where A is the key, B and C are non-key attributes
A relation is in 3NF if it is already in 2NF and there are no transitive dependencies
In the table COURSE (Course-Id, Course-Name, Instructor, Office), Instructor => Office is transitive dependency; change to:
COURSE (Course-Id, Course-Name, Instructor)
INSTRUCTOR (Instructor-Name, Office)

Boyce-Codd Form (3NF) – Examples

A more restricted version of 3NF (known as Boyce-Codd Normal Form) requires that the determinant of every functional dependency in a relation be a key - for every FD: X => Y, X is a key
Consider the following relation:
STU-MAJ-ADV (Student-Id, Major, Advisor)
Advisor => Major, but Advisor is not a key
Boyce-Codd Normal Form for above:
STU-ADV (Student-Id, Advisor)
ADV-MAJ (Advisor, Major)

Fourth Normal Form (4NF)

A multi-valued dependency exists when there are at least three attributes A, B, and C in a relation and for each value of A there is a well-defined set of values for B, and a well-defined set of values for C, but the set of values of B is independent of set C
A relation is in 4NF if it is already in 3NF and has no multi-valued dependencies
Every possible combination of the two multi-valued attributes have to be stored in the database thus leading to redundancy and consequent anomalies

Fourth Normal Form (4NF) – Example

Course-Id Instructor Textbook
MGS404 Clay Hansen
MGS404 Clay Kroenke
MGS404 Drake Hansen
MGS404 Drake Kroenke
By placing the multi-valued attributes in tables by themselves, we can convert the above to 4NF
Change to:
COURSE-INST (Course-Id, Instructor)
COURSE-TEXT (Course-Id, Textbook)

Domain/Key Normal Form (DK/NF)

A relation is in DK/NF if every constraint on the relation is a logical consequence of the definition of keys and domains
Constraint could be an edit rule, a functional dependency, or a multivalued dependency
Key is a unique identifier of a tuple / record / row
Domain is a physical and semantic description of an attribute’s allowed values
Unfortunately, no known algorithm to convert a relation to DK/NF

Domain/Key Normal Form (DK/NF) – Example

Consider the relation:
PROFESSOR (FID, FName, Class, SID, SName)
Constraints: FID => Fname; FID =>=> Class SID => Sname; SID => FID
FID must start with 1; SID must not start with 1
Domain Definitions: FID in CDDD where C = 1
SID in CDDD where C not = 1
Relation and Key Definitions:
FACULTY (FID, FName)
PREPARATION (FID, Class)
STUDENT (SID, SName, FID)

One-to-One Attribute Relationships

If A => B and B => A, then A and B have one-to-one attribute relationship
These must occur together in at least one relation R
Either A or B must be the key of R
An attribute C can be added to R if either A => C or B => C; if not, C cannot be added
A and B must not occur together in relations other than R because that would be redundant
Example: FACULTY (FID, FName) assuming FName is unique. Salary attribute can be added

Many-to-One Attribute Relationships

If A => B but B not => A, then A and B have many-to-one attribute relationship
A and B can occur in a relation R but A must be the key of R
An attribute C can be added to R if either A => C; if not, C cannot be added
Example: SID => Advisor-FID can be in relation:
STUDENT (SID, Advisor-FID)
Student-Name can be added to the above relation but not Advisor-Name.

Many-to-Many Attribute Relationships

If A not => B and B not => A, then A and B have many-to-many attribute relationship
A and B can occur together in a relation R, but the key must be the combination (A,B)
An attribute C can be added to R if either (A,B) => C; if not, C cannot be added
Example: PREPARATION (FID, Class)
Times-Taught can be added to this relation because it is functionally dependent on both (FID, Class)

De-Normalization

Normalization is achieved at the expense of database performance
De-Normalization is done to avoid processing multiple tables for often-used queries
Example: CUSTOMER (CustNum, CustName, City, State, Zip)
where Zip => (City, State) and violates 3NF
Normal Form would be
CUSTOMER (CustNum, CustName, ZIP)
ZIPCODES (Zip, City, State)

No comments: