About two years ago, my brother (and coauthor) and I were planning our trip to Europe. We used the Internet almost exclusively for our trip planning. On a particular day I visited the American Airlines Web site to see if they had any package deals. I was quite impressed by the overall content and aesthetics of their pages. I clicked on the "airfares" hyperlink to see what they could offer us. I entered in our anticipated departure date and city as well as our arrival city, and then entered the return information. I pressed the Submit button and within a few seconds I had a pretty good idea of how much we were going to be spending on airfare--ouch!
In addition, I realized I could review my frequent flyer mileage from one of the pages (that also hurt) after entering in my password, of course. Little did I realize how much time I had spent on their Web site. In fact, I talked to a few of my friends, non-computer professionals, and they too were impressed with the Web site. My friends were most impressed that they could enter in a couple pieces of information and, in return, could see the results of their request. I showed them other sites that did similar things. My friends were enthralled. I was dumbfounded. Why? Because as a database consultant, developer, and programmer, I have developed much more complicated databases than those my friends were seeing on the Web. Interestingly, when I would proudly show my friends the handiwork that my brother and I had done for a client, all I would get is a courteous "That's nice." My friends would then tell me to turn off my computer so they could have some real fun. With my database developer ego deflated, I would close the database, turn off my computer, and leave.
As I reflected on these experiences, I came to appreciate that deep down people do get excited about databases, so long as they are easy to use and produce the results they want. The most likely way the average consumer will be exposed to a database, unbeknownst to them, is on the Internet. Therefore, today's Internet professionals, who may not have touched a database before, now have to develop them for their companies' intranets and Internet sites. After all, business on the Internet has become highly competitive. Visually pleasing Web pages will draw consumer into your Internet site, databases giving consumers what they request will keep them there, and quick responses will make them want to return.
If you look at the latest versions of desktop database programs such as Microsoft
Access, Microsoft Visual FoxPro, Borland Paradox, and Lotus Approach, you will find
that each of these uses HTML conversion wizards to convert forms and reports to HTML
format. On the high end of the database spectrum, Microsoft SQL Server, Oracle, and
others incorporate HTML conversion wizards as well. Clearly, modern intranet and
Internet professionals have to learn database programming and development.
TIP: Databases represent the lifeblood of companies. A poorly designed database can easily cause financial havoc, among other things, to an otherwise stable company. Therefore the importance of proper database development practices cannot be stressed enough: no amount of programming code, Web page elegance, or ActiveX controls can salvage the damage that can come as a result of a poorly designed database.
The purpose of a database is to store information for quick and easy retrieval. We are seeing a marriage made in heaven--a medium designed for the exchange of information (the Internet) is being merged with a tool to store and retrieve information (databases).
Understanding relational database design is essential to employing this cutting-edge technology that will revolutionize how information is served to the public.
The origin of relational database design has been repeated in every book that talks about databases, so this section will very briefly describe the origin of the relational model.
In the mid-70s, E.F. Codd, who was working for IBM at the time, wrote A Relational Model of Data for Large Shared Data Banks. This document revolutionized the database industry. It provided a model for database organization that was a major contrast to the other theories of the time. It is on this model that the majority of databases are based.
The Relational Database Model, unlike previous database models, only deals with how the data is presented. This model does not deal with how the data is actually stored. The data is to be presented as tables, discussed next. Each column is considered a field. Each row is considered a record. The only redundant data that results from this model are columns that are common to two or more tables. These columns provide for the joining of two tables. It is important to note that the term relational was used to state that one group of related information was placed in one table.
If you have already installed Visual InterDev or any other piece of commercial software, you understand the concepts of what tables are and how they work. During the installation process, you were required to enter the serial number or CD-key that identifies your copy of Visual InterDev as unique. Somewhere in Microsoft's database, possibly in the Products table, is the serial number 1000 and the product description--Visual InterDev. Therefore, in a relational database, a table is a collection of unique instances of similar data. Now let us go a little deeper into how information is stored in tables.
A table is a set of columns and rows. Columns describe the attributes
or characteristics of the primary key (discussed later). In the above example
of a Products table, one column might be called CDKey_ID; another might be called
Description. Each row of values, divided into columns, will always contain at least
one unique value. These rows are called records. A column is a vertical collection
of fields; a row is a horizontal collection of fields. Figure 30.1 shows an example
of designing a table. Figure 30.2 shows the result of a designed table with sample
data.
Figure 30.1
Designing a table in MS SQL Server 6.5.
Figure 30.2
The completed table with sample data.
A field is a single unit of information. Normally, you will refer to a specific column in a row as the CDKey_ID field or the Description field. At the field level the user interacts with the table by inputting data. Once you have input data into a field, it is then evaluated to see whether it is of the correct data type. If it is not the correct data type, an error occurs and the database prevents the user from leaving the field until a value of the correct data type is input.
What type of data will be stored in a table column? When you create a new
column, database programs such as Microsoft SQL Server, Oracle, and DB2 will require
you to choose the data type from binary, datetime, decimal, image, or varchar, just
to name a few. This information tells the database how it is to interpret the data
in the column. To see an example of selecting a data type for a column, please refer
back to Figure 30.1.
CAUTION:
It is very important that the correct data type is chosen. A common mistake is to store the zip code as a type of number. It is true that numbers make up a zip code, but it is not a true number. When you implement Zip+4, this becomes obvious. Now you are storing a "-" in the field. If the zip code were a number, you would be subtracting a four-digit number from a five-digit number, and your package or letter would not arrive at the correct address. The U.S. Postal Service has improved considerably over the past few years, however. The more accurate data type is a "char."
TIP: The whole idea around choosing the correct data type is to choose the type that is closest to the description of the data. For example, you could store a dollar amount as a number, but money is a closer description. You should choose number as the type of data only if you will make calculations on the data.
Primary keys are often called the "row identifiers," because they uniquely identify rows. Without a primary key, relational database design is all but lost.
Each table has only one primary key. You should note that the primary key could be made of multiple fields. This is known as a composite key. A simple key is a primary key comprising a single field.
Choosing a primary key is a difficult decision. Of all the fields that will be entered into a record, you should have at least one candidate field from which to choose a primary key. As its name suggests, a candidate field is a field that is a candidate to be a primary key. Another issue in the decision of choosing a primary key is that primary keys cannot have null values.
Let's consider this in the context of a customer table. If you look at Figure
30.3 you will see several customers. Any Social Security number corresponding to
the one shown is purely coincidental.
Figure 30.3
Choosing a primary key requires effort.
The following fields in table Figure 30.3 above are candidate fields: CustomerID,
Customer Name, Phone, and Social Security Number. There are advantages and disadvantages
of choosing any one field as a simple primary key. Choosing a composite key is not
required because we have at least one candidate field. Let us review more fully the
purpose of a primary key.
The primary key marks a particular record as different from all the others. The primary key references this record in this table with another record in another table. What makes a field a good candidate field? If the field is unique, unambiguous, data-less, and unchanging it is a good candidate field.
The CustomerID field is the field that is most often chosen by developers. This method guarantees that the data is unique. The problem is that one customer could have two CustomerID's. For example, it is possible to have two records that are the same person, as seen in Figure 30.3. If you have two David Hicks in the database it could be the same person, it may not be. You simply do not know. I have personally seen a customer database for a large software company where it was common to see 12 different CustomerID's for one customer. If you are at all interested in keeping a history this is not a good solution.
The Customer Name is a good candidate field. A major problem with it is that if you look in your phone book, you can find many John Smiths, for example. Another problem, although lesser, deals with the possibility of a name change. This is more common with your female customers, because of marriage.
The Phone field is also a good candidate field. It is a better candidate than a Customer Name in that a phone number can belong to only one household. But there can be problems--I have had my telephone number for a few years now, but I will be changing it within the next few months. Another problem is that if John orders something and his wife Jane orders something later, your records will not reflect an actual customer, but a household. An added problem is that of new area codes. Think of how many area codes there are in the Los Angeles area. It used to be a lot fewer.
The Social Security Number is a phenomenal candidate field if it is applicable to your business. A Social Security Number can refer to only one person. The chance of a person having it changed is virtually nil. There are difficulties, though, in using the Social Security Number as the primary key. The main difficulty is in having your customers reveal their Social Security Numbers. Unless you are in the medical or financial field, people are reluctant to give out their Social Security Numbers.
A foreign key is a primary key from another table. Therefore, the values of the foreign key column match that of a primary key column of another table. For example, in an invoice table, you would store an invoice number and a customer number. The customer number would be a foreign key, because this value would be the primary key of a customer table. This will have more meaning in the next sections.
Indexes are used to increase the speed of information retrieval from a database, at the expense of hard drive space. Indexes are based upon the values of a particular column. The database stores a list of values in a column, and the location of those values in a particular table. By doing this, the database can search an index for a value, instead of the entire table. In addition to the increased disk space required for an index, insertion, updating, and deletion of data are slowed. These procedures are slowed because the database must also maintain the location of the values. When choosing to use an index, the following things must be considered:
When you deal with higher-end database applications, such as Microsoft SQL Server, Oracle, and DB2, you are given a choice of index types, clustered and non-clustered. Let us discuss these index types further.
A clustered index actually stores the data in a logical order. For this reason, only one clustered index is allowed per table. In general, these indexes are the fastest. However, the data takes up 20 percent more space in the database than if the data did not have a clustered index. Desktop databases such as Microsoft Access and Borland Paradox use the primary key as a clustered index.
Clustered indexes are best for queries returning the following:
In addition, they work well for keyed fields, primary keys being the best.
A non-clustered index stores pointers to the actual data. The index tells the database every location of a particular value. Unlike the clustered index, more than one non-clustered index is allowed. This type of index uses less disk space, but is faster than scanning the entire table. Desktop databases such as Microsoft Access and Borland Paradox use non-clustered indexes for non-primary keyed fields.
Even though databases allow many non-clustered indexes, you should have less than four indexes per table.
Non-clustered indexes are best for the following:
In this section, we will discuss the single most important part of database design--data normalization. Data normalization is the breaking up of data into tables of related data, keeping repeats and redundancy to minimal levels. You will see from the first two rules of data normalization (discussed later), that the terms "repeats" and "redundancy" are not redundant in themselves.
Normalization provides the following benefits:
CAUTION: There is no substitute for proper data normalization. Without it, your database will prove to be unwieldy, buggy, fragile, liable to crash, and miserable to maintain. In addition, lack of data normalization has resulted in inaccurate data reporting and data errors. Cute buttons, music, and animation on your company's Web site cannot help your database. There is no need to panic; we are simply expressing the importance of following good database design.
Be Aggressive when Separating Data into Tables
All it takes is for one item suddenly to become two to ruin your whole day. Case in point: I was working on a construction defect project. A law firm wanted a database of document summaries. They wanted to store document IDs, the author, the issues, the recipients, and later the storage location. Now, from reading the previous sentence you would assume that you should separate the issues and the recipients from the main document data. Yes, that is correct. Yet there is more to consider. It turns out that a one-page letter really can have more than one author. Who knew? What is more, the company would make copies of the documents and put them in separate locations so certain documents would stay together. Don't learn this lesson the hard way; be aggressive in separating data. You are probably wondering what I did to solve the problem. I made a few queries to separate the data and I was back in business--after some significant reprogramming.
The following steps provide an easy way to follow the 12 rules of data normalization. Each of the steps corresponds to the widely accepted rules of data normalization.
Step 1--Eliminate repeating data. When you are dealing with invoices, you may
have ship-to data, product data, and the total invoice amount, as well as other data.
Figure 30.4 shows a table that is in desperate need of normalization.
Figure 30.4
This table really needs to be normalized.
Notice that there is repeating data, and it is confusing. Is the freight for each
item, or is that the total freight? This table is a nightmare. This table is perfect
for our demonstration. The following fields are being repeated: OrderID, CustomerID,
OrderDate, ShipName, and FreightCharge. Something needs to be separated. From this
one table, we will create another table. We will have two invoice tables: Invoice
Header, and Invoice Footer. We will store the following fields in Invoice Header:
OrderID, CustomerID, OrderDate, ShipName, and FreightCharge. In Invoice Footer we
will store OrderID, ProductID, ProductDescription, ProductSupplier, Quantity, UnitPrice,
and LineTotal. Figure 30.5 shows the two described tables. You will notice that the
Invoice Footer still has repeating data in the form of an OrderID. This is completely
unavoidable because it is a detail table.
Figure 30.5
Order table after Step 1 has been applied.
Step 2-öþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþ
Step 3--Eliminate columns that do not describe the key. It is now time to review
our keys. Let's look at our current tables in Figure 30.7 so we can identify the
key fields.
Figure 30.7
Identifying primary keys.
You will remember that the key field is the unique identifier for a record. Both
Invoice Header and Products are easy for selecting key fields. For the Invoice Header,
the key field is the OrderID. For the Products, it is the ProductID. In the Invoice
Footer, there is no one field that is unique for all the records (yes, I did that
on purpose). Because of this, we will have to select two fields for a composite key.
We will select OrderID and ProductID. Actually, the keys should be selected before
you get to this point. If you look at the Invoice Footer, you will notice a ProductSupplier
field. This has absolutely nothing to do with the Invoice. It does relate to the
Product, though, so let's move it to that table, as shown in Figure 30.8.
Figure 30.8
Product with Supplier data.
Step 4--Eliminate multiple relationships. Let's say that Red Descent wanted to get
a market share of Seagrit's Micro Disk. They both sell the same exact product just
different prices. It would be a violation of Step 1 to have two "201" products
in the Products table. Therefore, we create a Product-Supplier table. This allows
for many suppliers for one product. This is illustrated in Figure 30.9.
Figure 30.9
Product and Product-Suppliers tables.
Step 5--Eliminate related multiple relationships. This particular rule is rarely
needed and is usually provided for in Step 4. The example we used for Rule 4 would
apply to this rule. This is because the product relates to the supplier. The example
in Step 4 can be taken a step further by storing the manufacturer as well. Then if
some regulation is passed which states that, for instance, if a supplier supplies
one product from a manufacturer, it must supply all products from a manufacturer,
you are ready. In such a case, Step 5 would make inserting data faster.
Why would you de-normalize a normalized database? Did you notice that the Customer table does not follow Step 3? This is because the city and staöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþnormalize, you create redundant data. If you update data a lot, your database will have to maintain two copies of the data. This will cause performance degradation. So de-normalize only when you know you will have performance improvements. In short, normalized data is standard and provides for the fastest inserts, updates, and deletes. De-normalized data, if de-normalized properly, provides for the fastest retrieval.
Data de-normalization provides for easier creation of ad hoc reports by end users.
Because the data is duplicated, complex queries on which to base a report are not
required.
CAUTION:
De-normalizing a database is a major undertaking. It is recommended that de-normalization be done by experienced database personnel only.
TIP: One easy approach to de-normalization is to create a summary table for values that are not likely to be updated. For example, you can create a table to store aggregates of weekly sales figures. The summary data can be computed and inserted once all the appropriate values have been entered.
Over-normalization is a technique to increase database performance. It is easier to employ than de-normalization. Over-normalization involves splitting tables. Horizontal partitioning and vertical partitioning are the two techniques of over-normalization. In general, over-normalization is much easier to maintain than de-normalization. Furthermore, of the two over-normalization techniques, vertical partitioning is the easier. This is because horizontal partitioning requires a decision on which table to insert, update, delete, or retrieve which fields from. For example, you might retrieve name and address information from one table and other information from another table. On the other hand, when vertical partitioning is used it is much easier to make a decision on which table to insert a particular record into. For example, you only have to decide which of two tables to deal with all information on a customer. Let us now look at the two over-normalization techniques.
Horizontal partitioning involves splitting a table based on record. This reduces
the number of rows per table. Implementation is based on the table you choose. Most
tables have a type of control number as a primary key. By having a certain range
of numbers in one table and the remaining numbers in a second table, a horizontal
partition has been made. For example, in an Invoice History containing one million
records, you could place invoice numbers 1 to 500,000 in one Invoice History table,
and records 500,001 to 1,000,000+ in a second Invoice History table. This number
could be adjusted to account for a certain date, so that a report comparing last
year's sales with this year's sales could be performed faster. Figure 30.10 shows
an example of horizontal partitioning.
Figure 30.10
Customer IDs numbered 100 to 250 have been moved left in the original Customer table.
Customer IDs numbered 251 to 750 have been moved to a new Customer table.
Vertical partitioning involves splitting a table based on column. This reduces
the number of columns per table. Implementation is based on the table you choose.
A decision must be made as to which columns are being used more than others are.
From there, you can create one table of data that is being used more often, and another
to store the data that is used less often. For example, you could store a property
name, number, type, and reference code in one table, and its address, statistics,
and other less used information in a separate table. Figure 31.11 shows an example
of vertical partitioning.
Figure 30.11
This is an example of how security can be implemented by splitting confidential information
from public information.
TIP: Vertical partitioning can also be used to help implement security. For example, you could split a single table into one confidential information table and one nonconfidential information table. This can protect information that you would not want to divulge by allowing most users to see the nonconfidential information, and the users with higher security to access the confidential data. Moreover, you can allow most users to only deal with a customer's reference number, and give those users with higher access the ability to view the more descriptive data such as name and address. See Figure 30.11 for an example.
Once you have set the proper amount of normalization, it is time to set up referential integrity. Referential integrity is what keeps your data together. Let's say your Invoice Footer table has Invoice Number 118 but your Invoice Header table does not. You have an orphaned piece of data. I prefer to use the term illegitimate, because it makes it sound as bad as it is. This illegitimate data is completely useless to you because you do not know to whom the product was sent. You do not know when it was ordered. Referential integrity prevents this travesty. Note: this is acceptable if it involves lookup data. For example, if you have an employee table and an employee leaves, you could remove that employee from the lookup table so his or her name would not be used in future records. However, it is a travesty for things like Invoice Numbers, Customer Numbers, and other critical data that must be linked together.
Referential integrity uses the primary key and the foreign key of different tables.
Referential integrity requires data that is in a foreign key to be present in a primary
key of a different table. Let's look at Figure 30.12. You will notice that at the
center of all other tables is tblProperty. The prefix "tbl" tells anyone
who looks at this that it is a table. Note: Figure 30.12 is a graphical representation
of relationships between tables. The graphical relationship design was done in Microsoft
Access. High-end databases such as Microsoft SQL Server and Oracle do not have the
benefit of graphical relationship design. Their relationships are set programmatically
or at table design. In Microsoft SQL Server they are called constraints.
Figure 30.12
Despite its appearance, this relationship is the cleanest for the number of tables
shown.
tblProperty has lines coming out of it to nine different tables. There are lines
connecting it to tblPropertyIns, tblPropertyWarranty, tblPropertyGoals, tblPropertyMaintain,
tblPropertyOps, tblPropertyOwner, tblPropertyMan, and tblPropertyPersProp. You'll
notice that at each end of the line is either a 1 or a [yen].
In this chapter you learned database design fundamentals using the relational model. You now know how tables work and the important roles the primary key plays. You also came to appreciate how indexes work to speed searches. You learned how to normalize a database and why de-normalization and over-normalization may be a better solution when retrieval speed is of the utmost importance. You also learned not only how ODBC works but also how ODBC retrieves data.