Special Edition Using
Microsoft Visual InterDev

Chapter 30

Database Programming Basics

About two years ago, my brother (and coauthor) and I were planning our trip to Europe. We used the Internet almost exclusively for our trip planning. On a particular day I visited the American Airlines Web site to see if they had any package deals. I was quite impressed by the overall content and aesthetics of their pages. I clicked on the "airfares" hyperlink to see what they could offer us. I entered in our anticipated departure date and city as well as our arrival city, and then entered the return information. I pressed the Submit button and within a few seconds I had a pretty good idea of how much we were going to be spending on airfare--ouch!

In addition, I realized I could review my frequent flyer mileage from one of the pages (that also hurt) after entering in my password, of course. Little did I realize how much time I had spent on their Web site. In fact, I talked to a few of my friends, non-computer professionals, and they too were impressed with the Web site. My friends were most impressed that they could enter in a couple pieces of information and, in return, could see the results of their request. I showed them other sites that did similar things. My friends were enthralled. I was dumbfounded. Why? Because as a database consultant, developer, and programmer, I have developed much more complicated databases than those my friends were seeing on the Web. Interestingly, when I would proudly show my friends the handiwork that my brother and I had done for a client, all I would get is a courteous "That's nice." My friends would then tell me to turn off my computer so they could have some real fun. With my database developer ego deflated, I would close the database, turn off my computer, and leave.

As I reflected on these experiences, I came to appreciate that deep down people do get excited about databases, so long as they are easy to use and produce the results they want. The most likely way the average consumer will be exposed to a database, unbeknownst to them, is on the Internet. Therefore, today's Internet professionals, who may not have touched a database before, now have to develop them for their companies' intranets and Internet sites. After all, business on the Internet has become highly competitive. Visually pleasing Web pages will draw consumer into your Internet site, databases giving consumers what they request will keep them there, and quick responses will make them want to return.

If you look at the latest versions of desktop database programs such as Microsoft Access, Microsoft Visual FoxPro, Borland Paradox, and Lotus Approach, you will find that each of these uses HTML conversion wizards to convert forms and reports to HTML format. On the high end of the database spectrum, Microsoft SQL Server, Oracle, and others incorporate HTML conversion wizards as well. Clearly, modern intranet and Internet professionals have to learn database programming and development.


TIP: Databases represent the lifeblood of companies. A poorly designed database can easily cause financial havoc, among other things, to an otherwise stable company. Therefore the importance of proper database development practices cannot be stressed enough: no amount of programming code, Web page elegance, or ActiveX controls can salvage the damage that can come as a result of a poorly designed database.


What Is a Database?

The purpose of a database is to store information for quick and easy retrieval. We are seeing a marriage made in heaven--a medium designed for the exchange of information (the Internet) is being merged with a tool to store and retrieve information (databases).

Understanding relational database design is essential to employing this cutting-edge technology that will revolutionize how information is served to the public.

The origin of relational database design has been repeated in every book that talks about databases, so this section will very briefly describe the origin of the relational model.

In the mid-70s, E.F. Codd, who was working for IBM at the time, wrote A Relational Model of Data for Large Shared Data Banks. This document revolutionized the database industry. It provided a model for database organization that was a major contrast to the other theories of the time. It is on this model that the majority of databases are based.

The Relational Database Model, unlike previous database models, only deals with how the data is presented. This model does not deal with how the data is actually stored. The data is to be presented as tables, discussed next. Each column is considered a field. Each row is considered a record. The only redundant data that results from this model are columns that are common to two or more tables. These columns provide for the joining of two tables. It is important to note that the term relational was used to state that one group of related information was placed in one table.

Tables

If you have already installed Visual InterDev or any other piece of commercial software, you understand the concepts of what tables are and how they work. During the installation process, you were required to enter the serial number or CD-key that identifies your copy of Visual InterDev as unique. Somewhere in Microsoft's database, possibly in the Products table, is the serial number 1000 and the product description--Visual InterDev. Therefore, in a relational database, a table is a collection of unique instances of similar data. Now let us go a little deeper into how information is stored in tables.

A table is a set of columns and rows. Columns describe the attributes or characteristics of the primary key (discussed later). In the above example of a Products table, one column might be called CDKey_ID; another might be called Description. Each row of values, divided into columns, will always contain at least one unique value. These rows are called records. A column is a vertical collection of fields; a row is a horizontal collection of fields. Figure 30.1 shows an example of designing a table. Figure 30.2 shows the result of a designed table with sample data.

Figure 30.1
Designing a table in MS SQL Server 6.5.


Figure 30.2
The completed table with sample data.

Fields

A field is a single unit of information. Normally, you will refer to a specific column in a row as the CDKey_ID field or the Description field. At the field level the user interacts with the table by inputting data. Once you have input data into a field, it is then evaluated to see whether it is of the correct data type. If it is not the correct data type, an error occurs and the database prevents the user from leaving the field until a value of the correct data type is input.

What type of data will be stored in a table column? When you create a new column, database programs such as Microsoft SQL Server, Oracle, and DB2 will require you to choose the data type from binary, datetime, decimal, image, or varchar, just to name a few. This information tells the database how it is to interpret the data in the column. To see an example of selecting a data type for a column, please refer back to Figure 30.1.


CAUTION:
It is very important that the correct data type is chosen. A common mistake is to store the zip code as a type of number. It is true that numbers make up a zip code, but it is not a true number. When you implement Zip+4, this becomes obvious. Now you are storing a "-" in the field. If the zip code were a number, you would be subtracting a four-digit number from a five-digit number, and your package or letter would not arrive at the correct address. The U.S. Postal Service has improved considerably over the past few years, however. The more accurate data type is a "char."


TIP: The whole idea around choosing the correct data type is to choose the type that is closest to the description of the data. For example, you could store a dollar amount as a number, but money is a closer description. You should choose number as the type of data only if you will make calculations on the data.


Primary Keys

Primary keys are often called the "row identifiers," because they uniquely identify rows. Without a primary key, relational database design is all but lost.

Each table has only one primary key. You should note that the primary key could be made of multiple fields. This is known as a composite key. A simple key is a primary key comprising a single field.

Choosing a primary key is a difficult decision. Of all the fields that will be entered into a record, you should have at least one candidate field from which to choose a primary key. As its name suggests, a candidate field is a field that is a candidate to be a primary key. Another issue in the decision of choosing a primary key is that primary keys cannot have null values.

Let's consider this in the context of a customer table. If you look at Figure 30.3 you will see several customers. Any Social Security number corresponding to the one shown is purely coincidental.

Figure 30.3
Choosing a primary key requires effort.


The following fields in table Figure 30.3 above are candidate fields: CustomerID, Customer Name, Phone, and Social Security Number. There are advantages and disadvantages of choosing any one field as a simple primary key. Choosing a composite key is not required because we have at least one candidate field. Let us review more fully the purpose of a primary key.

The primary key marks a particular record as different from all the others. The primary key references this record in this table with another record in another table. What makes a field a good candidate field? If the field is unique, unambiguous, data-less, and unchanging it is a good candidate field.

The CustomerID field is the field that is most often chosen by developers. This method guarantees that the data is unique. The problem is that one customer could have two CustomerID's. For example, it is possible to have two records that are the same person, as seen in Figure 30.3. If you have two David Hicks in the database it could be the same person, it may not be. You simply do not know. I have personally seen a customer database for a large software company where it was common to see 12 different CustomerID's for one customer. If you are at all interested in keeping a history this is not a good solution.

The Customer Name is a good candidate field. A major problem with it is that if you look in your phone book, you can find many John Smiths, for example. Another problem, although lesser, deals with the possibility of a name change. This is more common with your female customers, because of marriage.

The Phone field is also a good candidate field. It is a better candidate than a Customer Name in that a phone number can belong to only one household. But there can be problems--I have had my telephone number for a few years now, but I will be changing it within the next few months. Another problem is that if John orders something and his wife Jane orders something later, your records will not reflect an actual customer, but a household. An added problem is that of new area codes. Think of how many area codes there are in the Los Angeles area. It used to be a lot fewer.

The Social Security Number is a phenomenal candidate field if it is applicable to your business. A Social Security Number can refer to only one person. The chance of a person having it changed is virtually nil. There are difficulties, though, in using the Social Security Number as the primary key. The main difficulty is in having your customers reveal their Social Security Numbers. Unless you are in the medical or financial field, people are reluctant to give out their Social Security Numbers.

Foreign Keys

A foreign key is a primary key from another table. Therefore, the values of the foreign key column match that of a primary key column of another table. For example, in an invoice table, you would store an invoice number and a customer number. The customer number would be a foreign key, because this value would be the primary key of a customer table. This will have more meaning in the next sections.

Indexes

Indexes are used to increase the speed of information retrieval from a database, at the expense of hard drive space. Indexes are based upon the values of a particular column. The database stores a list of values in a column, and the location of those values in a particular table. By doing this, the database can search an index for a value, instead of the entire table. In addition to the increased disk space required for an index, insertion, updating, and deletion of data are slowed. These procedures are slowed because the database must also maintain the location of the values. When choosing to use an index, the following things must be considered:

When you deal with higher-end database applications, such as Microsoft SQL Server, Oracle, and DB2, you are given a choice of index types, clustered and non-clustered. Let us discuss these index types further.

Clustered Indexes

A clustered index actually stores the data in a logical order. For this reason, only one clustered index is allowed per table. In general, these indexes are the fastest. However, the data takes up 20 percent more space in the database than if the data did not have a clustered index. Desktop databases such as Microsoft Access and Borland Paradox use the primary key as a clustered index.

Clustered indexes are best for queries returning the following:

In addition, they work well for keyed fields, primary keys being the best.

Non-Clustered Indexes

A non-clustered index stores pointers to the actual data. The index tells the database every location of a particular value. Unlike the clustered index, more than one non-clustered index is allowed. This type of index uses less disk space, but is faster than scanning the entire table. Desktop databases such as Microsoft Access and Borland Paradox use non-clustered indexes for non-primary keyed fields.

Even though databases allow many non-clustered indexes, you should have less than four indexes per table.

Non-clustered indexes are best for the following:

Normalized Database Development

In this section, we will discuss the single most important part of database design--data normalization. Data normalization is the breaking up of data into tables of related data, keeping repeats and redundancy to minimal levels. You will see from the first two rules of data normalization (discussed later), that the terms "repeats" and "redundancy" are not redundant in themselves.

Normalization provides the following benefits:


CAUTION: There is no substitute for proper data normalization. Without it, your database will prove to be unwieldy, buggy, fragile, liable to crash, and miserable to maintain. In addition, lack of data normalization has resulted in inaccurate data reporting and data errors. Cute buttons, music, and animation on your company's Web site cannot help your database. There is no need to panic; we are simply expressing the importance of following good database design.




There are 12 rules for the Relational Data Model and 333 subsets. We will provide suggestions and give you plenty of examples of how to follow these rules, simplified into 5 steps. Before normalizing your database, we have provided the following steps to make the normalization process faster:
  1. Have a list of all the fields you will be storing in this database. It would be a good idea to leave room for more fields. It is common to get requests near design completion for more data to be stored in the database. Be sure to confirm that all the fields your company wants in the database are present before any data is imported or input.

  2. Find your static data. Static data is data that will rarely change. If you call it a type, status, or priority, it is static data. You will have varying degrees of staticity. Staticity is the condition of being static or unchanging. For example, the probability of the United States adding anymore states is low, and so a table of states has a high staticity rating. Make a table for these highly static items. For example, have a table that stores the status of a project; a table storing types of insurance; and table storing the names and abbreviations of the States of the Union. These are called lookup tables.

  3. Group your data together in related sets. For example, in an order entry database, you would have product data in one table, customer data in another, and invoice data in yet another.

  4. Go through and find everything that will have multiple items. For example, one invoice will have many items, but only one total. Separate out this data. You should be aggressive in finding this data.


Be Aggressive when Separating Data into Tables
All it takes is for one item suddenly to become two to ruin your whole day. Case in point: I was working on a construction defect project. A law firm wanted a database of document summaries. They wanted to store document IDs, the author, the issues, the recipients, and later the storage location. Now, from reading the previous sentence you would assume that you should separate the issues and the recipients from the main document data. Yes, that is correct. Yet there is more to consider. It turns out that a one-page letter really can have more than one author. Who knew? What is more, the company would make copies of the documents and put them in separate locations so certain documents would stay together. Don't learn this lesson the hard way; be aggressive in separating data. You are probably wondering what I did to solve the problem. I made a few queries to separate the data and I was back in business--after some significant reprogramming.




After you perform the previous steps, you can go through the database normalization rules. The aforementioned steps should eliminate 90 percent of the work required to implement the 12 rules of data normalization.

Normalizing Data

The following steps provide an easy way to follow the 12 rules of data normalization. Each of the steps corresponds to the widely accepted rules of data normalization.

Step 1--Eliminate repeating data. When you are dealing with invoices, you may have ship-to data, product data, and the total invoice amount, as well as other data. Figure 30.4 shows a table that is in desperate need of normalization.

Figure 30.4
This table really needs to be normalized.


Notice that there is repeating data, and it is confusing. Is the freight for each item, or is that the total freight? This table is a nightmare. This table is perfect for our demonstration. The following fields are being repeated: OrderID, CustomerID, OrderDate, ShipName, and FreightCharge. Something needs to be separated. From this one table, we will create another table. We will have two invoice tables: Invoice Header, and Invoice Footer. We will store the following fields in Invoice Header: OrderID, CustomerID, OrderDate, ShipName, and FreightCharge. In Invoice Footer we will store OrderID, ProductID, ProductDescription, ProductSupplier, Quantity, UnitPrice, and LineTotal. Figure 30.5 shows the two described tables. You will notice that the Invoice Footer still has repeating data in the form of an OrderID. This is completely unavoidable because it is a detail table.

Figure 30.5
Order table after Step 1 has been applied.


Step 2-öþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþ

Step 3--Eliminate columns that do not describe the key. It is now time to review our keys. Let's look at our current tables in Figure 30.7 so we can identify the key fields.

Figure 30.7
Identifying primary keys.

You will remember that the key field is the unique identifier for a record. Both Invoice Header and Products are easy for selecting key fields. For the Invoice Header, the key field is the OrderID. For the Products, it is the ProductID. In the Invoice Footer, there is no one field that is unique for all the records (yes, I did that on purpose). Because of this, we will have to select two fields for a composite key. We will select OrderID and ProductID. Actually, the keys should be selected before you get to this point. If you look at the Invoice Footer, you will notice a ProductSupplier field. This has absolutely nothing to do with the Invoice. It does relate to the Product, though, so let's move it to that table, as shown in Figure 30.8.

Figure 30.8
Product with Supplier data.

Step 4--Eliminate multiple relationships. Let's say that Red Descent wanted to get a market share of Seagrit's Micro Disk. They both sell the same exact product just different prices. It would be a violation of Step 1 to have two "201" products in the Products table. Therefore, we create a Product-Supplier table. This allows for many suppliers for one product. This is illustrated in Figure 30.9.

Figure 30.9
Product and Product-Suppliers tables.


Step 5--Eliminate related multiple relationships. This particular rule is rarely needed and is usually provided for in Step 4. The example we used for Rule 4 would apply to this rule. This is because the product relates to the supplier. The example in Step 4 can be taken a step further by storing the manufacturer as well. Then if some regulation is passed which states that, for instance, if a supplier supplies one product from a manufacturer, it must supply all products from a manufacturer, you are ready. In such a case, Step 5 would make inserting data faster.

De-Normalized Database Development

Why would you de-normalize a normalized database? Did you notice that the Customer table does not follow Step 3? This is because the city and staöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþöþnormalize, you create redundant data. If you update data a lot, your database will have to maintain two copies of the data. This will cause performance degradation. So de-normalize only when you know you will have performance improvements. In short, normalized data is standard and provides for the fastest inserts, updates, and deletes. De-normalized data, if de-normalized properly, provides for the fastest retrieval.

Data de-normalization provides for easier creation of ad hoc reports by end users. Because the data is duplicated, complex queries on which to base a report are not required.


CAUTION:
De-normalizing a database is a major undertaking. It is recommended that de-normalization be done by experienced database personnel only.


TIP: One easy approach to de-normalization is to create a summary table for values that are not likely to be updated. For example, you can create a table to store aggregates of weekly sales figures. The summary data can be computed and inserted once all the appropriate values have been entered.


Over-Normalization

Over-normalization is a technique to increase database performance. It is easier to employ than de-normalization. Over-normalization involves splitting tables. Horizontal partitioning and vertical partitioning are the two techniques of over-normalization. In general, over-normalization is much easier to maintain than de-normalization. Furthermore, of the two over-normalization techniques, vertical partitioning is the easier. This is because horizontal partitioning requires a decision on which table to insert, update, delete, or retrieve which fields from. For example, you might retrieve name and address information from one table and other information from another table. On the other hand, when vertical partitioning is used it is much easier to make a decision on which table to insert a particular record into. For example, you only have to decide which of two tables to deal with all information on a customer. Let us now look at the two over-normalization techniques.

Horizontal Partitioning

Horizontal partitioning involves splitting a table based on record. This reduces the number of rows per table. Implementation is based on the table you choose. Most tables have a type of control number as a primary key. By having a certain range of numbers in one table and the remaining numbers in a second table, a horizontal partition has been made. For example, in an Invoice History containing one million records, you could place invoice numbers 1 to 500,000 in one Invoice History table, and records 500,001 to 1,000,000+ in a second Invoice History table. This number could be adjusted to account for a certain date, so that a report comparing last year's sales with this year's sales could be performed faster. Figure 30.10 shows an example of horizontal partitioning.

Figure 30.10
Customer IDs numbered 100 to 250 have been moved left in the original Customer table. Customer IDs numbered 251 to 750 have been moved to a new Customer table.

Vertical Partitioning

Vertical partitioning involves splitting a table based on column. This reduces the number of columns per table. Implementation is based on the table you choose. A decision must be made as to which columns are being used more than others are. From there, you can create one table of data that is being used more often, and another to store the data that is used less often. For example, you could store a property name, number, type, and reference code in one table, and its address, statistics, and other less used information in a separate table. Figure 31.11 shows an example of vertical partitioning.

Figure 30.11
This is an example of how security can be implemented by splitting confidential information from public information.



TIP: Vertical partitioning can also be used to help implement security. For example, you could split a single table into one confidential information table and one nonconfidential information table. This can protect information that you would not want to divulge by allowing most users to see the nonconfidential information, and the users with higher security to access the confidential data. Moreover, you can allow most users to only deal with a customer's reference number, and give those users with higher access the ability to view the more descriptive data such as name and address. See Figure 30.11 for an example.


Referential Integrity

Once you have set the proper amount of normalization, it is time to set up referential integrity. Referential integrity is what keeps your data together. Let's say your Invoice Footer table has Invoice Number 118 but your Invoice Header table does not. You have an orphaned piece of data. I prefer to use the term illegitimate, because it makes it sound as bad as it is. This illegitimate data is completely useless to you because you do not know to whom the product was sent. You do not know when it was ordered. Referential integrity prevents this travesty. Note: this is acceptable if it involves lookup data. For example, if you have an employee table and an employee leaves, you could remove that employee from the lookup table so his or her name would not be used in future records. However, it is a travesty for things like Invoice Numbers, Customer Numbers, and other critical data that must be linked together.

Referential integrity uses the primary key and the foreign key of different tables. Referential integrity requires data that is in a foreign key to be present in a primary key of a different table. Let's look at Figure 30.12. You will notice that at the center of all other tables is tblProperty. The prefix "tbl" tells anyone who looks at this that it is a table. Note: Figure 30.12 is a graphical representation of relationships between tables. The graphical relationship design was done in Microsoft Access. High-end databases such as Microsoft SQL Server and Oracle do not have the benefit of graphical relationship design. Their relationships are set programmatically or at table design. In Microsoft SQL Server they are called constraints.

Figure 30.12
Despite its appearance, this relationship is the cleanest for the number of tables shown.


tblProperty has lines coming out of it to nine different tables. There are lines connecting it to tblPropertyIns, tblPropertyWarranty, tblPropertyGoals, tblPropertyMaintain, tblPropertyOps, tblPropertyOwner, tblPropertyMan, and tblPropertyPersProp. You'll notice that at each end of the line is either a 1 or a [yen].

From Here…

In this chapter you learned database design fundamentals using the relational model. You now know how tables work and the important roles the primary key plays. You also came to appreciate how indexes work to speed searches. You learned how to normalize a database and why de-normalization and over-normalization may be a better solution when retrieval speed is of the utmost importance. You also learned not only how ODBC works but also how ODBC retrieves data.