Robert W. Weeks firstname.lastname@example.org, http://members.cox.net/bob.weeks
When designing a business information system, it's important to start with a good database design. A good design makes it possible-even easy-to answer the questions the business needs answered, and to generate the reports the business needs. A poor design makes these tasks difficult or impossible. Often, poor database designs-along with the data already entered and the programs, forms, and reports already written-are discarded in favor of a better design, resulting in lost time and money.
The goal of entity-relationship design is two-fold: to understand our data, and to design a database that properly represents this data. In the beginning, however, don't think of tables and databases. Instead, we concentrate on entities and attributes, and the relationships between them.
To use entity-relationship design, you need a good understanding of the business situation involved. You need to be able to identify the "things" that the business needs to track. You must also be able to identify the relationships between these things. In other words, you need to know about the business situation. Generally, you'll obtain this information through your own knowledge of the business, through interviews with the users and managers, by examining the reports and forms the business already uses, and through other techniques. It's likely that during the model-building process you'll learn additional things about the business and clarify the meanings of the data the business uses.
As an example of entity-relationship design, consider the case of the Pine Valley Video Tape Club, which is seeking to implement its records on a computer. The club has members (customers) and a collection of tapes available for rental. The purpose of the system is to track membership, tape inventory, and tape rental activity.
For each member, the club wants to keep track of these data items:
In real life, the club would need members' city, state, zip code, telephone number, credit card number, and perhaps other information.
For each tape, the club wants to keep track of these items:
When a member rents a tape, the club needs to keep track of these items:
Members must return rented tapes in four days, and each member has a limit of five tapes rented out at any time.
Here are some sample questions that the club wants to be able to answer:
Of course, other questions will arise as the business begins to use the database.
Let's examine this situation using the first rule for entity-relationship design:
Start with entities. These are the things that are important enough that the organization has given them names or identification numbers.
Entities are the "things" that a business keeps track of (although entities can also be events or actions). Examples of entities that might be found in businesses include customers (identified by customer number), suppliers (supplier number), members (membership number), students at WSU (social security number), checking or savings account (account number), parts (part number), repair orders (work order number), purchase orders (PO number), and invoices (invoice number). In our example, two entities that immediately stand out are the club members and the tapes that the club owns. When naming entities, it's traditional to use singular nouns such as Member and Tape for names.
Entities have attributes, which are individual facts or characteristics about an entity. For example, the specifications for the video tape club system indicate that each member has an identification number, name, address, date of membership, and the video tape rented along with the rental date. These pieces of information are attributes for the Member entity. Attributes are not usually broken down into smaller components; they are the atomic unit of data in a table.
An entity is an abstraction that stands for all the occurrences or instances of what the entity is tracking. An instance of the entity Member is a person, while an instance of the entity Tape represents one of the tapes the club owns.
When identifying entities, remember the second rule for good design:
An entity should have a significance of its own. Even if we ignore everything else in the database, the entity should still be important on its own.
In our example, a member (an instance of the Member entity) is important even if the member has yet to rent a tape. For example, a customer might enter the store and apply for club membership, but might not choose a tape to rent that day. We would still want to keep track of this member so that we have the information on hand when the member returns and does rent a tape. Similarly, a tape is important to track even if it hasn't been rented.
After choosing entities, draw them using the notation illustrated.
A third rule is:
Make sure that the entity has a primary key.
The Member entity represents all the people who are members of the video tape club. There must be a way to distinguish one member from another. The primary key serves this purpose.
The primary key is an attribute (or combination of two or more attributes) that is guaranteed to be unique for each occurrence or instance of an entity. In other words, we need to be able to distinguish each member or each tape in some way. For the Member entity, we might be tempted to make the Name attribute the primary key. The video club, however, would not have to have very many members before there would be two people with the same name, two Mike Smiths, for example. Therefore, we assign a unique identification number (call it MemberID) to each member. Some sort of system will generate unique MemberIDs as we add new members.
For the Tape entity, a candidate for the role of primary key is the Title attribute. All copies of a movie, however, have the same title, so the Title attribute by itself can't serve as a primary key. Create an attribute called TapeID and let it serve as the primary key. As we receive tapes from our suppliers, we will generate unique TapeIDs for each tape.
After deciding which attributes will serve as primary keys, underline them in the diagram as follows:
At this point, apply some tests to your design. One test that your design must eventually pass is:
There should be no multi-valued attributes. Make sure that each attribute of an entity is such that each instance of the entity can have at most one value for that attribute-not a list of values.
A multi-valued attribute is an attribute where for each instance of an entity, there could be more than one value. Ask these questions: For each instance of entity Member could there be more than one member number? More than one member name? More than one address? More than one membership date? More than one video tape rented? The answer to all questions except the last is no. Each member could have more than one tape rented at any time (club rules allow up to five). The TapeID, DateRented, DateDue, DateReturned, and FeePaid attributes, therefore, could be a list of values-a multi-valued attribute situation that we need to avoid. The solution is to model the rental of tapes through a relationship. We say that the multi-valued attribute implies a relationship with another entity.
In this situation we want to create a relationship called Rental which links members with the tapes they have rented. In more explicit terms, the relationship Rental connects instances of the Member entity with instances of the Tape entity. Instances of the Rental relationship represent times when a certain member has rented a certain tape. Each time the store rents a tape, we create another instance of the Rental relationship.
Since the relationship modeled by the relationship between Member and Tape tracks the tapes that members have rented, we can now remove the TapeID, DateRented, DateDue, DateReturned, and FeePaid attributes from the Member entity.
When creating a relationship, show it as a diamond connected with lines to the two entities that it relates. Include as attributes of the relationship facts about instances of the relationship. In our example, an instance of the Rental entity represents a video tape rental, that fact that a certain member rented a certain tape on a certain day, paid a certain price, and returned in on a certain day. Attributes that we need to keep track of include DateRented, DateDue, DateReturned, and FeePaid (the dollar amount charged for this rental). These attributes belong to the Rental relationship because they arise only when a member rents a tape.
Don't include other attributes such as the primary keys of the entities that the relationship connects. This seems like a reasonable thing to do, because when we rent a tape we need to know which tape was rented (TapeID) and which member rented it (MemberID). These facts, however, are automatically captured because the Rental relationship connects one instance of Member with one instance of Tape.
A relationship will have a cardinality (sometimes called functionality). The cardinality gives us additional information about the relationship between two entities. You can determine the cardinality by answering two questions: First, given a single instance of entity A, how many instances of entity B could it be related to? In our example, given a particular member, how many tapes could the member rent? The answer is an unlimited number of tapes, or many. You may remember that club rules limit members to five tapes checked out at a time, but we're considering the situation over a long period of time.
Now, consider the reverse question: Given a particular tape, how many members could it be rented by? Again, the answer is many. (Of course, a particular tape can be in the hands of just a single member at a time. Again, consider the situation over a long time frame.)
Because the answer to both questions is many, we say the cardinality is many-to-many or that we have a many-to-many relationship. On the entity-relationship diagram, write the letter M next to each entity. We can also write this cardinality as M:M.
Many-to-many relationships are not always the case. Consider for a moment the situation where club members might purchase tapes. The entities and relationships are very similar to the rental example, except that a tape can be purchased just a single time. A purchase of a tape might be represented by an relationship called Purchase instead of Rental. In this case, a member might purchase many tapes (an instance of Member is related to many instances of Tape). A single tape, however, can be sold just once (an instance of tape can be related to just a single instance of Member). We say we have a one-to-many relationship (one member to many tapes). On the entity-relationship diagram, place the number 1 next to the Member entity, and the letter M next to the Tape entity.
When specifying a one-to-many relationship, make sure that you are clear as to which side is the "one" side and which is the "many" side. From Member to Tape, the cardinality is one-to-many, written as 1:M. From Tape to Member, the cardinality is many-to-one or M:1.
After making sure there are no multi-valued attributes, we're ready to consider another rule of entity-relationship modeling:
Redundancy is one of the worst enemies of effective database design. Redundancy costs in several ways: If there is redundant data, then someone had to enter the same fact more than once, when once is sufficient. If someone made a typing mistake when entering one of the instances of a redundant fact, you may not know which version of the fact is correct. It is difficult to update redundant data, as you must find and change all occurrences of a fact you need to update. If you fail to update all occurrences you will have inconsistencies, which means that the same question returns two or more different answers. For example, if a bank stores a customer's address in two places and updates only one when the customer moves, the bank is storing two addresses for the customer. Which is correct? Finally, customers may be asked to supply data such as their address more than once.
Often, you'll easily see redundancy when you look at the entity-relationship diagrams. Other times, it is more difficult to see. One way to find redundancy is to consider labels that we might (manually) create for labeling the instances of an entity. Would we have to type the same data repeatedly when creating these labels? Suppose we create labels for the video tapes. On these labels, we'd type the TapeID, Title, Rating, and RentalCharge attributes. For a popular movie that the club owns many copies of, we'd have to type the Title, Rating, and RentalCharge attributes values many times, even though these attributes are the same for each copy of the tape.
Another way of looking at this is to consider how many times do we need to record that the title Forrest Gump is associated with (or is dependent on) TapeID 32435? We need to record this relationship just once. If the club, however, has more than one copy of Forrest Gump, this relationship or dependency will appear redundantly.
To remove redundancy from the design, create a new entity to hold the redundant attributes, and create a relationship between the original entity and the new one. For the tape club, we'll create a new entity called Catalog that has the attributes CatalogID, Title, Rating, and RentalCharge. This entity is the club's catalog of tapes available for rental. If a member wants to see a listing of all the tapes available for rental, this is the list to look at. It's like the (old-fashioned) card catalog in a library.
There is a relationship between Tape and Catalog called Describes, meaning that an occurrence of Catalog describes an occurrence of Tape, or that an occurrence of Tape is described by an occurrence of Catalog. The cardinality from Catalog to Tape is one-to-many, as one instance of Catalog could be related to many instances of Tape (one occurrence of the tape catalog could describe many copies of a movie). From Tape to Catalog the cardinality is many-to-one, as each occurrence of Tape is related to just one instance of Catalog (each video tape, no matter how many copies the club has, is described by a single occurrence of Catalog).
Precise definitions are important. If someone asks how many movies does our store have, where do we look to find the answer? Do we count instances of Tape or instances of Catalog? What exactly is a movie? An instance of Tape represents a videocassette that the store owns. An instance of Catalog represents a movie that has been produced. The store may have 100 copies of Forrest Gump available for rental. That translates to 100 instances of the Tape entity, but only one instance in Catalog.
So far, our entity-relationship design doesn't track the tape supplier information, which consists of the purchase date of the tape, the purchase price, the supplier's name, and the supplier's address. One question we need to answer is where to include this information. There are two candidate entities for holding this information, Tape and Catalog. The way to resolve this question is to ask what does the club actually purchase from the suppliers? Is it a description of a tape in a catalog, or is it a videocassette on the shelf? The answer, of course, is that the club buys video cassettes from the suppliers, so the supplier information goes in the Tape entity.
Another way to resolve this is to recognize that it's possible the club could purchase some copies of a movie from one supplier, and more copies from a different supplier.
Having decided that supplier information belongs in the Tape entity, we might add it, creating a revised entity like this:
We should examine the revised entity, considering the rules for entity design discussed earlier. When we do, we find that this entity contains redundant information. For example, if the club purchases five copies of a tape from a single supplier, someone will need to enter the supplier information five times-once for each instance of the tape. The club really wants to store this information just once.
The way to remove redundancy is to create a new entity to hold just a single instance of the redundant information. Here we create a new entity called Supplier to hold supplier information. We link the Tape and Supplier entities through a relationship called Purchase, which represents instances when the club has purchased tapes. The cardinality from Supplier to Tape is many-to-one, as each supplier can supply many tapes, but any particular tape comes from just one supplier.
The attributes of the Supplier entity include the suppliers' name and address. Remember, each entity should include a primary key-an attribute or combination of attributes that will be unique. The supplier name, strictly speaking, doesn't satisfy this requirement, as it is possible that two suppliers could have the same name. By including an SupplierID attribute, we can make it the primary key and ensure that there will be no duplication. Since the relationship takes care of tracking the supplier name and address, we can remove them from the Tape entity.
What about the Purchase Price and Purchase Date attributes? It is acceptable to leave these attributes as part of the Tape entity. These attributes, however, are more directly associated with the purchase of a tape. Since we have an entity specifically designed for holding information about tape purchases, it makes sense to make these attributes part of the Purchase entity.
Consider the Catalog entity, which has attributes for TapeID, Title, Rating, and RentalCharge, which at this point contains a dollar figure representing the rental fee for this tape. This situation is fine, but it does make it difficult to maintain or change the rental rates. For example, suppose the club wants to increase the rental charge for newly-released movies on weekends, or it may want to reduce the rental charge for movies that aren't so popular. With the present design, someone has to identify the newly-released or not-so-popular movies and change the rates. With a catalog of thousands of titles, this process could take a lot of time, and requires someone with the knowledge to correctly classify each title.
A solution to this problem is to modify the Catalog entity. Create a new entity called Charge which contains the attributes ChargeCode (which serves as the primary key) and Charge, which holds the rental fee. Create a relationship called IsCharged between Catalog and Charge. Remove the RentalCharge attribute from the Catalog entity, as this information is now captured by the IsCharged relationship. The cardinality from Charge to Catalog is one-to-many, as each instance of Charge will be related to many instances in Catalog (many movies will have the same charge code).
The benefit of this new entity and relationship is that once tapes are classified into categories such as new releases, popular favorites, not-so-popular oldies, and so forth, the club can adjust the rates for these categories by working with just a small number of instances of the Charge entity.
One of the goals of entity-relationship design is to create relations or tables. These are what a relational database management system works with. We commonly show relations as tables of rows and columns. Columns correspond to attributes, and rows represent instances of entities.
Converting an entity-relationship diagram to a series of relations or tables involves carefully following a series of steps.
Step One: Represent each entity as a table. Use the entity name as the table name, use attribute names as column names, and identify the attribute (or combination of attributes) that is the primary key by underlining.
Applying this step to our entity-relationship diagram, we create these relations:
Member(MemberID, Name, Address, DateJoined)
(This notation means there is a table named Member with columns as shown in parenthesis. Underlined column names represent the primary key.)
Supplier(SupplierID, SupplierName, SupplierAddress)
Catalog(CatalogID, Title, Rating)
Step Two: Represent each relationship as a relation or table. Use the name of the relationship as the name of the table. Include as columns of the table both the explicit and implicit attributes of the relationship.
By explicit attributes, we mean those that we included on the entity-relationship diagram. Not all relationships have explicit attributes. For example, the relationship Rental has explicit attributes of DateRented, DateDue, DateReturned, and FeePaid. The relationship also has implicit attributes which link instances of Rental to instances of the two entities it relates. These implicit attributes are the primary keys of the two related entities.
For the Rental relationship, these attributes are MemberID from the Member entity, and TapeID from the Tape entity.
For the Purchase relationship, the implicit attributes are TapeID from the Tape entity, and SupplierID from the Supplier entity.
For the Describes relationship, the implicit attributes are TapeID from Tape, and CatalogID from Catalog. This relationship doesn't have any explicit attributes.
For the IsCharged relationship between the Catalog and Charge entities, the implicit attributes are CatalogID from the Catalog entity and ChargeCode from the Charge entity.
By including both the implicit and explicit attributes, we create these tables:
Rental(MemberID, TapeID, DateRented, DateDue, DateReturned, FeePaid)
Purchase(TapeID, SupplierID, PurchasePrice, PurchaseDate)
Note that for the Rental table, we had to include the DateRented in the combination of columns making up the primary key. This is because a member who really likes a movie could rent the same copy of the same movie twice. (We're assuming the member wouldn't rent it twice on the same day! Is this assumption valid? If not, how could we come up with a primary key?)
For the Describes table, which has no explicit attributes, we include the primary keys of the two related entities. To form the primary key, follow this rule: If the relationship is one-to-many, use the primary key of the "many" entity as the primary key of the table. In this case, Tape is the "many" entity, so the primary key is the TapeID attribute from the Tape entity.
The Purchase relationship illustrates this rule again. There is a one-to-many relationship from Tape to Supplier, with Tape on the "many" side. Therefore, its primary key is the primary key of the Purchase entity. We must include the SupplierID attribute from the Supplier entity because of the rule that the primary keys are always included in these tables.
Step Three: Clean up or simplify the tables by seeking to combine tables following these rules. First, are there tables created in step two that have the same primary key as tables created in step one? Remember that a primary key represents a real thing-a tape, a member, a purchase. If there are two tables with the same primary key, we'd be storing information about the same thing in two different places, which we don't want to do.
In this example, the Purchase table has the same primary key as the Tape table (the TapeID attribute). We can combine these by moving SupplierID , PurchasePrice, and PurchaseDate from the Purchase relation to the Tape table. After this move, we can eliminate the Purchase table.
The Describes table has the same primary key as the Tape table, so move its non-primary key attributes (CatalogID) to the Tape table. This action eliminates the Describes table.
The IsCharged table has the same primary key as the Catalog table that it is related to. Move its non-key attribute (ChargeCode) to the Catalog table. This eliminates the IsCharged table.
After this elimination phase, we have these tables:
Member(MemberID, Name, Address, DateJoined)
Tape(TapeID, SupplierID, PurchaseDate, PurchasePrice, CatalogID)
Supplier(SupplierID, SupplierName, SupplierAddress)
Catalog(CatalogID, Title, Rating, ChargeCode)
Rental(MemberID, TapeID, DateRented, DateDue, DateReturned, RentalFee)
This is the final database design. At this point we could use turn to our DBMS software-Microsoft Access, Paradox, dBASE IV, Microsoft SQL Server, or whatever-and issue commands to create the database. If our DBMS supports the concept of a database (Paradox and dBASEIV don't), we start by creating a database. We might call it PineValley, for example.
After creating the database, we create tables. In this example, we have six tables to create: Member, Tape, Supplier, Catalog, Charge, and Rental. For each table, we create its columns.
When creating columns, we usually have to tell the DBMS the type of data the column will hold. Common column types include the following:
Text or alphanumeric: Can hold numeric digits, alphabetic letters, and punctuation. These columns often have a length associated with them. For example, a column to hold a member's name might be 25 characters long. For each member, then, the DBMS reserves 25 characters to hold the name. If we reserve too little space, we can't hold a member's entire name. If we reserve too much space, we waste it.
Numeric: Holds a numeric value. We can perform arithmetic and comparisons on values stored in numeric columns.
Date: Holds dates. We can perform date arithmetic and comparisons on dates stored in date columns.
Time: Holds times. Some DBMSs combine date and time values into a DateTime column.
BLOB: Binary Large Object columns, which hold binary information such as images or sounds. These columns can hold large amounts of data.
Many DBMSs support primary keys, meaning that as we create columns, we identify to the DBMS which column (or combination of columns) is the primary key. The DBMS then refuses to accept a new row if its primary key value duplicates a primary key value already found in the table.
After creating the database in the DBMS, we are ready to load it with data. We might type in the data from the keyboard, convert it from some other source, or an application might generate the data.