Simple Database Design

In my Computer Science II course I regularly go over the following problem as a "tutorial" to teach basic database design. I like this activity in particular because it provides numerous opportunities to talk about general design principles and "what-ifs".

Goal: Design a database to support a course roster system. The database design should be able to model students, courses, and their relation (ability of students to enroll in courses) to each other. The system will also need to email students about updates in enrollment, so be sure your model is able to incorporate this functionality.

In general, there are a few basic steps to designing a database:

  1. Identify which tables you need to model the problem; generally there will be one table per entity
  2. For each table, identify the fields (columns) that define that entity
  3. Identify the relationships between tables (one-to-many, many-to-many)

Repeat these steps as necessary.

For our database, clearly we're going to need tables for both students and courses. Let's start with students. First a few points on naming conventions:

  • Use UpperCamelCasing for table names
  • Use lowerCamelCasing for column names (other conventions exist, but consistency is important)
  • Use singular form for each table (Student instead of Students). English isn't very consistent with its pluralization rules (sometimes add "s", other times add "es", "y" to "i", sometimes entirely different words--index vs indices). You can avoid these problems by consistently using singular forms.

A good design principle is to ensure that each table has a primary key (PK) to uniquely identify records in the table. Some good design principles:

  • A common naming convention is to use tableName + Id. This eliminates any guesswork when trying to remember the primary key column. Further it disambiguates key names when you use them in queries later on (a common "bad" practice is to name them all "id" which can easily get confusing).
  • It is best to not allow null values for PKs (since at most one record could have a null value, its not very useful)
  • It is best to use integers--avoid varchars (as it is less efficient to do comparisons and character encoding/case sensitivity issues abound) and floats (floating point arithmetic issues)
  • Databases are good at key management--let them do their job and make your PK field auto incremented

With these principles in mind, here is our first stab at creating a Student table:

Let's expand on a few of our design choices here:

  • We have made the name fields (more than) large enough to accommodate most names. Further, we have decided that both the first name and last name are required (not null) but middle names are optional (some cultures do not have middle names)
  • We have explicitly defined the engine (InnoDB) and character encoding (latin1_general_cs, alternatively utf8mb4 for full unicode support). Depending on your database setup, its a good idea to explicitly specify these
  • We have decided to include an NUID (Nebraska University ID, an 8-digit with possible leading zeros used in the NU system) as a key (index) and made it unique (one NUID per student). Let's discuss this in detail.

In the real world there are many unique identifiers (Social Security Numbers, ISBNs, etc.) that are supposed to uniquely identify entities. A naive approach may have considered using the NUID as a primary key in our database. This is bad for several reasons. First, an NUID is a string (leading zeros are allowed) which should be avoided. More importantly though, we have no control over the NUID. A common scenario involve the following: registration and records for the university generates an NUID for a new student. Suppose that student then enrolls in our system; thus the NUID is being used all over the place in various tables (as foreign keys). Registration and records then calls up and informs us that the NUID was issued in error: the student already had another NUID issued previously. To make the appropriate change means we have to touch every record in every table that used the ID as part of the database design. Many other situations exist (SSNs are susceptible to fraud; data entry errors, etc.). By using a design as above we have isolated this external key to one column in one table. It remains accessible, but it doesn't impact the design of the rest of our database; this is sometimes called a surrogate key. In general, if your database doesn't have control over it, don't let it have control of your database.

Recall that our system needs to support multiple emails for students. Some naive solutions:

  • A single varchar field with a comma-delimited list of email addresses--this violates normalization and is essentially no different than using a flat file with all the problems that come with it.
  • Several fields (PrimaryEmail, SecondaryEmail, TertiaryEmail)--this means that any student could only have up to 3 emails which is not a true one-to-many relation. Further, some records may have less than three (even zero!) records meaning that we've "wasted" the columns.

Clearly the solution is to establish another table for email records and add a foreign key to reference a record in the student table.

Note the convention: the foreign key has the same name as the key it references. Furthermore, its type (int) must match the key type it references (yet another reason to prefer integers for primary keys). Furthermore, order is important here: since the Email table references the Student table, the Student table must be created first.

A course table is straightforward:

It is common to use aliases for various varchar fields (TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT), but should be avoided for portability (not all databases support these aliases and/or they could have different meanings).

We still need to model the actual enrollment of students into courses. Clearly this is a many-to-many relationship and we should thus create a join table to bring them together. Obviously we need to foreign keys referencing a student and a course. However, we also want to model when the student enrolled in the course. For this we can include a semester column; but what type should it be? We could make it a varchar to support representations like "Fall 2014" or "Spring 2015". However, such lexicographic representations are not well-ordered (when sorted alphanumerically, "Fall 2015" would come before "Spring 2014" even though they are out of order temporally). It is common to instead establish an encoding that uses integers (which are well-ordered). The drawback to this is that the encoding is not necessarily readable to end-users and should/must be converted appropriately. We'll go with the latter here.

There is still a problem though. There is nothing that prevents us from inserting multiple records for the same student, same course, in the same semester. Though a student could take the same course multiple times (say if they failed the first time), they can't take the same course in the same semester. This should be considered bad data and we should define a constraint to prevent it. Specifically we could add a uniqueness constraint on the combination of all three of these fields:

There are lots of other things we could add and model (offerings, instructors, etc.) and many more rules and constraints we could define.