The Bad, Bad, Bad, Bad Question

I have taught a second semester course on discrete mathematics for several years. Early on, I would ask the following standard question on assignments when we got to asymptotic analysis.

Show that

1^k + 2^k + \cdots + n^k \in O(n^{k+1})

One of the prerequisites for this course is Calc I, and so we cover big-O, big-theta, and big-omega with proof techniques based on the definitions (showing inequalities) as well as the limit method. The problem is that students would immediately try to apply the limit method (since it is simpler and easier to use than the definition) to this question and produce a proof something like the following.

Let f(n) = 1^k + 2^k + \cdots + n^k and g(n) = n^{k+1}. Then we have that

\begin{align*}\lim_{n\rightarrow\infty} \frac{f(n)}{g(n)} = &  \lim_{n\rightarrow\infty} \frac{1^k + 2^k + 3^k + \cdots + n^k}{n^{k+1}} \\ = & \lim_{n\rightarrow\infty} \left[ \frac{1^k }{n^{k+1}} + \frac{2^k}{n^{k+1}} + \frac{3^k}{n^{k+1}} + \cdots + \frac{n^k}{n^{k+1}} \right]\\ = & \lim_{n\rightarrow\infty} \frac{1^k }{n^{k+1}} + \lim_{n\rightarrow\infty}\frac{2^k}{n^{k+1}} + \cdots + \lim_{n\rightarrow\infty} \frac{n^k}{n^{k+1}}\\ = & \, 0 + 0 + 0 + \cdots + 0 \\ = & \, 0  \end{align*}

Therefore, f(n) \in O(g(n)).

Which of course is completely wrong. First, the result is incorrect. The limit method actually shows that f(n) \in o(g(n)) (that is, little-o), which implies big-O but precludes big-theta. However, indeed f(n) \in \Theta(g(n)). The error in the supposed proof is an invalid calculus operation. Though the limit of a sum is equal to the sum of its limits, in this case the number of terms is not constant. The number of terms grows as n \rightarrow \infty so we cannot simply split the limits.

After a few semesters of continually getting these wrong answers, I changed the question. I gave it in two parts: the first part I gave them the typically incorrect proof and told them it was incorrect and asked them to identify why the proof was wrong. The second part asked them to prove the original inclusion using the definition-based proof. Most students were able to do the second part, but still far too many didn't understand why the limit proof was wrong.

Yet again, I changed the question. The first part changed so that I asked them to work through a limit proof for k = 1 in which case the summation simplifies to

\frac{n(n+1)}{2}


which is the well-known Gauss's formula and which is something that we go over in class several times when analyzing standard algorithms.

The intention was that when they go through the proof with the closed form using Gauss's formula, they'll get that f(n) \in \Theta(g(n)) which is incongruous with the original proof which shows that it is little-o.

Despite even this tweak, most students still do not get this question correct or struggle greatly with it. Some ignore the fact that we go over Gauss's formula multiple times; some lack the calculus rigor to see the problem, some are still struggling with trying to grasp the concept of asymptotic analysis.

I had always thought that presenting this question provided value in that it gave students an opportunity to avoid a common mistake and at the same get a deeper understanding by examining WHY the proof is wrong. I saw it as a sort of companion to the "bad" proof typically presented to students that 1 = 0 (by hiding an invalid division by zero operation). However, I've slowly come to think that the question is far more trouble than it is worth. The only thing it succeeds in doing is getting every student to show up for office hours where I end up explaining the entire thing dozens of times. I could achieve the same thing by presenting it during lecture. Or, I could just get rid of the question altogether.

Gomer Questions

Developing good assignment and test questions can be one of the most difficult, challenging, and fun things to do when teaching a class. Most of my exams are open-book, open-note, and even open-computer. I don't like the type of questions that reward rote memorization or plug-and-chug type of work. I tend to prefer the type of question that has a very simple answer but you can only arrive at it by completely understanding the topic or thinking about it for a bit.

One particular type of question that I like is what I refer to as "Gomer" questions. In many areas of Computer Science, scenarios are often motivated by communication or interaction between Alice and Bob (A and B) sometimes with a malicious actor, Oscar. Gomer, however, is not malicious just incompetent if not well-meaning. I use "Gomer" as a reference to Gomer Pyle, the simpleminded auto mechanic played by Jim Nabors. Gomer has become a cultural stand-in for a simpleminded person and avoids the potential for offending students by choosing a more common name (nevertheless, the Oscars of the world will continue to be maligned as evil adversaries).

Gomer questions usually have the premise that Gomer has found a purported solution to a problem or a change or optimization to an existing algorithm or process, but he is inevitably wrong. The question then asks the student to consider Gomer's solution and find and explain the flaw. Gomer questions are similar to mathematical questions that ask the student to find a counterexample to some statement to demonstrate that the statement is false or flawed.

Here are two examples that I've used in some of my courses (solutions are left as an exercise).

From a Data Structures & Algorithms course:

Gomer has studied Strassen's matrix multiplication and thinks he has come up with a brilliant new breakthrough. He claims to have come up with a way to split an n x n matrix into 9 parts each 1/3 the original size and which only needs 6 recursive matrix multiplication operations. Is Gomer's improvement likely to be correct?

From a Cryptography course:

Gomer has written a secure, anonymous twitter app using RSA. His application uses a 1024-bit public RSA key. It works by taking a user's intended tweet (limited to 140 bytes), treating it as a single integer (taking the byte representation) and encrypting it using the 1024-bit RSA key. The message is then securely transmitted to Gomer's application server, decrypted, and posted to an anonymous twitter account. After the application goes live, it works great for the majority of posts, but sometimes decrypted messages are garbled. Identify the flaw in Gomer's application.

Gomer questions are great because they force a student to think about a problem indirectly by working through why a proposed solution doesn't work. Rather than forcing a student to come up with an original solution, it encourages them to break down an existing one, hopefully giving some insight into the nature of the problem.

Gomer questions are also good because it doesn't leave a student wondering if they are right or wrong about a solution. They are given a solution that they know is wrong and simply have to find the flaw. In fact, many Gomer questions can be developed by adapting wrong answers from students in previous semesters. This is nice because if a wrong answer tends to be common, you can prevent a large number of students from losing a lot of points while still getting the essential concept through to them.

Gomer questions are a nice departure from the usual type of questions because they make you think not about what a problem is, but instead, about what it isn't.

Simple Database Design

In my Computer Science II course I regularly go over the following problem as a "tutorial" to teach basic database design. I like this activity in particular because it provides numerous opportunities to talk about general design principles and "what-ifs".

Goal: Design a database to support a course roster system. The database design should be able to model students, courses, and their relation (ability of students to enroll in courses) to each other. The system will also need to email students about updates in enrollment, so be sure your model is able to incorporate this functionality.

In general, there are a few basic steps to designing a database:

  1. Identify which tables you need to model the problem; generally there will be one table per entity
  2. For each table, identify the fields (columns) that define that entity
  3. Identify the relationships between tables (one-to-many, many-to-many)

Repeat these steps as necessary.

For our database, clearly we're going to need tables for both students and courses. Let's start with students. First a few points on naming conventions:

  • Use UpperCamelCasing for table names
  • Use lowerCamelCasing for column names (other conventions exist, but consistency is important)
  • Use singular form for each table (Student instead of Students). English isn't very consistent with its pluralization rules (sometimes add "s", other times add "es", "y" to "i", sometimes entirely different words--index vs indices). You can avoid these problems by consistently using singular forms.

A good design principle is to ensure that each table has a primary key (PK) to uniquely identify records in the table. Some good design principles:

  • A common naming convention is to use tableName + Id. This eliminates any guesswork when trying to remember the primary key column. Further it disambiguates key names when you use them in queries later on (a common "bad" practice is to name them all "id" which can easily get confusing).
  • It is best to not allow null values for PKs (since at most one record could have a null value, its not very useful)
  • It is best to use integers--avoid varchars (as it is less efficient to do comparisons and character encoding/case sensitivity issues abound) and floats (floating point arithmetic issues)
  • Databases are good at key management--let them do their job and make your PK field auto incremented

With these principles in mind, here is our first stab at creating a Student table:

Let's expand on a few of our design choices here:

  • We have made the name fields (more than) large enough to accommodate most names. Further, we have decided that both the first name and last name are required (not null) but middle names are optional (some cultures do not have middle names)
  • We have explicitly defined the engine (InnoDB) and character encoding (latin1_general_cs, alternatively utf8mb4 for full unicode support). Depending on your database setup, its a good idea to explicitly specify these
  • We have decided to include an NUID (Nebraska University ID, an 8-digit with possible leading zeros used in the NU system) as a key (index) and made it unique (one NUID per student). Let's discuss this in detail.

In the real world there are many unique identifiers (Social Security Numbers, ISBNs, etc.) that are supposed to uniquely identify entities. A naive approach may have considered using the NUID as a primary key in our database. This is bad for several reasons. First, an NUID is a string (leading zeros are allowed) which should be avoided. More importantly though, we have no control over the NUID. A common scenario involve the following: registration and records for the university generates an NUID for a new student. Suppose that student then enrolls in our system; thus the NUID is being used all over the place in various tables (as foreign keys). Registration and records then calls up and informs us that the NUID was issued in error: the student already had another NUID issued previously. To make the appropriate change means we have to touch every record in every table that used the ID as part of the database design. Many other situations exist (SSNs are susceptible to fraud; data entry errors, etc.). By using a design as above we have isolated this external key to one column in one table. It remains accessible, but it doesn't impact the design of the rest of our database; this is sometimes called a surrogate key. In general, if your database doesn't have control over it, don't let it have control of your database.

Recall that our system needs to support multiple emails for students. Some naive solutions:

  • A single varchar field with a comma-delimited list of email addresses--this violates normalization and is essentially no different than using a flat file with all the problems that come with it.
  • Several fields (PrimaryEmail, SecondaryEmail, TertiaryEmail)--this means that any student could only have up to 3 emails which is not a true one-to-many relation. Further, some records may have less than three (even zero!) records meaning that we've "wasted" the columns.

Clearly the solution is to establish another table for email records and add a foreign key to reference a record in the student table.

Note the convention: the foreign key has the same name as the key it references. Furthermore, its type (int) must match the key type it references (yet another reason to prefer integers for primary keys). Furthermore, order is important here: since the Email table references the Student table, the Student table must be created first.

A course table is straightforward:

It is common to use aliases for various varchar fields (TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT), but should be avoided for portability (not all databases support these aliases and/or they could have different meanings).

We still need to model the actual enrollment of students into courses. Clearly this is a many-to-many relationship and we should thus create a join table to bring them together. Obviously we need to foreign keys referencing a student and a course. However, we also want to model when the student enrolled in the course. For this we can include a semester column; but what type should it be? We could make it a varchar to support representations like "Fall 2014" or "Spring 2015". However, such lexicographic representations are not well-ordered (when sorted alphanumerically, "Fall 2015" would come before "Spring 2014" even though they are out of order temporally). It is common to instead establish an encoding that uses integers (which are well-ordered). The drawback to this is that the encoding is not necessarily readable to end-users and should/must be converted appropriately. We'll go with the latter here.

There is still a problem though. There is nothing that prevents us from inserting multiple records for the same student, same course, in the same semester. Though a student could take the same course multiple times (say if they failed the first time), they can't take the same course in the same semester. This should be considered bad data and we should define a constraint to prevent it. Specifically we could add a uniqueness constraint on the combination of all three of these fields:

There are lots of other things we could add and model (offerings, instructors, etc.) and many more rules and constraints we could define.

Artificial Sorting and Closures in PHP

Sorting is a common, necessary task in programming. As everyone should know, the best way to sort is to use a built-in sorting function in whatever language or framework you're using. Rarely do you want to (nor should you) roll your own sorting algorithm/function. Usually sorting functions are used in conjunction with a comparator function to specify an ordering for the sorting routine.

To review: a comparator function is a function that takes two (generic) objects a and b and is responsible for comparing them. The function is expected to return an integer: a negative integer if the objects are in-order ( a < b ), zero if they are equal, and a positive integer if they are out of order ( a > b). This allows the sorting function to sort without having to worry about what it is sorting.

As a simple example, consider sorting an array of integers in non-increasing order (the reverse order from the default ordering of the sort() function). One could this as follows.

Now suppose we want to sort an array of objects whose ordering is more complex. We don't have too much freedom in how to design the comparator function: it must take two (mixed) arguments and return an integer. Any deviation from this design may have undefined consequences. However, suppose that the ordering relies on some other pieces of data in order to do the comparison. As an example, consider the problem of sorting an array of Student objects based on the students year (Freshman, Sophomore, Junior, Senior). Suppose that we define the ordering by setting up an array:

The ordering of these elements is "unnatural" (a "natural" ordering would order the elements in lexicographic ordering: Freshman, Junior, Senior, Sophomore). But its the ordering that we would expect if we sorted an array of students by their class year. Of course we could always define a comparator with lots of checks as follows.

As you can see, the logic gets pretty complex if we're going to hardcode everything. Moreover, suppose that we expand the list of possible values for class year to allow "Graduate" students, "Pre-Med" students, "Pre-Enrollment" students, etc. Each possibility leads to a complete redesign of the comparator function, increasing its complexity, increasing the potential for bugs, etc. Its not maintainable at all.

Instead, what we want is to use the array defined above in our comparator function. We could instead do the following.

This is much cleaner and much easier to extend: all we need to do to add values or change the ordering is modify the $classYear array. Simple, right?

The problem is that the cmpStudentByYear() function doesn't know anything about the $classYear array. Its not a local variable (indeed, we wouldn't want to recreate it every time we call the function), nor is it a parameter (passing it would mean we violate the comparator function signature). Any attempt to access $classYear will likely result in a warning and a null result or even a fatal error.

Note: there is another problem here if the student's class year is invalid and not in the $classYear array. In that case, array_search() returns the bizarre choice of false (a boolean) rather than (say) an invalid index (such as -1).

One solution would be to refer to $classYear as a global variable.

Adding the first line places the array within the scope of the comparator. This is not ideal as, in general, globals are evil.

Instead, let's define a closure. A closure is a function that has its own "enclosure" (that is, its very own scope). A function always knows about its parameters and local variables. However, a function sometimes needs to know about other, non-local variables (also called "free" variables). We can create a closure to "close" the function and these variables into one package that is (mostly) protected from outside modifications (and protects the non-local variables from the function itself).

Sometimes we need to do this (as in the example we've looked at). Basically, the benefit is that a function may have access to variables that are not in the same scope as the scope in which the function is invoked. In our particular example, the $classYear is accessed by the comparator function, but the comparator function is invoked within the usort() function, which has a completely different scope!

In PHP (starting with version 5.3.0), closures can be achieved using the keyword use. However, closures seem to only be definable with respect to anonymous functions (indeed, the official PHP documentation seems to conflate closures and anonymous functions, http://php.net/manual/en/functions.anonymous.php).

This example creates an anonymous function and stores a reference to it in the variable $myComparator. Moreover, the use keyword "closes" the function with a completely new copy of $classYear that is only accessible by the function itself (this is called "early binding": the copy is created upon definition of the function). The function's copy of $classYear is protected from outside changes and any changes to its copy have no effect on the original $classYear. Neat.

Now we can finally call usort() and pass our closure comparator.