Hash tables: the student database examples, part 2

The advantage of arrays

One of the main advantages of arrays is that they provide single-step access to their components providing we have the index of the component we wish to access. If we have an array a, and we know the object we wish to access is referenced from array cell a[i], no further work is needed to access it. Compare this with lists where it is necessary to trace down from the head to each of the elements until the desired element is found, or trees where such tracing down links is still required though on average a much fewer number of links need to be followed, particularly if the tree is kept balanced.

If we had a collection of data with the data items having a numeric key, and an array of data item references of a size equal to the highest value the numeric key could possibly be, we could use the array to make a very efficient table in terms of time. We could access any item from its key in one step. However, it may be inefficient in terms of space. Suppose we have a collection of 100 items, each of which has a key which falls in the range 1 to 1,000,000. Then we could use an array of size 1,000,000 to store the items, but that array would be 10,000 times the size it needs to be, and 999,900 of its cells would be wasted.

One solution wouild be to have, say an array of length 1000, and only to use the last three digits of the key to get the cell into which a given object is put when it is entered into the table. Then only 900 cells would be wasted. But what happens if there is one item with key, say, 334,572 and another with key 908,572. In this case, if both items had to be put into the array, both would be put into the cell indexed by 572. This is known as a collision.

Hash functions

Forgetting about the problem of collisions for the moment, our technique of storing a collection of data in an array, then having a way of getting from a key of a data item to an index to the array is known as "hashing", with the structure as a whole termed a hash table. The function which takes a key and returns an array index is known as a hash function. The hash function we used above was simply to take the key and to return the key value modular 1000, so:

int hash(int key)
{
 return key%1000;
}

A good hash function spreads the data around the array evenly, thus avoiding having large numbers of collisions. The function suggested above may not be a good one. Suppose, for example, we were dealing with a student database, and the key we decided to use was a student identity number generated by the university registry. Suppose the university used various factors to compose an identity number, but one of those was to make the last three digits of the identity number the digits from the UCAS code the student registered on (preceded by 0 for UCAS codes consisting of two letters then two numbers) . Then, if we were building a department database, we would find most cells unused, but large numbers of collisions, since students in a department tend to share a small range of UCAS codes. If we want to build a general database system where we are not sure exactly what sort of keys will be used, it would be best to keep clear of hash functions which run the possibility that under some circumstances there would be large numbers of collisions. If the number of cells in the data were a large prime number, for example, collisions caused by regular patterns amongst the number might be avoided. Suppose the hash function were:

int hash(int key)
{
 return key%997;
}

Then a student with identity number 385500 would result in a hash value 658, while a student with identity number 471500 would result in a hash value 916, whereas previously both would have had hash value 500.

However, in our examples we have used strings rather than integers as keys. We therefore need a hash function which takes a string and returns an integer that is within the range of the size of the array used in the hash table. One possibility is to treat the letters as if they were digits of a number in base 32 (chosen rather than 26 because each letter then maps directly onto a set of bits in the binary representation). So 'a' represents 0, 'b' represents 1, and so on up to 'z' representing 25. We shall ignore case differences, and treat any non-letter characters in the string as if they were not there. So "cat" converts to the number 2×32²+0×32+19 (i.e. 2067), while "X-ray" converts to the number 23×32³+16×32²+0×32+24 (i.e. 770072). This would then have to be taken to the modulus of the size of the array. An alternative is to use Horner's rule, where the above is calculated as ((23×32+16)×32+0)×32+24, and apply the modulo operator after each paranthesized expression. Here is the code to do that, where SIZE is the size of the array:

static int hash(String key)
{
 int val=0;
 key = key.toLowerCase();
 for(int i=0; i<key.length(); i++)
    if(Character.isLetter(key.charAt(i)))
       val=(val*32+key.charAt(i)-'a')%SIZE;
 return val;
}

Note the use of the built-in static method isLetter in Java's library class Character. This class serves as a wrapper class for the type char as well as holding a collection of useful methods for dealing with characters. Similarly, toLowerCase is a non-static method in class String. Also, here characters are added and subtracted as if they were integers, which can be done as they are integers, representing the characters using the 16-bit Unicode convention. If you put a character into an arithmetic expression, it will be treated as its Unicode equivalent (which will also be its number in the ASCII representation for all characters you are likely to use).

Here is our first attempt at an implementation of the interface Table as given previously:

import java.io.*;

class Table5 implements Table
{
// Table class, data stored in hash table
// Does not handle collisions, except to print warning!

 private KeyObject[] array;
 final static int SIZE=41;

 public Table5()
 {
  array = new KeyObject[SIZE];
 }

 public boolean delete(String key)
 {
  int hash = hash(key);
  if(array[hash]==null)
     return false;
  else
     {
      array[hash]=null;
      return true;
     }
 }

 public void add(KeyObject obj)
 {
  int hash = hash(obj.getKey());
  if(array[hash]!=null)
     System.out.println("COLLISION with "+obj.getKey()+
                        " and "+array[hash].getKey());
  array[hash]=obj;
 }

 public KeyObject retrieve(String key)
 {
  return array[hash(key)];
 }

 public void print(PrintStream out)
 {
  for(int i=0; i<SIZE; i++)
     if(array[i]!=null)
        out.println(array[i]);
 }

 private static int hash(String key)
 {
  int val=0;
  key = key.toLowerCase();
  for(int i=0; i<key.length(); i++)
     if(Character.isLetter(key.charAt(i)))
        val=(val*32+key.charAt(i)-'a')%SIZE;
  return val;
 }
}

Here there has been no attempt to deal with collisions. Rather, if when some new item is added to the table the hash function maps to an array cell that already holds an item, the old item is replaced by the new one, and hence lost. To demonstrate this happens, a message is printed. The size of the array has been set to an unrealistically small value so that collisions can be easily demonstrated. For example, the surnames "Wilson", "Peters" and "Henry" all map to the same cell using the hash function and size of array given. An array cell set to null indicates no data is stored there. Initially, when the table is newly created and empty, all array cells are set to null. This does not have to be done explicitly, since when a new array object is created whose component type is not a primitive type, all its cells are set automatically to null. So the constructor:

 public Table5()
 {
  array = new KeyObject[SIZE];
 }

does this. An array cell is set to null when the item whose key maps to the index of the cell using the hash function is removed from the table.

Resolving collisions using linear probing

Obviously we cannot in any realistic database system have a situation where adding one item to the database can cause another unrelated item to be written over and in effect removed. There are various ways of dealing with collisions. A simple way is to put the data item in the next available (i.e. set to null) cell in the array. So, start with the index given by the hash function on the key of the item. If the cell that gives is not set to null increase the index by 1 (or to 0 if it becomes equal to the size of the array), until a cell set to null is found. To retrieve a data item given its key, look at the item in the cell indexed by the hash function applied to the key. If it is null, the item is not in the database, if the item in that position has an identical key, we have found the item required and return a reference to it. Otherwise, we increase the index by 1 (again, going to 0 if the end of the array is reached) until we have either found at item with an identical key to that being searched for, or a cell set to null. This is known as linear probing.

There is one problem with what has been described above. Suppose, as in our example previously, the names "Wilson", "Peters" and "Henry", used as keys, all map to the same number k using the hash function. Now suppose we put in first the record for Wilson, that will go in cell array[k]. Next we put in the record for Peters, we look at cell a[k], but that is occupied, so we put it in cell a[k+1] instead. Finally we put in the record for Henry, but as we find both a[k] and a[k+1] occupied, we put it in cell a[k+2]. Note that if cell a[k+2] were already occupied by the record for someone whose name maps onto k+2 using the hash function, then the record for Henry would have to go into array[k+3] or even further depending on what further data was stored.

Now, after the above, suppose we delete the record for Peters. Then suppose we want to retrieve the record for Henry. We would first look at cell array[k] but find that occupied by the record for Wilson. We would then look at cell array[k+1] and finding taht set to null after the deletion of the record for Peters, conclude that there is no record for Henry.

To avoid this, if a record is deleted, we set the cell where it was not to null, which now indicates a cell that was never used, but to a special value indicating a cell from which something has been deleted. Then, if we are trying to retrieve a record given a key, we look first at the cell its key maps to using the hash function, and then at further cells, moving on to the next either if the cell is occupied with a record of some other name, or it is set to the special "deleted" value. So in the above scenario, when we look at cell array[k+1] we find it set to the special "deleted" value adn so go on to look at cell array[k+2].

Here is some code which implements this:

import java.io.*;

class Table6 implements Table
{
// Table class, data stored in hash table
// Handles collisions with linear probing

 private KeyObject[] array;
 final static int SIZE=41;

 public Table6()
 {
  array = new KeyObject[SIZE];
 }

 public boolean delete(String key)
 {
  int i = hash(key);
  while(array[i]!=null&&!array[i].equalKey(key))
     i=(i+1)%SIZE;
  if(array[i]==null)
     return false;
  else
     {
      array[i] = new Deleted();
      return true;
     }
 }

 public void add(KeyObject obj)
 {
  int i = hash(obj.getKey());
  while(array[i]!=null && !(array[i] instanceof Deleted))
     i=(i+1)%SIZE;
  array[i]=obj;
 }

 public KeyObject retrieve(String key)
 {
  int i = hash(key);
  while(array[i]!=null&&!array[i].equalKey(key))
     i=(i+1)%SIZE;
  return array[i];
 }

 public void print(PrintStream out)
 {
  for(int i=0; i<SIZE; i++)
     if(array[i]!=null&&!(array[i] instanceof Deleted))
        out.println(array[i]);
 }

 private static int hash(String key)
 {
  int val=0;
  key = key.toLowerCase();
  for(int i=0; i<key.length(); i++)
     if(Character.isLetter(key.charAt(i)))
        val=(val*32+key.charAt(i)-'a')%SIZE;
  return val;
 }

 private static class Deleted extends KeyObject
 {
  public String getKey()
  {
   return "";
  }
 }
}

We have an array cell set to null to indicate it has never been used, but how are we to represent the "special value" indicating "used for something now deleted"? We cannot use null, but as this is an array of KeyObjects, we must use a value which is a KeyObject. We resolve this by having a special subclass of KeyObject called Deleted which is only used internally in the code for tables, so is given as a nested class. We can test whether an array cell has been set to this special value by using instanceof.

Separate Chaining

Another way of deling with collisions, called separate chaining is to have each cell in the array refer not to a data item, but to a linked list of data items. If we want to find an item given a key, we go to the cell indexed by applying the hash function to the key, then move down the linked list of items there, until either null is reached (in which case the item is not in the database) or the item is found. A good hash function will ensure collisions are rare, and so the linked lists in a hash table which uses separate chaining will mostly be of length one (or zero, i.e. null for a cell where no data added maps to), occasionally two and only very rarely more than that. Manipulation of linked lists to cut out list cells, as we considered before, is still needed to implement deletion from the table. Simply because a condition is rare doesn't mean we shouldn't program for it. When writing our code we should assume the linked lists could be of any length. In this case it means the code would cope even if, for example, the table were used to store much more data than originally envisaged. In the case of our linear probing implementation, the table could never store more data items than the size of the array, and an attempt to do so would cause program execution to hang as it cycles forever looking for a non-existent free cell in the array.

Here is the code for tables implemented using separate chaining:

import java.io.*;

class Table7 implements Table
{
// Table class, data stored in hash table
// Handles collisions with separate chaining

 private Cell[] array;
 final static int SIZE=41;

 public Table7()
 {
  array = new Cell[SIZE];
 }

 public boolean delete(String key)
 {
  int hash = hash(key);
  if(array[hash]==null)
     return false;
  else if(array[hash].first.equalKey(key))
     {
      array[hash]=array[hash].next;
      return true;
     }
  else
     {
      Cell ptr;
      for(ptr=array[hash];
          ptr!=null&&!(ptr.next.first.equalKey(key));
          ptr=ptr.next);
      if(ptr==null)
         return false;
      else
         {
          ptr.next = ptr.next.next;
          return true;
         }
     }
     
 }

 public void add(KeyObject obj)
 {
  int hash = hash(obj.getKey());
  array[hash] = new Cell(obj,array[hash]);
 }

 public KeyObject retrieve(String key)
 {
  int hash = hash(key);
  if(array[hash]==null)
     return null;
  else
     {
      Cell ptr;
      for(ptr=array[hash]; 
          ptr!=null&&!ptr.first.equalKey(key);
          ptr=ptr.next);
      if(ptr==null)
         return null;
      else
         return ptr.first;
     }
 }

 public void print(PrintStream out)
 {
  for(int i=0; i<SIZE; i++)
     for(Cell ptr=array[i]; ptr!=null; ptr=ptr.next)
        out.println(ptr.first);
 }

 private static int hash(String key)
 {
  int val=0;
  key = key.toLowerCase();
  for(int i=0; i<key.length(); i++)
     if(Character.isLetter(key.charAt(i)))
        val=(val*32+key.charAt(i)-'a')%SIZE;
  return val;
 }

 private static class Cell
 {
  KeyObject first;
  Cell next;

  Cell(KeyObject f, Cell n)
  {
   first=f;
   next=n;
  } 
 }
}

Matthew Huntbach

Last modified: 19th October 2001