a
,
and we know the object we wish to access is referenced from array
cell a[i]
, no further work is needed to access it.
Compare this with lists where it is necessary to trace down
from the head to each of the elements until the desired element is
found, or trees where such tracing down links is still required
though on average a much fewer number of links need to be followed,
particularly if the tree is kept balanced.
If we had a collection of data with the data items having a numeric key, and an array of data item references of a size equal to the highest value the numeric key could possibly be, we could use the array to make a very efficient table in terms of time. We could access any item from its key in one step. However, it may be inefficient in terms of space. Suppose we have a collection of 100 items, each of which has a key which falls in the range 1 to 1,000,000. Then we could use an array of size 1,000,000 to store the items, but that array would be 10,000 times the size it needs to be, and 999,900 of its cells would be wasted.
One solution wouild be to have, say an array of length 1000, and only to use the last three digits of the key to get the cell into which a given object is put when it is entered into the table. Then only 900 cells would be wasted. But what happens if there is one item with key, say, 334,572 and another with key 908,572. In this case, if both items had to be put into the array, both would be put into the cell indexed by 572. This is known as a collision.
int hash(int key) { return key%1000; }
A good hash function spreads the data around the array evenly, thus avoiding having large numbers of collisions. The function suggested above may not be a good one. Suppose, for example, we were dealing with a student database, and the key we decided to use was a student identity number generated by the university registry. Suppose the university used various factors to compose an identity number, but one of those was to make the last three digits of the identity number the digits from the UCAS code the student registered on (preceded by 0 for UCAS codes consisting of two letters then two numbers) . Then, if we were building a department database, we would find most cells unused, but large numbers of collisions, since students in a department tend to share a small range of UCAS codes. If we want to build a general database system where we are not sure exactly what sort of keys will be used, it would be best to keep clear of hash functions which run the possibility that under some circumstances there would be large numbers of collisions. If the number of cells in the data were a large prime number, for example, collisions caused by regular patterns amongst the number might be avoided. Suppose the hash function were:
int hash(int key) { return key%997; }Then a student with identity number 385500 would result in a hash value 658, while a student with identity number 471500 would result in a hash value 916, whereas previously both would have had hash value 500.
However, in our examples we have used strings rather than integers as
keys. We therefore need a hash function which takes a string and
returns an integer that is within the range of the size of the array
used in the hash table. One possibility is to treat the letters as
if they were digits of a number in base 32 (chosen rather than 26 because
each letter then maps directly onto a set of bits in the binary
representation). So 'a'
represents 0, 'b'
represents 1, and so on up to 'z'
representing 25. We
shall ignore case differences, and treat any non-letter characters in
the string as if they were not there. So "cat"
converts
to the number 2×322+0×32+19 (i.e. 2067), while
"X-ray" converts to the number 23×323+16×322+0×32+24
(i.e. 770072).
This would then have to be taken to the modulus of the size of the array.
An alternative is to use Horner's rule, where the above is calculated
as ((23×32+16)×32+0)×32+24, and apply the modulo operator
after each paranthesized expression. Here is the code to do that, where
SIZE
is the size of the array:
static int hash(String key) { int val=0; key = key.toLowerCase(); for(int i=0; i<key.length(); i++) if(Character.isLetter(key.charAt(i))) val=(val*32+key.charAt(i)-'a')%SIZE; return val; }Note the use of the built-in static method
isLetter
in
Java's library class
Character
.
This class serves as a wrapper class for the type char
as well as holding a collection of useful methods for dealing with
characters. Similarly, toLowerCase
is a non-static method
in class
String
.
Also, here characters are added and subtracted as if they were integers,
which can be done as they are integers, representing the characters
using the 16-bit Unicode convention. If you put a character into an
arithmetic expression, it will be treated as its Unicode equivalent
(which will also be its number in the
ASCII
representation for all characters you are likely to use).
Here is our first attempt at an implementation of the interface
Table
as given previously:
import java.io.*; class Table5 implements Table { // Table class, data stored in hash table // Does not handle collisions, except to print warning! private KeyObject[] array; final static int SIZE=41; public Table5() { array = new KeyObject[SIZE]; } public boolean delete(String key) { int hash = hash(key); if(array[hash]==null) return false; else { array[hash]=null; return true; } } public void add(KeyObject obj) { int hash = hash(obj.getKey()); if(array[hash]!=null) System.out.println("COLLISION with "+obj.getKey()+ " and "+array[hash].getKey()); array[hash]=obj; } public KeyObject retrieve(String key) { return array[hash(key)]; } public void print(PrintStream out) { for(int i=0; i<SIZE; i++) if(array[i]!=null) out.println(array[i]); } private static int hash(String key) { int val=0; key = key.toLowerCase(); for(int i=0; i<key.length(); i++) if(Character.isLetter(key.charAt(i))) val=(val*32+key.charAt(i)-'a')%SIZE; return val; } }
Here there has been no attempt to deal with collisions. Rather, if when
some new item is added to the table the hash function maps to an array cell
that already holds an item, the old item is replaced by the new one, and
hence lost. To demonstrate this happens, a message is printed. The size
of the array has been set to an unrealistically small value so that
collisions can be easily demonstrated. For example, the surnames
"Wilson", "Peters" and "Henry" all map to the same cell using the
hash function and size of array given. An array cell set to null
indicates no data is stored there. Initially, when the table is newly
created and empty, all array cells are set to null
. This
does not have to be done explicitly, since when a new array object is created
whose component type is not a primitive type, all its cells are set
automatically to null
. So the constructor:
public Table5() { array = new KeyObject[SIZE]; }does this. An array cell is set to
null
when the item whose
key maps to the index of the cell using the hash function is removed from
the table.
null
) cell in the array.
So, start with the index given by the hash function on the key of the item.
If the cell that gives is not set to null
increase the
index by 1 (or to 0 if it becomes equal to the size of the array), until
a cell set to null
is found. To retrieve a data item given
its key, look at the item in the cell indexed by the hash function applied
to the key. If it is null
, the item is not in the database,
if the item in that position has an identical key, we have found the item
required and return a reference to it. Otherwise, we increase the index
by 1 (again, going to 0 if the end of the array is reached) until we have
either found at item with an identical key to that being searched for,
or a cell set to null
. This is known as
linear probing.
There is one problem with what has been described above. Suppose, as in
our example previously, the names "Wilson", "Peters" and "Henry", used
as keys, all map to the same number k
using the hash function.
Now suppose we put in first the record for Wilson, that will go in cell
array[k]
. Next we put in the record for Peters, we look
at cell a[k]
, but that is occupied, so we put it in cell
a[k+1]
instead. Finally we put in the record for Henry,
but as we find both a[k]
and a[k+1]
occupied, we put it in cell a[k+2]
. Note that if cell
a[k+2]
were already occupied by the record for someone whose
name maps onto k+2
using the hash function, then the
record for Henry would have to go into array[k+3]
or
even further depending on what further data was stored.
Now, after the above, suppose we delete the record for Peters. Then
suppose we want to retrieve the record for Henry. We would first look
at cell array[k]
but find that occupied by the record for
Wilson. We would then look at cell array[k+1]
and finding
taht set to null
after the deletion of the record for
Peters, conclude that there is no record for Henry.
To avoid this, if a record is deleted, we set the cell where it was
not to null
, which now indicates a cell that was never
used, but to a special value indicating a cell from which something has
been deleted. Then, if we are trying to retrieve a record given a key,
we look first at the cell its key maps to using the hash function, and then
at further cells, moving on to the next either if the cell is occupied
with a record of some other name, or it is set to the special "deleted"
value. So in the above scenario, when we look at cell array[k+1]
we find it set to the special "deleted" value adn so go on to look at
cell array[k+2]
.
Here is some code which implements this:
import java.io.*; class Table6 implements Table { // Table class, data stored in hash table // Handles collisions with linear probing private KeyObject[] array; final static int SIZE=41; public Table6() { array = new KeyObject[SIZE]; } public boolean delete(String key) { int i = hash(key); while(array[i]!=null&&!array[i].equalKey(key)) i=(i+1)%SIZE; if(array[i]==null) return false; else { array[i] = new Deleted(); return true; } } public void add(KeyObject obj) { int i = hash(obj.getKey()); while(array[i]!=null && !(array[i] instanceof Deleted)) i=(i+1)%SIZE; array[i]=obj; } public KeyObject retrieve(String key) { int i = hash(key); while(array[i]!=null&&!array[i].equalKey(key)) i=(i+1)%SIZE; return array[i]; } public void print(PrintStream out) { for(int i=0; i<SIZE; i++) if(array[i]!=null&&!(array[i] instanceof Deleted)) out.println(array[i]); } private static int hash(String key) { int val=0; key = key.toLowerCase(); for(int i=0; i<key.length(); i++) if(Character.isLetter(key.charAt(i))) val=(val*32+key.charAt(i)-'a')%SIZE; return val; } private static class Deleted extends KeyObject { public String getKey() { return ""; } } }We have an array cell set to
null
to indicate
it has never been used, but how are we to represent the "special value"
indicating "used for something now deleted"? We cannot use null
,
but as this is an array of KeyObject
s, we must use a value
which is a KeyObject
. We resolve this by having a special
subclass of KeyObject
called Deleted
which
is only used internally in the code for tables, so is given as a nested
class. We can test whether an array cell has been set to this special
value by using instanceof
.
null
is reached (in which case the
item is not in the database) or the item is found. A good hash function
will ensure collisions are rare, and so the linked lists in a hash table
which uses separate chaining will mostly be of length one (or zero, i.e.
null
for a cell where no data added maps to), occasionally two
and only very rarely more than that. Manipulation of linked lists to
cut out list cells, as we considered before,
is still needed to implement deletion from the table. Simply because a
condition is rare doesn't mean we shouldn't program for it. When writing
our code we should assume the linked lists could be of any length.
In this case it means the code would cope even if, for example, the table
were used to store much more data than originally envisaged. In the
case of our linear probing implementation, the table could never store
more data items than the size of the array, and an attempt to do so would
cause program execution to
hang
as it cycles forever looking for a non-existent free cell in the array.
Here is the code for tables implemented using separate chaining:
import java.io.*; class Table7 implements Table { // Table class, data stored in hash table // Handles collisions with separate chaining private Cell[] array; final static int SIZE=41; public Table7() { array = new Cell[SIZE]; } public boolean delete(String key) { int hash = hash(key); if(array[hash]==null) return false; else if(array[hash].first.equalKey(key)) { array[hash]=array[hash].next; return true; } else { Cell ptr; for(ptr=array[hash]; ptr!=null&&!(ptr.next.first.equalKey(key)); ptr=ptr.next); if(ptr==null) return false; else { ptr.next = ptr.next.next; return true; } } } public void add(KeyObject obj) { int hash = hash(obj.getKey()); array[hash] = new Cell(obj,array[hash]); } public KeyObject retrieve(String key) { int hash = hash(key); if(array[hash]==null) return null; else { Cell ptr; for(ptr=array[hash]; ptr!=null&&!ptr.first.equalKey(key); ptr=ptr.next); if(ptr==null) return null; else return ptr.first; } } public void print(PrintStream out) { for(int i=0; i<SIZE; i++) for(Cell ptr=array[i]; ptr!=null; ptr=ptr.next) out.println(ptr.first); } private static int hash(String key) { int val=0; key = key.toLowerCase(); for(int i=0; i<key.length(); i++) if(Character.isLetter(key.charAt(i))) val=(val*32+key.charAt(i)-'a')%SIZE; return val; } private static class Cell { KeyObject first; Cell next; Cell(KeyObject f, Cell n) { first=f; next=n; } } }
Last modified: 19th October 2001