Introduction to Abstract Data Types: Sets using Arrays, part 1

A simple set manipulation program

Suppose we want to keep a collection of items. We are not interested in the order the items are stored, and there is no concept of an item occurring multiple times in the collection. We are just interested in adding things to the collection, removing things from the collection, and checking whether a particular item is in the collection.

As before, for simplicity we will consider collections of integers rather than some more complex object type, but the principles discussed here could extend to collections of things other than integers. What we are discussing here is a collection which is like the mathematical idea of a set.

Here is a program which maintains a set of integers, with commands that you can use to add an integer to the set, remove an integer from the set, check whether an integer is in the set, or print the set:

import java.io.*;

class UseIntSets1
{
// Simplistic "single method" set maintenance program

 public static void main(String[] args)
 throws IOException
 {
  final int MAX = 100;
  int[] array = new int[MAX];
  int i,n=0,count=0;
  char ch;
  String line;
  BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
  do {
      System.out.print(": ");
      line = in.readLine();
      ch = line.charAt(0);
      if(line.length()>1)
         n = Integer.parseInt(line.substring(1).trim());
      switch(ch)
         {
          case 'q' : 
            break;
          case 'd' :
            for(i=0; i<count&&array[i]!=n; i++) {}
            if(i<count)
               {
	        count--;
                array[i]=array[count];
               }
            break;
          case 'a' :
            for(i=0; i<count&&array[i]!=n; i++) {}
            if(i==count)
	       {
                array[count] = n;
		count++;
	       }
            break;
          case 'm' :
            for(i=0; i<count&&array[i]!=n; i++) {}
            if(i==count)
               System.out.println("Not a member");
            else
               System.out.println("Is a member");
            break;
          case 'p':
            System.out.print("{");
            if(count>0)
               {
                System.out.print(array[0]);
                for(i=1;i<count;i++)
                   System.out.print(","+array[i]);
               }
            System.out.println("}");
            break;
          default:
            System.out.print("d - delete, a - add, m - member,");
            System.out.println(" p - print, q - quit");
            break;
         }
     }
  while(ch!='q');
 }
}

There are simple one-letter commands: a to add an integer to the set, d to delete one, m to check if an integer is currently a member of the set (the program causes a message to be printed saying whether it is or isn't), p to print the set, and q to quit. The first three commands must all be followed by an integer.

This program should not be taken as an example of a good program. Firstly, it is not very robust. If commands are not given in exactly the right format, the program will terminate printing incomprehensible error messages on the screen. In program intended for real use, that should never happen. If the user enters anything not in the correct format the program should always be able to handle it, printing an appropriate error message and carrying on.

But secondly, the program is poorly structured. It is just one long single method. This means it is difficult to see how the code works, and you cannot easily re-use it in other programs. Programs become clearer and more easy to re-use and modify when the code is broken up into separate classes and methods. If this fairly trivial program is already too complicated to understand in one piece, what about programs of a much greater size?

Sets represented by an array and count

The program above uses a form of representation common in languages like Java where the size of an array is fixed when it is created, and it is required to store a set of data which may vary in size during the course of execution of the program. An array is created of the maximum size thought necessary, but only part of that array is used. Here the array is called array while the variable count holds the number of cells in the array currently in use. At any time during execution, the cells a[0] up to and including a[count-1] store the data in the set, while the contents of the rest of the array are irrelevant.

The size of the array is set by a final integer variable called MAX. The final notation means the value of this integer can never be changed during program execution, so it is really a constant. But it is better to declare the maximum just once like this and after that always use the name given to the constant than use the actual figure in the program. In the program above, MAX is set to 100, but if we were careful always to refer to MAX and never otherwise directly to 100, we could change that to some other value without having to change the rest of the program if circumstances should change, maybe we discover we may need to store sets of more than 100 elements. It is a convention in Java that constants are given names which are all capital letters. By "convention" we mean a standard which Java programmers usually keep to, but which is not actually part of the Java language.

Note that we did not put any code in which checks whether the maximum size of the array is being exceeded. If the program was run and an attempt was made to put more than 100 integers in the set, it would halt with an error message that would be meaningless to someone who was not familiar with the code of the program. Again, this is done only in order to make the program short for a simple illustration. No program that was intended for real use should just crash like this, although a common source of bugs in real-life programs is where bounds that the programmer assumed would never be exceeded, are exceeded in practice, but as the programmer thought this would never happen the error is not handled appropriately.

In order to give the program a better structure, we could separate out the sections of code that manipulate the array into separate methods:

import java.io.*;

class UseIntSets2
{
// Set maintenance program with methods in same class

 public static void main(String[] args)
 throws IOException
 {
  final int MAX = 100;
  int[] array = new int[MAX];
  int n=0,count=0;
  char ch;
  String line;
  BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
  do {
      System.out.print(": ");
      line = in.readLine();
      ch = line.charAt(0);
      if(line.length()>1)
         n = Integer.parseInt(line.substring(1).trim());
      switch(ch)
         {
          case 'q' : 
            break;
          case 'd' :
            count = delete(n,array,count);
            break;
          case 'a' :
            count = add(n,array,count);
            break;
          case 'm' :
            if(member(n,array,count))
               System.out.println("Is a member");
            else
               System.out.println("Not a member");
            break;
          case 'p':
            print(array,count);
            break;
          default:
            System.out.print("d - delete, a - add, m - member,");
            System.out.println(" p - print, q - quit");
            break;
         }
     }
  while(ch!='q');
 }

 public static int delete(int n, int[] array, int count)
 {
  int i;
  for(i=0; i<count&&array[i]!=n; i++) {}
  if(i<count)
     {
      array[i]=array[count-1];
      return count-1;
     }
  return count;
 }

 public static int add(int n, int[] array, int count)
 {
  int i;
  for(i=0; i<count&&array[i]!=n; i++) {}
  if(i==count)
     {
      array[count] = n;
      return count+1;
     }
  return count;
 }

 public static boolean member(int n, int[] array, int count)
 {
  int i;
  for(i=0; i<count&&array[i]!=n; i++) {}
  return i<count;
 }
 
 public static void print(int[] array, int count)
 {
  int i;
  System.out.print("{");
  if(count>0)
     {
      System.out.print(array[0]);
      for(i=1;i<count;i++)
         System.out.print(","+array[i]);
     }
  System.out.println("}");
 }
}

Now it can be seen more easily how the data structure is handled to give the desired effect. The method delete changes the array to represent deleting an integer from the set. The place where the integer to be deleted is found is set to the integer stored at the end of the used section of the array. The count is then reduced by one, so that the initial copy of this integer is now not counted as part of the data. The effect is to change the order in which the data is stored, but this does not matter as there is no concept of ordering in sets, and none of the methods to implement them here rely on the data being ordered in any way. Note that if the element to be deleted is not found, the array is left unchanged. Special cases like this must always be considered, in this case it was a design decision that deleting an integer from a set it does not occur in is a permissible operation (so there is no error handling) and leaves the set unchanged. The method delete needs to return an integer giving the new size of the set, as this could be one less than the size before the deletion, or the same size if the item being deleted was not found.

Similarly, when an integer is added to the set, if the integers is already present no change is made. So it is necessary to loop through the array to see if it is present. If we go through it all up to the point where the valid data finishes, and don't find the integer already there, we put the integer in the next available free cell in the array, and return a new value for count to indicate the portion of the array used for storing data has increased by 1 in size.

There is a method called member which tells us whether an integer is in the set. It returns a boolean (i.e. true or false). This method cannot change the representation of the set. There is a method called print which actually prints a representation of the set onto the standard output (which would be the window in which the call to the program is made normally), but does not return anything.

All these methods use a body-less for-loop written:

for(i=0; i<count&&array[i]!=n; i++) {}

to find the position of the integer n in the array array. It first does the initialisation (i=0) and the test, in this case i<count&&array[i]!=n, and if this is true repeatedly does the test and the update (the update is just i++) until the test becomes false. The body of a for-loop is the statement following the header, where { and } are used to combine several statements into one. But {} indicates a statement made up out of no statements, and in this case the {} is the body of the for-loop. Another way of writing the same thing is:

for(i=0; i<count&&array[i]!=n; i++);

It means exactly the same, but it is perhaps easier to miss the semi-colon at the end and think, wrongly, that the following statement is meant to be in the for loop.

Note that the for-loop is not

for(int i=0; i<count&&array[i]!=n; i++) {}

with the variable i declared inside the for-loop. If it were written like this, the variable i would not be accessible outside the for-loop, but we need to know its value once the for loop has finished. If it is equal to the variable count it means it has been increased to cover each of the values from 0 going up by 1 to count, and at no point was the second element of the test array[i]!=n false. That means that the integer n does not occur in the portion of the array currently being used to store data. Otherwise, if the loop terminates because array[i]!=n is false before i<count is false, it means i is at the value where a[i] is equal to n, so i gives the position of the variable in n in the array. Note that array is not a key-word in Java (as it is, for example, in Pascal). We just happen to have chosen to name a variable by this name here.

Sets represented by objects

The second program above, while better structured than the first, is still a very bad example. The problem is that it is not clear that the array and count represent a set. The programmer needs to know they do, and reset the count value when necessary after a call to the delete or add methods. There is nothing stopping the programmer from changing what is stored in the array, so that the assumptions made in the methods (that any integer in the set is stored exactly once in the portion of the array up to and including the cell index by count-1) may not hold.

It was for this reason that the concept of an abstract data type was introduced. In an abstract data type, the way a structure is represented, in this case a set represented by an array and a count, is kept separate from the code which uses it. The code which uses it accesses it only by calling the methods which provide the desired operations. Here is the new code for the main program:

import java.io.*;

class UseIntSets3
{
// Set maintenance program with separate set class

 public static void main(String[] args)
 throws IOException
 {
  int n=0;
  char ch;
  String line;
  BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
  IntSet1 mySet = new IntSet1();
  do {
      System.out.print(": ");
      line = in.readLine();
      ch = line.charAt(0);
      if(line.length()>1)
         n = Integer.parseInt(line.substring(1).trim());
      switch(ch)
         {
          case 'q' : 
            break;
          case 'd' :
            mySet.delete(n);
            break;
          case 'a' :
            mySet.add(n);
            break;
          case 'm' :
            if(mySet.member(n))
               System.out.println("Is a member");
            else
               System.out.println("Not a member");
            break;
          case 'p':
            mySet.print();
            break;
          default:
            System.out.print("d - delete, a - add, m - member,");
            System.out.println(" p - print, q - quit");
            break;
         }
     }
  while(ch!='q');
 }

}

It creates an object referred to by variable mySet of a type we have called IntSet1. But the details of how a IntSet1 object works are kept in another file. The person who wrote the code above which uses it need have no knowledge of how a IntSet1 object works internally, and cannot alter the way an IntSet1 object is stored in any way except calling the methods that are provided as public methods on an IntSet1 object. The code executed when a method is called on an intSet1 object does all the work in making the necessary changes to that object to represent adding or deleting an integer.

Here is the code for IntSet1:

class IntSet1
{
// Set class, sets stored as unordered partly-filled arrays

 private int[] array;
 private int count;
 final static int MAX=100;

 public IntSet1()
 {
  array = new int[MAX];
 }

 public void delete(int n)
 {
  int i;
  for(i=0; i<count&&array[i]!=n; i++) {}
  if(i<count)
     {
      count--;
      array[i]=array[count];
     }
 }

 public void add(int n)
 {
  int i;
  for(i=0; i<count&&array[i]!=n; i++) {}
  if(i==count)
     {
      array[count] = n;
      count++;
     }
 }

 public boolean member(int n)
 {
  int i;
  for(i=0; i<count&&array[i]!=n; i++) {}
  return i<count;
 }

 public void print()
 {
  int i;
  System.out.print("{");
  if(count>0)
     {
      System.out.print(array[0]);
      for(i=1;i<count;i++)
         System.out.print(","+array[i]);
     }
  System.out.println("}");
 }
}

It will generally be kept in a separate file called IntSet1.java. The fields array and count are declared as private. This means that only the code inside the class itself can access them. Each IntSet1 object will have its own array and count field. The field MAX, however, is declared as static, meaning there is just one variable called MAX which is shared by all IntSet1 objects.

If s is a variable of type IntSet1, it can be set to refer to a new IntSet1 object by s = new IntSet1(). The method with signature public IntSet1(), known as the constructor for IntSet1 says what happens when a new IntSet1 object is created. Its field array is set to refer to a new array of integers of length MAX. There is no need to have a statement count = 0, because if fields are not mentioned in the constructor, they are set automatically to zero if they are numerical fields, or to null if they are object fields.

Following this, s.add(20) will add the number 20 to the set represented by the object referred to by s, and so on for any integer or expression which evaluates to an integer. Similarly, s.delete(10) will remove 10 from the set represented by the object referred to by s. These methods have void as their return type, so no value is returned from them, but they change the value of the set represented by the object referred to by s. The method print also has return type void, but s.print() does not cause any change in the object referred to by s, rather it causes a representation of that object to be printed.

To the person writing a method which makes use of these methods on sets, IntSet1 might as well be a standard part of the Java language. There is a clear separation between the code that implements IntSet1 objects and the code which uses these objects. This is a key point for structuring programs both so they are easy to understand and easy to modify. Making the array and count which represent a set inaccessible to any code except that for the methods which give the set operations is an example of the principle known as information hiding. In any situation, it is easier to understand the situation if irrelevant information is not mixed up with relevant information. Lack of access to the array and count mean they can't be changed in ways that don't fit in with how they represent sets, and in general if we don't have direct access to things we don't need to change directly, we will avoid errors caused by making changes to them in unexpected ways.

An important thing here is that we think of sets, when we write programs that make use of them, not in terms of the way they are represented on the computer, but in terms of the operations possible on them, add, delete, member and print. Implementing sets and providing methods which give these operations is a separate issue. It is as if we have built a wall between the code that uses sets and the code that implements them, with the public methods of the sets being the gate in the wall.

As a minor point, it is generally not a good idea to have a method which directly prints a representation of an object. That is too inflexible as it can only be used in a limited way. In general, there should be a method which returns the representation of an object in the form of a string. Here is such a method appropriate for the class IntSet1:

 public String toString()
 {
  int i;
  String str="{";
  if(count>0)
     {
      str+=array[0];
      for(i=1;i<count;i++)
         str+=","+array[i];
     }
  return str+"}";
 }

In fact Java makes a special use of the method called toString with no arguments in any class. If an attempt is made to print an object, the method toString is called automatically on it even if it is not mentioned. So with the above method, we can alter the lines implementing the 'p' command in the switch statement of the program that uses sets to:

case 'p':
    System.out.println(mySet);
    break;

Calling System.out.println(mySet) automatically becomes System.out.println(mySet.toString()).

Matthew Huntbach

Last modified: 27 January 2004