Here is a complete list of all the "timing figures only" sort programs:
On the computer I'm using at present, if I run the insertion sort programs for collections of 10,000 integers, it is quick enough so that the sorting seems to take place instantly. But if I run them for collections of 20,000 integers, the delay taken to sort the collection is long enough to notice. If I run them for collections of 100,000 integers, I have to wait for over a minute for the collection to be sorted.
With the quick sort algorithms, I find I only have to wait a few seconds
to sort a collection of 1,000,000 integers. Consistently, quick sort
and merge sort work faster than insertion sort and selection sort, and
the difference is very noticeable for large amounts of data. The sort
methods provided by the Java system take about the same amount of time
as the quick sort and merge sort algorithms. So we can guess there isn't
some other sorting algorithm which is much faster, otherwise those who
put together the Java code library would
have used this algorithm for its own sort
methods.
So is this the way to choose between algorithms - program them, run them, and see how they perform? We should always test our code by running it, to see if it seems to work correctly and within reasonable time limits. But with algorithms, just as we can use mathematical techniques to prove they work correctly, so we can use mathematical techniques to show how their efficiency compares.
In this course, the emphasis is on understanding and coding algorithms.
A more theoretical course would place the emphasis on analysis of them,
proving their correctness and efficiency. In this set of notes we will
just show very briefly and quite informally the sort of reasoning
that could be taken much further in a course on the analysis of algorithms.
Binary Search
Before considering the efficiency of sorting algorithms, let us consider
the simpler problem of searching for an item in a collection.
We have previously seen methods
for searching for an item in an array, that is finding whether a given
item is in an array or not. Code to search an array (adapted for an array
with base type Integer
, but otherwise similar to before) is:
public static boolean search(Integer[] a,Integer n) { int i; for(i=0; i<a.length&&!a[i].equals(n); i++) {} return i<a.length; }You can find this code and support code to demonstrate it in the files Find1.java and Find2.java, the first printing the array contents, the second giving timing figures. All it does is look through each of the items in the array in turn, until either it has found the one it is searching for, or it has gone through the entire array.
But suppose we are searching for an item in an array where the items are stored in order? We might note that if we are looking through the items one at a time starting at the first one, which will be the smallest one, and find one greater than the one we are looking for before we find the one we are looking for, then we can stop searching, knowing that the item cannot be in the array. Here is code which does this:
public static boolean search(Integer[] a,Integer n) { int i; for(i=0; i<a.length&&a[i].compareTo(n)<0; i++) {} return (i<a.length&&a[i].equals(n)); }You can find this code and supporting code to demonstrate it in the files Find3.java and Find4.java.
On average, you now only have to go through half the array to show an item is not present, whereas previously you had to go through the whole array. So there is some improvement in efficiency.
But a better improvement in efficiency when searching for an item in a sorted array is to consider another algorithm, called binary search. Suppose we look at the middle item in an array first. If it is the item we are searching for, we have found it. If the item we are searching for is less than the middle item, we can continue searching, but this time we search only those items before the middle item. If the item we are searching for is greater than the middle item, we can continue searching, but this time searching only those items after the middle item. We know the item is not in the array if the portion of the array being searched narrows down to nothing. Here is code to do that:
public static boolean search(Integer[] a,Integer n) { return search(a,n,0,a.length); } private static boolean search(Integer[] a,Integer n,int from,int to) { if(from==to) return false; else { int mid = (from+to)/2; int res = n.compareTo(a[mid]); if(res==0) return true; else if(res<0) return search(a,n,from,mid); else return search(a,n,mid+1,to); } }You can see that the work is done in the four-argument method which takes the additional arguments of the indexes giving the portion of the array being searched. An alternative way of expressing this using iteration rather than recursion is:
public static boolean search(Integer[] a,Integer n) { int from=0, to=a.length; while(from!=to) { int mid = (from+to)/2; int res = n.compareTo(a[mid]); if(res==0) return true; else if(res<0) to=mid; else from=mid+1; } return false; }You can find the recursive code and supporting code in the files Find5.java and Find6.java, and the iterative code and supporting code in the files Find7.java and Find8.java.
You will find this binary search algorithm much quicker than the linear search. Even when searching an array of a million items, it appears to achieve it instantaneously, quickly enough on my computer for the timing to be less than one millisecond.
The reason for the quickness is that this algorithm doesn't just cut the amount of the array being searched by half, it does that repeatedly. It starts off with an array, cuts it in half to search half the array, cuts that in half to search a quarter of the array, and so on. It doesn't have to do any work to cut the array in half, it just changes the values of the integers giving the bounds. The main work done is to look at the middle item each time and compare it with the item being searched for. So the number of times the item being searched for is compared against an item in the array is, at most, the number of times the size of the array can be cut in half. Even with 1,000,000 (a million), cutting that in half gives 500,000, cutting that in half gives 250,000, cutting that in half gives 125,000, and so on - reducing to 1 after 20 halvings. So that means at most 20 comparisons to find if an item is in an ordered array of a million items. Compare that with the algorithm which looks at each item in turn until either the item is found or the item looked at is greater than the item being looked for. On average it would look at 500,000 items to determine whether an item is in the array or not.
Note the efficiency has nothing to do with the fact that binary search can be expressed using recursion. Iterative search can be expressed using recursion as well:
public static boolean search(Integer[] a,Integer n) { return search(a,n,0); } private static boolean search(Integer[] a,Integer n,int pos) { if(pos>=a.length) return false; else if(a[pos].equals(n)) return true; else return search(a,n,pos+1); }Here, the work is done in the three argument method, where
search(a,n,pos)
can be interpeted as
"return true
if the integer n
is in the
portion of the array a
starting at position pos
and going to the end, return false
otherwise". If
pos
is equal to or greater than the length of the array,
there is no portion of the array to search, so the answer must be
false
. If it happens that the item indexed by pos
in the array is the integer n
, then we can say right away
that it is in the portion of the array starting at the position indexed
by pos
. Otherwise it is in the portion of the array starting
at the position indexed by pos
if it is in the portion of the
array starting at the position indexed by pos+1
. The files
Find9.java and
Find10.java can be used
to run this code. You will find if you run the code for a large
array (for example, 30000 elements) it will halt with a
java.lang.StackOverflowError
. The reason for this is that
in practive Java places a limit on the depth of recursive calling. As here,
unlike binary search, each recursive call only reduces the size of the portion
of the array being searched by one, the depth of recursion will get to be
equal to the size of the array.
N
, then to find the
lowest item we have to go through the whole array making N-1
comparisons between the lowest item found so far and the next item in the
array. Then the lowest item is put in the first place, and we have to find
the second lowest item from the rest of the array, which involves
N-2
comparisons. We put that in the second place and have to
make N-3
to find the third placed item, and so on to the
point where we are making just 1
comparison to determine which
is the second highest item and which is the highest. So, in all that's
(N-1)+(N-2)+(N-3)+...+3+2+1
comparisons.
If you're already familiar with the mathematical idea of series, you will know
this adds up to N2/2
comparisons. If you're not,
consider the following: adding the (N-1)
and the 1
gives N
, adding the (N-2)
and 2
gives
N
, and so on, so there are N/2
pairs each of which
add to N
to produce a total sum of N2/2
.
Selection sort is particularly easy to analyse as the number of
comparisons is always the same no matter what the data.
Insertion sort is a bit more
complex to analyse as the number of conparisons depends on the data.
First there is one comparison to see if the second item should come
before the first item. Then the third item must be compared with the
second item, and only of it is smaller also compared with the first
item. Then the fourth item must be compared with the third item, and if it
is smaller with the second item, and if it is smaller than that with
the first item. So if insertion sort is applied to an array which is
already in order, each item will only have to be compared to the one
before it, and there will be N-1
comparisons if there is a
total of N
items. If insertion
sort is applied to an array which is in reverse order, each item will be
compared with all the items before it, so there will be
1+2+3+...+(N-3)+(N-2)+(N-1)
comparisons, which makes a
total of N2/2
comparisons as we showed above. If
the data in the array is in random order, each item will be compared on average
with half the items before it, making a total of N2/4
comparisons. So with insertion sort, we have a best case, a
worst case and an average case. It is important to
note that the best case and worst case only apply when the data is in some
sort of order. If the data is in a random order, then the efficiency is
very unlikely to be anything but the average case.
Now let us consider merge sort.
This works by dividing the array into two equally sized pieces, sorting
them and merging the two sorted arrays into one.
If the data was originally in random order, merging the two halves will require
nearly N
comparisons, though the best case is if the data
was originally ordered, in which case merger would require only
N/2
comparisons. We also need to consider the number of
comparisons required to sort each half. Each half is divided into two
pieces of length N/4
, and the two pieces are sorted and
merged. The merging of two pieces of size N/4
to make an
ordered piece of size N/2
requires up to N/2
comparisons, and there are two mergers of N/4
pieces to make
N/2
pieces. so that's a total of N
comparisons.
Following from this, you can see there are four mergers of N/8
pieces to make N/4
pieces, so a total of N
mergers,
and so on. This goes down to the point where the pieces are of length 1 so
no sorting and merger is required. This point is reached when the original
array has been divided a total of M
times, where M
is the number of times N
can be halved. So in total there are
up to M
times N
comparisons to sort an array of
length N
using merge sort.
As we noted before, 1,000,000 (one million) can be cut in half twenty
times before it reaches the size of 1. So to sort an array of a million
items using merge sort, up to 20,000,000 (twenty million) comparisons
will have to be made. But compare this to insertion sort, where we noted
that with randomly ordered data, an array of length N
would require N2/4
comparisons. For an array of
length 1,000,000 that's 250,000,000,000 comparisons (two hundred and fifty
thousand million, or two hundred and fifty billion). So you can see that
merge sort is much more efficient than insertion sort.
To work out the efficiency of quick sort
we think in a similar way to working out the efficiency of merge sort. With
quick sort, the comparisons take place before the recursive calls, not after,
but otherwise the reasoning is the same. There are N
comparisons
required to divide the original array of N
elements into
two pieces of size N/2
. A similar process applies to each
of these, so there are two lots of N/2
comparisons to
divide the two N/2
sized pieces into four N/4
sized pieces, and so on. So long as each division is into approximately
equal sized pieces, as with merge sort therte are a total of about
M
times N
comparisons to sort an array of
length N
where M
is the number of times
N
can be halved for it to reach 1
.
Quick sort, however, has a bad worst case. Suppose it is applied to an
array which is already ordered, and as we have done, the first item
is taken as the pivot. Then the array will be divided into two unequal
parts - one empty (all those items less than the pivot), and one of length
N-1
(all those items greater than the pivot). This requires
N-1
comparisons, then the N-1
part is similarly
divided requiring N-2
comparisons and so on down to one comparison.
So in this worst case, quick sort requires N2/2
comparisons.
The sort algorithm which the Java library uses in its built-in methods is in fact merge sort, and this is in order to avoid execution becoming unexpectedly slow if the sort method is applied to an array which is already sorted.
M
was the number of times
we could halve N
before we reached 1. If N
is
8
, then halving it once gives 4
, halving it
twice gives 2
and halving it three times gives 1
.
If N
is 16
, halving it four times gives 1
.
You can see that if N
is 2M
then
M
is the number of times N
can be halved.
Remember that 2M
is 2*2*...*2*2
where
M
2
s are multiplied together.
If N
is in between 2M
and
2M-1
, then repeated halving will eventually
lead to a point where an odd number has to be halved, and since we are
dealing with integers only, that will be into two pieces with one piece
one greater than the other. Consider halving the larger piece each
time this happens, then if N
is less than or eqal to
2M
and greater than 2M-1
,
it can be halved M
times.
In mathematical terms, M
is "the logarithm of N
to the base of 2
", which is written as log2N
.
In general, if ab
=c
then the value
b
is described as "the logarithm to the base a
of c
", which is written logac
.
From this we say that binary search is "of order log N
" meaning
that for an array of N
elements it takes log N
steps to execute for an array of length N
. We say that linear
search is "of order N
" meaning it takes around N
steps to execute. Selection sort and insertion sort are "of order
N2
" because they take around N2
steps to execute. Merge sort and quick sort are "of order N log N
",
meaning N
times the logarithm of N
because that is
a measure of the number of steps taken to execute. Note that when we give
these figures, we ignore constant multiplying factors. We don't need to
give the base of the logarithm, since the logarithm any base of a number
is the logarithm to another base of the same number multiplied by a
constant factor.
Note that if an algorithm is described as O(1)
it means
the time taken to execute does not depend on the amount or size of
the data. In Java, accessing any element of an array given its index
(that is, finding or setting a[i]
) is O(1)
.
Algorithms which make use of getting or setting components of an
array rely on knowing that this can be done in a single step. We could
not, for example, apply the analysis we have applied above to similar
algorithms on Lisp lists, for example,
since we know that to find the i
th element of a Lisp list
we have to go through i
steps.
In general, an O(1)
algorithm is the most efficient,
followed by O(log N)
, followed by O(N)
followed by O(N log N)
, followed by O(N2)
.
Algorithms of O(log N)
are also known as "logarithmic".
Algorithms of O(N)
are also known as "linear".
Algorithms of O(N2)
are also known as
"quadratic". You may also come across algorithms of O(N3)
or "cubic", which are particularly inefficient. An even more inefficient
class of algorithms is those of O(2N)
, known as
"exponential". Recall that 220 is over a million, so an algorithm
of O(2N)
would be taking around a million steps for
just for 20 data items.
Last modified: 24 February 2006