==============================================================================
C-Scene Issue #05
A Simple Hash Table Implementation
Platform: Non-specific
Charles P. Wright
==============================================================================


A Simple Hash Table Implementation

Charles P. Wright

Hash tables are pretty cool things, and quite versatile. They have a wide range of uses including associative arrays in perl (I am sure we all love these) and many others. I apologize in advance for the differences in code between the article and the included files. I already wrote this program and have added comments into the article to make it comprehensible, some of the code (the hash algorithm is really nasty to read). I have tried to add the comments back into the original code, but I could have screwed up.

To effectively use a hash table you need to know approximately how many data items you will have. This is because if you under estimate you will get a lot of collisions (these are explained later in the article), and if you over estimate you will waste memory. You can get around this by using an array of pointers to your hash structure as this should greatly reduce the memory used by unoccupied elements.

A hash table is an array that has a prime number of elements. (See prime.txt for a rather large list of prime numbers) When you wish to add an piece of data to the hash table you need to take the key (usually characters) and turn it into a number. This is the hash algorithm, the only really neccessary requirement is that the hash algorithm will always produce the same results for the same key. It is also desirable to distribute the results for different keys nicely. You then modulus the number produced by the hash algorith by the number of elements in the hash table. Then you put the data in that element of the array.

unsigned int hash (char *s); // Starts on line #9 of hash.c

unsigned int hash (char *s) {
   int i, n;			// Our temporary variables
   unsigned int hashval;
   unsigned int ival;
   char *p;

   p = (char *) &ival; 		// p lets us reference the integer value as a
				// character array instead of an integer.  This 				// is actually pretty bad style even though it
				// works.  You should try a union if you know
				// how to use them.

   hashval = ival = 0;		// Initialize our variables

   // Figure out how many characters are in an integer the correct answer is 4 
   // (on an i386), but this should be more cross platform.
   n = (((log10((double)(UINT_MAX)) / log10(2.0))) / CHAR_BIT) + 0.5;
 
   // loop through s four characters at a time
   for(i = 0; i < strlen(s); i += n) {
      // voodoo to put the string in an integer don't try and use strcpy, it
      // is a very bad idea and you will corrupt something.  
      strncpy(p, s + i, n);
      // accumulate our values in hashval
      hashval += ival;
   }

   // divide by the number of elements and return our remainder
   return hashval % HASHELEMENTS;
}

If we could just shove the data in the array it would be pretty simple, but what if there are two keys that hash to the same value? This is a collision, and slows down our lookup. A good hash algorithm won't have many of these, but the are inevitable. In this case I have implemented a linked list. This will deal with collisions pretty well.

To look up an item we simply hash the search key and then go straight to that element of the array. If it matches then everything is cool and we can extract the data, otherwise we must traverse the linked list. If we run out of items, then the data obviously doesn't exist.

The hash table is a great method for lookups as it will take order c operations to find an element. If I knew any advanced math then I could explain this better, but essentially it means in theory you should only have to perform one operation to find an element. You can try and glean more information from The art of Computer Programming, Volume 3, Sorting and Searching, by Knuth. As you get more elements compared to slots you begin to deviate from theory. You can use the profiling code to experiment with this.

Just keep in mind when to use this and when to use other searching methods, if you don't know how much data you will have then I would suggest you should try other things like a binary tree. You can resize your hash table, but that requires allocating new memory and then rehashing every element, a binary tree can have as many nodes as you need without the need for extra allocation.

These are the basics of hash tables, to find out more I suggest reading the included code. It really doesn't do anything useful, and for real stuff I suggest having a pointer to your data. The code should be reasonably cross platform, but I will only claim that it should work with gcc-2.7.2.3-11 on RedHat 5.1 on an Intel Pentium Processor. YMMV.

I hope this has been useful to you, have fun hashing.


This page is Copyright © 1998 By C Scene. All Rights Reserved