Hash Tables in Java LG #57

Hash tables in Java

A hash table is conceptually a contiguous section of memory with a number of addressable elements, commonly called bins, in which data can be quickly inserted, deleted and found. Hash tables represent a sacrifice of memory for the sake of speed – they are certainly not the most memory efficient means of storing data, but they provide very fast lookup times. Hash tables are a common means of organising data, so the designers of the Java programming language have provided a number of classes for easily creating and manipulating instances of hash tables.

Hashtable is the class which provides hash tables in Java. Hashtable inherits directly from Dictionary and implements the Map, Cloneable and Serializable interfaces. The implementation of the Map interface is new with Java 1.2. You can view the documentation on Hashtable here.

A key is a value that can be mapped to one of the addressable elements of the hash table. The Java programming language provides an interface for generating keys for a number of the core classes: as an example, the snippet below prints out the key representation of a string for later use in a hash table.

     String abc = new String("abc");
     System.out.println("Key for \"abc\" is "+ abc.hashCode());

A hashing function is a function that performs some operation on a set of data such that a key is generated from that data with the key being used as the means of identifying which memory element of the hash table to place the data in. There are a number of properties that it is desirable for a hashing function to possess in order that the hash table be effectively used:

  • Data should be dispersed as randomly as possible across the hash table to minimise the chances of a collision. For example, a good hashing function would place the letter « b » fairly far from the letter « a ».
  • The hashing function should execute in a reasonable period.

Unfortunately, as we shall see below, the hashing functions provided by Java do not satisfy the first criterion.

The load factor of a hash table is defined as the ratio of the number of filled bins to the total number of bins available. A bin is full when it points to or contains a data element. The load factor is a useful parameter to use in estimating the likelihood of a collision occurring. The Java programming language will allocate more memory to an existing hash table if the load factor exceeds 75%. The user can also choose to set the initial capacity of the hash table with the aim of reducing the number of rehashing methods required. The code snippet below demonstrate how this can be achieved.

       int initialCapacity = 1000;
       float loadFactor = 0.80;
       Hashtable ht = new Hashtable(initialCapacity, loadFactor);

If you want to allocate more space for your hash table before the load factor reaches the specified value then use the rehash() method like this:

       ht.rehash();

A collision occurs when two pieces of data are denoted with the same key by the hashing function. Since the point of using a hash table is to maximise the efficiency with which data is inserted, deleted or found, collisions are to be avoided as much as possible. If you know the hashing function used to create a key then it can be very easy to create collisions. For example, the Java code illustrates how two different strings can have the same hashcode. (text version)

import Java.util.*;
import Java.lang.*;

// x + 31x = x(31 + 1) = x + 31 + 31(x-1)
public class Identical
{
    public static void main(String[] args)
    {
        String s1 = new String("BB");
        String s2 = new String("Aa");
        System.out.println(s1.hashCode());
        System.out.println(s2.hashCode());
    }
}

This code generates the following output on my RedHat 6.2 box using the kaffe compiler.

[bash]$ javac Identical.java
[bash]$ java Identical
2112
2112
[bash]$

Chaining is a method of dealing with collisions in a hash table by imagining that each bin is a pointer to a linked list of data elements. When a collision occurs, the new data element is simply inserted into this linked list in some manner. Similarly, when attempting to remove an element from a bin with more than one entry, the list is followed until the element matching the one to be deleted is found.Actually, there is no need for collisions to be dealt with by solely using a linked list – a data structure such as a binary tree could also be used. The Hashtable class in the Java foundation classes uses this method to insert elements into a hash table. Interestingly, using chaining means that a hashtable can have a load factor that exceeds 100%.

READ  News Bytes LG #38

Open addressing occurs when all of the elements in a hash table are stored in the table itself – no pointers to linked lists: each bin holds one piece of data. This is the simplest means of implementing a hash table, since the table reduces to being an array where the elements can be inserted or deleted from any index position at any time.

Linear probing is a means of implementing open addressing by choosing the next free bin to insert a data element into if a collision occurs while trying to insert data. Each of the subsequent bins is checked for a collision before the data is inserted.

The String class contains a method hashCode() which is used to generate a key which can used as the key for a hash table. The hashcode for a String object is computed as
s[0]*31^(n-1)+s[1]*31^(n-2)+…+s[n-1]
using integer arithmetic and where s[i] is the ith character of a string of length n. The hash value of an empty string is defined as zero.

I’ve included a small sample program called CloseWords which finds words in the system dictionary which are « close » to the command line argument. To do this the program explicitly exploits one of the traits of the class String’s hashing function which is that the hashcode generated tends to cluster together words which are of similar alphanumeric composition. This is actually an undesirable trait, since if the input data is comprised of a limited set of characters then there will tend to be a large number of collisions. The ideal hashing function would distribute the data randomly over the hash table, with trends in the data not leading to an all over tendency to cluster.

READ  Colours with Linux terminals: Do I have it at all?

Another limitation of the hashCode method is that by making the key of type integer the designers of Java unnaturally limited the possible magnitude of the key to just 2^32 -1 meaning that the probability of a collision occurring is much larger than if the key was represented by a 64 bit data type.

The Hashtable class and methods supplied in the Java Foundation Classes are a powerful tool for data manipulation – particularly when rapid data retrieval, searching or deleting are required. For large data sets, however, the implementation of the hashing functions in Java will cause a tendency for clustering – which will unnecessarily slow down execution. A better implementation of a hash table would involve a hashing function which distributed data more randomly and a longer data type used for the key.

References and links

For a more complete discussion of the limitations of hash tables in Java and a better implementation see.

Java is a superbly documented language – check it out at SUN.

For information on the open source Kaffe compiler visit the website.

CloseWords

Note: you can grab the source code here


import
java.
lang.
*;
import
java.
util.
*;
import
java.
io.
*;
public
class
CloseWords {
Hashtable
ht;
String
currString;
param
args
public
static
void
main(
String[]
args) { ht =
new
Hashtable();
try {
DataInputStream
in =
new
DataInputStream(
new
BufferedInputStream(
new
FileInputStream(
"/usr/dict/words")));
while((currString = in.readLine())!=
null) ht.put(
new
Integer(currString.hashCode()), currString);
int
i = args[0].hashCode();
int
found=0;
while(found < 5) { i--;
if(ht.get(
new
Integer(i))!=
null) { System.out.println(ht.get(
new
Integer(i))); found++; } } i = args[0].hashCode(); found=0;
while(found < 5) { i++;
if(ht.get(
new
Integer(i))!=
null) { System.out.println(ht.get(
new
Integer(i))); found++; } } }
catch(
IOException
ioe) { System.out.println(
"IO Exception"); } } }
Copyright © 2000, Ben Tindale
Published in Issue 57 of Linux Gazette, September 2000