C TOPICS: March 2010

What do you think if I want to search any file among lakhs of files in data base of my hard disk so what data structure algorithm should I use?? As we saw with binary search, certain data structures such as a binary search tree can help improve the efficiency of searches. From linear search to binary search, we improved our search efficiency from O(n) to O(logn) . We now present a new data structure, called a hash table, that will increase our efficiency to O(1) , or constant time. The main advantage of hash tables over other table data structures is speed. This advantage is more apparent when the number of entries is large (thousands or more).

What is a Hash Table?

A hash table is one way to create a table (sometimes called a dictionary). Typical operations are Empty, Insert, and Retrieve. The idea is that a table is a place to store data records, each containing a key value as well as other data fields (or field), so as to easily retrieve the data later by supplying a key value. The Retrieve function supplies the entire matching record of data. One way to create a table, either on disk or in main memory, is to use a hash table. The goal of a hash table is to get O(1) lookup time, that is, constant lookup time. Rather than having to proceed through a O(n) sequence of comparisons as in linear search, or a O(log n) sequence of comparisons as in binary search, a hash table will usually complete a lookup after just one comparison!

What is Hash Function?

The idea is that we store records of data in the hash table, which is either an array (when main memory is used), or a file (if disk space is used). Each record has a key field and an associated data field. The record is stored in a location that is based on its key. The function that produces this location for each given key is called a hash function.

The hash function is used to index the original value or key and then used later each time the data associated with the value or key is to be retrieved. Thus, hashing is always a one-way operation. There's no need to "reverse engineer" the hash function by analyzing the hashed values. In fact, the ideal hash function can't be derived by such analysis. A good hash function also should not produce the same hash value from two different inputs. If it does, this is known as a collision. A hash function that offers an extremely low risk of collision may be considered acceptable.

Collision:-

When inserting items into a hash table, it sometimes happens that an item to be inserted hashes to the same location as an item already in the table. (Note that some method is needed to distinguish locations in use from unused locations.) The new item typically is inserted somewhere else, though we must be sure that there is still a way (and a fairly fast way) to look up the item. There are several ways of handling collisions like rehashing function, separate chaining, coalesced hashing, and buckets and so on.

Example Of Hashing:-

As a simple example of the using of hashing in databases, a group of people could be arranged in a database like this:

Avinash, Mani, Santosh, Vilas, Patil (and many more sorted into alphabetical order)

Each of these names would be the key in the database for that person's data. A database search mechanism would first have to start looking character-by-character across the name for matches until it found the match (or ruled the other entries out). But if each of the names were hashed, it might be possible (depending on the number of names in the database) to generate a unique four-digit key for each name. For example:

7864 - Avinash, 9802 - Mani ,1990 - Santosh, 2476 – vilas and 8822 - Patil

A search for any name would first consist of computing the hash value (using the same hash function used to store the item) and then comparing for a match using that value. It would, in general, be much faster to find a match across four digits.

Let's take another simple example. First, we start with a hash table array of strings (we'll use strings as the data being stored and searched in this example). Let's say the hash table size is 12:

Figure : The empty hash table of strings

Next we need a hash function. There are many possible ways to construct a hash function. For now, let's assume a simple hash function that takes a string as input. The returned hash value will be the sum of the ASCII characters that make up the string mod the size of the table:

int hash(char *str, int table_size)

               int sum;

               /* Make sure a valid string passed in */

               if (str==NULL) return -1;

               /* Sum up all the characters in the string */

               for( ; *str; str++) sum += *str;

               /* Return the sum mod the table size */

               return sum % table_size;

Now that we have a framework in place, let's try using it. First, let's store a string into the table: "Steve". We run "Steve" through the hash function, and find that hash("Steve",12) yields 3:

Figure : The hash table after inserting "Steve"

Let's try another string: "Spark". We run the string through the hash function and find that hash("Spark",12) yields 6. Fine. We insert it into the hash table:

Figure : The hash table after inserting "Spark"

Let's try another: "Notes". We run "Notes" through the hash function and find that hash("Notes",12) is 3. Ok. We insert it into the hash table:

Figure : A hash table collision

What happened? A hash function doesn't guarantee that every input will map to a different. There is always the chance that two inputs will hash to the same output. This indicates that both elements should be inserted at the same place in the array, and this is impossible. This phenomenon is known as a collision.

There are many algorithms for dealing with collisions, such as linear probing and separate chaining. While each of the methods has its advantages, we will only discuss separate chaining here.

Separate chaining requires a slight modification to the data structure. Instead of storing the data elements right into the array, they are stored in linked lists. Each slot in the array then points to one of these linked lists. When an element hashes to a value, it is added to the linked list at that index in the array. Because a linked list has no limit on length, collisions are no longer a problem. If more than one element hashes to the same value, then both are stored in that linked list.

Let's look at the above example again, this time with our modified data structure:

Figure : Modified table for separate chaining

Again, let's try adding "Steve" which hashes to 3:

Figure : After adding "Steve" to the table

And "Spark" which hashes to 6:

Figure : After adding "Spark" to the table

Now we add "Notes" which hashes to 3, just like "Steve":

Figure : Collision solved - "Notes" added to table

Uses of Hashing:-

Associative arrays:- Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays (arrays whose indices are arbitrary strings or other complicated objects), especially in interpreted programming languages like AWK, Perl, and PHP.

Database indexing:- Hash tables may also be used for disk-based persistent data structures and database indices (such as dbm) although balanced trees are more popular in these applications.

Caches:- Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries — usually the one that is currently stored in the table.

C TOPICS

Tuesday, March 2, 2010

HASHING

What is a Hash Table?

Monday, March 1, 2010

VOLATILE KEYWORD

counter