DSAA之Open Hash（一）

1. General Idea

　　以前一直听别人说散列，觉得很高大上，其实在多线程之Mutex和Condition Variable实验（三），笔者就使用过hash了，现在看看hash更准确的定义：

The common convention is to have the table run from 0 to H_SIZE-1;

Each key is mapped into some number in the range 0 to H_SIZE - 1 and placed in the appropriate cell.

The mapping is called a hash function, which ideally should be simple to compute and should ensure that any two distinct keys get different cells. Since there are a finite number of cells and a virtually inexhaustible supply of keys, this is clearly impossible, and thus we seek a hash function that distributes the keys evenly among the cells.

The only remaining problems deal with choosing a function, deciding what to do when two keys hash to the same value (this is known as a collision), and deciding on the table size.

　　上面第３点提到的比较关键，hash比较复杂的地方在于它的hash map函数的实现，及hash_table_size的选择，在这两个上面有更多具体的讨论。

2. prime Hash size

For reasons we shall see later, and to avoid situations like the one above, it is usually a good idea to ensure that the table size is prime.

　　对于为什么选择hash表的大小为素数有很多讨论，一个比较友好的算法分析：哈希表的大小为何是素数，博主使用了举列子的方式。但是还是不够直观，笔者进行了一些思考，也许能够通过数学推演，直接看出来其中的缘由。
　　假设我们进行hash散列的序列是一组等差数列： $a_{n}=a_{1}+(n-1)*d$ ，假设选取的hash大小为 $m$ ，其公因子为 $1<m_{1}<m$ ，现在计算映射值：

a_{n} % m = (a_{1} % m + (n - 1) * d % m) % m

$a_{n}\%m=(a_{1}\%m+(n-1)*d\%m)\%m$
　　当

m_{1} = d

$m_{1}=d$ 时，也就是等差数列的公差等于

m

$m$ 的因子，上式变为：

a_{n} % m = (a_{1} % m + (n - 1) * m_{1}) % m

$a_{n}\%m=(a_{1}\%m+(n-1)*m_{1})\%m$
　　观察上式，因为

a_{1} % m + (n - 1) * m_{1}

$a_{1}\%m+(n-1)*m_{1}$ 关于因子

m 1

$m1$ 成等差数列，所以

a_{n}

$a_{n}$ 经过hash函数映射之后的位置将成等差数列。假设

m_{1} = 2

$m_{1}=2$ 时，哈希表的使用率只有半分之50，且随着

m_{1}

$m_{1}$ 的增大而降低， 以上分析均假设输入的数列为等差数列。

3. Open Hashing (Separate Chaining)

　　笔者自己实现了一个比较简单open hash，代码如下：

#include <stdio.h>
#include <stdlib.h>
#include <err.h>
#define handle_error(msg) do{ perror(msg); exit(-1);}while(0)
typedef struct node {
  int key;
  struct node * next;
} NODE;
struct hash_tbl
{
    unsigned int table_size;
    NODE * * the_lists;
};
typedef NODE * LIST;
typedef struct hash_tbl * HASH_TABLE;



int hash(int key, HASH_TABLE H);
void insert( int key, HASH_TABLE H );
NODE *  find( int key, HASH_TABLE H);
HASH_TABLE initialize_table( unsigned int table_size );
void delete(int key, HASH_TABLE H);
NODE *  prefind( int key, HASH_TABLE H);


int main (){
    int i,n,num;
    HASH_TABLE hash_table;
    LIST ptr;
    printf("input the table_size :\n");
    scanf("%d",&n);
    hash_table=initialize_table(n);
    printf("insert 1..n to your hash_table\n");
    for ( i=1;i<=n;i++)
        insert(i,hash_table);
    printf("done\n");

    //打印建立的hash table
    printf("hash_table:\n");
    for(i=0;i<n;i++){
        printf("[%d] ",i);
        for(ptr=(hash_table->the_lists)[i];ptr != NULL;ptr=ptr->next)
            printf("%d ",ptr->key);
        printf("\n");
    }
    printf("\n");

    //随意输入查询hash table
    printf("please input the key you want to find:\n");
    scanf("%d",&num);
    if(find(num,hash_table) == NULL)
        printf("can't find your key %d\n",num);
    else
        printf("find the key %d\n",num);

    //随意输入删除hash table
    printf("please input the key you want to delete:\n");
    scanf("%d",&num);
    delete(num, hash_table);

    //打印链表
    printf("hash table after deleted a key:\n");
    for(i=0;i<n;i++){
        printf("[%d] ",i);
        for(ptr=(hash_table->the_lists)[i];ptr != NULL;ptr=ptr->next)
            printf("%d ",ptr->key);
        printf("\n");
    }
    printf("\n");
}

int hash(int key, HASH_TABLE H){
    return key%H->table_size;
}
HASH_TABLE initialize_table( unsigned int table_size ){
    HASH_TABLE H;
    int i;
    H = malloc ( sizeof (struct hash_tbl) );
    if( H == NULL )
        errx(1,"Out of space!!!\n");
    H->table_size=table_size;
    H->the_lists = calloc( H->table_size, sizeof (LIST));
    if( H->the_lists == NULL )
        errx(1,"Out of space\n");
    return H;
}
NODE *  find( int key, HASH_TABLE H){
    NODE * ptr;
    LIST list_header =(H->the_lists)[hash(key,H)];
    for(ptr=list_header;ptr!=NULL;ptr=ptr->next){
        if(ptr->key == key)
            break;
    }
    return ptr;
}
NODE *  prefind( int key, HASH_TABLE H){
    NODE * ptr;
    LIST list_header =(H->the_lists)[hash(key,H)];
    for(ptr=list_header;ptr->next!=NULL;ptr=ptr->next){
        if(ptr->key == key || ptr->next->key == key)
            break;
    }
    return ptr;
}
void insert( int key, HASH_TABLE H ){
    NODE * pos,* new_cell;
    LIST list_header;
    pos = find( key, H );
    if( pos == NULL ){
        new_cell = malloc(sizeof(NODE));
        if( new_cell == NULL )
            errx(1,"Out of space!!!\n");
        else{
            list_header = (H->the_lists)[hash( key, H)];
            if(list_header == NULL){
                new_cell->next = NULL;
                new_cell->key = key; /* Probably need strcpy!! */
                (H->the_lists)[hash( key, H)]=new_cell;
            }
            else{
                new_cell->next = list_header;
                new_cell->key = key; /* Probably need strcpy!! */
                (H->the_lists)[hash( key, H)] = new_cell;
            }
        }
    }
}
void delete(int key, HASH_TABLE H){
    NODE * tmp;
    NODE * ptr=prefind(key,H);
    if(ptr == NULL)
        errx(1,"can't find the key\n");
    else if( ptr == (H->the_lists)[hash( key, H)]){
        (H->the_lists)[hash( key, H)] = ptr->next;
        free(ptr);
        return ;
    }
    tmp=ptr->next;
    ptr->next = tmp->next;
    free(tmp);
}

　　结果如下：

[root@localhost ~]# ./4_4           
input the table_size :
10
insert 1..n to your hash_table
done
hash_table:
[0] 10 
[1] 1 
[2] 2 
[3] 3 
[4] 4 
[5] 5 
[6] 6 
[7] 7 
[8] 8 
[9] 9 

please input the key you want to find:
2
find the key 2
please input the key you want to delete:
2
hash table after deleted a key:
[0] 10 
[1] 1 
[2] 
[3] 3 
[4] 4 
[5] 5 
[6] 6 
[7] 7 
[8] 8 
[9] 9 

[root@localhost ~]# ./4_4
input the table_size :
7
insert 1..n to your hash_table
done
hash_table:
[0] 7 
[1] 1 
[2] 2 
[3] 3 
[4] 4 
[5] 5 
[6] 6 

please input the key you want to find:
2
find the key 2
please input the key you want to delete:
1
hash table after deleted a key:
[0] 7 
[1] 
[2] 2 
[3] 3 
[4] 4 
[5] 5 
[6] 6 

[root@localhost ~]#

4. 效率分析

　　DSAA列举了如下的几点：

We define the load factor $\gamma$ of a hash table to be the ratio of the number of elements in the hash table to the table size. In the example above, $\gamma$ = 1.0. The average length of a list is $\gamma$ . The effort required to perform a search is the constant time required to evaluate the hash function plus the time to traverse the list.这里比较简单，评价open hash表的一种方式就是负载因子，当然负载因子等于1的时候是最理想的。也就是全部的表的位置都使用到了，其每个队列的长度为1

In an unsuccessful search, the number of links to traverse is (excluding the final NULL link) on average. A successful search requires that about 1 + ( $\gamma$ /2) links be traversed, since there is a guarantee that one link must be traversed (since the search is successful), and we also expect to go halfway down a list to find our match.保证链表长度最小，也就是保证 $\gamma$ 最小

This analysis shows that the table size is not really important, but the load factor is. The general rule for open hashing is to make the table size about as large as the number of elements expected (in other words, let $\gamma$ = 1). It is also a good idea, as mentioned before, to keep the table size prime to
ensure a good distribution.所以open hash的准则有两个，第一个表的大小为素数，第二个表足够大，和数据量的比值越接近1越好。