Friday, January 1, 2010

nginx - Hashing scheme

ngx_http_upstream module is using ngx_hash module for searching the HTTP headers specific function pointers to take appropriate action.   Intially when I had gone through it ngx_hash.c, I was confused due to the differences between this and the way I have used hash lists before.  I have used hash lists for dynamic insertion, deletion in addition to search.  For supporting all these operations, I am used to develop hash lists that contain set of buckets, with each bucket pointing to one of different types of linked lists or binary trees. Any operation starts with finding out the 'hash' value (typical hash algorithm used in networking world is Bob-Jenkins algorithm), get hold of head of list/tree and then do operation such as addition, deletion or search.  In case of addition, a node is created and add to the list/tree. In case of delete operation, node is deleted from the list/tree of the bucket. In case of search operation, it return the pointer to the node.

nginx hash, though in principle is same as typical  hash lists, but it has significant differences.  It appears that 'ngx hash' is not meant for applications that add and remove elements dynamically.  They are specifically designed to hold set of 'init time' elements arranged in hash list.  Main purpose seems to  be speeding up the search of one time added elements. 'ngx_hash' utility module provides mainly two functions :  ngx_hash_init() and ngx_hash_find().  I did not find any hash destroy function. It may be that it is never need to be destroyed at run time. As part of process exit,  Linux process anyway reclaims all the memory.

Little bit deep into the hash list implementation,  it is allocating memory for hash table of type 'ngx_hash_t' which has elements 'buckets' and 'size of hash table'.   Bucket is double pointer.  It first allocates memory for bucket array, that is, memory to hold pointers, that is first dimension of two dimensional array.  Size of this array is equal to the hash size. Second dimension to this array needs to point to all collision elements. As I said before,  all elements that are put in the hash list are known while creating the hash list itself.  No dynamic addtion or deletion is possible here.  So, it allocates memory to hold all the elements across hash lists at one time.  Then it puts right offsets to appropriate places in the buckets array. Each element is of type 'ngx_hash_elt_t', which has 'name' (key), lengh of the key and pointer to application specific value information.  ngx_hash_init() is relatively complicated function where it determines the hash list size needed to allocate buckets array and then it figures out the size of memory required to hold the hash list elements. ngx_hash_find is relatively simple where it finds the right bucket from the hash value and searches for the collision elements in array form.

ngx_hash module also has some logic to search for wild card based strings where wild card is present only at the head or tail of strings.  I will do some writeup on that at later time.

1 comment:

brablc said...

Very nice article. I would just my rather obvious discovery:

- Bigger max_size uses more memory and bigger bucket_size uses more CPU cycles during lookup. If you have enough memory increase max_size and try to keep bucket_size as low as possible.

- If you have max_size less than 10000 and small bucket_size, you can expect long loading time because of https://github.com/nginx/nginx/blob/c3aed0a23392a509f64b740064f5f6633e8c89d8/src/core/ngx_hash.c#L289 .

I was studying this for server_names_hash_max_size and server_names_hash_bucket_size.