External Hash is all about building an efficient out-of-core hashing algorithm. Still use Divide and Conquer.
Divide
-
Divide part is a stream process, classifying all pages into partitions.
-
Partition is a set of pages such that the values on the pages have same hash value.
-
Suppose we have memory buffers. Take 1 buffer as the , send page values to the hash function so that they will be put in partitions based upon its hash value.
-
Among these partitions, if one partition’s size exceeds the capacity of B pages (equivalent to the size of memory), we should recursively divide this partition using a different hash function distinct from until it can fit entirely in memory.
Conquer
Then we just build a hashtable for each partition and write back to disk.
Cost
The cost of hashing varies depending on the specific circumstances, such as different memory sizes and partition sizes, which can result in varying costs.
Example
How many IOs does it take to hash a 500 page table with B = 10 buffer pages? Assume that we use perfect hash functions.
- The first partitioning pass divides the 500 pages into 9 partitions. This means that each partition will have pages of data. We had to read in the 500 original pages, but we have to write out a total of because each partition has 56 pages and there are 9 partitions. The total number of IOs for this pass is therefore .
- We cannot fit any partition into memory because they all have 56 pages, so we need to recursively partition all of them. On the next partitioning pass each partition will be divided into 9 new partitions (so total partitions) with pages each. This pass needed to read in the 504 pages from the previous pass and write out pages for a total of .
- Now each partition is small enough to fit into memory. The final conquer pass will read in each partition from the previous pass and write it back out to build the hash table. This means every page from the previous pass is read and written once for a total of .
- Adding up the IOs from each pass gives a total of .