Hash Map in Python - Collision, Load Factor & Rehashing

Introduction:

In this tutorial, we learn about the Hash Map in Python with collision, Load Factor and Rehashing. The Hash Map is an indexed data structure. It stores data as a pair. In the data structure, the data is stored in special indexes. A hash function creates the index. This hash function, which can be built-in or generated, uses a specific key and creates an index to store the data. The key must be immutable and unique. With hashing, insertions, deletions, and searching in nearly constant O(1) time, maybe even O(n) time. For the worst case, the time complexity is O(1). The hash map in Python contains some functions for setting value, getting value, and deleting values from the index. The functions are given below -

set_value(key, value): Insert data according to the value of the unique parameter (key value) obtained from the hash key. If the key already exists, then this function will update the value.
get_value(key): By using this function, we can get the value of the given key from the hash map or index. But when it does not get the key or is unavailable in the hash map, this function returns an appropriate message.
del_value(key): By using this function, we can delete the value of the given key from the hash map or index. But when it is not found, the key that needs to be deleted or the key is not available in the hash map, then this function returns an appropriate message.

Hashing-Hash function:

This function converts the key into a small value to be used as an index in the hash map. The key value is an integer or string. The Hashing-Hash function has two types, which are given below -

a. Built-in Hash function:

The hash() function returns the object's hash value in the built-in hash function.

b. User-defined Hash function:

The developer creates the user-defined hash function.

Collision:

Collision is a part of the hash map in Python. The collision occurs when you want to insert data into the hash map, but the other key already occupies the index. The quality of a good hash map is a smaller number of collisions. By using two methods, we can easily avoid the collision, which are given below -

1. Separate Chaining:

Separate Chaining is also known as open hashing. If the two key hash values are the same, then add the data of the second key to the end of the first key in the chain pattern and continue it. Chaining can be done with the help of other data structures such as linked lists, lists or self-balancing BST (AVL tree, red-black tree). Most of the keys in a connection are allocated outside the hash table. The following database has access time values ranging from O(1) to O(n) for linked lists and List, and O(logn) for the trees.

Program Code:

Now we give the program code of the open hashing or separate chaining in Python, which is given below -

class Node:
    def __init__(self, key, value):
        self.key = key
        self.value = value
        self.next = None
  
  
class HashTable:
    def __init__(self, capacity):
        self.capacity = capacity
        self.size = 0
        self.table = [None] * capacity
  
    def _hash(self, key):
        return hash(key) % self.capacity
  
    def insert(self, key, value):
        index = self._hash(key)
  
        if self.table[index] is None:
            self.table[index] = Node(key, value)
            self.size += 1
        else:
            current = self.table[index]
            while current:
                if current.key == key:
                    current.value = value
                    return
                current = current.next
            new_node = Node(key, value)
            new_node.next = self.table[index]
            self.table[index] = new_node
            self.size += 1
  
    def search(self, key):
        index = self._hash(key)
  
        current = self.table[index]
        while current:
            if current.key == key:
                return current.value
            current = current.next
  
        raise KeyError(key)
  
    def remove(self, key):
        index = self._hash(key)
  
        previous = None
        current = self.table[index]
  
        while current:
            if current.key == key:
                if previous:
                    previous.next = current.next
                else:
                    self.table[index] = current.next
                self.size -= 1
                return
            previous = current
            current = current.next
  
        raise KeyError(key)
  
    def __len__(self):
        return self.size
  
    def __contains__(self, key):
        try:
            self.search(key)
            return True
        except KeyError:
            return False
  

if __name__ == '__main__':
  
    h = HashTable(5)
  
    
    h.insert("Rose", 4)
    h.insert("Lotus", 5)
    h.insert("Dog", 3)
  
    
    print("Rose" in h)  
    print("Red" in h)  
  
    
    print(h.search("Lotus"))  
  
   
    h.insert("Rose", 2)
    print(h.search("Rose"))  
  
    h.remove("Dog")
    
    print(len(h))  

Output:

Now, we compile and run the above program. The result is given below -

True
False
5
2
2

2. Open Addressing:

The open addressing is also known as closed hashing. If another key allocates the index, it will check the next free space in the hash map to give the new key value. In many ways, the probing can be done. There are three types of probing: Linear probing, quadratic probing and double hashing.

a. Linear probing:

Circularly, the search starts from the starting index to the previous index to find the next free slot. The cache performance is very good in linear probing, which is advantageous. But there is also a disadvantage. The disadvantage of linear probing is that the primary and secondary clustering will occur.

b. Quadratic probing:

In the quadratic probing, the searching is done in the quadratic manner for the next free slot in the index. So, the nth iteration is needed for the (n²)th slot. There is no primary clustering, which is an advantage of quadratic probing. But there is also a disadvantage. The disadvantages of quadratic probing are that there will be secondary clustering and very poor cache performance.

c. Double hashing:

In double hashing, we use the second hash function to find the next free space of the index. This index is obtained from the first hash function. There is any primary and secondary clustering, an advantage of double hashing. But there is also a disadvantage. The disadvantage of double hashing is very poor cache performance.

What is Clustering?

In clustering, successive keys or indexes are quickly included when different data (key) is placed for the same hash index.

Load Factor:

The Load Factor is defined as (m/n). Here, m represents the number of elements that can be found in the hash map before increasing the size of the hash map, and n represents the size of the hash map. Load factor is a parameter that determines when to increase the hash map size to find and add as low as possible time complexity 0(1). The default load factor in Java is 0.75f of the hash map size. When the current load factor exceeds the predefined load, the hash map's size must double. Some steps of load factor are given below -

In the first step, the execution time depends on K and the hash function. For example, if we take the key string "abcd", its hash function will depend on the length of the string. However, for very large values of n, the hash calculation can be considered constant time, i.e. O(1), since the number of entries in the map and the length of the keys are almost negligible compared to n.
In the second step, the List of the K-V pairs found in the scale must be crossed. Therefore, the worst case is that all n inputs are at the same scale. Then, the time complexity is O(n). However, enough research has been done to ensure that the hash function evenly distributes the keys in the array so that this situation rarely happens.
The load factor can be defined as (m/n). Here, m represents the number of elements that can be found in the hash map before increasing the size of the hash map, and n represents the size of the hash map.
This load factor should be kept low so that the number of index entries is small and the complexity is almost constant, such as O(1) time complexity.

Rehashing:

The Load factor is a part of the Rehashing. It is a special way to avoid collisions and make the time complexity O(1). This is the process of increasing or doubling the hash map size when it reaches or exceeds a certain point called the load factor. Rehash is done to reduce the difficulty of accessing elements from the hash map when the size is full.

The time complexity of accessing the hash table elements varies from O(1) time complexity at best. Here, all key pairs are in an index, to O(n) in the worst case, when all Key pairs are stored in an index. Index in which items are entered in order. Therefore, it is repeated to reduce the access time and calculate all key points in the hash map. Use the reuse method if the number of points in the hash map exceeds the starting pointer or size of the hash map. The Rehashing is done based on the load factor of the hash map. It will decide when to increase the hash map size.

Why is Rehashing needed in the Hash map?

For collision prevention, rehashing is needed in the Hash map. It also maintains the data structure's efficiency. The load factor increases if an element is added to the hash map. The load factor is the ratio of the number of contents to the number of buckets. When it is greater than a certain threshold (usually set to 0.75), the hash map becomes unusable due to increased conflicts. To avoid this, the hash map can be modified. Also, the contents were recreated in new buckets, reducing the load and collisions. This is known as rehashing. It may take time and space, but it should make an efficient hash map. When rehashing, you need to check the index from the hash function to add the contents from the old hash map to the new hash map. If the hash function uses the size of the hash map to find the index, n the new size will always be considered when rehashing.

What is the procedure of Rehashing?

The following steps can do the Rehashing: -

Check the current load factor for each item inserted into the hash map.
Rehashing if it exceeds the predefined load factor.
To rehash, create a new hash map twice the size of the original hash map.
Then, traverse each element of the old hash map and add it to the new one.
After the rehash is complete, delete the old hash map for space saving.

For example, assuming a 5-dimensional hash map, the hash function is key% size (key%5). We need to add 5 elements like 11, 12, 13, 14, and 15 to the hash map, which makes the time complexity O(1). 0.75 is the initial load factor.

So, the _hash(key) equals to

_hash(11) = 1; here, in the index 1, the element 11 is stored. So, the current load factor of it is 1/5 = 0.2.
_hash(12) = 2, Here, in the index 2, the element 12 is stored. So, the current load factor of it is 2/5 = 0.4.
_hash(13) = 3, Here, in the index 3, the element 13 is stored. So, the current load factor of it is 3/5 = 0.6.
_hash(14) = 4, Here, in the index 4, the element 14 is stored. So, the current load factor of it is 4/5 = 0.8.
_hash(15) = 5, Here, in the index 5, the element 15 is stored. So, the current load factor of it is 15/5 = 1.

Program Code:

Now we give the program code of rehashing the hash map in Python, which is given below -

# Create the class Map here
class Map:

# Here, we create another class MapNode with in the class Map
    class MapNode:

# Here, we define the function _init_ with three parameter self, key and value
        def __init__(self, key, value):
            self.key = key
            self.value = value
            self.next = None
    buckets = list()
 
    # size is also initialized here
    size = 0
 
    # The number of buckets initialized here
    num_Buckets = 0
 
    # Here, we initialized the default load Factor
    DEFAULT_LOAD_FACTOR = 0.75
 
# Here, we define the function _init_
    def __init__(self):
        Map.num_Buckets = 7
 
        Map.buckets = [None]*Map.num_Buckets
 
        print("The Hash Map is created")
        print("The total number of pairs in the Map: " + str(Map.size))
        print("The Map size is: " + str(Map.num_Buckets))
        print("Default Load Factor is: " + str(Map.DEFAULT_LOAD_FACTOR) + "\n")
 
    def getBucketInd(self,key):
        hash_Code = hash(key)
        return (hash_Code % Map.num_Buckets)
 
    def insert(self,key,value):
 
        #  Here, we are getting the index at which we need to be inserted
        bucketInd = self.getBucketInd(key)
        head = Map.buckets[bucketInd]
 
        # Here, we check that all nodes are present already in the index or not
        while (head != None):
 
            # If the node is already present, the value is updated
            if (head.key==key):
                head.value = value
                return
            head = head.next
 
        newElementNode = Map.MapNode(key, value)
 
        # Initialize the head node at the index
        head = Map.buckets[bucketInd]
 
        newElementNode.next = head
        Map.buckets[bucketInd]= newElementNode
 
        print("Pair(\" {} \", \" {} \") insertion is successfully done.".format(key,value))
 
        # Incrementing size of the map
        Map.size+=1
 
        # Here, we calculated the Load factor 
        loadFactor = (1* Map.size) / Map.num_Buckets
 
       # Print the current load factor
        print("Current Load factor = " + str(loadFactor))
 
        # The rehashing is done if the load factor is > 0.75, 
        if (loadFactor > Map.DEFAULT_LOAD_FACTOR):
            print(str(loadFactor) + " is greater than " + str(Map.DEFAULT_LOAD_FACTOR))
            print("Therefore Rehashing will be done.")
 
            self.rehash()
 
            print("The new Map size is: " + str(Map.num_Buckets))
 
        print("The number of pairs in the Map: " + str(Map.size))
        print("The Map size is: " + str(Map.num_Buckets))
 
    def rehash(self):
 
        print("\n***Rehashing Started***\n")
 
        # The temp is made by the present bucket list
        temp = Map.buckets
 
        # bucket is calculated here
        buckets = (2 * Map.num_Buckets)
 
        for i in range(2 * Map.num_Buckets):
            # Initialised to null
            Map.buckets.append(None)
 
        # Now the map size is made zero
        Map.size = 0
        Map.num_Buckets *= 2
 
        for i in range(len(temp)):
 
            # The head of the chain 
            head = temp[i]
 
            while (head != None):
                key = head.key
                value = head.value
 
                self.insert(key, value)
                head = head.next
 
        print("\n***Rehashing Ended***")
 
    def printMap(self):
 
        # The temp is made by the present bucket 
        temp = Map.buckets
 
        print("Current HashMap:")
        for i in range(len(temp)):
            head = temp[i]
 
            while (head != None):
                print("key = \" {} \", val = {}" .format(head.key,head.value))
 
                head = head.next
        print()
 
# Define the main function
 
if __name__ == '__main__':
    # Creating the Map
    map = Map()
 
    # The elements are inserted here
    map.insert(1, "Java")
    map.printMap()
 
    map.insert(2, "Tpoint")
    map.printMap()
 
    map.insert(3, "Is")
    map.printMap()
 
    map.insert(4, "A")
    map.printMap()
 
    map.insert(5, "Online")
    map.printMap()
    
    map.insert(6, "Learning")
    map.printMap()
    
    map.insert(7, "Website")
    map.printMap()

Output:

Now, we compile and run the above program. The result is given below -

The Hash Map is created
The total number of pairs in the Map: 0
The Map size is: 7
Default Load Factor is: 0.75

Pair(" 1 ", " Java ") insertion is successfully done.
Current Load factor = 0.14285714285714285
The number of pairs in the Map: 1
The Map size is: 7
Current HashMap:
key = " 1 ", val = Java

Pair(" 2 ", " Tpoint ") insertion is successfully done.
Current Load factor = 0.2857142857142857
The number of pairs in the Map: 2
The Map size is: 7
Current HashMap:
key = " 1 ", val = Java
key = " 2 ", val = Tpoint

Pair(" 3 ", " Is ") insertion is successfully done.
Current Load factor = 0.42857142857142855
The number of pairs in the Map: 3
The Map size is: 7
Current HashMap:
key = " 1 ", val = Java
key = " 2 ", val = Tpoint
key = " 3 ", val = Is

Pair(" 4 ", " A ") insertion is successfully done.
Current Load factor = 0.5714285714285714
The number of pairs in the Map: 4
The Map size is: 7
Current HashMap:
key = " 1 ", val = Java
key = " 2 ", val = Tpoint
key = " 3 ", val = Is
key = " 4 ", val = A

Pair(" 5 ", " Online ") insertion is successfully done.
Current Load factor = 0.7142857142857143
The number of pairs in the Map: 5
The Map size is: 7
Current HashMap:
key = " 1 ", val = Java
key = " 2 ", val = Tpoint
key = " 3 ", val = Is
key = " 4 ", val = A
key = " 5 ", val = Online

Pair(" 6 ", " Learning ") insertion is successfully done.
Current Load factor = 0.8571428571428571
0.8571428571428571 is greater than 0.75
Therefore, Rehashing will be done.

***Rehashing Started***
***Rehashing Ended***
The new Map size is: 14
The number of pairs in the Map: 0
The Map size is: 14
Current HashMap:
key = " 1 ", val = Java
key = " 2 ", val = Tpoint
key = " 3 ", val = Is
key = " 4 ", val = A
key = " 5 ", val = Online
key = " 6 ", val = Learning

Pair(" 7 ", " Website ") insertion is successfully done.
Current Load factor = 0.07142857142857142
The number of pairs in the Map: 1
The Map size is: 14
Current HashMap:
key = " 1 ", val = Java
key = " 2 ", val = Tpoint
key = " 3 ", val = Is
key = " 4 ", val = A
key = " 5 ", val = Online
key = " 6 ", val = Learning
key = " 7 ", val = Website

Next TopicHow to Convert Images to NumPy Array?

← prev next →