How to Build a Hash Function in PythonThe hash table is a well-known database structure that has proven to be a fundamental element to programming, invented over half a century ago. Even today, it can solve a variety of real-world problems that require indexing databases tables and caching computed values or even implementing sets. It's often discussed during the job interview as well as Python employs hash tables in a variety of places to make searches for names nearly instantly. Although Python is a program with its hash tables, called the "dict", it is useful to learn the workings of hash tables in the background. A code assessment might ask us to build one. This tutorial will take us through the process of making a hash table starting from scratch as if there was none in Python. As we go, we will be faced with some challenges that will explain fundamental concepts and give us an understanding of why hash tables are quick. Get to Know the Hash Table Data StructureBefore we dive in, we have to be sure to become familiar with the terms used because they are sometimes a bit confusing. In the section, the word "hash table" or the "hash map" is frequently utilized interchangeably with the word "dictionary". There is a slight distinction between these two concepts since the one is more specific in comparison to the latter. Hash Table vs DictionaryIn the field of computer science, the Dictionary is a term used in computer science. The Dictionary is an abstract type of data comprised of key elements as well as values placed in pairs. Additionally, it defines the following actions for the elements it contains:
In a way, the data type is similar to a "bilingual dictionary" with the keywords are foreign terms, and values are definitions, and translations into different languages. However, there doesn't have to be an idea of equivalence between the keys and values. For instance, a telephone book is another instance of a dictionary combining names and numbers for phone calls. Dictionaries are a great source of information. They have several intriguing characteristics. One of these is the ability to consider the Dictionary as a mathematical function that assigns an argument or two to precisely one value. The direct results of that reality are:
Other concepts can be extended to the concept of the Dictionary. For instance, "multimap" that allows the user to have several values for each key. However, Bidirectional Map doesn't just convert keys to values but allows mapping in the opposite direction. This guide will look at the standard Dictionary, which assigns exactly one value for each key. This is a graphic representation of a possible dictionary that maps abstract concepts to the corresponding English words: It's a single-way mapping of keys and values, which are two distinct kinds of components. We immediately find fewer values than keys due to the fact that the bow is a homophone that has numerous meanings. However, the dictionary has four pairs of values. Depending on the method we choose to use, we can reuse values repeatedly to save memory or duplicate the values for simplicity. How do we create a dictionary using the programming language? The answer is that we do not, as the majority of modern languages have dictionary functions as a basic type of data or classes that are part of the library of standard classes. Python comes with the built-in dictionary type that already wraps an extremely optimised data structure that is written in C in order that we do not need to write our dictionary. Python's dictionary allows us to perform all the dictionary-related operations described at the top of this article. Code: Output: 'air bat car' Input: Output: {'abc': 'air bat car'} Using the square bracket syntax ( [ ]), we can create a new key-value pair to the dictionary. We can also change the value of the existing pair that is identified by a key. We can also determine the value that is associated with the key. However, we can have a different question. What is the built-in dictionary? How does it function? What is the process of mapping keys to a variety of different types of data, and how does it accomplish this so quickly? Implementing this type of abstract data is called the "Dictionary problem". The most popular solution is to use the hash table data structure, which we are about to explore. But keep in mind that this isn't the only option to create a dictionary generally. A different popular method is built on the foundation of the black-red tree. Hash Table: An Array with a Hash FunctionThe users might have thought about how the process of accessing the sequence components in Python is so fast regardless of the index they are requesting? If the users were working with an extremely long sequence of characters, such as this: Output: 'ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABC' Input: Output: 2600000000 It is estimated that there are 2.6 billion characters derived from repeated ASCII alphabets within the text variable above that we can calculate using the Python len() function. But, finding an initial, middle, last and any additional character in the string is just as easy: Input: Output: 'A' Input: Output: 'A' Input: Output: 'Z' This is also true for any type of sequence available in Python, including lists and tuples. What is the reason? The reason for such fast speed is the Python's sequences. Python is supported with the array that is a Random-Access information structure. It follows two basic principles:
If we know the address in memory in an array, also known as "offset", it is possible to locate the desired element in the array quickly by using a straightforward formula: Above is the formula to Calculate the Memory Address of a Sequence Element. The array starts at its offset. It is an address that is the primary element, starting at zero. After that, we advance by adding the desired number of bytes that we can get by multiplying the dimension of the element by the element's index. It will always take the same length of time to add up a few numbers. We know how to locate an item within an array is simple regardless of where the element is physically in the physical location. Could we use the same concept and apply it to the dictionary? Yes! Hash tables derive their name because of a method known as hashing that lets them convert any key to an integer number that could serve as an index within an ordinary array. Therefore, instead of searching for a particular value using a numeric index, we will search at it with an arbitrary key with no significant performance drop. That's neat! In reality, hashing doesn't be compatible with every key. However, the majority of built-in types within Python are able to be hashed. If we stick to some rules, we will be able to make our fishable types, too. Learn how to hash in the next section. Understand the Hash FunctionHash functions is a function that is a method of hashing that converts any data into a fixed sequence of bytes, known as "the hash value or hash code". It's a number that could serve as a fingerprint or a digest that is usually larger than a file, which allows us to check its authenticity. If the user have ever browsed huge files from the Internet, for example, the disk image of the Linux distribution or a Linux distribution, they might have observed the MD5 or SHA-2 checksum on the page for downloading. Apart from confirming the integrity of data and resolving a dictionary problem, Hash functions can also be used in other areas, such as security and cryptography. For instance, we generally save hashed passwords in databases to reduce the possibility of information leaks. Digital Signatures use hashing to make an encrypted message digest prior to encryption. Blockchain transactions are yet another great instance of making use of a hash function for security purposes. Although there are a variety of algorithms for hashing, however, they all share the same characteristics, which will be discussed here. Correctly implementing a reliable hash function is a challenging task that may require a deep mastery of complex maths that involves the prime number. Fortunately, we do not require implementing this kind of algorithm manually. Python has an integrated hashlib module that provides many popular cryptographic hash functions in addition to more secure algorithms for checking sums. It also comes with an overall hash() function, which is used to perform a fast element search in sets and dictionaries. We can learn the way it functions to understand the most significant features that hash function functions have. How to Examine Python's Built-in hash()Before the user try creating a hash function from scratch, they should study the Python hash() to distil its features. This will enable them to comprehend the challenges involved in creating a hash function of their own. Try to call the hash() function on a couple of data type literals that are included in Python, like strings and numbers, in order to test what happens: Input: Output: 322818021289917443 Input: Output: 326490430436040707 Input: Output: -3852290318913306444 Input: Output: 8945776761251421587 There are many things we can learn when we look at the results. The first is that Hash functions built into the program could give different results at our end for certain of the inputs mentioned above. While the numerical input appears to return the same hash, the string likely does not. What's the reason? It could appear that the hash() is a non-deterministic function, but it could not be further from reality! If we run the hash() with the same argument inside our current interpreter session, we will get the same results: Input: Output: -3852290318913306444 Input: Output: -3852290318913306444 Input: Output: -3852290318913306444 It's because hash values are unchangeable and never change during the life of an object. But, if we quit Python and restart it again, we will observe different values for hash value across Python executions. This can be tested by using the option -c option to execute the following single-liner program within our terminal: It's normal behaviour. Has been implemented into Python as a measure to counter the Denial-of-Service (DoS) attack that exploited a recognized security flaw of hash algorithms on web servers. The attackers could use an insecure hash algorithm to intentionally create hash collisions, causing server overload and rendering it inaccessible. Ransom was a common motive for this attack, as the majority of victims earned money from the continuous presence on the internet. Nowadays, Python enables hash randomization as a default feature for certain inputs, including strings, which makes the hash value less predictable. This allows hash() a bit more secure and attacks less difficult. It is possible to disable randomization; however, we can set a fixed seed value using the PYTHONHASHSEED environment variable, such as: Input: Output: 3248502820309220970 Input: Output: 3248502820309220970 Input: Output: 3248502820309220970 Overall the Python function hash() is indeed a determinate function and is among the fundamental characteristics of the function. Every Python execution produces the exact hash value of an input known to the program. This could help with splitting or sharing data among an array of dispersed Python interpreters. Be cautious and be aware of the risks associated with turning off randomization of hashes. In addition, hash() seems to be pretty universal since it accepts any input. That is, it can take values of different dimensions and types. The program accepts strings and floating-point numbers without a problem, regardless of their size or length. It is possible to calculate a hash value for the more obscure types as well: Input: Output: -9223363242224569907 Input: Output: 8652138113829 Input: Output: 145274959535 Input: Output: 145274845263 Input: Output: 145274845224 This is where we call the hash function of the Python none object and use the hash() function itself as well as a person class, which has a couple of instances. But some objects do not have the same hash value. The hash() function may generate an error in the event that we attempt to call it against one of these few objects. Input: Error: Traceback (most recent call last): File " The type of input can determine whether we are able to calculate the hash value. In Python, there are instances of built-in mutable types, such as assets, lists, and dicts - are not hashable. There's a hint as to why this is not the case, but we will find out more about it in the subsequent section. In the meantime, it is safe to assume that the majority of data types can work with an algorithm for hashing in general. How to Dive Deeper into Python's hash()Another intriguing feature is that the hash() always generates the same size that is fixed regardless of the size our input size was. In Python, this is an integer that has a moderate size. It may appear as a negative number which is why we must consider that if we intend to rely on the hash value in any way: Input: Output: 9052257963471308498 The normal consequence of producing a standard-sized output is that a lot of the original data is irreparably lost. It's okay since we are looking for the final hash value to be a unifying digest of large amounts of data in the end. But, since the function of hash project a potentially infinite number of values into an undefined space, it can result in the possibility of a collision of the hash when two inputs create the same value. Hash collisions are an important idea in hash tables that we will be able to revisit in the future in greater detail when creating our custom hash table. In the meantime, we may consider them to be highly undesirable. Avoid collisions with hash keys in any way since they could lead to poor lookups, and hackers could utilize them. So, a secure hash function should reduce the possibility of a collision to ensure security and effectiveness. We can see the distribution of the values created by Python's hash() function by plotting the textual histogram on our terminal. This implies that the function hash must assign equally distributed values across the space. The users can copy the following block of code and save it in a file named hash_distribution.py: It uses a counter instance to easily depict the histogram of the hash values for the given items. Hash numbers are distributed across the specified number of containers by wrapping them in the modulo operator. Then, we can select one hundred printed ASCII characters as an example and then determine their hash values and display their distribution. Output: 0 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ (56) 1 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ (44) Input: Output: 0 ■■■■■■■■■■■■■■■■■■■■ (20) 1 ■■■■■■■■■■■■■■ (14) 2 ■■■■■■■■■■■■■■■■■■■■■■■■ (24) 3 ■■■■■■■■■■■■■■■■■■■ (19) 4 ■■■■■■■■■■■■■■■■■■■■■■■ (23) As we can observe, using the inbuilt hash() function is quite good, but it is not ideal in distributing the hash values evenly. If there are just two containers, expect around a 50-50 distribution. In the event of adding more containers, it will cause them to be filled more or less equally. As a result, the regular distribution of hash value is usually pseudo-random and is particularly crucial in cryptographic functions. This blocks possible attackers from employing statistical analysis to identify the connection between the input and output outputs of the function. Think about changing a single letter in a string and see how it will affect the resultant values for hash in Python: Input: Output: 9052257963471308498 Input: Output: 8158704031393424069 It's a different hash value, even though only one letter is distinct. The hash value is often susceptible to an avalanche effect which means even the smallest change in input is amplified. However, this aspect in hash functions isn't necessary to implement a hash table data structure. In most instances, the Python's hash() exhibits an additional, non-essential feature of the cryptographic hash function resulting from the pigeonhole concept previously mentioned. It is an only-way operation because finding its reverse is almost impossible in most situations. However, there are some notable cases of exceptions: Input: Output: 47 The hash values of small integers are the same as the integers' values, which is a design aspect in the implementation CPython utilizes to ensure simplicity and effectiveness. Keep in mind that actual values of the hash aren't important in the event that we can calculate them in a precise manner. The calculation of a hash number using Python is extremely fast, even with large inputs. Modern computers use the hash() with a string of more than 100 million characters as an argument is returned instantly. If it wasn't that rapid, then the added cost of the computation of hash values will negate the benefits from hashing the process in the beginning. How to Identify Hash Function PropertiesBased on the information we have gathered from a Python hash() function, we can now decide the desirable properties of the function's hash in general. Here's a list of the available features, which compares the normal hash function to its cryptographic version: The objectives of both hash functions are similar, and they have a number of similarities in their characteristics. In contrast, the cryptographic hash function offers additional security assurances. Before creating your hash functions, you'll be looking at another function, built-in Python, that appears to be its most easy alternative. How to Compare an Object's Identity with Its HashPerhaps one of the simplest implementations of hash functions found in Python is a built-in id(), which gives us the identity of an object. In the normal Python interpreter, identity is the same as an object's memory address, which is expressed as an integer. Input: Output: 2324401272432 The id() function has all of the sought-after hash function's properties. In the end, it's extremely efficient, and it works regardless of input. It returns an integer of a fixed size predictably. However, it's impossible to access the original object from its memory addresses. Memory addresses remain constant throughout the life of an object and are also randomly generated in between interpreter runs. Then, why does Python insist on using a different method for hashing? The first thing to note is that the purpose of id() is different from hash(), so different Python distributions can implement identity in different methods. Furthermore, memory addresses are known and do not have a homogeneous distribution. Additionally, identical objects will generally generate the same hash code, even when they have different identities. This is unsafe and inadequate for hashing. How to Make Our Own Hash FunctionThe creation of a hash function that fulfils all requirements isn't easy. However, trying to create a hash function from the beginning is an excellent method of understanding the way it functions. When we are done with this tutorial, we will possess a basic hash function that's not perfect. However, we will have gained important knowledge. In this case, we will be able to restrict ourselves to one type of data at first and then use a hash function. For instance, we could look at strings and then sum our value of the ordinal of the characters within the strings: We repeat the text with a generator, then convert each character into the appropriate Unicode code point using its built-in ord() function to add the ordinal values. The result will be one number for every given text supplied as an argument. Input: Output: 511 Input: Output: 512 Input: Output: 512 We will immediately notice some issues with this method. Not only is it string-specific, however, but it also suffers from a weak distribution of hash codes that tend to create clusters with the same input value. Any slight modification to the input doesn't have any effect on the output. And even more importantly, the function is insensitive to the character order in the text. This is why the anagrams of the same word, for example, Loner as well as Loner, could result in collisions in the hash code. To resolve the issue, Try converting the input into a Unicode string using a call at str(). Our function should now be able to handle any argument. Input: Output: 512 Input: Output: 197 Input: Output: 491 We can invoke the hash_function() with an argument of any data such as a string, floating-point number, a Boolean value. This implementation is only equivalent to the representation of strings. Certain objects might not have a representation that matches the code below. Particularly, custom instances of classes without the specific techniques .__str__() and .__repr__() that are properly implemented are an excellent illustration. Additionally, it is impossible to distinguish between various kinds of data anymore: Input: Output: 197 Input: Output: 197 In actuality, we would like to think of "3.14" and the floating-point number 3.14 as distinct objects, each with a different hash code. One option to reduce the issue is to exchange str() for repr(), which wraps how strings are represented by adding extra an apostrophe ( '): Input: Output: "197" Input: Output: '197' It will enhance our hash function to a degree: Input: Output: 275 Input: Output: 197 Strings are now distinguished from numbers. To address the issue of anagrams such as Loren and Loner, We could alter our hash function to consider the value of the character and position within the text. Here, we compute the total of the products derived by multiplying the ordinal values of characters as well as their indexes. Note that we have to list the indices beginning with one instead of zero. If not, the first character would always be ignored since its value would be divided by zero. Our hash function is pretty universal and doesn't create more collisions than previously, but the output could be large enough due to the fact that more string is, more complex the hash algorithm. Additionally, it's very slow when we input larger amounts of data: Input: Output: 1677 Input: Output: 38150 Input: Output: 66139005681000117 It is possible to tackle unbounded growth by measuring our modulo ( %) of our hash-code against a specified maximum size such as 100: Input: Output: 77 Input: Output: 50 Input: Output: 17 If we aren't sure of the number of input values prior to making our decision, it's best to put off that option for later should we be able to. Making sure to select a smaller number of hash codes will increase the chances of collisions between hash codes. We can also set an upper limit on the hash codes we use by using a reasonable maximum value, like sys.maxsize, the largest number of integers natively supported in Python. If we ignore the function's slow performance for a second, We will discover another unusual problem with the hash function. This causes a suboptimal distribution of hash code through clustering and also by not making the most of the slots available. Input: Output: 0 ■■■■■■■■■■■■■■■ (15) 1 ■■■■■■■■■■■■■■ (14) 2 ■■■■■■■■■■■■■■■ (15) 3 ■■■■■■■■■■■■■ (13) 4 ■■■■■■■■■■■■■■ (14) 5 ■■■■■■■■■■■■■■ (14) 6 ■■■■■■■■■■■■■■■ (15) The distribution of the containers is uneven. In addition, it is true that there are 7 containers to choose from, and one of them is not included in the histogram. This is due to the fact that two apostrophes created to repr() cause virtually all keys, in this case, to give an even hash number. It is possible to avoid this issue by taking out the left apostrophe in the event that it is present. Input: Output: (396, 398, 400) Input: Output: (198, 199, 200) Input: Output: 0 ■■■■■■■■■■■■■■■ (15) 1 ■■■■■■■■■■■■■■■ (15) 2 ■■■■■■■■■■■■■ (13) 3 ■■■■■■■■■■■■■■■ (15) 4 ■■■■■■■■■■■■■■ (14) 5 ■■■■■■■■■■■■■ (13) 6 ■■■■■■■■■■■■■■■ (15) The method str.lstrip() will only impact the string if it begins with the prefix specified to strip. Naturally, we will have the option of developing our hash function to the next level. If we are interested in implementing the hash() for strings and bytes in Python, the current implementation utilizes SipHash. SipHash algorithm could be able to switch to an alternative variant of FNV if former is not available. To determine which algorithm the Python interpreter has chosen to use, look to this system module. Input: Output: 'siphash24' ConclusionAfter completing this tutorial, we have a good understanding of the function of hash and how it's supposed to perform, and the challenges we will face when you implement it. Next TopicYield keywords in python |