Jaro and Jaro-Winkler Similarity in Python

Jaro Similarity

The Jaro resemblance between two strings is a measure of their resemblance. The Jaro distance has a value between 0 and 1. where 1 denotes equality between the strings, and 0 denotes lack of resemblance.

Examples:

Algorithm:

The following formula is used to compute the Jaro Similarity:

Jaro and Jaro-Winkler Similarity in Python

where:

  • m is the number of matching characters
  • t is half the number of transpositions
  • where |s1| and |s2| are the lengths of strings s1 and s2 respectively.

The characters are said to be matching if they are the same and the characters are not further than { max(|s1|, |s2|) / 2 } - 1

Half of the letters in both strings that match but in a different sequence are transpositions.

Calculation:

  • Let s1="arnab" and s2="raanb," the greatest distance at which any character may be matched is 1.
  • It is clear that both strings include five characters that match, but because of the different character orders, there are four characters out of order, resulting in two transpositions.
  • Thus, the following formula may be used to determine Jaro's similarity:

Jaro Similarity = ( 1 / 3 ) * { ( 5 / 5 ) + ( 5 / 5 ) + ( 5 - 2 ) / 5 } = 0.86667

Implementation of Jaro Similarity in Python

Below is the implementation of the above approach.

Code:

Program Explanation:

The Jaro Similarity between two input strings (s1 and s2) is computed using this Python program. Jaro Similarity is a similarity metric that yields a number between 0 and 1, with 1 denoting a perfect match between two strings. The program returns the greatest similarity of 1.0m after it has first verified that the input strings are equivalent. The length of the strings is then determined, the maximum distance that may be matched is specified, and counters are initialized. In order to locate matches in the second string within the given distance, the program iterates through the characters in the first string. Hash arrays are used to track matches, and transpositions are tallied.

Output:

0.733333
  • Time Complexity: O(N * M), where N and M are the lengths of the strings, respectively, of s1 and s2.
  • Auxiliary Space: O(N + M)

Jaro-Winkler Similarity

A string metric called the Jaro-Winkler similarity is used to calculate the edit distance between two strings. Winkler - Jaro Similarity and Jaro Similarity are quite similar. When the prefixes of two strings match, they diverge. Prefix scale "p" is used by Jaro - Winkler Similarity to provide a more accurate result when strings share a prefix up to a specified maximum length (l).

Examples:

Calculation:

Jaro Winkler's similarity is defined as follows:

where:

  • Sj is Jaro's similarity.
  • Sw is Jaro-Winkler similarity.
  • P is the scaling factor (0.1 by default).
  • L is the matching prefix's length, up to a maximum of four characters.
  • Assume s1="arnab" and s2="aranb." The two strings' Jaro similarity is 0.933333. (Based on the computation above.)
  • We assume a scale factor of 0.1, and the length of the matched prefix is 2.
  • changing a value in the formula:

Jaro-Winkler Similarity= 0.9333333 + 0.1 * 2 * (1-0.9333333) = 0.946667

Implementation of Jaro-Winkler Similarity in Python

Below is the implementation of the above approach:

Code:

Program Explanation:

The Jaro Similarity and Jaro-Winkler Similarity metrics for comparing two strings are implemented in this Python program. The length of the strings, the maximum permitted matching distance, and the number of matches with possible transpositions are all taken into account when the jaro_distance function determines the Jaro Similarity. By adding a common prefix and modifying the similarity score according to the prefix's length, the jaro_Winkler function improves the similarity even more. For the two sample strings ("TRATE" and "TRACE"), the driver code illustrates how to use the Jaro-Winkler Similarity and outputs the score. Spell checking and record linking are two typical string-matching applications that employ these similarity measures.

Output:

Jaro-Winkler Similarity = 0.9066666666666667
  • Time Complexity: O(N * M), where N and M are the lengths of the strings, respectively, of s1 and s2.
  • Auxiliary Space: O(N + M).