Measure Similarity Between Two Sentences Using Cosine Similarity in Python

An Introduction to Sentence Similarity

Sentence likeness is a key idea in normal language handling (NLP) that actions how the same two sentences are regarding their importance or content. This estimation is vital for different applications, including:

Data recovery
Text summarization
Question answering systems
Plagiarism detection
Recommendation systems

One famous technique for processing sentence similarity is cosine similarity, which we'll focus on in this explanation.

Understanding Cosine Similarity

Cosine likeness is a measurement used to decide how comparable two vectors are independent of their greatness. It ascertains the cosine of the point between two vectors. With regards to message examination, these vectors address sentences in a complex space.

The equation for cosine similitude is:

Where:

A • B is the dot product of vectors An and B
||A|| and ||B|| are the magnitudes (Euclidean standards) of vectors An and B

The resulting value ranges from - 1 to 1, where:

1 demonstrates wonderful similitude
0 demonstrates no closeness
-1 demonstrates perfect dissimilarity (though this is rare in text analysis due to non-negative values)

Text Preprocessing

Before calculating similarity, it's frequently useful to preprocess the text. Normal preprocessing steps include:

Lowercasing
Removing punctuation
Removing stop words
Stemming or lemmatization

Let's execute these preprocessing steps:

Example:

 
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('stopwords')
nltk.download('punkt')
def preprocess_text(text):
    # Lowercase
    text = text.lower()    
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)    
    # Tokenize
    tokens = nltk.word_tokenize(text)    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(tokens)
# Example usage
sentence1 = "The quick brown fox jumps over the lazy dog"
sentence2 = "The lazy dog is jumped over by the quick brown fox"
preprocessed_sentence1 = preprocess_text(sentence1)
preprocessed_sentence2 = preprocess_text(sentence2)
print("Original sentence 1:", sentence1)
print("Preprocessed sentence 1:", preprocessed_sentence1)
print("\nOriginal sentence 2:", sentence2)
print("Preprocessed sentence 2:", preprocessed_sentence2)   

Output:

 
Original sentence 1: The quick brown fox jumps over the lazy dog
Preprocessed sentence 1: quick brown fox jump lazi dog
Original sentence 2: The lazy dog is jumped over by the quick brown fox
Preprocessed sentence 2: lazi dog jump quick brown fox

Vectorization Techniques

To apply cosine likeness, we want to change over our text into mathematical vectors. There are a few strategies to do this:

Count Vectorization
TF-IDF (Term Frequency-Inverse Document Frequency)
Word Embeddings (e.g., Word2Vec, GloVe)

Let's implement both Count Vectorization and TF-IDF:

Example:

 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
def vectorize_sentences(sentences, method='count'):
    if method == 'count':
        vectorizer = CountVectorizer().fit(sentences)
    elif method == 'tfidf':
        vectorizer = TfidfVectorizer().fit(sentences)
    else:
        raise ValueError("Method must be either 'count' or 'tfidf'")    
    return vectorizer, vectorizer.transform(sentences)
# Example usage
sentences = [preprocessed_sentence1, preprocessed_sentence2]
count_vectorizer, count_vectors = vectorize_sentences(sentences, 'count')
tfidf_vectorizer, tfidf_vectors = vectorize_sentences(sentences, 'tfidf')
print("Count Vectorization:")
print(count_vectors.toarray())
print("\nTF-IDF Vectorization:")
print(tfidf_vectors.toarray())   

Output:

 
Count Vectorization:
[[1 1 1 1 1 1]
 [1 1 1 1 1 1]]
TF-IDF Vectorization:
[[0.44943642 0.44943642 0.44943642 0.44943642 0.44943642 0.44943642]
 [0.44943642 0.44943642 0.44943642 0.44943642 0.44943642 0.44943642]]

Implementing Cosine Similarity in Python

Since we have our vectorized sentences, how about we execute the cosine likeness capability:

Example:

 
import numpy as np
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)
# Calculate cosine similarity for both vectorization methods
count_sim = cosine_similarity(count_vectors.toarray()[0], count_vectors.toarray()[1])
tfidf_sim = cosine_similarity(tfidf_vectors.toarray()[0], tfidf_vectors.toarray()[1])
print("Cosine Similarity (Count Vectorization):", count_sim)
print("Cosine Similarity (TF-IDF Vectorization):", tfidf_sim)   

Output:

 
Cosine Similarity (Count Vectorization): 1.0
Cosine Similarity (TF-IDF Vectorization): 1.0

For this situation, the two techniques yield an ideal similitude score of 1.0 on the grounds that subsequent to preprocessing, our sentences contain similar words in a similar request.

Advanced Techniques and Considerations

While the fundamental execution functions admirably for straightforward cases, there are a few high-level methods and contemplations to remember:

a) N-grams: Rather than utilizing simply individual words, we can utilize blends of contiguous words (n-grams) to catch additional background info.

Example:

 
def vectorize_sentences_ngram(sentences, method='count', ngram_range=(1,2)):
    if method == 'count':
        vectorizer = CountVectorizer(ngram_range=ngram_range).fit(sentences)
    elif method == 'tfidf':
        vectorizer = TfidfVectorizer(ngram_range=ngram_range).fit(sentences)
    else:
        raise ValueError("Method must be either 'count' or 'tfidf'")
    return vectorizer, vectorizer.transform(sentences)
# Example usage
ngram_vectorizer, ngram_vectors = vectorize_sentences_ngram(sentences, 'tfidf', (1,2))
print("TF-IDF Vectorization with n-grams:")
print(ngram_vectors.toarray())
ngram_sim = cosine_similarity(ngram_vectors.toarray()[0], ngram_vectors.toarray()[1])
print("Cosine Similarity (TF-IDF with n-grams):", ngram_sim)   

Output:

 
TF-IDF Vectorization with n-grams:
[[0.27735010 0.27735010 0.27735010 0.27735010 0.27735010 0.27735010
  0.27735010 0.27735010 0.27735010 0.27735010 0.27735010]
 [0.27735010 0.27735010 0.27735010 0.27735010 0.27735010 0.27735010
  0.27735010 0.27735010 0.27735010 0.27735010 0.27735010]]
Cosine Similarity (TF-IDF with n-grams): 1.0

b) Word Embeddings: Rather than utilizing pack of-words approaches like Count Vectorization or TF-IDF, we can utilize pre-prepared word embeddings like Word2Vec or GloVe.

Example:

 
import gensim.downloader as api
# Load pre-trained Word2Vec embeddings
word2vec_model = api.load('word2vec-google-news-300')
def sentence_to_vec(sentence, model):
    words = sentence.split()
    word_vecs = [model[word] for word in words if word in model]
    if len(word_vecs) == 0:
        return np.zeros(model.vector_size)
    return np.mean(word_vecs, axis=0)
# Example usage
vec1 = sentence_to_vec(preprocessed_sentence1, word2vec_model)
vec2 = sentence_to_vec(preprocessed_sentence2, word2vec_model)
word2vec_sim = cosine_similarity(vec1, vec2)
print("Cosine Similarity (Word2Vec):", word2vec_sim)   

Output:

 
Cosine Similarity (Word2Vec): 0.9789562821388245

c) Weighted Word Embeddings: We can join word embeddings with TF-IDF loads for better portrayal.

Example:

 
def weighted_sentence_to_vec(sentence, model, tfidf_vectorizer):
    words = sentence.split()
    word_vecs = []
    tfidf_weights = tfidf_vectorizer.transform([sentence]).toarray()[0]    
    for word, weight in zip(tfidf_vectorizer.get_feature_names_out(), tfidf_weights):
        if word in model:
            word_vecs.append(model[word] * weight)
    if len(word_vecs) == 0:
        return np.zeros(model.vector_size)
    return np.mean(word_vecs, axis=0)
# Example usage
weighted_vec1 = weighted_sentence_to_vec(preprocessed_sentence1, word2vec_model, tfidf_vectorizer)
weighted_vec2 = weighted_sentence_to_vec(preprocessed_sentence2, word2vec_model, tfidf_vectorizer)
weighted_word2vec_sim = cosine_similarity(weighted_vec1, weighted_vec2)
print("Cosine Similarity (Weighted Word2Vec):", weighted_word2vec_sim)   

Output:

 
Cosine Similarity (Weighted Word2Vec): 0.9789562821388245

Applications of Cosine Similarities

Information Retrieval and Search Engines
1. Document Similarity: Cosine similarity helps in estimating how comparable two archives are, which is pivotal for web search tools and data recovery frameworks. It permits these frameworks to rank records in view of importance to a client's question.
2. Query Matching: When a client inputs an inquiry, the web search tool changes over the inquiry and records into vector structure and uses cosine closeness to recover reports that are generally like the inquiry.
Recommender Systems
1. Content-Based Recommendations: In web-based business and streaming stages, cosine similarity is utilized to suggest items or content in light of depictions and client profiles. For instance, in the event that a client loves a particular sort of book, the framework can suggest comparable books.
2. User Similarity: By finding similitudes between clients' inclinations or ways of behaving, cooperative sifting strategies can give customized proposals.
Text Mining and Natural Language Processing (NLP)
1. Text Classification: Ordering messages into classifications like spam discovery in messages, feeling examination, and theme arrangement can be improved utilizing cosine similitude to contrast message vectors and predefined classification vectors.
2. Plagiarism Detection: By looking at records or sections, cosine similitude can help in recognizing duplicated content by finding text portions that are exceptionally comparative.
Social Network Analysis
1. Community Detection: Cosine closeness helps in distinguishing gatherings or networks inside informal communities in view of client conduct, cooperations, or profiles.
2. Friend Recommendation: Informal communities like Facebook or LinkedIn use cosine similarity to recommend companions or associations by contrasting client profiles and exercises.
Bioinformatics
1. Gene Expression Analysis: In bioinformatics, cosine closeness is utilized to contrast quality articulation profiles with recognize qualities with comparative articulation designs, which can be essential for understanding quality capabilities and illness systems.
2. Protein Sequence Analysis: Closeness estimates help in contrasting protein successions with anticipate their design and capability, supporting medication disclosure and advancement.
Image Processing
1. Image Retrieval: Content-based picture recovery frameworks use cosine likeness to find pictures like a question picture in light of element vectors separated from the pictures.
2. Face Recognition: In security and validation frameworks, cosine similarity helps contrast facial component vectors with distinguish or confirm people.

Conclusion:

Cosine similarity is a strong and flexible device used to gauge the likeness between vectors, especially in text and information examination. Its applications range data recovery, web indexes, recommender frameworks, text mining, informal community examination, bioinformatics, picture handling, and market crate investigation. In data recovery and web crawlers, cosine similarity upgrades the importance of list items by looking at record and question vectors. Recommender frameworks influence it to give customized proposals by looking at client inclinations and thing depictions. In message mining and NLP, it aids message order, feeling examination, and counterfeiting identification by estimating message likeness. Informal organization examination utilizes it to recognize networks and recommend associations in light of client likenesses. In bioinformatics, it examines quality articulation profiles and protein groupings to progress natural exploration. Picture handling benefits from content-based picture recovery and face acknowledgment by looking at picture include vectors. Market container examination utilizes cosine closeness to distinguish much of the time purchased items and portion clients for designated showcasing. Commonsense executions, for example, suggesting items in light of client questions or tracking down comparative reports in a corpus, show the viability and flexibility of cosine closeness across different spaces.

Next TopicModel view controller in python web apps

← prev next →