Recommendation System - Machine Learning

A machine learning algorithm known as a recommendation system combines information about users and products to forecast a user's potential interests. These systems are used in a wide range of applications, such as e-commerce, social media, and entertainment, to provide personalized recommendations to users.

There are several types of recommendation systems, including:

Content-based filtering: This type of system uses the characteristics of items that a user has liked in the past to recommend similar items.
Collaborative filtering: This type of system uses the past behaviour of users to recommend items that similar users have liked.
Hybrid: To generate suggestions, this kind of system combines content-based filtering and collaborative filtering techniques.
Matrix Factorization: Using this method, the user-item matrix is divided into two lower-dimension matrices that are then utilized to generate predictions.
Deep Learning: To train the user and item representations that are subsequently utilized to generate recommendations, these models make use of neural networks.

The choice of which type of recommendation system to use depends on the specific application and the type of data available.

It's worth noting that recommendation systems are widely used and can have a significant impact on businesses and users. However, it's important to consider ethical considerations and biases that may be introduced to the system.

In this article, We utilize a dataset from Kaggle Datasets: Articles Sharing and Reading from CI&T Deskdrop in this project.

For the purpose of giving customers individualized suggestions, we will demonstrate how to develop Collaborative Filtering, Content-Based Filtering, and Hybrid techniques in Python.

Details About Dataset

The Deskdrop dataset from CI&T's Internal Communication platform, which is an actual sample of 12 months' worth of logs (from March 2016 to February 2017). (DeskDrop). On more than 3k publicly published articles, there are around 73k documented user interactions. Two CSV files make up the file:

shared_articles.csv
Users_interactions.csv

Now, we will try to implement it in the code.

Importing Libraries

import sklearn
import scipy
import numpy as np
import random
import pandas as pd
from nltk.corpus import stopwords
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import math

Loading the Dataset

Here, we have to load our dataset to perform the machine learning operations.

As we already know that we have two CSV files as the dataset.

1. shared_articles.csv

It includes data on the articles posted on the platform. Each article contains a timestamp for when it was shared, the original url, the title, plain text content, the language it was shared in (Portuguese: pt or English: en), and information about the individual who shared it (author).

SHARED CONTENT: Users can access the article that was shared on the platform.
CONTENT REMOVED: The article has been taken down from the site and is no longer accessible for recommendations.

We will just analyze the "CONTENT SHARED" event type here for the purpose of simplicity, making the erroneous assumption that all articles were accessible for the whole one-year period. Only publications that were available at a specific time should be recommended for a more accurate review, but we'll do this exercise for you anyhow.

dataframe_articles = pd.read_csv('shared_articles.csv')
dataframe_articles = dataframe_articles[dataframe_articles['eventType'] == 'CONTENT SHARED']
dataframe_articles.head(5)

Output:

Recommendation System - Machine Learning

2. users_interactions.csv

It includes user interaction records for shared content. By using the contentId field, it may be connected to articles shared.csv.

The values for eventType are:

VIEW: The article has been read by the user.
LIKE: The user gave the article a like.
THE USER CREATED REMARK: The user added a comment to the article.
FOLLOW: The user is selected to get an email when a new remark is made in the article.
BOOKMARK: The user has saved the page, so they may easily access it later.

dataframe_interactions = pd.read_csv('users_interactions.csv')
dataframe_interactions.head(10)

Output:

Data Manipulation

Here, we assign a weight or strength to each sort of interaction since there are many kinds. For instance, we believe that a remark in an article indicates a user's interest in the item is more significant than a like or a simple view.

strength_of_event_type = {
   'VIEW': 1.0,
   'LIKE': 2.0,
   'BOOKMARK': 2.5,
   'FOLLOW': 3.0,
   'COMMENT CREATED': 4.0,  
}

dataframe_interactions['eventStrength'] = dataframe_interactions['eventType'].apply(lambda x: strength_of_event_type[x])

Note: User cold-start is a problem with recommender systems that makes it difficult to give consumers with little or no consumption history individualized recommendations since there isn't enough data to model their preferences.

Due to this, we are only retaining users in the dataset who had at least five interactions.

dataframe_user_interaction_count = dataframe_interactions.groupby(['personId', 'contentId']).size().groupby('personId').size()
print(' Total Number of users: %d' % len(dataframe_user_interaction_count))
dataframe_user_with_enough_interaction = dataframe_user_interaction_count[dataframe_user_interaction_count >= 5].reset_index()[['personId']]
print('Total Number of users with minimum 5 interactions: %d' % len(dataframe_user_with_enough_interaction))

Output:

print('Total Number of interactions: %d' % len(dataframe_interactions))
dataframe_interaction_from_selected_users = dataframe_interactions.merge(dataframe_user_with_enough_interaction,
               how = 'right',
               left_on = 'personId',
               right_on = 'personId')
print('Total Number of interactions from users with at least 5 interactions: %d' % len(dataframe_interaction_from_selected_users))

Output:

Desk drop allows users to browse articles several times and engage with them in various ways (e.g. like or comment). We thus combine all of the interactions a user had with an item by a weighted total of interaction type strength, and then apply a log transformation to smooth the distribution and utilize this information to model the user interest in a particular article.

def preference_of_smooth_users(x):
    return math.log(1+x, 2)
   
dataframe_interaction_full = dataframe_interaction_from_selected_users \
                    .groupby(['personId', 'contentId'])['eventStrength'].sum() \
                    .apply(preference_of_smooth_users).reset_index()
print('Total Number of unique user/item interactions: %d' % len(dataframe_interaction_full))
dataframe_interaction_full.head(10)

Output:

Evaluation

Evaluation is crucial for machine learning projects because it enables objective comparison of various methods and model hyperparameter selections.

Making sure the trained model generalizes for data it was not trained on utilizing cross-validation procedures is a crucial component of assessment. Here, we employ a straightforward cross-validation technique known as a holdout, in which a random data sample?in this case, 20%?is set aside throughout training and utilized just for assessment. This article's assessment metrics were all calculated using the test set.

A more reliable assessment strategy would involve dividing the train and test sets according to a reference date, with the train set being made up of all interactions occurring before that date and the test set consisting of interactions occurring after that day. For the sake of simplicity, we decided to utilize the first random strategy for this notebook, but you might want to try the second way to more accurately replicate how the recsys would behave in production when anticipating interactions from "future" users.

dataframe_interaction_train, dataframe_interaction_text = train_test_split(dataframe_interaction_full,
                                   stratify=dataframe_interaction_full['personId'],
                                   test_size=0.20,
                                   random_state=42)

print('Total Number interactions on Train set: %d' % len(dataframe_interaction_train))
print('Total Number interactions on Test set: %d' % len(dataframe_interaction_text))

Output:

There are a number of metrics that are frequently used for assessment in recommender systems. We decided to employ Top-N accuracy measures, which assess the precision of the top suggestions made to a user in comparison to the test set items with which the user has actually interacted.

According to how this assessment process operates:

For Every User
- For each item with which the user engaged in the test set
  - The user hasn't interacted with 100 other objects. Therefore choose 100. Here, we naively assume that the user doesn't care about the non-interacted things, but that assumption may not be accurate because the user may just be unaware of them. Let's stick with this premise, nonetheless.
  - A set consisting of one interacted item and 100 non-interacted ("non-relevant!") items should be sent to the recommender model to create a ranked list of suggested things.
  - Calculate the Top-N accuracy metrics for this person and the items they have interacted with from the ranked list of recommendations.
Add up the Top-N accuracy metrics globally.

Recall@N, which assesses if the interacted item is one of the top N items (hit) in the prioritized list of 101 suggestions for a user, was chosen as the Top-N accuracy metric.

NDCG@N and MAP@N are two more well-liked ranking metrics whose computation of the score takes into consideration the position of the pertinent item in the ranked list (max. value if the relevant item is in the first position).

# Indexing by personId to facilitate evaluation search performance
dataframe_interaction_full_indexed = dataframe_interaction_full.set_index('personId')
dataframe_interaction_train_indexed = dataframe_interaction_train.set_index('personId')
dataframe_interaction_text_indexed = dataframe_interaction_text.set_index('personId')

def getting_items_interacted(person_id, interaction_dataframe):
    # Gather user information and include movie details.
    items_interacted = interaction_dataframe.loc[person_id]['contentId']
    return set(items_interacted if type(items_interacted) == pd.Series else [items_interacted])

Now we will create a class named "ModelEvaluator", as this will be used for the evaluation of the recommendation model that we will create.

#Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluator:


    def getting_not_interacted_samples(self, person_id, sample_size, seed=42):
        items_interacted = getting_items_interacted(person_id, dataframe_interaction_full_indexed)
        items_all = set(dataframe_articles['contentId'])
        items_not_interacted = items_all - items_interacted

        random.seed(seed)
        sample_non_interacted_items = random.sample(items_not_interacted, sample_size)
        return set(sample_non_interacted_items)

    def _to_verify_hit_top_n(self, item_id, items_recommended, topn):        
            try:
                index = next(i for i, c in enumerate(items_recommended) if c == item_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit, index

    def model_evaluation_for_users(self, model, person_id):
        # Adding the test set's items.
        interacted_testset_values = dataframe_interaction_text_indexed.loc[person_id]
        if type(interacted_testset_values['contentId']) == pd.Series:
            person_interacted_testset_items = set(interacted_testset_values['contentId'])
        else:
            person_interacted_testset_items = set([int(interacted_testset_values['contentId'])])  
        interated_testset_items_count = len(person_interacted_testset_items)

        # Obtaining a model's rated suggestion list for a certain user.
        dataframe_person_recs = model.recommending_items(person_id,
                                               items_to_ignore=getting_items_interacted(person_id,
                                                                                    dataframe_interaction_train_indexed
),
                                               topn=10000000000)

        hits_at_5_count = 0
        hits_at_10_count = 0
        # For each item with which the user engaged in the test set
        for item_id in person_interacted_testset_items:
            # Selecting 100 random things with which the user hasn't interacted
            # (to indicate items that are deemed to be no relevant to the user) (to represent items that are assumed to be not relevant to the user)
            sample_non_interacted_items = self.getting_not_interacted_samples(person_id,
                                                                          sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS,
                                                                          seed=item_id%(2**32))

            # Combining the 100 random objects with the currently interacted item
            items_to_filter_recs = sample_non_interacted_items.union(set([item_id]))

            # Recommendations are only filtered if they come from the interacted item or a random sample of 100 non-interacted items.
            dataframe_valid_recs = dataframe_person_recs[dataframe_person_recs['contentId'].isin(items_to_filter_recs)]                    
            valid_recs_ = dataframe_valid_recs['contentId'].values
            # Checking whether the currently interacted-with item is one of the Top-N suggested things.
            hit_at_5, index_at_5 = self._to_verify_hit_top_n(item_id, valid_recs_, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._to_verify_hit_top_n(item_id, valid_recs_, 10)
            hits_at_10_count += hit_at_10

        # Recall is the percentage of things that have been engaged with and are included among the Top-N suggested items.
        # when combined with a group of unrelated objects
        recall_at_5 = hits_at_5_count / float(interated_testset_items_count)
        recall_at_10 = hits_at_10_count / float(interated_testset_items_count)

        person_metrics = {'hits@5_count':hits_at_5_count,
                          'hits@10_count':hits_at_10_count,
                          'interacted_count': interated_testset_items_count,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return person_metrics

    def model_evaluation(self, model):
        #print('Running evaluation for users')
        people_metrics = []
        for idx, person_id in enumerate(list(dataframe_interaction_text_indexed.index.unique().values)):
            #if idx % 100 == 0 and idx > 0:
            #    print('%d users processed' % idx)
            person_metrics = self.model_evaluation_for_users(model, person_id)  
            person_metrics['_person_id'] = person_id
            people_metrics.append(person_metrics)
        print('%d users processed' % idx)

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('interacted_count', ascending=False)
       
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())
       
        global_metrics = {'modelName': model.getting_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
   
model_evaluator = ModelEvaluator()    

Popularity Model

The Popularity model is a typical baseline strategy that is typically challenging to surpass. This strategy merely suggests to a user the most well-liked products that the customer has not yet eaten; it is not personally tailored. As the "wisdom of the multitude" is accounted for by popularity, it typically offers sound advice that is generally engaging for most people.

A recommender system's main goal, which goes much beyond this straightforward method, is to apply long-tail products to users with extremely particular interests.

# computes the bestselling things
dataframe_item_popularity = dataframe_interaction_full.groupby('contentId')['eventStrength'].sum().sort_values(ascending=False).reset_index()
dataframe_item_popularity.head(10)

Output:

class PopularityRecommender:
   
    MODEL_NAME = 'Popularity'
   
    def __init__(self, dataframe_popularity, dataframe_items=None):
        self.dataframe_popularity = dataframe_popularity
        self.dataframe_items = dataframe_items
       
    def getting_model_name(self):
        return self.MODEL_NAME
       
    def recommending_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Suggest the most well-liked products that the consumer hasn't yet viewed.
        dataframe_recommendation = self.dataframe_popularity[~self.dataframe_popularity['contentId'].isin(items_to_ignore)] \
                               .sort_values('eventStrength', ascending = False) \
                               .head(topn)

        if verbose:
            if self.dataframe_items is None:
                raise Exception('"dataframe_items" is required in verbose mode')

            dataframe_recommendation = dataframe_recommendation.merge(self.dataframe_items, how = 'left',
                                                          left_on = 'contentId',
                                                          right_on = 'contentId')[['eventStrength', 'contentId', 'title', 'url', 'lang']]


        return dataframe_recommendation
   
popularity_model = PopularityRecommender(dataframe_item_popularity, dataframe_articles)

Here, using the above-described methodology, we evaluate the Popularity model.

It had a Recall@5 of 0.2417, which suggests that the Popularity model placed around 24% of the test set's interactive items among the top 5 items (from lists with 100 random items). Furthermore, as predicted, Recall@10 was significantly higher (37%)

You might find it surprising that popular models can typically perform so well.

print('Evaluating Popularity recommendation model...')
metrics_pop_global, dataframe_pop_detailed_results = model_evaluator.model_evaluation(popularity_model)
print('\nGlobal metrics:\n%s' % metrics_pop_global)
dataframe_pop_detailed_results.head(10)

Output:

Content-Based Filtering model

The descriptions or qualities of the objects with which the user has engaged are used in content-based filtering techniques to suggest related items. This solution is reliable in preventing the cold-start issue since it only relies on the user's prior decisions. It is straightforward to create item profiles and user profiles for text-based objects like books, articles, and news stories using the raw text.

In this case, we're employing TF-IDF, a highly well-liked information retrieval (search engine) approach.

Using this method, unstructured text is transformed into a vector structure, where each word is represented by a location in the vector, and the value indicates how pertinent a word is for an article. The same Vector Space Model will be used to represent all things, making it possible to compare articles.

# Avoiding stopwords (words without sense) in Portuguese and English (as we have a corpus with mixed languages)
stopwords_list = stopwords.words('english') + stopwords.words('portuguese')

# trains a model with 5000 vectors that is made up of the most common bigrams and unigrams in the corpus, excluding stopwords.
vectorizer = TfidfVectorizer(analyzer='word',
                     ngram_range=(1, 2),
                     min_df=0.003,
                     max_df=0.5,
                     max_features=5000,
                     stop_words=stopwords_list)

item_ids = dataframe_articles['contentId'].tolist()
tfidf_matrix = vectorizer.fit_transform(dataframe_articles['title'] + "" + dataframe_articles['text'])
tfidf_feature_names = vectorizer.get_feature_names()
tfidf_matrix

Output:

We average all the item profiles with which the user has engaged in order to model the user profile. The final user profile will give more weight to the articles on which the user has interacted the most (e.g., liked or commented), with the average being weighted by the strength of the interactions.

def getting_item_profiles(item_id):
    idx = item_ids.index(item_id)
    profile_item = tfidf_matrix[idx:idx+1]
    return profile_item

def getting_item_profiless(ids):
    list_profiles_item = [getting_item_profiles(x) for x in ids]
    profile_items = scipy.sparse.vstack(list_profiles_item)
    return profile_items

def building_user_profiles(person_id, dataframe_interaction_indexed):
    dataframe_interactions_person = dataframe_interaction_indexed.loc[person_id]
    profiles_user_items = getting_item_profiless(dataframe_interactions_person['contentId'])
   
    user_item_strengths = np.array(dataframe_interactions_person['eventStrength']).reshape(-1,1)
    # Weighted average of the item profiles by the intensity of the interactions
    user_item_strengths_weighted_avg = np.sum(profiles_user_items.multiply(user_item_strengths), axis=0) / np.sum(user_item_strengths)
    user_profile_norm = sklearn.preprocessing.normalize(user_item_strengths_weighted_avg)
    return user_profile_norm

def build_users_profiles():
    dataframe_interaction_indexed = dataframe_interaction_train[dataframe_interaction_train['contentId'] \
                                                   .isin(dataframe_articles['contentId'])].set_index('personId')
    profiles_user = {}
    for person_id in dataframe_interaction_indexed.index.unique():
        profiles_user[person_id] = building_user_profiles(person_id, dataframe_interaction_indexed)
    return profiles_user

profiles_users = build_users_profiles()
len(profiles_users)

Output:

Let's look at the profile first. It is a unit vector with a length of 5000. Each position's value indicates how vital a token (a bigram or a unigram) is to me.

According to a look at below profile, the most pertinent tokens actually do reflect interests in machine learning, deep learning, artificial intelligence, and the Google Cloud Platform professionally! Therefore, we may anticipate some solid advice here!

my_profile = profiles_users[-1479311724257856983]
print(my_profile.shape)
pd.DataFrame(sorted(zip(tfidf_feature_names,
                        profiles_users[-1479311724257856983].flatten().tolist()), key=lambda x: -x[1])[:20],
             columns=['token', 'relevance'])

Output:

class ContentBasedRecommender:
   
    MODEL_NAME = 'Content-Based'
   
    def __init__(self, items_df=None):
        self.item_ids = item_ids
        self.items_df = items_df
       
    def getting_model_name(self):
        return self.MODEL_NAME
       
    def _getting_similar_items_to_the_users(self, person_id, topn=1000):
        # The user profile and all object profiles are compared using the cosine similarity formula.
        cosine_similarities = cosine_similarity(profiles_users[person_id], tfidf_matrix)
        # Gets the most comparable products.
        similar_indices = cosine_similarities.argsort().flatten()[-topn:]
        # Sort comparable objects according to similarity.
        similar_items = sorted([(item_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
        return similar_items
       
    def recommending_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        similar_items = self._getting_similar_items_to_the_users(user_id)
        # Ignores things with which the user has previously behaved
        similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
       
        dataframe_recommendations = pd.DataFrame(similar_items_filtered, columns=['contentId', 'recStrength']) \
                                    .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            dataframe_recommendations = dataframe_recommendations.merge(self.items_df, how = 'left',
                                                          left_on = 'contentId',
                                                          right_on = 'contentId')[['recStrength', 'contentId', 'title', 'url', 'lang']]


        return dataframe_recommendations
   
content_based_recommender_model = ContentBasedRecommender(dataframe_articles)

We have a Recall@5 of 0.162 with the customized recommendations of the content-based filtering model, which indicates that around 16% of the test set's interacting items were listed by this model among the top 5 things (from lists with 100 random items). Recall@10 was 0.261 (52%), as well. The fact that the Information-Based model performed less well than the Popularity model suggests that consumers may not be as committed to reading content that is highly similar to what they have already read.

print('Evaluating The Content-Based Filtering model...')
metrics_cb_global, dataframe_cb_result_detailed = model_evaluator.model_evaluation(content_based_recommender_model)
print('\nGlobal metrics:\n%s' % metrics_cb_global)
dataframe_cb_result_detailed.head(10)

Output:

Collaborative Filtering model

It has main implementation methods.

Memory-based: This method computes user similarities based on the items with which they have engaged (user-based approach) or computes item similarities based on the users who have interacted with the things (item-based approach).

User Neighbourhood-based CF is a common illustration of this strategy, in which the top N similarly inclined users (typically determined using Pearson correlation) for a user are chosen and used to suggest products that those inclined users liked but with whom the current user has not yet interacted. Although this strategy is relatively easy to put into practice, it often does not scale effectively for numerous people. Crab offers an excellent Python implementation of this strategy.

Model-based: In this method, models are created by utilizing various machine learning algorithms to make product recommendations to customers. Numerous model-based CF techniques exist, including probabilistic latent semantic analysis, neural networks, bayesian networks, clustering models, and latent component models like Singular Value Decomposition (SVD).

Matrix Factorisation

User-item matrices are condensed into a low-dimensional form using latent component models. This method has the benefit of working with a much smaller matrix in a lower-dimensional space rather than a high-dimensional matrix with a large number of missing values.

Both the user-based and item-based neighbourhood algorithms described in the preceding section might be used with a reduced presentation. This paradigm has a number of benefits. Compared to memory-based ones, it handles the sparsity of the original matrix better. Additionally, it is much easier to compare similarities in the generated matrix, especially when working with sizable sparse datasets.

Here, we employ Singular Value Decomposition, a well-known latent component model (SVD). You might also use other, more CF-specific matrix factorization frameworks like surprise, mrec, or python-recsys. We choose a SciPy implementation of SVD since Kaggle kernels support it.

The choice of how many elements to factor in the user-item matrix is crucial. The factorization in the original matrix reconstructions is more exact the more factors there are. As a result, if the model is permitted to retain too many specifics of the original matrix, it may struggle to generalize to data that was not used for training. The generality of the model is increased by reducing the number of components.

# Make a sparse pivot table with columns for the products and rows for the users
dataframe_users_items_pivot_matrix = dataframe_interaction_train.pivot(index='personId',
                                                          columns='contentId',
                                                          values='eventStrength').fillna(0)

dataframe_users_items_pivot_matrix.head(10)

Output:

pivot_matrix_users_items = dataframe_users_items_pivot_matrix.to_numpy()
pivot_matrix_users_items[:10]

Output:

users_ids = list(dataframe_users_items_pivot_matrix.index)
users_ids[:10]

Output:

pivot_sparse_matrix_users_items = csr_matrix(pivot_matrix_users_items)
pivot_sparse_matrix_users_items

Output:

# The number of factors to be applied to the user-item matrix
Number_of_factor = 15
# matrix factorization of the initial user-item matrix is carried out
# U, sigma, Vt = svds(users_items_pivot_matrix, k = Number_of_factor)
U, sigma, Vt = svds(pivot_sparse_matrix_users_items, k = Number_of_factor)

U.shape

Output:

sigma = np.diag(sigma)
sigma.shape

Output:

We attempt to rebuild the original matrix by multiplying the elements after factorization. As a result, the matrix is no longer sparse. We will utilize the predictions for goods with which the user has not yet interacted to produce recommendations.

predicted_ratings_all_users = np.dot(np.dot(U, sigma), Vt)
predicted_ratings_all_users

Output:

predicted_ratings_norm_all_users = (predicted_ratings_all_users - predicted_ratings_all_users.min()) / (predicted_ratings_all_users.max() - predicted_ratings_all_users.min())

# the process of returning the rebuilt matrix to a Pandas dataframe.
dataframe_cf_preds = pd.DataFrame(predicted_ratings_norm_all_users, columns = dataframe_users_items_pivot_matrix.columns, index=users_ids).transpose()
dataframe_cf_preds.head(10)

Output:

class CFRecommender:
   
    MODEL_NAME = 'Collaborative Filtering'
   
    def __init__(self, dataframe_cf_predictions, items_df=None):
        self.dataframe_cf_predictions = dataframe_cf_predictions
        self.items_df = items_df
       
    def getting_model_name(self):
        return self.MODEL_NAME
       
    def recommending_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Obtain and arrange user predictions
        predictions_sorted_users = self.dataframe_cf_predictions[user_id].sort_values(ascending=False) \
                                    .reset_index().rename(columns={user_id: 'recStrength'})

        # Send the user the movies with the highest expected rating that they haven't yet viewed.
        dataframe_recommendations = predictions_sorted_users[~predictions_sorted_users['contentId'].isin(items_to_ignore)] \
                               .sort_values('recStrength', ascending = False) \
                               .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            dataframe_recommendations = dataframe_recommendations.merge(self.items_df, how = 'left',
                                                          left_on = 'contentId',
                                                          right_on = 'contentId')[['recStrength', 'contentId', 'title', 'url', 'lang']]


        return dataframe_recommendations
   
cf_recommender_model = CFRecommender(dataframe_cf_preds, dataframe_articles)

Recall@5 (33%) and Recall@10 (46%) values were obtained while evaluating the Collaborative Filtering model (SVD matrix factorization), which is much higher than the Popularity model and Content-Based model.

print('Evaluating Collaborative Filtering (SVD Matrix Factorization) model...')
metrics_cf_global, dataframe_cf_detailed_results = model_evaluator.model_evaluation(cf_recommender_model)
print('\nGlobal metrics:\n%s' % metrics_cf_global)
dataframe_cf_detailed_results.head(10)

Output:

Hybrid Recommender

It is a combination of Collaborative Filtering and Content-Based Filtering methods. In reality, several studies have shown that hybrid methods outperform individual approaches, and both academics and practitioners frequently adopt them.

Let's create a straightforward hybridization technique that ranks items based on the weighted average of the normalized CF and Content-Based scores. The weights for the CF and CB models in this instance are 100.0 and 1.0, respectively, because the CF model is significantly more accurate than the CB model.

class HybridRecommender:
   
    MODEL_NAME = 'Hybrid'
   
    def __init__(self, model_cb_rec, model_cf_rec, dataframe_items, weight_cb_ensemble=1.0, weight_cf_ensemble=1.0):
        self.model_cb_rec = model_cb_rec
        self.model_cf_rec = model_cf_rec
        self.weight_cb_ensemble = weight_cb_ensemble
        self.weight_cf_ensemble = weight_cf_ensemble
        self.dataframe_items = dataframe_items
       
    def getting_model_name(self):
        return self.MODEL_NAME
       
    def recommending_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Obtaining the top 1000 suggestions for content-based filtering
        dataframe_cb_recs = self.model_cb_rec.recommending_items(user_id, items_to_ignore=items_to_ignore, verbose=verbose,
                                                           topn=1000).rename(columns={'recStrength': 'recStrengthCB'})
       
        # Obtaining the top 1000 suggestions via collaborative filtering
        dataframe_cf_recs = self.model_cf_rec.recommending_items(user_id, items_to_ignore=items_to_ignore, verbose=verbose,
                                                           topn=1000).rename(columns={'recStrength': 'recStrengthCF'})
       
        # putting the outcomes together by contentId
        dataframe_recs = dataframe_cb_recs.merge(dataframe_cf_recs,
                                   how = 'outer',
                                   left_on = 'contentId',
                                   right_on = 'contentId').fillna(0.0)
       
        # Using the CF and CB scores to create a hybrid recommendation score
        # dataframe_recs['recStrengthHybrid'] = dataframe_recs['recStrengthCB'] * dataframe_recs['recStrengthCF']
        dataframe_recs['recStrengthHybrid'] = (dataframe_recs['recStrengthCB'] * self.weight_cb_ensemble) \
                                     + (dataframe_recs['recStrengthCF'] * self.weight_cf_ensemble)
       
        # Sorting advice based on hybrid score
        recommendations_df = dataframe_recs.sort_values('recStrengthHybrid', ascending=False).head(topn)

        if verbose:
            if self.dataframe_items is None:
                raise Exception('"dataframe_items" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.dataframe_items, how = 'left',
                                                          left_on = 'contentId',
                                                          right_on = 'contentId')[['recStrengthHybrid', 'contentId', 'title', 'url', 'lang']]


        return recommendations_df
   
hybrid_recommender_model = HybridRecommender(content_based_recommender_model, cf_recommender_model, dataframe_articles,
                                             weight_cb_ensemble=1.0, weight_cf_ensemble=100.0)

print('Evaluating Hybrid model...')
metrics_hybrid_global, dataframe_hybrid_detailed_results = model_evaluator.model_evaluation(hybrid_recommender_model)
print('\nGlobal metrics:\n%s' % metrics_hybrid_global)
dataframe_hybrid_detailed_results.head(10)

Output:

Comparing the Methods

Now, we will compare the methods for recall@5 and recall@10.

dataframe_global_metrics = pd.DataFrame([metrics_cb_global, metrics_pop_global, metrics_cf_global, metrics_hybrid_global]) \
                        .set_index('modelName')
dataframe_global_metrics

Output:

A new champion has emerged!

By combining Collaborative Filtering and Content-Based Filtering, our straightforward hybrid technique outperforms the former. Recall@5 is currently 34.2%, while Recall@10 is 47.9%.

Now for better understanding, we can also plot the graph for the comparison of the models.

%matplotlib inline
ax = dataframe_global_metrics.transpose().plot(kind='bar', figsize=(15,8))
for p in ax.patches:
    ax.annotate("%.3f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')

Output:

TESTING

Now, we will test the best model, which is hybrid for other users.

def inspection_interactions(person_id, test_set=True):
    if test_set:
        dataframe_interactions = dataframe_interaction_text_indexed
    else:
        dataframe_interactions = dataframe_interaction_train_indexed
    return dataframe_interactions.loc[person_id].merge(dataframe_articles, how = 'left',
                                                      left_on = 'contentId',
                                                      right_on = 'contentId') \
                          .sort_values('eventStrength', ascending = False)[['eventStrength',
                                                                          'contentId',
                                                                          'title', 'url', 'lang']]

Some of the articles We engaged with in Deskdrop from Train Set are shown below. It is clear that machine learning, deep learning, artificial intelligence, and the Google Cloud Platform are among key areas of interest.

Output:

hybrid_recommender_model.recommending_items(-1479311724257856983, topn=20, verbose=True)

Output:

As we check the comparison between the recommendation from the hybrid model and the actual interest, we find that the recommendations are pretty similar.

Conclusion

In this article, on the CI&T Deskdrop dataset, we investigated and contrasted the primary recommender system methodologies. It could be shown that content-based filtering and a hybrid strategy outperformed collaborative filtering alone in terms of article suggestions.

Among the three, the Hybrid model has the highest accuracy for the best recommendation.

Next TopicCustomer Segmentation Using Machine Learning

← prev next →