7 Visualizations with Python to Handle Multivariate Categorical Data

A collection of predetermined groups or categories that an observation can belong to is known as categorical data. You can find categorical data anywhere. Survey responses pertaining to factors such as marital status, occupation, level of education, etc. With categorical data, there might be issues that need to be resolved before moving on to other tasks. The several approaches to managing categorical data in a DataFrame are covered in this article. Now let's examine a few issues raised by categorical data and how a DataFrame may manage it.

Categorical data, as previously stated, can only include a limited range of values.

Importing Libraries

With only one line of code, we can easily manage categorical data in a DataFrame and carry out common and intricate operations thanks to Python tools.

Pandas: This library offers several methods to complete analytical jobs in one go and aids in loading data frames in a 2D array format.
Numpy: Numpy arrays are incredibly quick and can do complex calculations quickly.
The Matplotlib/Seaborn package is utilised for creating visual aids.
Sklearn: This module consists of many libraries with pre-implemented methods to carry out various tasks, such as developing and evaluating models and preparing data.

 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.preprocessing import LabelEncoder   

The dataset is now ready to be loaded into the pandas dataframe.

 
main_data = pd.read_csv('demographics.csv')
main_data.head()   

Output:

 
	Fst_nme	Lst_nme	Blood_type	Mrg_status	income	device
0	Abdul	Colon	A+	married	145000	AndroidOS
1	Abdul	Pierce	B+	married	85000	MacOS
2	Desirae	Pierce	B+	MARRIED	130000	iOS
3	Shannon	Gibson	A+	married	175000	MacOS
4	Desirae	Little	B+	unmarried	130000	MacOS

Take the feature and blood type into consideration to comprehend membership restrictions. We must confirm whether or not the blood type feature contains fake information. Initially, a data frame containing every feasible blood type number that is acceptable must be created.

 
# create a new dataframe with possible values for blood type
blood_type_categories = pd.DataFrame({ 'blood_type': ['A+', 'A-', 'B+', 'B-', 'AB+', 'AB-', 'O+', 'O-']
})
blood_type_categories   

Explanation:

Using the Pandas library, the code generates a new DataFrame with the name blood_type_categories. The blood type column in this DataFrame has the following potential values: 'A+', 'A-', 'B+', 'B-', 'AB+', 'AB-', 'O+', and 'O-'. This can be helpful for classifying or evaluating blood type-related data.

Output:

 
     Blood_type
0                    A+
1                    A-
2                    B+
3                    B-
4                    AB+
5                    AB-
6                     O+
7                     O-

The difference technique may now be used to find the fake values.

 
# finding bogus categories
unique_blood_types_main = set(main_data['blood_type'])
bogus_blood_types = unique_blood_types_main.difference(blood_type_categories['blood_type'])
bogus_blood_types   

Explanation:

In the main_data dataset, the code snippet locates and separates incorrect blood types. The distinct blood types are first taken out of main_data and compared to the blood types that are indicated as legitimate in blood_type_categories. Blood types that are detected in main_data but not in blood_type_categories are classified as fictitious and kept in a collection called bogus_blood_types.

Output:

 
{'C+', 'D-'}

The relevant rows can be removed from the dataset once the false values have been identified. If there is information available, the values in certain cases could be changed to different ones. They will be removed, though, as there is no information available on the real blood type.

 
# extracting records with bogus blood types
bogus_records_index = main_data['blood_type'].isin(bogus_blood_types)
# drop the records with bogus blood types
without_bogus_records = main_data[~bogus_records_index]
without_bogus_records['blood_type'].unique()   

Explanation:

In order to eliminate records that have incorrect blood kinds, you must first use isin() to detect the records that have spurious blood types, producing the bogus_records_index. Then, use the inverse selection (~bogus_records_index) to filter these fake records out of the main dataset. Lastly, use unique() to confirm that each blood type in the cleaned dataset is unique.

Output:

 
array(['A+', 'B+', 'A-', 'AB-', 'AB+', 'B-', 'O-', 'O+'], dtype=object)

Inconsistent Categories Handling

It is common for categorical data to include inconsistencies. Examine the attribute, marital status. Let's examine all the distinct values associated with marital status.

 
# exploring inconsistencies in marriage status category
main_data['marriage_status'].unique()   

Explanation:

You may obtain a list of all unique values in the marriage_status column of the main_data DataFrame by using the.unique() function to examine discrepancies in the data. This will assist you in determining any disparities, such as typos, differences in capitalization, or unanticipated categories.

Output:

 
array(['married', 'MARRIED', ' married', 'unmarried ', 'divorced',
       'unmarried', 'UNMARRIED', 'separated'], dtype=object)

The use of capital letters, leading and following spaces, and other formatting elements makes it clear that certain categories are redundant. Let's start by discussing uppercase letters.

 
# removing values with capital letters
inconsistent_data = main_data.copy()
inconsistent_data['marriage_status'] = inconsistent_data['marriage_status']\
.str.lower()
inconsistent_data['marriage_status'].unique()   

Explanation:

Using the str.lower() function, the marriage_status column of this copy is changed from capital to lowercase, maintaining uniformity. Lastly, to confirm the modifications, the unique() function displays the marriage_status column's unique values.

Output:

 
array(['married', ' married', 'unmarried ', 'divorced', 'unmarried',
       'separated'], dtype=object)

We will address leading and trailing spaces next.

 
inconsistent_data['marriage_status'] = inconsistent_data['marriage_status']\
.str.strip()
inconsistent_data['marriage_status'].unique()   

Explanation:

The inconsistent_data DataFrame's'marriage_status' column contains values that need to be cleaned and standardised. This is done by using the code. Using the str.strip() function, it eliminates any leading or trailing whitespace from each element in the'marriage_status' column. Finally, it verifies the cleaned data by retrieving and displaying the unique values in the'marriage_status' column.

Output:

 
array(['married', 'unmarried', 'divorced', 'separated'], dtype=object)

Handle Remapping Categories

It is possible to map numerical data, such as age or wealth, to several categories. This aids in gaining more understanding of the dataset. Now let's investigate the income feature.

 
# range of income in the dataset
print(f"Max income - {max(main_data['income'])},\
 Min income - {min(main_data['income'])}")   

Explanation:

The code snippet that is supplied computes and outputs the dataset's range of income values. It takes values from the 'income' column of the main_data DataFrame to get the highest and least income. The greatest revenue may be found using the max() method, while the lowest income can be found using the min() function. The highest and lowest income amounts are then displayed when the result is formed into a string and printed.

Output:

 
Max income - 190000, Min income - 40000

Let's now construct the income feature's range and labels. This is accomplished by using the pandas cut function.

 
# create the groups for income
range = [40000, 75000, 100000, 125000, 150000, np.inf]
labels = ['40k-75k', '75k-100k', '100k-125k', '125k-150k', '150k+']
remapping_data = main_data.copy()
remapping_data['income_groups'] = pd.cut(remapping_data['income'], bins=range,
labels=labels)
remapping_data.head()   

Explanation:

We construct income ranges and names for them in the code above so that the data may be grouped. While labels provide the names of each category, the range variable defines the income bounds. You may use pd.cut() to add a new column called "income_groups" to the remapping_data DataFrame. This column will classify each item into one of the designated groups according to its income. Ultimately, the initial rows of the revised DataFrame are shown via remapping_data.head().

Output:

 
	Fst_nme	Lst_nme	Blood_type	Mrg_status	income	device	income_groups						
0	Abdul	Colon	A+	married	145000	AndroidOS		125k-150k						
1	Abdul	Pierce	B+	married	85000	MacOS		75k-100k						
2	Desirae	Pierce¬¬¬	B+	MARRIED	130000	iOS		125k-150k						
3	Shannon	Gibson	A+	married	175000	MacOS		150k+						
4	Desirae	Little	B+	unmarried	130000	MacOS		125k-150k

The distribution is now easy to visualise.

Explanation:

Remapping_data['income_groups'] is the code snippet.count_value().To see the distribution of the various income categories in the remapping_data DataFrame, plot.bar() generates a bar plot. The income_groups column's unique values are counted using the value_counts() function, and plot.bar() uses these counts to create a bar chart. This aids in comprehending the frequency of every dataset income group type.

Output:

Visualizations with Python to Handle Multivariate Categorical Data

Cleaning Categorical Data in Python

Phone numbers are the only characteristic of a new data frame that is developed in order to better comprehend this issue.

 
phone_numbers = []
for i in range(100):
# phone numbers could be of length 9 or 10
number = random.randint(100000000, 9999999999)
# +91 code is inserted in some cases
if(i % 2 == 0):
       phone_numbers.append('+91 ' + str(number))
else:
      phone_numbers.append(str(number))
phone_numbers_data = pd.DataFrame({
       'phone_numbers': phone_numbers
})
phone_numbers_data.head()   

Explanation:

A list of 100 random phone numbers-which may have nine or ten digits-is produced by this code. The country code "+91" is appended to all alternative phone numbers. The first few rows of the DataFrame are shown using head() after these phone numbers are saved in a pandas DataFrame with a single column named "phone_numbers."

Output:

 
       phone_numbers
0     +91 707631849
1            6315742874
2     +91 1584173083
3             3389343099
4      +91 3970692379

The code preceding the numbers might be added or removed based on the use case. In the same way, phone numbers with fewer than ten digits ought to be deleted.

 
phone_numbers_data['phone_numbers'] = phone_numbers_data['phone_numbers']\ .str.replace('\+91 ', '')
num_digits = phone_numbers_data['phone_numbers'].str.len()
invalid_numbers_index = phone_numbers_data[num_digits < 10].index
phone_numbers_data['phone_numbers'] = phone_numbers_data.drop(
invalid_numbers_index)
phone_numbers_data = phone_numbers_data.dropna()
phone_numbers_data.head()   

Explanation:

Prior to processing phone numbers, the code eliminates the "+91 " prefix. After that, it determines which phone numbers are less than ten digits and eliminates them. Any remaining rows with missing values are then dropped. At last, it shows the cleaned phone_numbers_data DataFrame's first few rows.

Output:

 
         phone_numbers
0         5377617628
2         7152234401
3         2839400071
4         7651215019
5         4451571165

Lastly, we are able to confirm the cleanliness of the data.

 
assert phone_numbers_data['phone_numbers'].str.contains('\+91 ').all() == False
assert (phone_numbers_data['phone_numbers'].str.len() != 10).all() == False   

Python Pandas: Visualising Categorical Data

To get more understanding of the data, categorical data may be visualised using a variety of graphs. Now let's see how many people each blood type belongs to. To do this, we shall employ the Seaborn Library.

Output:

Encoding Categorical Data in Python

Numbers are a necessary input for several learning algorithms, such as neural networks and regression. For these algorithms to work, categorical data must be transformed into numerical values. Let's examine a few encoding techniques.

Python Label Encoding

We may number the categories from 0 to num_categories - 1 using label encoding. Let us use the blood type feature to implement label encoding.

 
le = LabelEncoder()
without_bogus_records['blood_type'] = le.fit_transform(
	without_bogus_records['blood_type'])
without_bogus_records['blood_type'].unique()   

Explanation:

The code snippet transforms categorical blood type data into numerical form by using LabelEncoder from the sklearn.preprocessing package. The without_bogus_records DataFrame's blood_type column's categorical data are converted into encoded integers using the fit_transform function. In order to display the unique encoded values in the blood_type column and the unique numerical representations of the various blood kinds, the.unique() function is finally used.

Output:

 
array([0, 4, 1, 3, 2, 5, 7, 6])

One-hot Encoding in Python

One-hot encoding addresses some of the drawbacks associated with label encoding.

 
inconsistent_data = pd.get_dummies(inconsistent_data,columns=['marriage_status'])
inconsistent_data.head()   

Ordinal Encoding in Python

Ordinal data is a type of categorical data where the order is significant. We also want to maintain the order after encoding for such characteristics. We are going to encrypt income groups using ordinal encoding. Our goal is to maintain the order as 40K-75K < 75K-100K < 100K-125K < 125K-150K < 150K+

 
custom_map = {'40k-75k': 1, '75k-100k': 2, '100k-125k': 3,
              '125k-150k': 4, '150k+': 5}
remapping_data['income_groups'] = remapping_data['income_groups']\
    .map(custom_map)
remapping_data.head()   

Explanation:

The line of code generates a custom mapping dictionary called custom_map and gives income group ranges integer values. The 'income_groups' column in the remapping_data DataFrame is then transformed in accordance with this mapping using the map method. Every income range string is substituted with the integer value that corresponds to it in the dictionary. Lastly, it uses the head function to reveal the changes in the top few rows of the updated DataFrame.

Output:

 
	Fst_nme	Lst_nme	Blood_type	Mrg_status	income	device	incm_grp
0	Abdul	Colon	A+	married	145000	AndroidOS	4
1	Abdul	Pierce	B+	married	85000	MacOS	3
2	Desirae	Pierce	B+	MARRIED	130000	iOS	4
3	Shannon	Gibson	A+	married	175000	MacOS	5
4	Desirae	Little	B+	unmarried	130000	MacOS	4

Next TopicGradient boosting classification explained through python

← prev next →