7 Visualizations with Python to Handle Multivariate Categorical DataA collection of predetermined groups or categories that an observation can belong to is known as categorical data. You can find categorical data anywhere. Survey responses pertaining to factors such as marital status, occupation, level of education, etc. With categorical data, there might be issues that need to be resolved before moving on to other tasks. The several approaches to managing categorical data in a DataFrame are covered in this article. Now let's examine a few issues raised by categorical data and how a DataFrame may manage it. Categorical data, as previously stated, can only include a limited range of values. Importing LibrariesWith only one line of code, we can easily manage categorical data in a DataFrame and carry out common and intricate operations thanks to Python tools.
The dataset is now ready to be loaded into the pandas dataframe. Output: Fst_nme Lst_nme Blood_type Mrg_status income device 0 Abdul Colon A+ married 145000 AndroidOS 1 Abdul Pierce B+ married 85000 MacOS 2 Desirae Pierce B+ MARRIED 130000 iOS 3 Shannon Gibson A+ married 175000 MacOS 4 Desirae Little B+ unmarried 130000 MacOS Take the feature and blood type into consideration to comprehend membership restrictions. We must confirm whether or not the blood type feature contains fake information. Initially, a data frame containing every feasible blood type number that is acceptable must be created. Explanation: Using the Pandas library, the code generates a new DataFrame with the name blood_type_categories. The blood type column in this DataFrame has the following potential values: 'A+', 'A-', 'B+', 'B-', 'AB+', 'AB-', 'O+', and 'O-'. This can be helpful for classifying or evaluating blood type-related data. Output: Blood_type 0 A+ 1 A- 2 B+ 3 B- 4 AB+ 5 AB- 6 O+ 7 O- The difference technique may now be used to find the fake values. Explanation: In the main_data dataset, the code snippet locates and separates incorrect blood types. The distinct blood types are first taken out of main_data and compared to the blood types that are indicated as legitimate in blood_type_categories. Blood types that are detected in main_data but not in blood_type_categories are classified as fictitious and kept in a collection called bogus_blood_types. Output: {'C+', 'D-'} The relevant rows can be removed from the dataset once the false values have been identified. If there is information available, the values in certain cases could be changed to different ones. They will be removed, though, as there is no information available on the real blood type. Explanation: In order to eliminate records that have incorrect blood kinds, you must first use isin() to detect the records that have spurious blood types, producing the bogus_records_index. Then, use the inverse selection (~bogus_records_index) to filter these fake records out of the main dataset. Lastly, use unique() to confirm that each blood type in the cleaned dataset is unique. Output: array(['A+', 'B+', 'A-', 'AB-', 'AB+', 'B-', 'O-', 'O+'], dtype=object) Inconsistent Categories HandlingIt is common for categorical data to include inconsistencies. Examine the attribute, marital status. Let's examine all the distinct values associated with marital status. Explanation: You may obtain a list of all unique values in the marriage_status column of the main_data DataFrame by using the.unique() function to examine discrepancies in the data. This will assist you in determining any disparities, such as typos, differences in capitalization, or unanticipated categories. Output: array(['married', 'MARRIED', ' married', 'unmarried ', 'divorced', 'unmarried', 'UNMARRIED', 'separated'], dtype=object) The use of capital letters, leading and following spaces, and other formatting elements makes it clear that certain categories are redundant. Let's start by discussing uppercase letters. Explanation: Using the str.lower() function, the marriage_status column of this copy is changed from capital to lowercase, maintaining uniformity. Lastly, to confirm the modifications, the unique() function displays the marriage_status column's unique values. Output: array(['married', ' married', 'unmarried ', 'divorced', 'unmarried', 'separated'], dtype=object) We will address leading and trailing spaces next. Explanation: The inconsistent_data DataFrame's'marriage_status' column contains values that need to be cleaned and standardised. This is done by using the code. Using the str.strip() function, it eliminates any leading or trailing whitespace from each element in the'marriage_status' column. Finally, it verifies the cleaned data by retrieving and displaying the unique values in the'marriage_status' column. Output: array(['married', 'unmarried', 'divorced', 'separated'], dtype=object) Handle Remapping CategoriesIt is possible to map numerical data, such as age or wealth, to several categories. This aids in gaining more understanding of the dataset. Now let's investigate the income feature. Explanation: The code snippet that is supplied computes and outputs the dataset's range of income values. It takes values from the 'income' column of the main_data DataFrame to get the highest and least income. The greatest revenue may be found using the max() method, while the lowest income can be found using the min() function. The highest and lowest income amounts are then displayed when the result is formed into a string and printed. Output: Max income - 190000, Min income - 40000 Let's now construct the income feature's range and labels. This is accomplished by using the pandas cut function. Explanation: We construct income ranges and names for them in the code above so that the data may be grouped. While labels provide the names of each category, the range variable defines the income bounds. You may use pd.cut() to add a new column called "income_groups" to the remapping_data DataFrame. This column will classify each item into one of the designated groups according to its income. Ultimately, the initial rows of the revised DataFrame are shown via remapping_data.head(). Output: Fst_nme Lst_nme Blood_type Mrg_status income device income_groups 0 Abdul Colon A+ married 145000 AndroidOS 125k-150k 1 Abdul Pierce B+ married 85000 MacOS 75k-100k 2 Desirae Pierce¬¬¬ B+ MARRIED 130000 iOS 125k-150k 3 Shannon Gibson A+ married 175000 MacOS 150k+ 4 Desirae Little B+ unmarried 130000 MacOS 125k-150k The distribution is now easy to visualise. Explanation: Remapping_data['income_groups'] is the code snippet.count_value().To see the distribution of the various income categories in the remapping_data DataFrame, plot.bar() generates a bar plot. The income_groups column's unique values are counted using the value_counts() function, and plot.bar() uses these counts to create a bar chart. This aids in comprehending the frequency of every dataset income group type. Output: Cleaning Categorical Data in PythonPhone numbers are the only characteristic of a new data frame that is developed in order to better comprehend this issue. Explanation: A list of 100 random phone numbers-which may have nine or ten digits-is produced by this code. The country code "+91" is appended to all alternative phone numbers. The first few rows of the DataFrame are shown using head() after these phone numbers are saved in a pandas DataFrame with a single column named "phone_numbers." Output: phone_numbers 0 +91 707631849 1 6315742874 2 +91 1584173083 3 3389343099 4 +91 3970692379 The code preceding the numbers might be added or removed based on the use case. In the same way, phone numbers with fewer than ten digits ought to be deleted. Explanation: Prior to processing phone numbers, the code eliminates the "+91 " prefix. After that, it determines which phone numbers are less than ten digits and eliminates them. Any remaining rows with missing values are then dropped. At last, it shows the cleaned phone_numbers_data DataFrame's first few rows. Output: phone_numbers 0 5377617628 2 7152234401 3 2839400071 4 7651215019 5 4451571165 Lastly, we are able to confirm the cleanliness of the data. Python Pandas: Visualising Categorical DataTo get more understanding of the data, categorical data may be visualised using a variety of graphs. Now let's see how many people each blood type belongs to. To do this, we shall employ the Seaborn Library. Output: Output: Encoding Categorical Data in PythonNumbers are a necessary input for several learning algorithms, such as neural networks and regression. For these algorithms to work, categorical data must be transformed into numerical values. Let's examine a few encoding techniques. Python Label Encoding We may number the categories from 0 to num_categories - 1 using label encoding. Let us use the blood type feature to implement label encoding. Explanation: The code snippet transforms categorical blood type data into numerical form by using LabelEncoder from the sklearn.preprocessing package. The without_bogus_records DataFrame's blood_type column's categorical data are converted into encoded integers using the fit_transform function. In order to display the unique encoded values in the blood_type column and the unique numerical representations of the various blood kinds, the.unique() function is finally used. Output: array([0, 4, 1, 3, 2, 5, 7, 6]) One-hot Encoding in Python One-hot encoding addresses some of the drawbacks associated with label encoding. Ordinal Encoding in Python Ordinal data is a type of categorical data where the order is significant. We also want to maintain the order after encoding for such characteristics. We are going to encrypt income groups using ordinal encoding. Our goal is to maintain the order as 40K-75K < 75K-100K < 100K-125K < 125K-150K < 150K+ Explanation: The line of code generates a custom mapping dictionary called custom_map and gives income group ranges integer values. The 'income_groups' column in the remapping_data DataFrame is then transformed in accordance with this mapping using the map method. Every income range string is substituted with the integer value that corresponds to it in the dictionary. Lastly, it uses the head function to reveal the changes in the top few rows of the updated DataFrame. Output: Fst_nme Lst_nme Blood_type Mrg_status income device incm_grp 0 Abdul Colon A+ married 145000 AndroidOS 4 1 Abdul Pierce B+ married 85000 MacOS 3 2 Desirae Pierce B+ MARRIED 130000 iOS 4 3 Shannon Gibson A+ married 175000 MacOS 5 4 Desirae Little B+ unmarried 130000 MacOS 4 |
We provides tutorials and interview questions of all technology like java tutorial, android, java frameworks
G-13, 2nd Floor, Sec-3, Noida, UP, 201301, India