Wine Quality Predicting with Python ML

Introduction to Wine Classification

Around the world, a wide variety of wines are accessible, such as sparkling wines, dessert wines, pop wines, table wines, and vintage wines.

You could be wondering how one determines what wine is good and what isn't. Machine learning is the solution to this query!

There are many different ways to classify wines. Several of them are mentioned below:

Logistic Regression
SVM
Naïve Bayes
CART
Random forest
Perception
KNN

Implementing Wine Classification in Python

Now let's go into a very rudimentary Python wine classification implementation. This will provide you with an introduction to classifiers and show you how to use them in Python for a variety of real-world applications.

1. Modules import

Importing the required modules and libraries into the application is the initial step. A few foundational modules are required for the grouping. Importing each model into the application that uses the Sklearn library is the next step. A few more sklearn library functions will be included as well.

import pandas as pd  #for data manipulation operations
import numpy as np  #for numeric operations on data
import seaborn as sns  #for data visualization operations
import matplotlib.pyplot as plt  #for data visualization operations
from sklearn.preprocessing import LabelEncoder # for encoding
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler #for standardization
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import scipy.stats as st
from termcolor import colored

#from markupsafe import escape
#!pip install pandas-profiling
#import pandas_profiling

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import plot_confusion_matrix
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
#!pip install lightgbm
from lightgbm import LGBMClassifier

#ignore warnings
import warnings
warnings.filterwarnings("ignore")

2. Dataset Preparation

The next step is to get our dataset ready. Let me start by providing an overview of the dataset before importing it into our application.

2.1 Introduction to Dataset

There are 12 features overall and 6497 observations in the dataset. None of the variables have NAN values. The data is simply downloadable.

The following are the names and descriptions of the 12 features:

Fixed acidity: The wine's fixed acidity level
The wine's volatile acidity refers to the amount of acetic acid, the amount of citric acid, the amount of residual sugar left over after fermentation, and the amount of salts or chlorides present.
The quantity of sulfur dioxide in its free form. The quantity of sulfur dioxide in its whole form, including both bound and free forms.
Density: The wine's mass/volume density
pH: The wine's pH ranges from 0 to 14.
Sulfurates: The amount of sulfur dioxide gas (S02) in the wine;
Alcohol: The amount of alcohol in the wine;
Quality: The wine's ultimate quality as indicated.

2.2 Loading the Dataset

Load the dataset and print the basic information of the dataset like column names, and data types.

data=pd.read_csv("./wine_dataset.csv")
data.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
-  -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -  -   -  -  -  -  
 0   fixed acidity         1599 non-null   float 64
 1   volatile acidity      1599 non-null   float 64
 2   citric acid           1599 non-null   float 64
 3   residual sugar        1599 non-null   float 64
 4   chlorides             1599 non-null   float 64
 5   free sulfur dioxide   1599 non-null   float 64
 6   total sulfur dioxide  1599 non-null   float 64
 7   density               1599 non-null   float 64
 8   pH                    1599 non-null   float 64
 9   sulphates             1599 non-null   float 64
 10  alcohol               1599 non-null   float 64
 11  quality               1599 non-null   int 64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

2.3 Cleaning of Data

Cleaning of the dataset includes dropping the unnecessary columns and the NaN values with the help of the code mentioned below:

data=data.drop('Unnamed: 0',axis=1)
data.dropna()

2.4 Data Visualization

An important step is to first visualize the data before processing it any further. The visualization is done in two forms namely,

Histographs
Scatterplot Graph

Plotting Histograms

fig, axes = plt.subplots(1, 3, figsize = (40, 10))

sns.histplot(ax = axes[0], x = df["fixed_acidity"],
             bins = 10,
             kde = True,
             cbar = True,
             color = "#CA96EC").set(title = "Distribution of 'fixed_acidity'");

sns.histplot(ax = axes[1], x = df["volatile_acidity"],
             bins = 10,
             cbar = True,
             kde = True,
             color = "#A163CF").set(title = "Distribution of 'volatile_acidity'");

sns.histplot(ax = axes[2], x = df["citric_acid"],
             bins = 10,
             kde = True,   
             cbar = True,
             color = "#29066B").set(title = "Distribution of 'citric_acid'");

Output:

The distributions of all the variables' values are displayed below. The figures demonstrate that the "pH" and "density" variable values follow a somewhat regular distribution.

The majority of the "fixed_acidity" variable's values fall between 7 and 8;
The majority of the "volatile_acidity" variable's values fall between 0.4 and 0.7;
The "citric_acid" variable's majority of values fall between 0.0 and 0.1;
The majority of the "residual sugar" variable's values fall between 1 and 2.5;
The "chlorides" variable's majority of values fall between 0.085 and 0.15;
The "free_sulfur_dioxide" variable's majority.
The majority of valuesThe variables "total_sulfur_dioxide" fall between 0 - 30;.
The range of \mark>\b>0.996 - 0.998; contains the majority of the values of the "density" variable
The majority of 's valuesThe range of includes the "pH" variable.3.2-3.4;
The range of 0.50 - 0.75; contains the majority of the values of the "sulphates" variable.
The range of 9 - 10; contains the majority of the values of the "alcohol" variable.
The majority of values are the "quality" variable.5 and 6.

Plotting Scatterplot

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

sns.scatterplot(ax = axes[0],
                x = "residual_sugar",
                y = "quality", hue = "quality",
                data = df).set(title = "Relationship between 'residual_sugar' and 'quality'");

sns.scatterplot(ax = axes[1],
                x = "alcohol",
                y = "quality", hue = "quality",
                data = df).set(title = "Relationship between 'alcohol' and 'quality'");

sns.scatterplot(ax = axes[2],
                x = "pH",
                y = "quality", hue = "quality",
                data = df).set(title = "Relationship between 'pH' and 'quality'");

sns.scatterplot(ax = axes[3],
                x = "density",
                y = "quality", hue = "quality",
                data = df).set(title = "Relationship between 'density' and 'quality'");

Output:

In a statistical setting, two or more variables are said to be connected \mark>if their values fluctuate in a way that causes the second variable's value to change along with the value of the first (though it might do so in the other direction). For instance, there is a relationship between the variables "hours worked" and "income earned" if a rise in hours worked is linked to an increase in income earned. If "price" and "purchasing power" are taken into account, then an individual's capacity to purchase items diminishes as their price rises (assuming a constant income).

A statistical measure that indicates the strength and direction of a link between two or more variables is called correlation, and it is represented as a number.

However, a correlation between two variables does not always imply that changes in one variable are the result of changes in the values of the other.

There is a causal link between the two occurrences, as evidenced by the fact that one event results from the occurrence of the other. Another name for this is cause and effect.

The distinction between the two kinds of relationships should be apparent in theory: either an event or an action can cause another (smoking raises the risk of lung cancer, for example) or it can correlate with another (smoking is correlated with alcoholism, but it does not cause alcoholism). In actuality, though, it's still challenging to determine cause and effect with clarity.

2.5 Train-Test Split and Data Normalization

To split the data into training and testing data, there is no optimal splitting percentage.

But one of the fair splitting rules is the 80/20 rule where 80% of the data goes to training data and the rest 20% goes to testing data.

This step also involves normalizing the dataset.

split=int(0.8*data.shape[0])
print("Split of data is at: ",split)
print("\n- - - - - - - AFTER SPLITTING - - - - - - -")
train_data=data[:split]
test_data=data[split:]
print('Shape of train data:',train_data.shape)
print('Shape of train data:',test_data.shape)
print("\n - - - - CREATING X AND Y TRAINING TESTING DATA - - - -")
y_train=train_data['quality']
y_test=test_data['quality']
x_train=train_data.drop('quality',axis=1)
x_test=test_data.drop('quality',axis=1)
print('Shape of x train data:',x_train.shape)
print('Shape of y train data:',y_train.shape)
print('Shape of x test data:',x_test.shape)
print('Shape of y test data:',y_test.shape)
 
nor_train=normalize(x_train)
nor_test=normalize(x_test)

3. Wine Classification Model

In this program we have used two algorithms namely, SVM and Logistic Regression.

3.1 Support Vector Machine (SVM) Algorithm

clf = svm.SVC(kernel='linear')
clf.fit(nor_train, y_train)
y_pred_svm = clf.predict(nor_test)
print("Accuracy (SVM) :",metrics.accuracy_score(y_test, y_pred_svm)*100)

The accuracy is around 50% of the model.

3.2 Logistic Regression Algorithm

logmodel = LogisticRegression()
logmodel.fit(nor_train, y_train)
y_pred_LR= logmodel.predict(nor_test)
print('Mean Absolute Error(Logistic Regression):', metrics.mean_absolute_error(y_test, y_pred_LR)*100)

Output:

In this instance, the accuracy also comes out to be around 50%. The model we have utilized or developed is the primary cause of this.

Next TopicPowershell vs python

← prev next →