Why We Use an 80-20 Split for Training and Test Data?

When developing a Machine Learning model, it is important to divide the data into two parts: Training and Testing. The main purpose of the training data is to help understand and comprehend the assumptions of the model, while the test data is used to evaluate the efficiency and performance of the model.

This data splitting prevents overfitting, a situation in which a model becomes too complex and heavily dependent on training data, thus compromising its ability to perform well on new unseen data. To avoid such issues, it is recommended to use about 70- 80% of the data for training, the remaining 20-. 30% should be given for testing.

This division is based on empirical research, showing that the optimal results are obtained following this specific ratio. By allocating a larger portion of the data for training, we ensure that the model has a substantial amount of information to learn from and improve its performance. Conversely, the smaller portion is reserved for testing ensuring the model's ability to generalize to novel data is thoroughly assessed.

Why Do We Use an 80-20 Split for Data?

There are several reasons why an 80-20 split for training and test data is commonly used in machine learning.

Pareto Principle:

The 80-20 division has been taken from the Pareto principle, commonly known as the 80/20 rule. This principle suggests that approximately 80% of the results stem from 20% of the factors. Although not an absolute certainty, this principle applies to numerous real-life occurrences, such as the dilation of data.

Employing an 80-20 division for the training data enables the model to comprehend most patterns and connections within the data without necessitating its entirety.

Balance Between Training and Testing:

The division of 80-20 strikes a balance between thoroughly examining the example and obtaining sufficient information for impartial analysis. By dedicating 80% of the data to educational objectives, the model acquires ample samples to adequately encompass various aspects and features. Conversely, the remaining 20% comprises entirely novel data, enabling a more precise evaluation of the model's capacity to predict unknown information.

Practicality and Efficiency:

This separation is easy and computationally efficient, especially for wide data sets. Having fewer data points during the testing process reduces the time and resources required for the sample analysis.

Additionally, many machine learning libraries and programs include combinations of features that involve 80-20 partitions, making execution easily accessible and simple.

Simple and Easy to Understand:

The division of 80-20 is a simple and perfect notion, rendering it more accessible for newbies to lay hold of and clarify the employment of data for training and testing. This straightforwardness and simplicity in implementation are key to its extensive embrace within the domain.

Scaling Law:

In machine learning, the prevalent approach to dividing data structures uses an 80/20 ratio. This technique entails splitting the data into 80% for training purposes and allocating the remaining 20% for testing. The scaling law arises in mathematics to comprehend the correlation between various variables and their reciprocal impacts. Through an analysis of these scaling laws, we can acquire valuable knowledge regarding the consistent performance of models amidst alterations in dataset magnitude during the machine learning procedure.

Guyon's Scaling Law

Isabel Guyon introduced the statistical scaling principle in 1997. The main purpose of this law is to guide data distribution for training and neural networks to test and avoid overfitting. Compared to the Pareto principle, Guyon's law implies model division is based on the number of adjustable parameters in a sample.

As per the law, the square root of the number of adjustable parameters should be oppositely proportional to the portion of the dataset allocated for validation. Therefore, as the difficulty of the model increases, a larger portion of the data should be allocated to the validation set to prevent overfitting.

Why We Need to Split the Data Set?

To accurately identify the behaviour of a machine learning model, it is necessary to include other models that were not used during its training. If not doing so can lead to a biased evaluation of the model.

  • Using training examples to evaluate models is like giving a series of questions to a group of students and testing only a few on the final exam. It is unclear whether students understood the topic or just caught some details.
  • The easiest way is to split the entire data set into two groups. Subsequently, one set is used for training, while the other is kept for model analysis. This technique is often referred to as the retention technique. Let's look at the meanings of data subsets first.
  • We make use of these observations to educate our machine-learning models. We utilize these observations to adjust the model's parameters during the learning phase.
  • Once the training phase is completed, we evaluate the machine learning model by employing the observations from the test set. This enables us to gauge the model's performance when faced with new observations. It is important to note that both the train and test sets must exhibit identical distributions.
  • We can diminish the risks associated with data leakage or overfitting by employing the holdout technique. As a result, we can guarantee that the model we train can effectively apply its knowledge to unknown data.

Train-Set Split Evaluation

Evaluation using the train-set split method includes measuring the effectiveness of a machine-learning model by dividing your data into various parts. In the aftermath, you train and assess your model on these subsets. This step holds immense significance in machine learning as it reduces overfitting and ensures that your model can effectively evaluate unfamiliar data.

Why We Use an 80-20 Split for Training and Test Data?

The easiest method is the Train-Test Split to divide data, including the dataset divided into two sets.

  1. Typically, around 60-80% of the data is utilized for training the model.
  2. The second task involves conducting a train set split exercise to assess the model's performance on unseen data. Typically, a portion of the data, varying between 20% and 40%, is allocated for this assessment.
    This offers a fundamental assessment of the model's ability to generalize, but it may be accessible to variation caused by the unpredictability of the initial division.
  3. Validation Set: In the past, it was standard procedure to tweak the hyperparameters of the model and regularly evaluate its effectiveness while it was being trained. This process usually involved utilizing a portion of the data, representing approximately 10-20%. The purpose of this validation set was primarily to prevent overfitting and achieve a more trustworthy assessment compared to a simple train-test division. A k-fold cross-validation method involves splitting the dataset into k-folds of the same size.
  • K-1 sets are used for training. However, the remaining group is allocated for testing.
  • The above procedure is repeated k times, where each data point can be assigned to the test set once. This approach ensures a statistically reliable model performance evaluation, significantly better than a single-stage train test.
  • Regardless of the classification method, it is important to use specific criteria to assess the model's effectiveness. Traditional evaluation criteria include precision, which refers to the percentage of correct predictions.
  • Accuracy defines the ratio of accurate positive forecasts.
  • Recall defines the ratio of correctly forecasted instances of true positives.
  • The F1 Score is a metric that combines the precision and recall measures using a harmonic mean.

How to Split the Data Set?

When we try to find the optimal proportion between train and test data, the initial outcome is 80:20. We allocate 80% of the available data for training while the remaining portion is utilized for testing. In the past, older references and textbooks suggested a split of 70:30 or even 50:50.

Why We Use an 80-20 Split for Training and Test Data?

However, sources focusing on deep learning or big data indicate a significant shift towards a split of 99:1. It is important to note that, similar to other challenges in machine learning, this issue lacks an outright solution. When training data are not enough, the estimations of parameters become highly incalculable. On the contrary, a shortage of testing data leads to a considerable variance in performance evaluations.

Import the Dataset

We will use the India Housing dataset. Before importing the dataset, you can employ Pandas to import the data into a data frame. The pip command will assist you in installing the pandas library.

Now, incorporate the dataset into the Pandas dataframe of Python.


Why We Use an 80-20 Split for Training and Test Data?

Suppose the Bedrooms column is the output (Y).

Simultaneously, we must drop the column from the dataset to form the input vector.

Now, use Pandas's .head() method to see what the input and output look like.


Why We Use an 80-20 Split for Training and Test Data?

Output

0    3
1    4
2    6	
3    8
4    6
Name: Bedrooms, dtype: int64

Split the Data Using sklearn

We will split the data using the train_test_split function from the sklearn library. To execute the task, install the sklearn library with the pip command. Go with the below command in the command prompt for successful installation.

The given ratio randomly splits your data into training and testing sets. Now, let's take a closer look at the Python implementation.

We are currently employing an 80:20 division ratio. The remaining 0.2 signifies the testing data set, which accounts for 20%.

Run the below code to get and compare the structure of different training and testing sets.

Output:

shape of original dataset : (10, 4)
shape of input - training set (8, 3)
shape of output - training set (8,)
shape of input - testing set (2, 3)
shape of output - testing set (2,)

Complete Code

Explanation

In the above code, we have started by importing the pandas library to work with data and reads a CSV file named "india_housing.csv" into a DataFrame. Then, it displays the first few rows to preview the data. Next, it separates the target variable for prediction (number of bedrooms) from the other features. After that, it splits the dataset into training and testing sets, using 80% for training and 20% for testing. Consequently, it prints the shapes of both the original and the split datasets.

When Do We Need to Use the Train-Test Split?

  1. Implementation of the Train-Test Split concept easily resolves the predictive modelling issue. This approach becomes popular when ample data is available to partition the dataset into training and testing subsets. The training and testing datasets accurately reflect the problem dataset, making it an ideal solution. The original training split dataset must represent the problem dataset effectively.
  2. If there are ample records encompassing typical and uncommon situations, the issue data set is suitably portrayed. This may pertain to various combinations of input factors observed in real-life scenarios. A substantial quantity of examples, ranging from thousands to millions, may be required. Conversely, the train-test methodology is unsuitable if there is a limited dataset. The training dataset will lack sufficient information for the model to effectively correlate inputs with outputs once it is segregated into train and test sets. Furthermore, the test set will not have enough data to accurately evaluate the model's performance. The estimated performance could be excessively elevated or diminished.
  3. A viable substitute model assessment technique for inadequate information would be the k-fold cross-validation technique. Employing the train-test division assessment procedure offers advantages beyond the scale of the dataset, encompassing enhanced computational efficiency.
  4. When it comes to expensive training of specific models, the process of repeating evaluations proves to be ineffective in various other procedures. An example of this could be deep neural network models. In such cases, the typical approach is to utilize the train-test method. On the other hand, a project may possess a vast dataset and an efficient model yet require a speedy assessment of the model's performance. In this scenario, once again, the train-test split process is employed. Random selection divides the samples from the original training dataset into two subsets.

Configuring the Train-Split Set

In this process's realm, a primary setup factor exists. It is disallowed based on the magnitude of the train and test collections. This is typically called a ratio ranging from 0 to 1, representing the proportion of the train and test datasets. You must select a division ratio that satisfies the demands of your project, encompassing the subsequent elements:

  • Training Set Representativeness
  • Computational Cost to Evaluate the Model
  • Test Set Representativeness
  • Computational Cost to Train the Model

The Common Split Percentage consist of:

TrainTest
80%20%
67%33%
50%50%

The Procedure of Train-Test Split in Scikit-Learn

Scikit-Learning is a very popular library used in the machine learning field of Python that allows us to use a method to find the split between training and testing using the train_test_split() function. This function uses a dataset that will be entered as input, resulting in a split dataset into two sets.

Furthermore, the original data set can be divided into two parts, one for input (x) and the other for output (y). Then, you can call a function that generates both arrays and properly separates the training, testing, and validation subsets.

If you wish to indicate the division magnitude, you may utilize the "test_size" parameter. This parameter can accept several rows (integer) or a proportion (float) ranging from 0 to 1. The latter option is more frequently employed, with typical values like 0.33. In this case, 33% of the dataset will be assigned to the test set, while the remaining 67% will be allocated to the training set.

I will present an exemplary dataset of one thousand items for classification purposes. Hereafter, you will find a comprehensive outline of the complete sample collection.

The dataset is split into train and test sets when running the example. Below, you can find the size of the new dataset. As mentioned previously, it is evident that 330 examples (33%) were allocated to the test set, while 670 examples (67%) were assigned to the training set.

Output:

(670, 2) (330, 2) (670,) (330,)

Explanation

The above code is explaining a dataset for machine learning tasks. It generates 1000 data points using the make_blobs() function. It is creating clusters of different shapes and sizes. Then, it splits this dataset into training and testing sets using a 33% test size. After that, it displays the sizes of each set, providing you information about the number of samples and features in training and testing data sets.

In addition, the train_size parameter, which can be an integer number of rows or a distribution of the original data set from 0 to 1, such that 0.67 represents the 67th percentile, allows partitioning of the data set.

What is Repeatable Train Test Data?

Using a machine learning technique, the consistent train test split guarantees that the information used to train and test the model does not continue to change even after repeated split execution. This ensures you can repeat your work and compare the same accuracy between different models.

How Does It Work?

  1. First, you need to partition the data. Data segmentation results in two subsets known as training and evaluation sets. The evaluation process measures the model's effectiveness on unseen data, while the training process identifies the model.
  2. The second step is Randomness. The separation is usually optional. This means the training and test sets will each have different data sets when a separation is processed.
  3. As a last option, make it repeatable. For cracks to be simulated, they must be filled with desired features. This can be done with a random element. The process starts with a random number generator whose exact value is known as a randomization element. By creating a randomization element that instructs the computer to use the same random number sequence every time it performs a split. As a result, no matter how many parts you use, you will always see the same training and testing.

You can achieve this by assigning a whole number to the random_state. It is not an adjustable hyperparameter, so any value will suffice.

This concept is depicted in the following example, demonstrating that the result of dividing two data sets is identical.

Output:

[[ 9.51341402 -3.47126511]
 [-8.14017839 -3.59592016]
 [-8.65321875  4.04931456]
 [ 8.9036352  -2.71056751]
 [ 9.56697901 -3.65027215]]
[[ 9.51341402 -3.47126511]
 [-8.14017839 -3.59592016]
 [-8.65321875  4.04931456]
 [ 8.9036352  -2.71056751]
 [ 9.56697901 -3.65027215]]

Explanation

The Python code below generates synthetic data using the make_blobs function from the sklearn.datasets module, generating a dataset of 100 observations.It then splits the data set into training test sets using the train_test_split function from sklearn Implement the .model_selection module, distributing 33% of the data for testing while maintaining a fixed randomness of 1 for reproducibility. The code refers to the first five letters of the feature matrix in the training set. However, there are many times in the code how the train_test_split function is called twice with the same parameters, resulting in the same part of the training routine being exposed twice, either for reference or troubleshooting purposes.

Stratified Train Test Split

The stratified division is a specific train-test division used in machine learning to capture imbalanced datasets. Imbalancing in a dataset arises when there is an unequal distribution of the different categories. For example, we train a model to detect whether an email is spam. If 99% of the emails are spam, a typical train-test division would likely result in a model that excels at classifying spam emails but performs poorly in identifying non-spam emails.

Stratified train-test split work procedures:

  1. Sort the Data: Organising the dataset according to the class label instead of a random sequence is crucial to guarantee a fair and impartial portrayal of every category. This sorting procedure guarantees an equal distribution of all classes across the dataset.
  2. Fold the Data: After arranging it, you can separate it into different segments or partitions. Usually, the quantity of partitions is decided by your k-fold cross-validation technique.
  3. Split Trains and Tests Data: For completing the task, the examination set is used for every iteration, while the preparation set is employed for the remaining iterations. This iterative procedure repeats for each iteration.

To go with the abovementioned approach, you can assign the y component of the given dataset to the stratify parameter. By doing so, the train_test_split() function will ensure that the distribution of examples from each class in the provided y array is maintained in both the training and testing subsets.

We can divide the dataset into two sets, named the training and test sets, without using the stratify argument. Below, you can see the complete code example.

Output:

Counter({0: 94, 1: 6})
Counter({0: 45, 1: 5})
Counter({0: 49, 1: 1})

Explanation

The above declared program describes a data set of 100 observations, with an odd class distribution of 94% in one class and 6% in another. It divides the data set into two sets a training set and a test set, with all samples assigned to the test set for class distribution analysis. Then, we have determined the number of samples of each class in the original dataset, the training set (empty set), and the test set, to construct the imbalanced image according to class parameters and to show how dataset diversity imbalances are managed in machine learning.

As the next step, we can now compare the arrays by dividing the data between the training and test sets.

Output:

Counter({0: 94, 1: 6})
Counter({0: 47, 1: 3})
Counter({0: 47, 1: 3})

Explanation

The program creates a synthetic dataset with an odd class distribution (94%-6%) and defines a method to preserve this distribution when classifying data for machine learning. Next, we have created this dataset and then divide it into training test sets. The pre- and post-learning frequencies reflect the persistence of the class imbalance, which is a key factor for successful model training and analysis.

Now we known-how ourselves with the train_test_split function. Next, dive into assessing machine-learning model performance.

Evaluate Machine Learning Models Using Train Test Split

In this segment, we will explore applying the train-test division technique to evaluate machine learning models on datasets for predictive modelling in regression and traditional classification tasks.

In this demonstration, we will evaluate a random forest algorithm on the sonar dataset by employing the train-test split technique. The sonar dataset, a widely used dataset in machine learning, comprises 208 data instances. It comprises 60 numeric input variables and a target variable with two distinct class values, enabling binary classification.

Determining whether sonar echoes indicate the presence of a boulder, or a simulated explosive device is the objective of the dataset.

Output:

(208, 60) (208,)

Explanation

The above Python making use of the Pandas library to read a CSV file from the given URL. It loads the dataset into a DataFrame called dataframe without assuming the presence of column headers since 'header=None'. As a result, it extracts the values from the DataFrame into a NumPy array named 'data'. Then the code separates the features (X) from the target variable (y), where 'X' consists of all rows and all columns except the last one, and 'y' comprises the last column of 'data'. Then, it prints the shapes of the feature matrix 'X' and the target array 'y'.

First, let's assess a model by dividing training and testing. The imported data will be divided into sections for both input and output.

Output:

(208, 60) (208, )

Code:

Output:

Training data: (70, 2) (70,)
Testing data: (30, 2) (30,)

Explanation

In the above code example we have created a synthetic dataset is generated using NumPy library, consisting of 100 samples along with two features. The target variable 'y' is created with random integers 0 or 1. Then, the dataset is split into training and testing sets using the train_test_split function from the scikit-learn library, where 30% of the data is reserved for testing and 70% is allocated for training. As the last step, the code prints the dimensions of the training and testing datasets.

In the next, we have an option to split the dataset into 67% for training the model and 33% for evaluating the model. This split has been done randomly.

Output:

(139, 60) (69, 60) (139,) (69, )

Next, we can accomplish and adapt the model by using the dataset dedicated to training purposes.

Afterwards, we are generating forecasts using the trained model and evaluate the results by measuring the classification accuracy performance metric.

Complete Code

Run this code to get a complete result:

Output:

(208, 60) (208,)
(139, 60) (69, 60) (139,) (69,)
Accuracy: 0.783

Explanation

In the above-provided code section, we have used a random forest classifier to predict the target labels of a dataset retrieved from a URL. In the next, the data is read from the URL into a pandas DataFrame and converted into NumPy arrays, separating the features (X) from the target variable (y). The dimensions of X and y are printed to understand the dataset's shape. Then, the dataset is split into training and testing sets using the train_test_split function, where 33% of the data is reserved for testing, ensuring reproducibility through random_state=1. A random forest classifier is instantiated, trained on the training data, and then used to predict labels for the testing set. The accuracy of the model is computed using the accuracy_score function by comparing the predicted labels with the actual labels from the testing set, and the result is printed for evaluation.

Train Test Split for Regression

This section will describe the process of evaluating a random forest algorithm on the housing dataset using the train-test split technique. The housing dataset includes 506 data entries, each containing 13 numerical input variables and a numerical target variable. This dataset serves as a standard machine learning dataset. Its purpose is to forecast the price of houses in India based on various house characteristics.

After successfully acquiring and importing the dataset. Below is a summary of its structure, presented in the example, thus establishing a Pandas DataFrame.

Output:

(506, 14)

It has become feasible to evaluate a model using a train-test division. Initially, the dataset that has been uploaded must be segregated into input and output components.

Afterwards, we will split the dataset, allocating 67% of it to train the model and the remaining 33% to evaluate it. This partition was done randomly.

After that, we have defined below and fit the model using the training dataset.

Now, generate forecasts utilizing the training model and evaluate the outcomes using the performance metric of Mean Absolute Error (MAE).

Run the Complete Code

Output:

(506, 13) (506,)
(339, 13) (167, 13) (339,) (167,)
MAE: 2.171

Explanation

In the above code, we have created a random forest regressor to predict habitat value based on a dataset retrieved from the URL. Next, the data from the given URL is read into a pandas DataFrame and converted to NumPy arrays. Then, the dataset is split into training tests using the train_test_split function, where 33% of the data is kept for testing and random_state=1 ensures reproducibility given a random forest regressor instantiation, it is trained on on the training data, and used to predict the settlement value for the test set. The mean absolute error (MAE) between predicted and actual habitat values is calculated using the mean_absolute_error() function.

Conclusion

Training and testing data divided into 80-20 by functionality has obtained wide popularity in the field of machine learning and data analysis. This effective approach easily handles the training and testing performance of the model. Splitting the data set decreases the risk of overfitting and enables a reliable evaluation of model effectiveness. The 80-20 split keeps a useful balance, providing efficient and rapid sampling and reliable predictive results.