Cross Validation and types

sri hari
Nerd For Tech
Published in
3 min readMay 14, 2021

--

In this blog lets see about cross validation and its types.

While running a ML model we may get certain accuracy ,for example lets consider 95% .We will report to manager that our model gives this accuracy but the same model may give 93% while running in front of client. So we can’t fix specific accuracy for our model, to solve this issue we use cross validation.

When ever we do train test split, we use random state variable. When random state value changes accuracy also will change.

Cross-validation is a resampling technique for evaluating ML models by building multiple models using subsets of data. At the same time cross validation helps in preventing overfitting.

Types of Cross validation:

Leave one out cross validation:

LOOCV

As the name indicates in this type of cross validation, we take a single observation as the testing set and remaining observations as training set.

Disadvantages:

  1. If the dataset size is 1000, we need to perform 1000 experiments which takes high computing time and space.
  2. This will create low bias.

“No one is using this type of CV, still you can use this type CV for small dataset”

K-fold cross validation:

k-fold cross validation

In this type of dataset we split the dataset into k folds(parts/sections). Among the k folds, one fold is considered as testing set and other k-1 folds are considered as training set in our first iteration.

During second iteration other fold which was a train set during first iteration is chosen as test set and other k-1 folds are considered as training set. Since we have k fold, we will perform k iterations and each iterations(experiments) will give each accuracy. So we will arrive at k accuracies. From those accuracies we can calculate maximum, minimum and average accuracies.

Disadvantages:

In some experiment’s train data, if one of the classes is more, the particular experiment(model) may get biased. So stratified cross validation comes in.

“Most commonly used cross validation”

Stratified cross validation:

You may get confused why I uploaded the same picture with different heading for stratified and k fold cross validation but both types of CV works in a similar way. The only difference is whenever a test-train split is done, the proportion of classes in variables will get distributed equally among test train sets. So test-train split is balance. Thus the disadvantage of k fold is solved by stratified cross validation.

Blocked Time series cross validation:

time series blocked cross validation

From the above image, it is clear that we a fixing a n(no of days to be considered, in above image its 5). Using those n timeseries data, we will predict next n+1th data. This is how blocked cross validation works.

Pros:

  1. More splits.
  2. Prevents data leakage to model .

Cons:

  1. Maybe very computationally expensive.

Split time series cross validation:

Divide the dataset into two folds at each iteration on condition that the Testing set is always ahead of the training set. At each iteration our training data grows.

Pro:

  1. More splits
  2. can inspect how model fares on different days.

cons:

  1. data leakage from future is possible

Hope this blog is useful , Thankyou.

--

--

sri hari
Nerd For Tech

Student from Coimbatore Institute of Technology, R and D engineer trainee