5 lines of code increase the learning speed of Scikit-Learn parameters by 5 times

5 lines of code increase the learning speed of Scikit-Learn parameters by 5 times

Author|Michael Chau Compilation|VK Source|Towards Data Science

Everyone knows Scikit-Learn-it is a product that data scientists basically know, and it provides dozens of easy-to-use machine learning algorithms. It also provides two ready-made technologies to solve the hyperparameter adjustment problem: grid search (GridSearchCV) and randomized search (RandomizedSearchCV).

Both of these techniques are powerful ways to find the correct hyperparameter configuration, but this is an expensive and time-consuming process!

If you want to speed up this process

In this blog post, we introduced the tune-sklearn ( github.com/ray-project... API) while making it easier to use these new algorithms.

Tune sklearn is an alternative to the Scikit Learn model selection module, using advanced hyperparameter tuning techniques (Bayesian optimization, early stopping, distributed execution)-these techniques provide significant speedups than grid search and random search !

The following are the functions provided by tune sklearn:

  • Consistency with Scikit Learn API: Tuning sklearn is a replacement of GridSearchCV and RandomizedSearchCV, so you only need to change less than 5 lines in the standard Scikit Learn script to use the API.

  • Modern hyperparameter tuning technology: tune-sklearn allows you to easily use Bayesian optimization, hyperspace and other optimization techniques by simply switching a few parameters.

  • Framework support: tune-sklearn is mainly used to tune Scikit-Learn models, but it also supports and provides examples for many other Scikit-Learn frameworks, such as Skorch (Pytorch), KerasClassifiers (Keras), and XGBoostClassifiers (XGBoost).

  • Distributed: Tune sklearn uses Ray Tune, a distributed hyperparameter tuning library, to efficiently and transparently parallelize cross-validation on multiple cores and even multiple machines.

Tune sklearn is also fast. To see this, we benchmarked tune sklearn (enable early stopping) with native Scikit Learn on a standard hyperparameter scan. In our benchmark test, we can see a significant performance difference between a normal laptop and a large workstation with 48 CPU cores.

For a larger benchmark 48-core computer, Scikit Learn spent 20 minutes searching for 75 hyperparameter sets on a dataset of 40,000 in size. Tune sklearn only took 3.5 minutes and was executed in a way that minimally impacted performance.

The first picture: On a personal dual-core i5 8gb ram laptop, search for 6 hyperparameter sets. The second picture: On a 48-core 250gb ram mainframe computer, search for 75 hyperparameter sets.

Note: For smaller data sets (10,000 or fewer data points), accuracy may be sacrificed when trying to apply early stopping. We do not expect this to have an impact on users, as the library is designed to accelerate large training tasks with large data sets.

Simple 60-second roaming

Run pip install tune-sklearn ray[tune]the sample code in the following chapters from the beginning.

Let's see how it works.

Hyperparam set 2 is a set of hopeless hyperparameters. It will be detected by the early stopping mechanism of tune and stopped early to avoid wasting training time and resources.

TuneGridSearchCV example

1. just change the import statement to get Tune's grid search cross:

# from sklearn.model_selection import GridSearchCV
from tune_sklearn import TuneGridSearchCV
 

From here, we will continue like in the interface style of Scikit Learn! Let's use a "virtual" custom classification data set and an SGD classification program to classify the data.

We chose SGDClassifier because it has a partial_fit API, which allows it to stop fitting data with specific hyperparameter configurations. If the estimator does not support early stopping, we will return to parallel grid search.

#  
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

#  
X, y = make_classification(n_samples=11000, n_features=1000, n_informative=50, 
                           n_redundant=0, n_classes=10, class_sep=2.5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)

#  SGDClassifier 
parameters = {
   'alpha': [1e-4, 1e-1, 1],
   'epsilon':[0.01, 0.1]
}
 

As you can see, the settings here are exactly what you did for Scikit Learn! Now, let's try to fit a model.

tune_search = TuneGridSearchCV(
    SGDClassifier(),
    parameters,
    early_stopping=True,
    max_iters=10
)
import time #  
start = time.time()
tune_search.fit(X_train, y_train)
end = time.time()
print("Tune Fit Time:", end - start)
pred = tune_search.predict(X_test)
accuracy = np.count_nonzero(np.array(pred) == np.array(y_test))/len(pred)
print("Tune Accuracy:", accuracy)
 

Please note the nuances we introduced above:

  1. A new early_stopping variable, and

  2. max_iters parameter

early_stopping decides when to stop. MedianStoppingRule is a good default setting, but please refer to Tune's documentation on the scheduler for a complete list of options: docs.ray.io/en/master/t...

max_iters is the maximum number of iterations that a given hyperparameter set can run; if you stop searching for the hyperparameter set early, you can run fewer iterations.

Please try to compare it with GridSearchCV

from sklearn.model_selection import GridSearchCV
# n_jobs=-1  
sklearn_search = GridSearchCV(
   SGDClassifier(),
   parameters,
   n_jobs=-1
)

start = time.time()
sklearn_search.fit(X_train, y_train)
end = time.time()
print("Sklearn Fit Time:", end - start)
pred = sklearn_search.predict(X_test)
accuracy = np.count_nonzero(np.array(pred) == np.array(y_test))/len(pred)
print("Sklearn Accuracy:", accuracy)
 

TuneSearchCV Bayesian optimization example

In addition to the grid search interface, tunesklearn also provides an interface TuneSearchCV for sampling from hyperparameter distributions.

In addition, with just a few lines of code changes, you can easily enable Bayesian optimization for the distribution in TuneSearchCV.

Run pip install scikit-optimize to try the following example:

from tune_sklearn import TuneSearchCV

#  
import scipy
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

#  
X, y = make_classification(n_samples=11000, n_features=1000, n_informative=50, 
                           n_redundant=0, n_classes=10, class_sep=2.5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)

#  SGDClassifier 
#  
param_dists = {
   'alpha': (1e-4, 1e-1),
   'epsilon': (1e-2, 1e-1)
}

tune_search = TuneSearchCV(SGDClassifier(),
   param_distributions=param_dists,
   n_iter=2,
   early_stopping=True,
   max_iters=10,
   search_optimization="bayesian"
)

tune_search.fit(X_train, y_train)
print(tune_search.best_params_) 
 

Lines 17, 18, and 26 are the lines of code changed to enable Bayesian optimization

As you can see, it is very simple to integrate tunesklearn into existing code. You can take a look at a more detailed example: github.com/ray-project...

Also, take a look at Ray's alternative to joblib, which allows users to parallelize training on multiple nodes (not just one node), which further speeds up training.

Documentation and examples

Note: Importing from ray.tune as shown in the linked document is only available on nightly Ray wheels, and will be available on pip soon

Original link: towardsdatascience.com/5x-faster-s...

Welcome to follow Panchuang AI blog site: panchuang.net/

sklearn machine learning Chinese official document: sklearn123.com/

Welcome to the Panchuang blog resource summary station: docs.panchuang.net/