User guide: create your own scikit-learn estimator


The central piece of transformer, regressor, and classifier is sklearn.base.BaseEstimator. All estimators in scikit-learn are derived from this class. In more details, this base class enables to set and get parameters of the estimator. It can be imported as:

>>> from sklearn.base import BaseEstimator

Once imported, you can create a class which inherate from this base class:

>>> class MyOwnEstimator(BaseEstimator):
...     pass


Transformers are scikit-learn estimators which implement a transform method. The use case is the following:

  • at fit, some parameters can be learned from X and y;

  • at transform, X will be transformed, using the parameters learned during fit.

In addition, scikit-learn provides a mixin, i.e. sklearn.base.TransformerMixin, which implement the combination of fit and transform called fit_transform:

One can import the mixin class as:

>>> from sklearn.base import TransformerMixin

Therefore, when creating a transformer, you need to create a class which inherits from both sklearn.base.BaseEstimator and sklearn.base.TransformerMixin. The scikit-learn API imposed fit to return ``self``. The reason is that it allows to pipeline fit and transform imposed by the sklearn.base.TransformerMixin. The fit method is expected to have X and y as inputs. Note that transform takes only X as input and is expected to return the transformed version of X:

>>> class MyOwnTransformer(BaseEstimator, TransformerMixin):
...     def fit(self, X, y=None):
...         return self
...     def transform(self, X):
...         return X

We build a basic example to show that our MyOwnTransformer is working within a scikit-learn pipeline:

>>> from sklearn.datasets import load_iris
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = load_iris(return_X_y=True)
>>> pipe = make_pipeline(MyOwnTransformer(),
...                      LogisticRegression(random_state=10,
...                                         solver='lbfgs'))
>>>, y)  
>>> pipe.predict(X)  



Similarly, regressors are scikit-learn estimators which implement a predict method. The use case is the following:

  • at fit, some parameters can be learned from X and y;

  • at predict, predictions will be computed using X using the parameters learned during fit.

In addition, scikit-learn provides a mixin, i.e. sklearn.base.RegressorMixin, which implements the score method which computes the R^2 score of the predictions.

One can import the mixin as:

>>> from sklearn.base import RegressorMixin

Therefore, we create a regressor, MyOwnRegressor which inherits from both sklearn.base.BaseEstimator and sklearn.base.RegressorMixin. The method fit gets X and y as input and should return self. It should implement the predict function which should output the predictions of your regressor:

>>> import numpy as np
>>> class MyOwnRegressor(BaseEstimator, RegressorMixin):
...     def fit(self, X, y):
...         return self
...     def predict(self, X):
...         return np.mean(X, axis=1)

We illustrate that this regressor is working within a scikit-learn pipeline:

>>> from sklearn.datasets import load_diabetes
>>> X, y = load_diabetes(return_X_y=True)
>>> pipe = make_pipeline(MyOwnTransformer(), MyOwnRegressor())
>>>, y)  
>>> pipe.predict(X)  

Since we inherit from the sklearn.base.RegressorMixin, we can call the score method which will return the R^2 score:

>>> pipe.score(X, y)  


Similarly to regressors, classifiers implement predict. In addition, they output the probabilities of the prediction using the predict_proba method:

  • at fit, some parameters can be learned from X and y;

  • at predict, predictions will be computed using X using the parameters learned during fit. The output corresponds to the predicted class for each sample;

  • predict_proba will give a 2D matrix where each column corresponds to the class and each entry will be the probability of the associated class.

In addition, scikit-learn provides a mixin, i.e. sklearn.base.ClassifierMixin, which implements the score method which computes the accuracy score of the predictions.

One can import this mixin as:

>>> from sklearn.base import ClassifierMixin

Therefore, we create a classifier, MyOwnClassifier which inherits from both slearn.base.BaseEstimator and sklearn.base.ClassifierMixin. The method fit gets X and y as input and should return self. It should implement the predict function which should output the class inferred by the classifier. predict_proba will output some probabilities instead:

>>> class MyOwnClassifier(BaseEstimator, ClassifierMixin):
...     def fit(self, X, y):
...         self.classes_ = np.unique(y)
...         return self
...     def predict(self, X):
...         return np.random.randint(0, self.classes_.size,
...                                  size=X.shape[0])
...     def predict_proba(self, X):
...         pred = np.random.rand(X.shape[0], self.classes_.size)
...         return pred / np.sum(pred, axis=1)[:, np.newaxis]

We illustrate that this regressor is working within a scikit-learn pipeline:

>>> X, y = load_iris(return_X_y=True)
>>> pipe = make_pipeline(MyOwnTransformer(), MyOwnClassifier())
>>>, y)  

Then, you can call predict and predict_proba:

>>> pipe.predict(X)  
>>> pipe.predict_proba(X)  

Since our classifier inherits from sklearn.base.ClassifierMixin, we can compute the accuracy by calling the score method:

>>> pipe.score(X, y)  