Meta-modelling API

Meta-modelling API concepts are required for a modular implementation of higher-order modelling features such as hyper-parameter optimization or ensemble methods. Following the principle of composition the meta-algorithms are implemented as modular compositions of the simpler algorithms. Technically, the meta-strategies are realised by meta-estimators that are estimator-like objects which perform certain methods with a given estimators. They hence take estimator type objects as some of their initializing inputs, and when initialized exhibit the fit-predict logic that implements the meta-algorithm when instantiated on the wrapped estimators.

Hyperparamter optimization

The optimization of model hyperparameter, for instance, can be implemented using scikit’s grid or random search meta-estimators, for example:

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets.base import load_boston
from sklearn.model_selection import GridSearchCV

from skpro.parametric import ParametricEstimator
from skpro.parametric.estimators import Constant

model = ParametricEstimator(
    point=RandomForestRegressor(),
    std=Constant('mean(y)')
)

# Initiate GridSearch meta-estimator
parameters = {'point__max_depth': [None, 5, 10, 15]}
clf = GridSearchCV(model, parameters)

# Optimize hyperparameters
X, y = load_boston(return_X_y=True)
clf.fit(X, y)

print('Best score is %f for parameter: %s' % (clf.best_score_, clf.best_params_))
# >>> Best score is -4.058729 for parameter: {'point__max_depth': 15}

Read the scikit documentation for more information.

Pipelines

Probabilistic estimators work well with scikit-learn’s Pipeline meta-estimator that allows to combine multiple estimators into one. Read the pipeline documentation to learn more.

Ensemble methods

The framework provides experimental support for ensemble methods. Currently, this includes bagging in a regression setting which is implemented by the BaggingRegressor estimator in the ensemble module. The meta-estimator fits base regressors (i.e. probabilistic estimators) on random subsets of the original dataset and then aggregates their individual predictions in a distribution interface to form a final prediction. The implementation is based on scikit’s meta-estimator of the same name but introduces support for the probabilistic setting.

The following example demonstrates the use of the bagging procedure:

from sklearn.tree import DecisionTreeRegressor

from skpro.ensemble import BaggingRegressor as SkproBaggingRegressor
from skpro.metrics import log_loss as loss
from skpro.parametric import ParametricEstimator
from skpro.workflow.manager import DataManager


def prediction(model, data):
    return model.fit(data.X_train, data.y_train).predict(data.X_test)


data = DataManager('boston')
clf = DecisionTreeRegressor()

baseline_prediction = prediction(
    ParametricEstimator(point=clf),
    data
)

skpro_bagging_prediction = prediction(
    SkproBaggingRegressor(
        ParametricEstimator(point=clf),
        n_estimators=10,
        n_jobs=-1
    ),
    data
)

l1, l2 = loss(data.y_test, baseline_prediction), \
         loss(data.y_test, skpro_bagging_prediction)

print('Baseline: ', l1)
print('Bagged model:', l2)

To learn more, you may also read scikit’s documentation of the BaggingRegressor.