Getting Started¶
Attention
This is a work-in-progress that shows how the pre-alpha version of Algoneer works. Please let us know if you should encounter any problems.
Algoneer provides blackbox tests that work on Model
objects. A Model is an Algorithm
that has been
trained with a specific DataSet
.
To get started we need to install Algoneer.
Note
You need Python version >=3.6 to run Algoneer.
pip3 install algoneer
Algoneer aims to be technology-agnostic and provides wrappers for the most popular data processing and machine learning libraries. In this tutorial, we are going to use Algoneer in conjunction with pandas and scikit-learn. In case you have not installed them already, you can do so by running:
pip3 install pandas scikit-learn
Algoneer also provides a separate package with several example datasets that make it easy to get started. We can also install them using pip:
pip3 install algoneer_datasets
That’s it, we’re good to go! Let’s start using Algoneer by loading an example dataset and running a test on it.
Note
You can find the whole example code on GitHub.
from algoneer.dataschema import DataSchema, AttributeSchema as AS
# we define the data schema for the bike dataset, which helps Algoneer to automatically
# run tests on the dataset and any models derived from it
class BikeSchema(DataSchema):
# these are the regressands, which have the "x" role
instant = AS(type=AS.Type.Integer, roles=["x"])
season = AS(type=AS.Type.Categorical, roles=["x"])
yr = AS(type=AS.Type.Integer, roles=["x"])
mnth = AS(type=AS.Type.Integer, roles=["x"])
holiday = AS(type=AS.Type.Boolean, roles=["x"])
weekday = AS(type=AS.Type.Integer, roles=["x"])
workingday = AS(type=AS.Type.Boolean, roles=["x"])
weathersit = AS(type=AS.Type.Categorical, roles=["x"])
temp = AS(type=AS.Type.Numerical, roles=["x"])
atemp = AS(type=AS.Type.Numerical, roles=["x"])
hum = AS(type=AS.Type.Numerical, roles=["x"])
windspeed = AS(type=AS.Type.Numerical, roles=["x"])
# this is the regressor, which has the "y" role
cnt = AS(type=AS.Type.Integer, roles=["y"])
from algoneer_datasets.bike_sharing import path
from algoneer.dataset.pandas import PandasDataset
# we read the CSV data into a pandas dataframe
import pandas as pd
df = pd.read_csv(path+'/data.csv.gz')
# we wrap the dataframe with an Algoneer dataset using the bike schema
ds = PandasDataset(BikeSchema(), df)
This creates a PandasDataSet
that contains
the bike sharing data. This dataset is just a thin wrapper around a pandas
dataframe and adds functionality that is helpful when using the dataset for
testing. Notably, it includes a DataSchema
that
contains information about all attributes in the dataset.
Now, to test a machine learning model with Algoneer we first need to train one. To do this, we can again import a model from the example datasets library:
from sklearn.ensemble import RandomForestRegressor
from algoneer.algorithm.sklearn import SklearnAlgorithm
# we wrap the random forest classifier using the SklearnAlgorithm class
algo = SklearnAlgorithm(RandomForestRegressor, n_estimators=100)
# we produce a model by training the algorithm with a dataset
model = algo.fit(ds)
Again, the Algorithm
class is just a thin wrapper
around existing algorithms, in this case a scikit-learn random forest regressor.
Now that we have trained our model, we can run a simple black box test on it:
from algoneer.methods.blackbox.shap import SHAP
shap = SHAP()
This so-called partial dependence plot is a simple test that quantifies the average effect that a given attribute has on the prediction of a machine learning model. You can read more about the test here.
Let’s run it on our model:
result = shap.run(model, ds, max_datapoints=100)
Here, max_datapoints specifies the number of datapoints that we use to average the effect of the attribute.