Jump-Start Your Supervised Learning Task With This Amazing Library

Supervised machine learning made more accessible for beginners

Bharti Prasad
3 min readApr 23, 2021
Photo by Lukas from Pexels

A better understanding of the data and task, as well as more appropriate features, most often lead to significant performance improvements.

dabl has several tools that make it easy to clean and inspect your data and create strong baseline models. This library tries to help make supervised machine learning more accessible for beginners, and reduce boilerplate for common tasks. It accomplishes this by automating the process of iterating through various data preprocessing, feature engineering, parameter tuning, and model building techniques in order to generate effective baseline models.

There are two main packages that dabl takes inspiration from and that dabl builds upon- scikit-learn and auto-sklearn.

!pip install dabl
import dabl

Data cleaning

import pandas as pd
data = pd.read_csv(dabl.datasets.data_path("adult.csv.gz"))
data_clean = dabl.clean(data)

Data cleaning is the first step in any data analysis. dabl will try to detect the types of your data and apply appropriate conversions. Its goal is to get the data “clean enough” to create useful visualizations and models and to allow us to perform custom cleaning operations.

data_clean = dabl.clean(data, type_hints={"capital-gain": "continuous"})types = dabl.detect_types(data_clean)
print(types)
type output

Exploratory Data analysis

Next comes exploratory data analysis. dabl provides a high-level interface that summarizes several common high-level plots. For low dimensional datasets, all features are shown; for high dimensional datasets, only the most informative features for the given task are shown. However, dabl does not guarantee to provide all the interesting aspects of the data. Overall, dabl will provide you with quick insight into what are the important features, their interactions, and how hard the problem might be.

With dabl.plot(), we can create plot of the features deemed most important for our task.

dabl.plot(data, target_col="income")
part 1 of the whole EDA
part 2 of the whole EDA
part 3 of the whole EDA
part 4 of the whole EDA
part 5 of the whole EDA

Initial Model Building

Finally, we can build an initial model for our data. The SimpleClassifier does all the work for us. It implements the familiar scikit-learn API of fit and predict.

model = dabl.SimpleClassifier(random_state=0).fit(data, target_col="income")

The SimpleClassifier first tries several baseline and instantaneous models, potentially on subsampled data, to get an idea of what a low baseline should be. Preprocessing operations such as missing value imputation and one-hot encoding are also performed by the SimpleClassifier.

We can inspect the model using:

dabl.explain(model)
output of model explanation

This can lead to additional insights and help guide custom data processing and cleaning.

Jump-start with your ML journey today with dabl.

Read the documentation for more information about the full list of APIs, limitations, and future works.

--

--

Bharti Prasad

Data Science Research Intern at Indian School of Business | Kaggle Expert | IIITDM-Jabalpur’22