Ensemble Methods

So far we’ve looked at a number of different machine learning algorithms. The one thing all these algorithms have in common is that they are all Strong learners. A strong learner is a model that is able to reach an arbitrary level of precision given enough resources and data. In contrast, a weak learner is a model that is just slightly better than random guessing. For example, a weak learner might be able to correctly classify 55% of the time, while a strong learner might be able to correctly classify 95% of the time. The basic idea behind ensemble methods is to combine multiple weak learners into a single strong learner.

This paradigm is akin to asking a crowd of people to guess the number of jelly beans in a jar, then taking the average as our final answer. The hope is that the average of the guesses will be more accurate than any single guess. I first saw this in a video, but was unable to find it again.

This section assumes the reader is familiar with decision trees as most ensemble methods are based on decision trees. If you are unfamiliar with decision trees, I recommend reading the Decision Trees section.

Bootstrap Resampling

Before we discuss any ensemble methods, we need to devise a way for us to train several independent models. One way to do this is to use bootstrap resampling. The idea behind bootstrap resampling is to take a random sample of the data with replacement, then train a model on that sample. We can repeat this process multiple times to get multiple models.

The following is a toy example of a single bootstrap resampling process:

Note each resample only keeps \(1-e^{-1}\approx 63.21%\) of the data. However, the power of bootstrap comes in the several resamples being done. With \(k\) resamples, we keep \(1-e^{-k}\times 100%\) of the data. Just \(k=5\) will keep \(99.33%\) of the data, so as long as we use enough models, this shouln’t be an issue. However, this does show that it is very risky to have several layers of bootstrapping, as we will be keeping an even smaller fraction of the data.

Weighted Bootstrap

In some cases, we may want to give more weight to some of the observations. For this, we can augment our data with weights proportional to the number of times each observation is included in the resample. This is called a weighted bootstrap.

The following is a toy example of a single weighted bootstrap resampling process. Observe how the observations with higher weights are more likely to be included in the resample.

Bagging

Bagging stands for Bootstap Aggregating. The key idea of bagging is to train several models in parallel using different bootstrap resamples of the data. Then, we combine the predictions of the models to make the final prediction. This could be averaging the predictions for regression problems, or majority voting for classification problems. It is possible to use different types of models for bagging, however, bagging models are typically homogeneous. The most popular bagging algorithm is random forests, which is when the base models are decision trees.

The following diagram shows a bagging process with 3 base models: KNN, SVM, and decision tree.

Voting

Voting is a variant of bagging, however, unlike bagging, the base models are all trained on the original dataset. It is not a good idea to use voting when bagging is available, but might be useful if the models are already trained. The following diagram shows a voting process with 3 base models: KNN, SVM, and decision tree.

Limitation

One distadvantage of bagging is that it is slower to train than a single model. Also, a bagging model can only be as good as the best base model. This are some limitations to consider when using bagging - the limits of the model become the limits of the bagging model.

Boosting

Boosting is a sequential ensemble method where each model is trained to correct the errors of the previous model. The first model is trained on the original data, and the second model is trained on the data where the errors of the first model are emphasized via weighted bootstrapping. This process continues until a stopping criterion is met. The final prediction is made by combining the predictions of all the models using a weighted sum. The weights are determined by the performance of each model on the training data.

AdaBoost

AdaBoost is a popular boosting algorithm that uses decision trees as base models. The algorithm works by training a series of decision trees on the training data, where each tree is trained to correct the errors of the previous tree. The final prediction is made by combining the predictions of all the trees using a weighted sum. The weights are determined by the performance of each tree on the training data. An overview of the AdaBoost algorithm is shown below:

Augment the dataset to have sample weights with initial weights set as \(\frac{1}{N}\).
Fit a weak learner, \(M_i\), using a weighted bootstrap.
Obtain the predictions on the original/full dataset.
Calculate \(E\), the weighted error: \[ E = 1 - \text{Weighted Accuracy} \]
Compute the importance of the model: \[ \lambda_i = \frac{1}{2} \log \left( \frac{1 - E}{E} \right) \]
Rescale the sample weights by:
- \(e^{\lambda_i}\) if incorrectly classified
- \(e^{-\lambda_i}\) if correctly classified
Normalize the weights such that the sum equals 1.
Repeat Steps 2-7 until a stopping condition is met.

In the end the ensemble prediction is given by: \[ M_\text{Final}(x) = \sum_{i=1}^N \lambda_i M_i(x) \]

The following graph shows why importance has that formula - it gives high positive imporance to models that have a low error rate, a 0 importance to models that are 50/50, and a high negative importance to models that have a high error rate (take the opposite prediction). While this formula focuses only on the 2-class case, it is enough for us to understand the intuition behind it.

The following image shows how two weak learners are combined to form a strong learner:

Gradient Boosting

Gradient Boosting is a type of boosting algorithm that uses gradient descent to minimize the loss function. However, unlike gradient decent for neural networks, where the weights get updated to minimize the loss function, it adds a model to the ensemble in each iteration. The model is added to the ensemble in such a way that it minimizes the loss function. An overview of the Gradient Boosting algorithm is shown below:

Start by fitting a not-so-weak learner to the dataset, denoted as \(M_1\).
Compute a differentiable loss function, \(\mathcal{L}(y, M_i(x))\).
Calculate the residuals: \[ \hat{r}_{in} = \frac{\partial \mathcal{L}(y_n, M_i(x_n))}{\partial M_i(x_n)} \]
Fit a new learner with \(x\) as the features and \(\hat{r}_{in}\) as the labels, denoted as \(m_{i+1}\).
Determine the step size (learning rate), \(\hat{\gamma}_{i+1}\): \[ \hat{\gamma}_{i+1} = \arg \min_\gamma \mathcal{L}\left(y, M_i(x) - \gamma m_{i+1}(x)\right) \]
- Note: This is also called the learning rate. Some suggest setting \(\gamma\) to a fixed small value (e.g., \(0.001\)) for all \(i\).
Update the model: \[ M_{i+1}(x) = M_i(x) - \hat{\gamma}_{i+1} m_{i+1}(x) \]
Repeat Steps 2-6 until a stopping condition is met.

The last model, \(M_T\), is the final model.

The following image illustrates the Gradient Boosting algorithm:

Gradient boosting is most useful when the base learner doens’t have a clear differentiable parameters such as decision trees.

XGBoost

XGBoost (eXtreme Gradient Boosting) is a variant of Gradient Boosting that uses a heuristic approaches to optimize the loss function and improve computational efficiency. It also introduces regularization terms to prevent overfitting. It is a controversial algorithm, so it will only be mentioned briefly here.

Stacking

Stacking is a technique that logically combines independently trained models to improve performance. It is a meta-algorithm that can be used with any combination of base models.

I had trouble trainig a good CNN that predicts the label of the CIFAR-10 dataset. My model got confused between cats and dogs a little bit too often. As such I propose a stacking model as shown below:

The key idea is to generate metaclasses that combine similar classes: - Cats and dogs - Other animals - Vehicles

Then I trained a models that predict the metaclasses, a model for each of the metaclasses that predicts the label or if the image is in the incorrect metaclass, and a final catch-all model that predicts the label if the image is in the incorrect metaclass.

Cascading

Cascading is a variant of stacking, where each model is applied sequentially to the data. Unlike the previous example, there’s no split paths. The following image shows how cascading can be used to tackle the classification of the CIFAR-10 dataset:

Now each model acts as a one vs all model. The first model predicts if the label should be “dog”, the second model predicts if the label should be “cat”, and so on. As such it is important to train a model with recall and specificity in mind.

Not PyTorch

Here I’ll show the results of the ensemble models I trained with data I used through the semester. The dataset for Bagging is a 10-dimensional latent space of the MNIST dataset, obtained through a variational autoencoder. The dataset for Stacking is the base CIFAR-10 dataset.

Bagging

To classify my latent space of the MNIST dataset, I used 5 instances of several classification models I’ve learned throughout my career + the ensemble tree models:

Previous models:

Logistic Regression
K-Nearest Neighbors
Support Vector Machine
Naive Bayes
MLP Classifier

Ensemble models:

Random Forest
AdaBoost
Gradient Boosting

Fortunately, Scikit-Learn implements all of these models, and a Voting ensembler. With prior bootstrapping of the data, I could aggregate all models into a single Bagging model. Note that I used some very wishiy-washy parameter tuning for the models, but I wanted to keep the code as simple as possible. Also, I wouldnt use a bagging model with any of the ensemble models, as the parameter tuning for these can be very time-consuming and it adds a second layer of bootstrapping.

Libraries and data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from SckitLearnWrapper import SckitLearnWrapper, get_all_model_names

X_train = np.loadtxt('data/train_data.csv', delimiter=',')
y_train = np.loadtxt('data/train_labels.csv', delimiter=',')
X_test = np.loadtxt('data/test_data.csv', delimiter=',')
y_test = np.loadtxt('data/test_labels.csv', delimiter=',')

I had to use some questionable code to aggregate the models into the VotingClassifier class. The ScikitLearnWrapper class is a wrapper that handles the bootstrapping, training, parameter tuning, and saving/loading of the models. It can be found in the SckitLearnWrapper.py file in the github repository.

# dummy models used to train the voting classifier
dummy_models = [
    ('dummy1', KNeighborsClassifier()), 
    ('dummy2', KNeighborsClassifier())
]

# train real models that will be hijacked into the voting classifier
# formatted in the shape the voting classifier expects
# list[(name: str, model)]
models = [
    (f'{model}_{i}', SckitLearnWrapper(model, i, X_train, y_train).train('models/'))
    for model in get_all_model_names()
    for i in range(5)
]

# train with dummy models
bagger = VotingClassifier(estimators=dummy_models, voting='soft')
bagger.fit(X_train, y_train)

# assign pre-trained models
bagger.estimators = models

Now that the Bagger is trained, we can use it to predict the labels of the test. Let’s take a look at the confusion matrix and the accuracy score of the Bagger model:

Confusion Matrix and accuracy score

y_pred = bagger.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues, 
          text_kw={'fontsize': 8, 'ha': 'center', 'va': 'center'})

# Format numbers as integers without scientific notation
for text in disp.text_.ravel():
    text.set_text(f'{int(float(text.get_text()))}')

plt.title('Confusion Matrix')
plt.show()

# compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9569

As we can observe, the Bagger model did pretty well. Let’s take a look at how each of the models performed individually:

Accuracy of each model

# get a list with all the models
all_models = [*models, ('Bagger_0', bagger)]

# make that list into a dataframe with the model name, model type, model index, and accuracy
accuracies = pd.DataFrame({
    'name': name,
    'model_type': name.split('_')[0],
    'model_index': name.split('_')[1],
    'accuracy': accuracy_score(y_test, model.predict(X_test)),
} for name, model in all_models)

# Show the accuracies from highest to lowest
accuracies.sort_values('accuracy', ascending=False)

	name	model_type	model_index	accuracy
16	RBF_1	RBF	1	0.9582
43	MLPClassifier_3	MLPClassifier	3	0.9571
45	Bagger_0	Bagger	0	0.9569
15	RBF_0	RBF	0	0.9565
18	RBF_3	RBF	3	0.9563
17	RBF_2	RBF	2	0.9561
19	RBF_4	RBF	4	0.9557
40	MLPClassifier_0	MLPClassifier	0	0.9555
41	MLPClassifier_1	MLPClassifier	1	0.9553
42	MLPClassifier_2	MLPClassifier	2	0.9545
44	MLPClassifier_4	MLPClassifier	4	0.9537
25	KNeighborsClassifier_0	KNeighborsClassifier	0	0.9533
27	KNeighborsClassifier_2	KNeighborsClassifier	2	0.9527
28	KNeighborsClassifier_3	KNeighborsClassifier	3	0.9510
26	KNeighborsClassifier_1	KNeighborsClassifier	1	0.9509
29	KNeighborsClassifier_4	KNeighborsClassifier	4	0.9503
9	RandomForestClassifier_4	RandomForestClassifier	4	0.9353
5	RandomForestClassifier_0	RandomForestClassifier	0	0.9349
8	RandomForestClassifier_3	RandomForestClassifier	3	0.9347
6	RandomForestClassifier_1	RandomForestClassifier	1	0.9340
7	RandomForestClassifier_2	RandomForestClassifier	2	0.9339
13	SVM_3	SVM	3	0.9335
11	SVM_1	SVM	1	0.9332
14	SVM_4	SVM	4	0.9331
10	SVM_0	SVM	0	0.9320
12	SVM_2	SVM	2	0.9320
31	GradientBoostingClassifier_1	GradientBoostingClassifier	1	0.9122
32	GradientBoostingClassifier_2	GradientBoostingClassifier	2	0.9098
33	GradientBoostingClassifier_3	GradientBoostingClassifier	3	0.9089
30	GradientBoostingClassifier_0	GradientBoostingClassifier	0	0.9077
34	GradientBoostingClassifier_4	GradientBoostingClassifier	4	0.9068
23	LogisticRegression_3	LogisticRegression	3	0.9055
22	LogisticRegression_2	LogisticRegression	2	0.9053
20	LogisticRegression_0	LogisticRegression	0	0.9039
21	LogisticRegression_1	LogisticRegression	1	0.9037
24	LogisticRegression_4	LogisticRegression	4	0.9031
3	GaussianNB_3	GaussianNB	3	0.8867
0	GaussianNB_0	GaussianNB	0	0.8863
2	GaussianNB_2	GaussianNB	2	0.8861
4	GaussianNB_4	GaussianNB	4	0.8859
1	GaussianNB_1	GaussianNB	1	0.8857
37	AdaBoostClassifier_2	AdaBoostClassifier	2	0.7522
36	AdaBoostClassifier_1	AdaBoostClassifier	1	0.7351
38	AdaBoostClassifier_3	AdaBoostClassifier	3	0.7329
35	AdaBoostClassifier_0	AdaBoostClassifier	0	0.7286
39	AdaBoostClassifier_4	AdaBoostClassifier	4	0.7196

While it wasn’t the best, the Bagger model managed to get the third highest accuracy without any insight about model performance! Let’s also take a look at the average accuracy of each model type:

Average accuracy of each model type

# group the accuracies by model type and compute the mean
pd.DataFrame(
    accuracies.groupby('model_type') \
              .accuracy.mean()\
              .sort_values(ascending=False) \
              .reset_index() \
              .rename(columns={'index': 'model_type'})
)

	model_type	accuracy
0	Bagger	0.95690
1	RBF	0.95656
2	MLPClassifier	0.95522
3	KNeighborsClassifier	0.95164
4	RandomForestClassifier	0.93456
5	SVM	0.93276
6	GradientBoostingClassifier	0.90908
7	LogisticRegression	0.90430
8	GaussianNB	0.88614
9	AdaBoostClassifier	0.73368

Now, the Bagger model is the clear winner! Hopefully this shows that bagging is a very powerful technique for improving the performance of a model.

Stacking

To classify the images of the CIFAR-10 dataset, I used the described architecture in the stacking segment of this section.

More libraries and data

import os
import torch
import torch.nn as nn
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    transforms.RandomAffine(degrees=(-10, 10)),
])

train_data = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_data = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)


items, y_test = zip(*test_data)
X_test = torch.stack(items)

Files already downloaded and verified
Files already downloaded and verified

The following is the architecure used of all the submodels of the stacking model.

class CIFAR10Classifier(nn.Module):
    def __init__(self, n_classes):
        super(CIFAR10Classifier, self).__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(3, 4, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.Conv2d(4, 8, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(8, 16, kernel_size=5, stride=1, padding=1),
            nn.ReLU(),
            nn.Conv2d(16, 16, kernel_size=5, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(16, 16, kernel_size=5, stride=1, padding=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(256, 64),
            nn.ReLU(),
            nn.Linear(64, n_classes)
        )
        
    def forward(self, x):
        x = self.layers(x)
        return x

This metaclass approach requires a significant amount of boilerplate code to train the models. However, for brevity sake, I’ll just load the pre-trained models and use them for prediction. The omitted code can be found in the Stacking.ipynb file in the repository.

Load pre-trained models

# model 1: meta class predictor
meta_model = CIFAR10Classifier(3)
meta_model.load_state_dict(torch.load('models2/model_meta.pth', weights_only=False))

# model 2: vehicle predictor
submodel_1 = CIFAR10Classifier(5)
submodel_1.load_state_dict(torch.load('models2/model_meta_class_0.pth', weights_only=False))

# model 3: general animal predictor
submodel_2= CIFAR10Classifier(5)
submodel_2.load_state_dict(torch.load('models2/model_meta_class_1.pth', weights_only=False))

# model 3: cat-dog predictor
submodel_3= CIFAR10Classifier(3)
submodel_3.load_state_dict(torch.load('models2/model_meta_class_2.pth', weights_only=False))

# model 5: catch-all predictor
catchall_model = CIFAR10Classifier(10)
catchall_model.load_state_dict(torch.load('models2/model_general.pth', weights_only=False))

<All keys matched successfully>

Then, we can use the following function to predict the class of an image using the ensemble of models. The arguments of the functions go as follows:

meta_model: the meta-classifier model
sub_models: a list containitng the sub-classifier models
general_model: the general catch all classifier model
input_images: the input images
sub_prediction_maps: mapping between sub-classifier labels and general class labels packed into a list of dictionaries: one for each of the submodels.

def stacked_predictions(meta_model, 
                        sub_models, 
                        general_model, 
                        input_images, 
                        sub_prediction_maps):
    # validate input
    assert len(sub_models) == len(sub_prediction_maps)
    
    # eval mode
    meta_model.eval()
    general_model.eval()
    
    for submodel in sub_models:
        submodel.eval()
    
    # predictions
    with torch.no_grad():
        
        # predict metaclass
        meta_features = meta_model(input_images).argmax(dim=1)
        predictions = torch.zeros_like(meta_features)
        
        # predict subclasses
        for k, submodel in enumerate(sub_models):
            # predict the classes for the observations predicted to belong to this metaclass
            mask = meta_features == k
            sub_prediction = submodel(input_images[mask]).argmax(dim=1)
            
            # map the predictions to the general class labels
            # use -1 to indicate that the metaclass prediction was predicted to be incorrect
            prediction_map = sub_prediction_maps[k]
            temp = [prediction_map.get(pred.item(), -1) for pred in sub_prediction]
            predictions[mask] = torch.tensor(temp)

        # correct incorrect metaclass predictions
        # report the amount of incorrect predictions
        incorrect_mask = predictions == -1
        print(f"Incorrect predictions: {incorrect_mask.sum().item()}")
        predictions[incorrect_mask] = general_model(input_images[incorrect_mask]).argmax(dim=1)

    # return final predictions     
    return predictions

Now, we can compare the performance of the ensemble model vs the general model:

Confusion Matrix for the ensemble model

# predict
y_pred = stacked_predictions(
    meta_model=meta_model, 
    sub_models=[submodel_1, submodel_2, submodel_3], general_model=catchall_model, 
    input_images=X_test, 
    sub_prediction_maps=[
        {1:0, 2:1, 3:8, 4:9},
        {1:2, 2:4, 3:6, 4:7},
        {1:3, 2:5}, 
    ],
)

# display confusion matrix
cm = confusion_matrix(y_test, y_pred)
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes)
disp.plot(cmap=plt.cm.Blues, 
          text_kw={'fontsize': 8, 'ha': 'center', 'va': 'center'})

# Format numbers as integers without scientific notation
for text in disp.text_.ravel():
    text.set_text(f'{int(float(text.get_text()))}')

plt.title('Confusion Matrix for Stacking Model')
plt.show()

# compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Incorrect predictions: 648
Accuracy: 0.6734

Confusion Matrix for the ensemble model

# predict
y_pred = catchall_model(X_test).argmax(dim=1).detach()


# display confusion matrix
cm = confusion_matrix(y_test, y_pred)
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes)
disp.plot(cmap=plt.cm.Blues, 
          text_kw={'fontsize': 8, 'ha': 'center', 'va': 'center'})

# Format numbers as integers without scientific notation
for text in disp.text_.ravel():
    text.set_text(f'{int(float(text.get_text()))}')

plt.title('Confusion Matrix for single CNN Model')
plt.show()

# compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.6404

As we can observer, the stacked model slightly outperforms the single CNN model. However, this is not always the case, it varied depending on the number of epochs. This is a nitpicked example to show that Stacking models can outperform single models.

Ensemble models is a different paradigm form the ones we have seen so far. It is a way to combine multiple models to create a single model that is more robust and accurate than any of the individual models.