Ensemble Methods

So far we’ve looked at a number of different machine learning algorithms. The one thing all these algorithms have in common is that they are all Strong learners. A strong learner is a model that is able to reach an arbitrary level of precision given enough resources and data. In contrast, a weak learner is a model that is just slightly better than random guessing. For example, a weak learner might be able to correctly classify 55% of the time, while a strong learner might be able to correctly classify 95% of the time. The basic idea behind ensemble methods is to combine multiple weak learners into a single strong learner.

This paradigm is akin to asking a crowd of people to guess the number of jelly beans in a jar, then taking the average as our final answer. The hope is that the average of the guesses will be more accurate than any single guess. I first saw this in a video, but was unable to find it again.

This section assumes the reader is familiar with decision trees as most ensemble methods are based on decision trees. If you are unfamiliar with decision trees, I recommend reading the Decision Trees section.

Bootstrap Resampling

Before we discuss any ensemble methods, we need to devise a way for us to train several independent models. One way to do this is to use bootstrap resampling. The idea behind bootstrap resampling is to take a random sample of the data with replacement, then train a model on that sample. We can repeat this process multiple times to get multiple models.

The following is a toy example of a single bootstrap resampling process:

Note each resample only keeps \(1-e^{-1}\approx 63.21%\) of the data. However, the power of bootstrap comes in the several resamples being done. With \(k\) resamples, we keep \(1-e^{-k}\times 100%\) of the data. Just \(k=5\) will keep \(99.33%\) of the data, so as long as we use enough models, this shouln’t be an issue. However, this does show that it is very risky to have several layers of bootstrapping, as we will be keeping an even smaller fraction of the data.

Weighted Bootstrap

In some cases, we may want to give more weight to some of the observations. For this, we can augment our data with weights proportional to the number of times each observation is included in the resample. This is called a weighted bootstrap.

The following is a toy example of a single weighted bootstrap resampling process. Observe how the observations with higher weights are more likely to be included in the resample.

Bagging

Bagging stands for Bootstap Aggregating. The key idea of bagging is to train several models in parallel using different bootstrap resamples of the data. Then, we combine the predictions of the models to make the final prediction. This could be averaging the predictions for regression problems, or majority voting for classification problems. It is possible to use different types of models for bagging, however, bagging models are typically homogeneous. The most popular bagging algorithm is random forests, which is when the base models are decision trees.

The following diagram shows a bagging process with 3 base models: KNN, SVM, and decision tree.

Voting

Voting is a variant of bagging, however, unlike bagging, the base models are all trained on the original dataset. It is not a good idea to use voting when bagging is available, but might be useful if the models are already trained. The following diagram shows a voting process with 3 base models: KNN, SVM, and decision tree.

Limitation

One distadvantage of bagging is that it is slower to train than a single model. Also, a bagging model can only be as good as the best base model. This are some limitations to consider when using bagging - the limits of the model become the limits of the bagging model.

Boosting

Boosting is a sequential ensemble method where each model is trained to correct the errors of the previous model. The first model is trained on the original data, and the second model is trained on the data where the errors of the first model are emphasized via weighted bootstrapping. This process continues until a stopping criterion is met. The final prediction is made by combining the predictions of all the models using a weighted sum. The weights are determined by the performance of each model on the training data.

AdaBoost

AdaBoost is a popular boosting algorithm that uses decision trees as base models. The algorithm works by training a series of decision trees on the training data, where each tree is trained to correct the errors of the previous tree. The final prediction is made by combining the predictions of all the trees using a weighted sum. The weights are determined by the performance of each tree on the training data. An overview of the AdaBoost algorithm is shown below:

  1. Augment the dataset to have sample weights with initial weights set as \(\frac{1}{N}\).

  2. Fit a weak learner, \(M_i\), using a weighted bootstrap.

  3. Obtain the predictions on the original/full dataset.

  4. Calculate \(E\), the weighted error: \[ E = 1 - \text{Weighted Accuracy} \]

  5. Compute the importance of the model: \[ \lambda_i = \frac{1}{2} \log \left( \frac{1 - E}{E} \right) \]

  6. Rescale the sample weights by:

    • \(e^{\lambda_i}\) if incorrectly classified
    • \(e^{-\lambda_i}\) if correctly classified
  7. Normalize the weights such that the sum equals 1.

  8. Repeat Steps 2-7 until a stopping condition is met.

In the end the ensemble prediction is given by: \[ M_\text{Final}(x) = \sum_{i=1}^N \lambda_i M_i(x) \]

The following graph shows why importance has that formula - it gives high positive imporance to models that have a low error rate, a 0 importance to models that are 50/50, and a high negative importance to models that have a high error rate (take the opposite prediction). While this formula focuses only on the 2-class case, it is enough for us to understand the intuition behind it.

The following image shows how two weak learners are combined to form a strong learner:

Gradient Boosting

Gradient Boosting is a type of boosting algorithm that uses gradient descent to minimize the loss function. However, unlike gradient decent for neural networks, where the weights get updated to minimize the loss function, it adds a model to the ensemble in each iteration. The model is added to the ensemble in such a way that it minimizes the loss function. An overview of the Gradient Boosting algorithm is shown below:

  1. Start by fitting a not-so-weak learner to the dataset, denoted as \(M_1\).

  2. Compute a differentiable loss function, \(\mathcal{L}(y, M_i(x))\).

  3. Calculate the residuals: \[ \hat{r}_{in} = \frac{\partial \mathcal{L}(y_n, M_i(x_n))}{\partial M_i(x_n)} \]

  4. Fit a new learner with \(x\) as the features and \(\hat{r}_{in}\) as the labels, denoted as \(m_{i+1}\).

  5. Determine the step size (learning rate), \(\hat{\gamma}_{i+1}\): \[ \hat{\gamma}_{i+1} = \arg \min_\gamma \mathcal{L}\left(y, M_i(x) - \gamma m_{i+1}(x)\right) \]

    • Note: This is also called the learning rate. Some suggest setting \(\gamma\) to a fixed small value (e.g., \(0.001\)) for all \(i\).
  6. Update the model: \[ M_{i+1}(x) = M_i(x) - \hat{\gamma}_{i+1} m_{i+1}(x) \]

  7. Repeat Steps 2-6 until a stopping condition is met.

The last model, \(M_T\), is the final model.

The following image illustrates the Gradient Boosting algorithm:

Gradient boosting is most useful when the base learner doens’t have a clear differentiable parameters such as decision trees.

XGBoost

XGBoost (eXtreme Gradient Boosting) is a variant of Gradient Boosting that uses a heuristic approaches to optimize the loss function and improve computational efficiency. It also introduces regularization terms to prevent overfitting. It is a controversial algorithm, so it will only be mentioned briefly here.

Stacking

Stacking is a technique that logically combines independently trained models to improve performance. It is a meta-algorithm that can be used with any combination of base models.

I had trouble trainig a good CNN that predicts the label of the CIFAR-10 dataset. My model got confused between cats and dogs a little bit too often. As such I propose a stacking model as shown below:

The key idea is to generate metaclasses that combine similar classes: - Cats and dogs - Other animals - Vehicles

Then I trained a models that predict the metaclasses, a model for each of the metaclasses that predicts the label or if the image is in the incorrect metaclass, and a final catch-all model that predicts the label if the image is in the incorrect metaclass.

Cascading

Cascading is a variant of stacking, where each model is applied sequentially to the data. Unlike the previous example, there’s no split paths. The following image shows how cascading can be used to tackle the classification of the CIFAR-10 dataset:

Now each model acts as a one vs all model. The first model predicts if the label should be “dog”, the second model predicts if the label should be “cat”, and so on. As such it is important to train a model with recall and specificity in mind.

Not PyTorch

Here I’ll show the results of the ensemble models I trained with data I used through the semester. The dataset for Bagging is a 10-dimensional latent space of the MNIST dataset, obtained through a variational autoencoder. The dataset for Stacking is the base CIFAR-10 dataset.

Bagging

To classify my latent space of the MNIST dataset, I used 5 instances of several classification models I’ve learned throughout my career + the ensemble tree models:

Previous models:

  • Logistic Regression
  • K-Nearest Neighbors
  • Support Vector Machine
  • Naive Bayes
  • MLP Classifier

Ensemble models:

  • Random Forest
  • AdaBoost
  • Gradient Boosting

Fortunately, Scikit-Learn implements all of these models, and a Voting ensembler. With prior bootstrapping of the data, I could aggregate all models into a single Bagging model. Note that I used some very wishiy-washy parameter tuning for the models, but I wanted to keep the code as simple as possible. Also, I wouldnt use a bagging model with any of the ensemble models, as the parameter tuning for these can be very time-consuming and it adds a second layer of bootstrapping.

Libraries and data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from SckitLearnWrapper import SckitLearnWrapper, get_all_model_names

X_train = np.loadtxt('data/train_data.csv', delimiter=',')
y_train = np.loadtxt('data/train_labels.csv', delimiter=',')
X_test = np.loadtxt('data/test_data.csv', delimiter=',')
y_test = np.loadtxt('data/test_labels.csv', delimiter=',')

I had to use some questionable code to aggregate the models into the VotingClassifier class. The ScikitLearnWrapper class is a wrapper that handles the bootstrapping, training, parameter tuning, and saving/loading of the models. It can be found in the SckitLearnWrapper.py file in the github repository.

# dummy models used to train the voting classifier
dummy_models = [
    ('dummy1', KNeighborsClassifier()), 
    ('dummy2', KNeighborsClassifier())
]

# train real models that will be hijacked into the voting classifier
# formatted in the shape the voting classifier expects
# list[(name: str, model)]
models = [
    (f'{model}_{i}', SckitLearnWrapper(model, i, X_train, y_train).train('models/'))
    for model in get_all_model_names()
    for i in range(5)
]

# train with dummy models
bagger = VotingClassifier(estimators=dummy_models, voting='soft')
bagger.fit(X_train, y_train)

# assign pre-trained models
bagger.estimators = models

Now that the Bagger is trained, we can use it to predict the labels of the test. Let’s take a look at the confusion matrix and the accuracy score of the Bagger model:

Confusion Matrix and accuracy score
y_pred = bagger.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues, 
          text_kw={'fontsize': 8, 'ha': 'center', 'va': 'center'})

# Format numbers as integers without scientific notation
for text in disp.text_.ravel():
    text.set_text(f'{int(float(text.get_text()))}')

plt.title('Confusion Matrix')
plt.show()

# compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9569

As we can observe, the Bagger model did pretty well. Let’s take a look at how each of the models performed individually:

Accuracy of each model
# get a list with all the models
all_models = [*models, ('Bagger_0', bagger)]

# make that list into a dataframe with the model name, model type, model index, and accuracy
accuracies = pd.DataFrame({
    'name': name,
    'model_type': name.split('_')[0],
    'model_index': name.split('_')[1],
    'accuracy': accuracy_score(y_test, model.predict(X_test)),
} for name, model in all_models)

# Show the accuracies from highest to lowest
accuracies.sort_values('accuracy', ascending=False)
name model_type model_index accuracy
16 RBF_1 RBF 1 0.9582
43 MLPClassifier_3 MLPClassifier 3 0.9571
45 Bagger_0 Bagger 0 0.9569
15 RBF_0 RBF 0 0.9565
18 RBF_3 RBF 3 0.9563
17 RBF_2 RBF 2 0.9561
19 RBF_4 RBF 4 0.9557
40 MLPClassifier_0 MLPClassifier 0 0.9555
41 MLPClassifier_1 MLPClassifier 1 0.9553
42 MLPClassifier_2 MLPClassifier 2 0.9545
44 MLPClassifier_4 MLPClassifier 4 0.9537
25 KNeighborsClassifier_0 KNeighborsClassifier 0 0.9533
27 KNeighborsClassifier_2 KNeighborsClassifier 2 0.9527
28 KNeighborsClassifier_3 KNeighborsClassifier 3 0.9510
26 KNeighborsClassifier_1 KNeighborsClassifier 1 0.9509
29 KNeighborsClassifier_4 KNeighborsClassifier 4 0.9503
9 RandomForestClassifier_4 RandomForestClassifier 4 0.9353
5 RandomForestClassifier_0 RandomForestClassifier 0 0.9349
8 RandomForestClassifier_3 RandomForestClassifier 3 0.9347
6 RandomForestClassifier_1 RandomForestClassifier 1 0.9340
7 RandomForestClassifier_2 RandomForestClassifier 2 0.9339
13 SVM_3 SVM 3 0.9335
11 SVM_1 SVM 1 0.9332
14 SVM_4 SVM 4 0.9331
10 SVM_0 SVM 0 0.9320
12 SVM_2 SVM 2 0.9320
31 GradientBoostingClassifier_1 GradientBoostingClassifier 1 0.9122
32 GradientBoostingClassifier_2 GradientBoostingClassifier 2 0.9098
33 GradientBoostingClassifier_3 GradientBoostingClassifier 3 0.9089
30 GradientBoostingClassifier_0 GradientBoostingClassifier 0 0.9077
34 GradientBoostingClassifier_4 GradientBoostingClassifier 4 0.9068
23 LogisticRegression_3 LogisticRegression 3 0.9055
22 LogisticRegression_2 LogisticRegression 2 0.9053
20 LogisticRegression_0 LogisticRegression 0 0.9039
21 LogisticRegression_1 LogisticRegression 1 0.9037
24 LogisticRegression_4 LogisticRegression 4 0.9031
3 GaussianNB_3 GaussianNB 3 0.8867
0 GaussianNB_0 GaussianNB 0 0.8863
2 GaussianNB_2 GaussianNB 2 0.8861
4 GaussianNB_4 GaussianNB 4 0.8859
1 GaussianNB_1 GaussianNB 1 0.8857
37 AdaBoostClassifier_2 AdaBoostClassifier 2 0.7522
36 AdaBoostClassifier_1 AdaBoostClassifier 1 0.7351
38 AdaBoostClassifier_3 AdaBoostClassifier 3 0.7329
35 AdaBoostClassifier_0 AdaBoostClassifier 0 0.7286
39 AdaBoostClassifier_4 AdaBoostClassifier 4 0.7196

While it wasn’t the best, the Bagger model managed to get the third highest accuracy without any insight about model performance! Let’s also take a look at the average accuracy of each model type:

Average accuracy of each model type
# group the accuracies by model type and compute the mean
pd.DataFrame(
    accuracies.groupby('model_type') \
              .accuracy.mean()\
              .sort_values(ascending=False) \
              .reset_index() \
              .rename(columns={'index': 'model_type'})
)
model_type accuracy
0 Bagger 0.95690
1 RBF 0.95656
2 MLPClassifier 0.95522
3 KNeighborsClassifier 0.95164
4 RandomForestClassifier 0.93456
5 SVM 0.93276
6 GradientBoostingClassifier 0.90908
7 LogisticRegression 0.90430
8 GaussianNB 0.88614
9 AdaBoostClassifier 0.73368

Now, the Bagger model is the clear winner! Hopefully this shows that bagging is a very powerful technique for improving the performance of a model.

Stacking

To classify the images of the CIFAR-10 dataset, I used the described architecture in the stacking segment of this section.

More libraries and data
import os
import torch
import torch.nn as nn
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    transforms.RandomAffine(degrees=(-10, 10)),
])

train_data = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_data = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)


items, y_test = zip(*test_data)
X_test = torch.stack(items)
Files already downloaded and verified
Files already downloaded and verified

The following is the architecure used of all the submodels of the stacking model.

class CIFAR10Classifier(nn.Module):
    def __init__(self, n_classes):
        super(CIFAR10Classifier, self).__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(3, 4, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.Conv2d(4, 8, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(8, 16, kernel_size=5, stride=1, padding=1),
            nn.ReLU(),
            nn.Conv2d(16, 16, kernel_size=5, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(16, 16, kernel_size=5, stride=1, padding=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(256, 64),
            nn.ReLU(),
            nn.Linear(64, n_classes)
        )
        
    def forward(self, x):
        x = self.layers(x)
        return x

This metaclass approach requires a significant amount of boilerplate code to train the models. However, for brevity sake, I’ll just load the pre-trained models and use them for prediction. The omitted code can be found in the Stacking.ipynb file in the repository.

Load pre-trained models
# model 1: meta class predictor
meta_model = CIFAR10Classifier(3)
meta_model.load_state_dict(torch.load('models2/model_meta.pth', weights_only=False))

# model 2: vehicle predictor
submodel_1 = CIFAR10Classifier(5)
submodel_1.load_state_dict(torch.load('models2/model_meta_class_0.pth', weights_only=False))

# model 3: general animal predictor
submodel_2= CIFAR10Classifier(5)
submodel_2.load_state_dict(torch.load('models2/model_meta_class_1.pth', weights_only=False))

# model 3: cat-dog predictor
submodel_3= CIFAR10Classifier(3)
submodel_3.load_state_dict(torch.load('models2/model_meta_class_2.pth', weights_only=False))

# model 5: catch-all predictor
catchall_model = CIFAR10Classifier(10)
catchall_model.load_state_dict(torch.load('models2/model_general.pth', weights_only=False))
<All keys matched successfully>

Then, we can use the following function to predict the class of an image using the ensemble of models. The arguments of the functions go as follows:

  • meta_model: the meta-classifier model
  • sub_models: a list containitng the sub-classifier models
  • general_model: the general catch all classifier model
  • input_images: the input images
  • sub_prediction_maps: mapping between sub-classifier labels and general class labels packed into a list of dictionaries: one for each of the submodels.
def stacked_predictions(meta_model, 
                        sub_models, 
                        general_model, 
                        input_images, 
                        sub_prediction_maps):
    # validate input
    assert len(sub_models) == len(sub_prediction_maps)
    
    # eval mode
    meta_model.eval()
    general_model.eval()
    
    for submodel in sub_models:
        submodel.eval()
    
    # predictions
    with torch.no_grad():
        
        # predict metaclass
        meta_features = meta_model(input_images).argmax(dim=1)
        predictions = torch.zeros_like(meta_features)
        
        # predict subclasses
        for k, submodel in enumerate(sub_models):
            # predict the classes for the observations predicted to belong to this metaclass
            mask = meta_features == k
            sub_prediction = submodel(input_images[mask]).argmax(dim=1)
            
            # map the predictions to the general class labels
            # use -1 to indicate that the metaclass prediction was predicted to be incorrect
            prediction_map = sub_prediction_maps[k]
            temp = [prediction_map.get(pred.item(), -1) for pred in sub_prediction]
            predictions[mask] = torch.tensor(temp)

        # correct incorrect metaclass predictions
        # report the amount of incorrect predictions
        incorrect_mask = predictions == -1
        print(f"Incorrect predictions: {incorrect_mask.sum().item()}")
        predictions[incorrect_mask] = general_model(input_images[incorrect_mask]).argmax(dim=1)

    # return final predictions     
    return predictions

Now, we can compare the performance of the ensemble model vs the general model:

Confusion Matrix for the ensemble model
# predict
y_pred = stacked_predictions(
    meta_model=meta_model, 
    sub_models=[submodel_1, submodel_2, submodel_3], general_model=catchall_model, 
    input_images=X_test, 
    sub_prediction_maps=[
        {1:0, 2:1, 3:8, 4:9},
        {1:2, 2:4, 3:6, 4:7},
        {1:3, 2:5}, 
    ],
)

# display confusion matrix
cm = confusion_matrix(y_test, y_pred)
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes)
disp.plot(cmap=plt.cm.Blues, 
          text_kw={'fontsize': 8, 'ha': 'center', 'va': 'center'})

# Format numbers as integers without scientific notation
for text in disp.text_.ravel():
    text.set_text(f'{int(float(text.get_text()))}')

plt.title('Confusion Matrix for Stacking Model')
plt.show()

# compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
Incorrect predictions: 648
Accuracy: 0.6734

Confusion Matrix for the ensemble model
# predict
y_pred = catchall_model(X_test).argmax(dim=1).detach()


# display confusion matrix
cm = confusion_matrix(y_test, y_pred)
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes)
disp.plot(cmap=plt.cm.Blues, 
          text_kw={'fontsize': 8, 'ha': 'center', 'va': 'center'})

# Format numbers as integers without scientific notation
for text in disp.text_.ravel():
    text.set_text(f'{int(float(text.get_text()))}')

plt.title('Confusion Matrix for single CNN Model')
plt.show()

# compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.6404

As we can observer, the stacked model slightly outperforms the single CNN model. However, this is not always the case, it varied depending on the number of epochs. This is a nitpicked example to show that Stacking models can outperform single models.


Ensemble models is a different paradigm form the ones we have seen so far. It is a way to combine multiple models to create a single model that is more robust and accurate than any of the individual models.