This document’s aim is to explain why is bagging useful with a very simple use case. It was inspired by one of my colleague, Stéphanie (to whom this simulation study is dedicated), who made a very good remark during the last coffee break. At my lab, two coffee machines are used. Usually, people pick one at random and have a cup from it. But Stéphanie admitted to have the weird habit of filling her cup with the coffee coming from the two machines (evenly mixed). She didn’t know (because Stéphanie has not applied for my course on “Big Data Analytics”) but Stéphanie is bagging: she is averaging two quantities with a large quality variability to reduce the variability (and thus to reduce the probability to have a very bad coffee). This is exactly the purpose of bagging: averaging for reducing the variability of your prediction function, hence improving its generalization variability (because it is more robust and does not overfit your specific data).
Let’s simulate this example.
Suppose that the coffee quality can be simulated with a Gaussian law (zero mean and variance equal to 1). A large value indicates a good quality (which can be, depending of the person, the fact that the coffee tastes like tea or that it is very strong). Over a year, we have approximately \(45 \times 5\) coffee breaks, coming from 2 coffee machines:
set.seed(2807) # this sets the random seed to obtain reproducible results
all_coffees <- matrix(rnorm(2*5*45), ncol=2)
colnames(all_coffees) <- c("machine1", "machine2")
head(all_coffees)
## machine1 machine2
## [1,] -1.1624644 0.3729421
## [2,] -0.1750661 -2.1022511
## [3,] 0.8881384 1.6653870
## [4,] -2.2589691 -1.3589362
## [5,] -0.4766109 -1.2399446
## [6,] 1.6897525 -1.7716457
summary(all_coffees)
## machine1 machine2
## Min. :-2.37226 Min. :-3.02504
## 1st Qu.:-0.57081 1st Qu.:-0.60086
## Median : 0.08466 Median : 0.03284
## Mean : 0.09512 Mean :-0.03527
## 3rd Qu.: 0.68639 3rd Qu.: 0.65883
## Max. : 2.77168 Max. : 2.45368
So, in average, this year, the first coffee machine’s quality is 0.095 and the second coffee machine’s quality is 0.033.
The first strategy - let’s call it the sheep strategy - is to pick a machine at random every day and to fill your cup with the coffee coming from this machine:
sheep <- all_coffees[cbind(1:nrow(all_coffees),
sample(1:2, nrow(all_coffees), replace=TRUE))]
head(sheep)
## [1] -1.1624644 -0.1750661 1.6653870 -1.3589362 -1.2399446 -1.7716457
The second strategy - let’s call it the obsessive strategy - is to always pick the first machine and to fill your cup from it:
obsessive <- all_coffees[ ,1]
The last strategy - let’s call it Stéphanie’s strategy - is to mix the coffee coming from the two coffee machines:
stephanie <- apply(all_coffees, 1, mean)
Now, we analyze the results by computing the average quality for all strategies as well as the quality standard deviation. A boxplot is also provided to ease the visualization of this very interesting result (which is going to change the coffee break for ever):
all_strategies <- cbind(sheep, obsessive, stephanie)
colnames(all_strategies) <- c("sheep", "obsessive", "stephanie")
apply(all_strategies, 2, function(acol) c(mean(acol), sd(acol)))
## sheep obsessive stephanie
## [1,] 0.09269486 0.09511518 0.02992478
## [2,] 0.97618265 0.97509965 0.73945482
boxplot(all_strategies)
In average, Stéphanie has not increased her coffee quality but has really reduced its variability. I can only advice her to ask for more coffee machines to improve her results…