library("lubridate")
library("RColorBrewer")
library("car")
library("scales")
library("FactoMineR")
library("factoextra")
The current dataset comes from kaggle https://www.kaggle.com/ralle360/historic-tour-de-france-dataset and provides information about the Tour de France race, including 8 different variables corresponding to 2,236 stages of the race. The data have been extracted from Wikipedia and prepared by RasmusFiskerBang and is made available through a CSV file that you can download on my website (the dataset is licensed under the CC0 license - Public Domain).
Even though CSV files can be opened with Excel, we strongly discourage this practice and we will use R directly for this task:
tdf_data <- read.table("stages_TDF.csv", sep = ",", header = TRUE,
stringsAsFactors = FALSE, encoding = "UTF-8",
quote = "\"")
head(tdf_data)
## Stage Date Distance Origin Destination
## 1 1 2017-07-01 14.0 Düsseldorf Düsseldorf
## 2 2 2017-07-02 203.5 Düsseldorf Liège
## 3 3 2017-07-03 212.5 Verviers Longwy
## 4 4 2017-07-04 207.5 Mondorf-les-Bains Vittel
## 5 5 2017-07-05 160.5 Vittel La Planche des Belles Filles
## 6 6 2017-07-06 216.0 Vesoul Troyes
## Type Winner Winner_Country
## 1 Individual time trial Geraint Thomas GBR
## 2 Flat stage Marcel Kittel GER
## 3 Medium mountain stage Peter Sagan SVK
## 4 Flat stage Arnaud Démare FRA
## 5 Medium mountain stage Fabio Aru ITA
## 6 Flat stage Marcel Kittel GER
where:
sep
is used to provide the character separating
columns;
header = TRUE
indicates that column names are
included in the file (in the first row);
stringsAsFactor = FALSE
indicates that strings must
not be converted to type factor
(this is the default
behavior since R 4.0.0);
quote = "\""
indicates which character(s) has(have)
to be considered as quotation character and not part of the
data.
Other information and options are available in the help page
?read.table
or in ?read.csv
,
?read.csv2
, ?read.delim
,
?read.delim2
.
We can take a first look at the data with:
head(tdf_data)
## Stage Date Distance Origin Destination
## 1 1 2017-07-01 14.0 Düsseldorf Düsseldorf
## 2 2 2017-07-02 203.5 Düsseldorf Liège
## 3 3 2017-07-03 212.5 Verviers Longwy
## 4 4 2017-07-04 207.5 Mondorf-les-Bains Vittel
## 5 5 2017-07-05 160.5 Vittel La Planche des Belles Filles
## 6 6 2017-07-06 216.0 Vesoul Troyes
## Type Winner Winner_Country
## 1 Individual time trial Geraint Thomas GBR
## 2 Flat stage Marcel Kittel GER
## 3 Medium mountain stage Peter Sagan SVK
## 4 Flat stage Arnaud Démare FRA
## 5 Medium mountain stage Fabio Aru ITA
## 6 Flat stage Marcel Kittel GER
summary(tdf_data)
## Stage Date Distance Origin
## Length:2236 Length:2236 Min. : 1.0 Length:2236
## Class :character Class :character 1st Qu.:156.0 Class :character
## Mode :character Mode :character Median :199.0 Mode :character
## Mean :196.8
## 3rd Qu.:236.0
## Max. :482.0
## Destination Type Winner Winner_Country
## Length:2236 Length:2236 Length:2236 Length:2236
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
When missing data are present, this last command shows that many rows
contain missing values (identified with NA
) in each column.
The dataset dimension is obtained with:
dim(tdf_data)
## [1] 2236 8
Information on column types can be obtained with:
sapply(tdf_data, class)
## Stage Date Distance Origin Destination
## "character" "character" "numeric" "character" "character"
## Type Winner Winner_Country
## "character" "character" "character"
which indicates that all columns are character except for the third one, which is numeric. Sometimes, numeric variables (and more specifically integers) are used to code for categorical variables but this is not the case in this dataset.
To be allowed to perform properly the subsequent analysis, it is best to precisely define the different types of the columns:
factor
, which is the proper type in R for categorical
variables with a finite number of possible valuestdf_data$Stage <- factor(tdf_data$Stage)
tdf_data$Date <- as.Date(tdf_data$Date, format = c("%Y-%m-%d"))
tdf_data$Year <- year(tdf_data$Date)
Distance: is indeed a numeric variable;
Origin and Destination: are categorical variables but the number of possible values (town of origin and destination of the stage) is so large that there is only a minor benefit in converting it to a factor
Type: is a categorical variable that is better converted to a factor
tdf_data$Type <- factor(tdf_data$Type)
Winner: is the winner name so the number of possible values is so large that there is only a minor benefit in converting it to a factor
Winner_Country: is the country code of the winner so is better converted to a factor:
tdf_data$Winner_Country <- factor(tdf_data$Winner_Country)
After these stages, the summary of the dataset is already more informative:
summary(tdf_data)
## Stage Date Distance Origin
## 4 : 101 Min. :1903-07-01 Min. : 1.0 Length:2236
## 11 : 100 1st Qu.:1938-07-17 1st Qu.:156.0 Class :character
## 2 : 100 Median :1969-07-02 Median :199.0 Mode :character
## 9 : 100 Mean :1966-08-02 Mean :196.8
## 10 : 99 3rd Qu.:1991-07-20 3rd Qu.:236.0
## 6 : 99 Max. :2017-07-23 Max. :482.0
## (Other):1637
## Destination Type Winner
## Length:2236 Plain stage :1053 Length:2236
## Class :character Stage with mountain(s): 530 Class :character
## Mode :character Individual time trial : 205 Mode :character
## Flat stage : 110
## Team time trial : 87
## Hilly stage : 76
## (Other) : 175
## Winner_Country Year
## FRA :691 Min. :1903
## BEL :460 1st Qu.:1938
## ITA :262 Median :1969
## NED :157 Mean :1966
## ESP :125 3rd Qu.:1991
## GER : 71 Max. :2017
## (Other):470
The variable Distance
is the distance of the stage:
mean(tdf_data$Distance) # mean
## [1] 196.783
median(tdf_data$Distance) # median
## [1] 199
min(tdf_data$Distance) # minimum
## [1] 1
max(tdf_data$Distance) # maximum
## [1] 482
# quartiles and min/max
quantile(tdf_data$Distance, probs = c(0, 0.25, 0.5, 0.75, 1))
## 0% 25% 50% 75% 100%
## 1 156 199 236 482
The option na.rm = TRUE
must be used when you have
missing values that you don’t want to be taken into account for the
computation (otherwise, most of these functions would return the value
NA
).
Exercise: How to interpret these values? More precisely, what do they say about the variable distribution?
Some of these values are also available with
summary(tdf_data$Distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 156.0 199.0 196.8 236.0 482.0
Dispersion characteristics are obtained with:
var(tdf_data$Distance) # variance
## [1] 8131.78
sd(tdf_data$Distance) # standard deviation
## [1] 90.17639
range(tdf_data$Distance) # range
## [1] 1 482
diff(range(tdf_data$Distance))
## [1] 481
Exercise: How would you compute the inter-quartile range (in just one line of code)?
## 75%
## 80
## [1] 80
Exercise: What is the coefficient of variation (CV) for this variable?
## [1] 0.4582529
Standard modifications of data include:
tdf_data$cut1 <- cut(tdf_data$Distance, breaks = 5)
table(tdf_data$cut1)
##
## (0.519,97.2] (97.2,193] (193,290] (290,386] (386,482]
## 320 676 950 211 79
tdf_data$cut2 <- cut(tdf_data$Distance, breaks = 5, labels = FALSE)
table(tdf_data$cut2)
##
## 1 2 3 4 5
## 320 676 950 211 79
tdf_data$cut3 <- cut(tdf_data$Distance, breaks = seq(0, 500, by = 100))
table(tdf_data$cut3)
##
## (0,100] (100,200] (200,300] (300,400] (400,500]
## 324 820 825 210 57
Exercise: What is the mode of cut3
?
## [1] "(200,300]"
tdf_data$DScaled <- as.vector(scale(tdf_data$Distance))
summary(tdf_data$DScaled)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.17111 -0.45226 0.02458 0.00000 0.43489 3.16288
var(tdf_data$DScaled)
## [1] 1
Before we start, a short note on color palettes:
display.brewer.all()
cut2
):pie(table(tdf_data$Type), col = c(brewer.pal(9, "Set1"), brewer.pal(9, "Set2"))) # 18 niveaux > 18 couleurs
## Warning in brewer.pal(9, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
barplot(table(tdf_data$Winner_Country), col = "darkgreen", las = 3,
cex.names = 0.6, cex.axis=0.6)
hist(tdf_data$Distance)
hist(tdf_data$Distance, main = "Distribution of stage distances",
xlab = "distance (km)", breaks = 20)
plot(density(tdf_data$Distance))
hist(tdf_data$Distance, main = "Distribution of stage distances",
xlab = "distance (km)", breaks = 20, freq = FALSE)
lines(density(tdf_data$Distance), col = "red", lwd = 2)
boxplot(tdf_data$Distance)
boxplot(tdf_data$Distance, horizontal = TRUE, notch = TRUE)
Contingency tables are obtained with the function table
(here on a subset of the winner countries and of the stage type).
selected_countries <- names(which(table(tdf_data$Winner_Country) > 10))
selected_stages <- tdf_data$Winner_Country %in% selected_countries
cont_table <- table(tdf_data$Type[selected_stages],
tdf_data$Winner_Country[selected_stages, drop = TRUE])
cont_table
##
## AUS BEL COL DEN ESP FRA FRG GBR GER ITA LUX
## Flat cobblestone stage 0 0 0 0 0 0 0 0 0 0 0 0
## Flat stage 0 5 2 0 1 5 10 0 30 28 9 0
## Flat Stage 0 0 4 0 0 0 2 0 0 0 2 0
## Half Stage 0 0 2 0 0 1 0 0 0 0 0 0
## High mountain stage 0 1 2 3 0 5 11 0 4 1 4 1
## Hilly stage 0 1 12 3 4 6 20 0 1 2 5 0
## Individual time trial 0 3 37 1 0 19 58 4 10 9 8 4
## Intermediate stage 0 0 1 0 0 0 0 0 0 0 2 0
## Medium mountain stage 0 4 3 0 0 3 5 0 4 4 3 0
## Mountain stage 0 1 0 1 3 9 9 0 0 2 3 4
## Mountain Stage 0 0 2 0 0 2 2 0 0 0 1 1
## Mountain time trial 0 0 1 0 0 3 1 0 1 0 2 1
## Plain stage 1 14 266 0 7 17 369 6 11 21 141 26
## Plain stage with cobblestones 0 0 0 0 0 0 0 0 0 0 0 0
## Stage with mountain 0 0 5 0 0 0 5 0 0 0 1 0
## Stage with mountain(s) 3 0 105 8 3 53 184 4 6 3 81 29
## Team time trial 48 0 18 0 0 0 15 0 0 0 0 4
## Transition stage 0 0 0 0 0 2 0 0 0 1 0 0
##
## NED NOR POR SUI USA
## Flat cobblestone stage 1 1 0 0 0
## Flat stage 2 7 0 0 1
## Flat Stage 1 0 0 0 0
## Half Stage 2 0 0 0 0
## High mountain stage 1 2 2 0 0
## Hilly stage 11 1 0 2 0
## Individual time trial 13 1 1 15 16
## Intermediate stage 0 0 0 0 0
## Medium mountain stage 2 1 2 0 0
## Mountain stage 0 0 0 0 4
## Mountain Stage 3 0 0 0 0
## Mountain time trial 2 0 0 1 0
## Plain stage 100 2 5 26 5
## Plain stage with cobblestones 1 0 0 0 0
## Stage with mountain 0 0 0 0 0
## Stage with mountain(s) 18 1 2 13 10
## Team time trial 0 0 0 0 2
## Transition stage 0 1 0 0 0
t(cont_table) / rowSums(cont_table)
##
## Flat cobblestone stage Flat stage Flat Stage Half Stage
## 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## AUS 0.000000e+00 2.500000e+00 0.000000e+00 0.000000e+00
## BEL 0.000000e+00 2.000000e-02 2.000000e+00 5.000000e-01
## COL 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## DEN 0.000000e+00 2.000000e-01 0.000000e+00 0.000000e+00
## ESP 0.000000e+00 1.351351e-01 0.000000e+00 1.111111e-01
## FRA 0.000000e+00 1.470588e-01 5.405405e-02 0.000000e+00
## FRG 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## GBR 0.000000e+00 1.000000e+01 0.000000e+00 0.000000e+00
## GER 0.000000e+00 9.032258e-01 0.000000e+00 0.000000e+00
## ITA 0.000000e+00 2.500000e-01 6.451613e-02 0.000000e+00
## LUX 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## NED 9.832842e-04 1.666667e-01 9.090909e-02 5.555556e-02
## NOR 1.000000e+00 6.882989e-03 0.000000e+00 0.000000e+00
## POR 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## SUI 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## USA 0.000000e+00 1.912046e-03 0.000000e+00 0.000000e+00
##
## High mountain stage Hilly stage Individual time trial Intermediate stage
## 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## AUS 1.912046e-03 9.090909e-02 3.000000e+00 0.000000e+00
## BEL 2.298851e-02 2.294455e-02 3.363636e+00 1.000000e+00
## COL 7.500000e-01 3.448276e-02 1.912046e-03 0.000000e+00
## DEN 0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00
## ESP 5.000000e-02 3.000000e+00 4.750000e+00 0.000000e+00
## FRA 1.222222e+00 2.000000e-01 2.900000e+01 0.000000e+00
## FRG 0.000000e+00 0.000000e+00 4.000000e-02 0.000000e+00
## GBR 1.081081e-01 2.000000e-01 1.111111e+00 0.000000e+00
## GER 1.470588e-02 5.405405e-02 1.800000e+00 0.000000e+00
## ITA 2.010050e-02 7.352941e-02 2.162162e-01 4.000000e-01
## LUX 3.333333e-01 0.000000e+00 5.882353e-02 0.000000e+00
## NED 3.225806e-02 3.666667e+00 6.532663e-02 0.000000e+00
## NOR 5.555556e-02 3.225806e-02 3.333333e-01 0.000000e+00
## POR 1.818182e-01 0.000000e+00 3.225806e-02 0.000000e+00
## SUI 0.000000e+00 1.818182e-01 4.166667e-01 0.000000e+00
## USA 0.000000e+00 0.000000e+00 1.454545e+00 0.000000e+00
##
## Medium mountain stage Mountain stage Mountain Stage Mountain time trial
## 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## AUS 3.333333e-01 9.090909e-02 0.000000e+00 0.000000e+00
## BEL 2.949853e-03 0.000000e+00 1.818182e-01 2.777778e-02
## COL 0.000000e+00 9.832842e-04 0.000000e+00 0.000000e+00
## DEN 0.000000e+00 3.000000e+00 0.000000e+00 0.000000e+00
## ESP 5.736138e-03 8.181818e-01 2.000000e+00 2.949853e-03
## FRA 5.747126e-02 1.720841e-02 1.818182e-01 1.000000e+00
## FRG 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## GBR 2.000000e+00 0.000000e+00 0.000000e+00 1.912046e-03
## GER 4.000000e-02 1.000000e+00 0.000000e+00 0.000000e+00
## ITA 3.333333e-01 3.000000e-02 5.000000e-01 5.000000e-01
## LUX 0.000000e+00 4.444444e-01 1.000000e-02 5.000000e-01
## NED 5.405405e-02 0.000000e+00 3.333333e-01 2.000000e-02
## NOR 1.470588e-02 0.000000e+00 0.000000e+00 0.000000e+00
## POR 1.005025e-02 0.000000e+00 0.000000e+00 0.000000e+00
## SUI 0.000000e+00 0.000000e+00 0.000000e+00 2.702703e-02
## USA 0.000000e+00 1.333333e+00 0.000000e+00 0.000000e+00
##
## Plain stage Plain stage with cobblestones Stage with mountain
## 5.025126e-03 0.000000e+00 0.000000e+00
## AUS 4.666667e+00 0.000000e+00 0.000000e+00
## BEL 8.580645e+00 0.000000e+00 2.512563e-02
## COL 0.000000e+00 0.000000e+00 0.000000e+00
## DEN 6.363636e-01 0.000000e+00 0.000000e+00
## ESP 1.416667e+00 0.000000e+00 0.000000e+00
## FRA 3.628319e-01 0.000000e+00 4.545455e-01
## FRG 6.000000e+00 0.000000e+00 0.000000e+00
## GBR 1.000000e+00 0.000000e+00 0.000000e+00
## GER 4.015296e-02 0.000000e+00 0.000000e+00
## ITA 1.620690e+00 0.000000e+00 9.090909e-02
## LUX 6.500000e+00 0.000000e+00 0.000000e+00
## NED 5.000000e+01 2.500000e-01 0.000000e+00
## NOR 2.000000e-02 0.000000e+00 0.000000e+00
## POR 5.555556e-01 0.000000e+00 0.000000e+00
## SUI 5.200000e+00 0.000000e+00 0.000000e+00
## USA 1.351351e-01 0.000000e+00 0.000000e+00
##
## Stage with mountain(s) Team time trial Transition stage
## 6.000000e-01 5.333333e+00 0.000000e+00
## AUS 0.000000e+00 0.000000e+00 0.000000e+00
## BEL 1.544118e+00 4.864865e-01 0.000000e+00
## COL 4.020101e-02 0.000000e+00 0.000000e+00
## DEN 1.000000e+00 0.000000e+00 0.000000e+00
## ESP 1.709677e+00 0.000000e+00 1.005025e-02
## FRA 5.111111e+00 4.838710e-01 0.000000e+00
## FRG 3.636364e-01 0.000000e+00 0.000000e+00
## GBR 5.000000e-01 0.000000e+00 0.000000e+00
## GER 2.949853e-03 0.000000e+00 9.090909e-02
## ITA 8.100000e+01 0.000000e+00 0.000000e+00
## LUX 2.636364e+00 4.000000e+00 0.000000e+00
## NED 3.441683e-02 0.000000e+00 0.000000e+00
## NOR 1.149425e-02 0.000000e+00 9.090909e-02
## POR 5.000000e-01 0.000000e+00 0.000000e+00
## SUI 6.500000e+00 0.000000e+00 0.000000e+00
## USA 1.000000e-01 1.000000e+00 0.000000e+00
Exercise: How to interpret the numbers in the last table (we call them “row profiles”)? For instance, what does 0.4545454545, in the column of France, mean? Compute the column profiles.
##
## Flat cobblestone stage Flat stage Flat Stage Half Stage
## 0.000000000 0.000000000 0.000000000 0.000000000
## AUS 0.000000000 0.172413793 0.000000000 0.000000000
## BEL 0.000000000 0.004347826 0.008695652 0.004347826
## COL 0.000000000 0.000000000 0.000000000 0.000000000
## DEN 0.000000000 0.055555556 0.000000000 0.000000000
## ESP 0.000000000 0.040000000 0.000000000 0.008000000
## FRA 0.000000000 0.014471780 0.002894356 0.000000000
## FRG 0.000000000 0.000000000 0.000000000 0.000000000
## GBR 0.000000000 0.447761194 0.000000000 0.000000000
## GER 0.000000000 0.394366197 0.000000000 0.000000000
## ITA 0.000000000 0.034351145 0.007633588 0.000000000
## LUX 0.000000000 0.000000000 0.000000000 0.000000000
## NED 0.006369427 0.012738854 0.006369427 0.012738854
## NOR 0.058823529 0.411764706 0.000000000 0.000000000
## POR 0.000000000 0.000000000 0.000000000 0.000000000
## SUI 0.000000000 0.000000000 0.000000000 0.000000000
## USA 0.000000000 0.026315789 0.000000000 0.000000000
##
## High mountain stage Hilly stage Individual time trial Intermediate stage
## 0.000000000 0.000000000 0.000000000 0.000000000
## AUS 0.034482759 0.034482759 0.103448276 0.000000000
## BEL 0.004347826 0.026086957 0.080434783 0.002173913
## COL 0.187500000 0.187500000 0.062500000 0.000000000
## DEN 0.000000000 0.222222222 0.000000000 0.000000000
## ESP 0.040000000 0.048000000 0.152000000 0.000000000
## FRA 0.015918958 0.028943560 0.083936324 0.000000000
## FRG 0.000000000 0.000000000 0.285714286 0.000000000
## GBR 0.059701493 0.014925373 0.149253731 0.000000000
## GER 0.014084507 0.028169014 0.126760563 0.000000000
## ITA 0.015267176 0.019083969 0.030534351 0.007633588
## LUX 0.014285714 0.000000000 0.057142857 0.000000000
## NED 0.006369427 0.070063694 0.082802548 0.000000000
## NOR 0.117647059 0.058823529 0.058823529 0.000000000
## POR 0.166666667 0.000000000 0.083333333 0.000000000
## SUI 0.000000000 0.035087719 0.263157895 0.000000000
## USA 0.000000000 0.000000000 0.421052632 0.000000000
##
## Medium mountain stage Mountain stage Mountain Stage Mountain time trial
## 0.000000000 0.000000000 0.000000000 0.000000000
## AUS 0.137931034 0.034482759 0.000000000 0.000000000
## BEL 0.006521739 0.000000000 0.004347826 0.002173913
## COL 0.000000000 0.062500000 0.000000000 0.000000000
## DEN 0.000000000 0.166666667 0.000000000 0.000000000
## ESP 0.024000000 0.072000000 0.016000000 0.024000000
## FRA 0.007235890 0.013024602 0.002894356 0.001447178
## FRG 0.000000000 0.000000000 0.000000000 0.000000000
## GBR 0.059701493 0.000000000 0.000000000 0.014925373
## GER 0.056338028 0.028169014 0.000000000 0.000000000
## ITA 0.011450382 0.011450382 0.003816794 0.007633588
## LUX 0.000000000 0.057142857 0.014285714 0.014285714
## NED 0.012738854 0.000000000 0.019108280 0.012738854
## NOR 0.058823529 0.000000000 0.000000000 0.000000000
## POR 0.166666667 0.000000000 0.000000000 0.000000000
## SUI 0.000000000 0.000000000 0.000000000 0.017543860
## USA 0.000000000 0.105263158 0.000000000 0.000000000
##
## Plain stage Plain stage with cobblestones Stage with mountain
## 0.019230769 0.000000000 0.000000000
## AUS 0.482758621 0.000000000 0.000000000
## BEL 0.578260870 0.000000000 0.010869565
## COL 0.000000000 0.000000000 0.000000000
## DEN 0.388888889 0.000000000 0.000000000
## ESP 0.136000000 0.000000000 0.000000000
## FRA 0.534008683 0.000000000 0.007235890
## FRG 0.428571429 0.000000000 0.000000000
## GBR 0.164179104 0.000000000 0.000000000
## GER 0.295774648 0.000000000 0.000000000
## ITA 0.538167939 0.000000000 0.003816794
## LUX 0.371428571 0.000000000 0.000000000
## NED 0.636942675 0.006369427 0.000000000
## NOR 0.117647059 0.000000000 0.000000000
## POR 0.416666667 0.000000000 0.000000000
## SUI 0.456140351 0.000000000 0.000000000
## USA 0.131578947 0.000000000 0.000000000
##
## Stage with mountain(s) Team time trial Transition stage
## 0.057692308 0.923076923 0.000000000
## AUS 0.000000000 0.000000000 0.000000000
## BEL 0.228260870 0.039130435 0.000000000
## COL 0.500000000 0.000000000 0.000000000
## DEN 0.166666667 0.000000000 0.000000000
## ESP 0.424000000 0.000000000 0.016000000
## FRA 0.266280753 0.021707670 0.000000000
## FRG 0.285714286 0.000000000 0.000000000
## GBR 0.089552239 0.000000000 0.000000000
## GER 0.042253521 0.000000000 0.014084507
## ITA 0.309160305 0.000000000 0.000000000
## LUX 0.414285714 0.057142857 0.000000000
## NED 0.114649682 0.000000000 0.000000000
## NOR 0.058823529 0.000000000 0.058823529
## POR 0.166666667 0.000000000 0.000000000
## SUI 0.228070175 0.000000000 0.000000000
## USA 0.263157895 0.052631579 0.000000000
chisq <- chisq.test(cont_table)$statistic
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
ddl <- min(nrow(cont_table) - 1, ncol(cont_table) - 1)
cramerV <- sqrt(chisq / (sum(cont_table) * ddl))
cramerV
## X-squared
## 0.2617944
Barplots are obtained using the contingency table as well:
selected_stages <- selected_stages &
(tdf_data$Type %in% c("Individual time trial", "Plain stage", "Stage with mountain(s)"))
cont_table <- table(tdf_data$Type[selected_stages, drop = TRUE],
tdf_data$Winner_Country[selected_stages, drop = TRUE])
barplot(cont_table, legend.text = TRUE, xlab = "country", ylab = "frequency",
cex.names = 0.6)
barplot(cont_table, legend.text = TRUE, xlab = "country", ylab = "frequency",
beside = TRUE, col = brewer.pal(3, "Set3"), cex.names = 0.6)
Covariances can be obtained using the function cov
:
cov(tdf_data$Distance, tdf_data$Year)
## [1] -1332.007
cov(tdf_data$Distance, tdf_data$Year, method = "spearman")
## [1] -181210.6
cov(tdf_data[ ,c("Distance", "Year", "DScaled")])
## Distance Year DScaled
## Distance 8131.78045 -1332.00737 90.17639
## Year -1332.00737 944.15261 -14.77113
## DScaled 90.17639 -14.77113 1.00000
cor(tdf_data$Distance, tdf_data$Year)
## [1] -0.4807206
cor(tdf_data$Distance, tdf_data$Year, method = "spearman")
## [1] -0.4347636
cor(tdf_data[ ,c("Distance", "Year", "DScaled")])
## Distance Year DScaled
## Distance 1.0000000 -0.4807206 1.0000000
## Year -0.4807206 1.0000000 -0.4807206
## DScaled 1.0000000 -0.4807206 1.0000000
and correlations are obtained with the function cor
that
takes the same arguments than cov
.
Partial correlation between Distance
and
Year
given DScaled
is obtained with:
cor(lm(tdf_data$Distance ~ tdf_data$DScaled)$residuals,
lm(tdf_data$Year ~ tdf_data$DScaled)$residuals)
## [1] -0.03113725
Exercise: Is it an expected result? Check at the residuals of the first model: what can you say?
summary(lm(tdf_data$Distance ~ tdf_data$DScaled)$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.890e-12 -2.640e-16 2.373e-15 0.000e+00 4.936e-15 4.392e-14
Dot plots (scatterplots) between two variables are obtained with:
plot(tdf_data$Year, tdf_data$Distance, pch = 19)
or with scatterplot matrices:
scatterplotMatrix(tdf_data[, c("Distance", "Year", "DScaled")])
scatterplotMatrix(tdf_data[, c("Distance", "Year", "DScaled")],
col = "black", regLine = FALSE, smooth = FALSE, pch = "+")
Exercise: Can you comment on the specific plot between
Distance
and DScaled
?
Within and between variance of Distance
with respect to
Winner_Country
are obtained from the function
anova
(in the column Mean Sq
, within group
variance corresponds to the row residuals
):
selected_countries <- names(which(table(tdf_data$Winner_Country) > 10))
selected_stages <- tdf_data$Winner_Country %in% selected_countries
anova(lm(tdf_data$Distance[selected_stages] ~
factor(tdf_data$Winner_Country[selected_stages, drop = TRUE])))
## Analysis of Variance Table
##
## Response: tdf_data$Distance[selected_stages]
## Df Sum Sq
## factor(tdf_data$Winner_Country[selected_stages, drop = TRUE]) 16 2312902
## Residuals 2139 15488666
## Mean Sq F value
## factor(tdf_data$Winner_Country[selected_stages, drop = TRUE]) 144556 19.963
## Residuals 7241
## Pr(>F)
## factor(tdf_data$Winner_Country[selected_stages, drop = TRUE]) < 2.2e-16 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
and the correlation ratio is the square root of the column
F value
:
cor_ratio <- anova(lm(tdf_data$Distance[selected_stages] ~
factor(tdf_data$Winner_Country[selected_stages, drop = TRUE])))
sqrt(cor_ratio[1, "Mean Sq"]/(cor_ratio[1, "Mean Sq"] + cor_ratio[2, "Mean Sq"]))
## [1] 0.9758575
Parallel boxplots are obtained using the same syntax with the
~
:
boxplot(tdf_data$Distance[selected_stages] ~
tdf_data$Winner_Country[selected_stages, drop = TRUE], las = 3,
xlab = "winner country", ylab = "distance")
To obtain multiple histograms on the same plot, you can use:
To test if Distance
average is equal to 200.
shapiro.test(tdf_data$Distance)
##
## Shapiro-Wilk normality test
##
## data: tdf_data$Distance
## W = 0.96778, p-value < 2.2e-16
Despite significant deviation to normality, we will perform a Student test (for the sake of the example):
res <- t.test(tdf_data$Distance, mu = 200, conf.level = 0.99)
res
##
## One Sample t-test
##
## data: tdf_data$Distance
## t = -1.6869, df = 2235, p-value = 0.09176
## alternative hypothesis: true mean is not equal to 200
## 99 percent confidence interval:
## 191.8666 201.6994
## sample estimates:
## mean of x
## 196.783
res$statistic
## t
## -1.686922
res$p.value
## [1] 0.09175796
res$conf.int
## [1] 191.8666 201.6994
## attr(,"conf.level")
## [1] 0.99
res <- wilcox.test(tdf_data$Distance, mu = 200, conf.int = TRUE)
res
##
## Wilcoxon signed rank test with continuity correction
##
## data: tdf_data$Distance
## V = 1186216, p-value = 0.09295
## alternative hypothesis: true location is not equal to 200
## 95 percent confidence interval:
## 194.4999 200.5000
## sample estimates:
## (pseudo)median
## 197.5
res$statistic
## V
## 1186216
res$p.value
## [1] 0.09294784
res$conf.int
## [1] 194.4999 200.5000
## attr(,"conf.level")
## [1] 0.95
For small contingency tables, the independence between rows and columns can be tested with a Fisher’s exact test (to be preferred) or a \(\chi^2\) test (only when Fisher’s exact test is computationally too heavy to run or when the sample size is sufficiently large).
selected_countries <- names(which(table(tdf_data$Winner_Country) > 10))
selected_stages <- tdf_data$Winner_Country %in% selected_countries
selected_stages <- selected_stages &
(tdf_data$Type %in% c("Individual time trial", "Plain stage", "Stage with mountain(s)"))
tdf_small <- data.frame(tdf_data$Winner_Country[selected_stages, drop = TRUE],
tdf_data$Type[selected_stages, drop = TRUE],
tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Winner_Country", "Type", "Distance")
cont_table <- table(tdf_small$Winner_Country, tdf_small$Type)
cont_table
##
## Individual time trial Plain stage Stage with mountain(s)
## 0 1 3
## AUS 3 14 0
## BEL 37 266 105
## COL 1 0 8
## DEN 0 7 3
## ESP 19 17 53
## FRA 58 369 184
## FRG 4 6 4
## GBR 10 11 6
## GER 9 21 3
## ITA 8 141 81
## LUX 4 26 29
## NED 13 100 18
## NOR 1 2 1
## POR 1 5 2
## SUI 15 26 13
## USA 16 5 10
res <- fisher.test(cont_table)
# Error in fisher.test(cont_table) :
# FEXACT error 7(location). LDSTP=18330 is too small for this problem,
# (pastp=11.752, ipn_0:=ipoin[itp=563]=3932, stp[ipn_0]=10.2479).
# Increase workspace or consider using 'simulate.p.value=TRUE'
In addition to being more suited to large contingency tables, \(\chi^2\) test also provide interesting statistics for interpretation of the results:
res <- chisq.test(cont_table)
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
res
##
## Pearson's Chi-squared test
##
## data: cont_table
## X-squared = 241.94, df = 32, p-value < 2.2e-16
res$observed
##
## Individual time trial Plain stage Stage with mountain(s)
## 0 1 3
## AUS 3 14 0
## BEL 37 266 105
## COL 1 0 8
## DEN 0 7 3
## ESP 19 17 53
## FRA 58 369 184
## FRG 4 6 4
## GBR 10 11 6
## GER 9 21 3
## ITA 8 141 81
## LUX 4 26 29
## NED 13 100 18
## NOR 1 2 1
## POR 1 5 2
## SUI 15 26 13
## USA 16 5 10
res$expected
##
## Individual time trial Plain stage Stage with mountain(s)
## 0.4577343 2.339275 1.202990
## AUS 1.9453709 9.941921 5.112708
## BEL 46.6889017 238.606095 122.705003
## COL 1.0299022 5.263370 2.706728
## DEN 1.1443358 5.848189 3.007476
## ESP 10.1845888 52.048879 26.766532
## FRA 69.9189189 357.324324 183.756757
## FRG 1.6020702 8.187464 4.210466
## GBR 3.0897067 15.790109 8.120184
## GER 3.7763082 19.299022 9.924669
## ITA 26.3197240 134.508338 69.171938
## LUX 6.7515814 34.504313 17.744106
## NED 14.9907993 76.611271 39.397930
## NOR 0.4577343 2.339275 1.202990
## POR 0.9154687 4.678551 2.405980
## SUI 6.1794135 31.580219 16.240368
## USA 3.5474411 18.129385 9.323174
res$residuals^2
##
## Individual time trial Plain stage Stage with mountain(s)
## 4.577343e-01 7.667582e-01 2.684348e+00
## AUS 5.717380e-01 1.656421e+00 5.112708e+00
## BEL 2.010645e+00 3.145041e+00 2.554640e+00
## COL 8.681835e-04 5.263370e+00 1.035151e+01
## DEN 1.144336e+00 2.268513e-01 1.858170e-05
## ESP 7.630301e+00 2.360135e+01 2.571102e+01
## FRA 2.031791e+00 3.815061e-01 3.219869e-04
## FRG 3.589148e+00 5.844299e-01 1.052041e-02
## GBR 1.545524e+01 1.453134e+00 5.535811e-01
## GER 7.225829e+00 1.499208e-01 4.831501e+00
## ITA 1.275136e+01 3.133016e-01 2.022541e+00
## LUX 1.121397e+00 2.096067e+00 7.140126e+00
## NED 2.643810e-01 7.140368e+00 1.162171e+01
## NOR 6.424077e-01 4.920662e-02 3.425217e-02
## POR 7.805344e-03 2.208580e-02 6.850435e-02
## SUI 1.259064e+01 9.860235e-01 6.465361e-01
## USA 4.371214e+01 9.508361e+00 4.913489e-02
barplot(t(cont_table), legend.text = TRUE, xlab = "country", ylab = "frequency",
cex.names = 0.6, col = brewer.pal(3, "Set3"))
Exercise:
Titanic[ , Sex = "Male", Age = "Adult", ]
is the
contingency table of Male Adults in Titanics for the variables
Class
and Survival
. Are these two variables
independant? Which classes deviate the most from the independence? Same
questions for Women.
##
## Fisher's Exact Test for Count Data
##
## data: Titanic[, Sex = "Male", Age = "Adult", ]
## p-value = 1.559e-08
## alternative hypothesis: two.sided
##
## Pearson's Chi-squared test
##
## data: Titanic[, Sex = "Male", Age = "Adult", ]
## X-squared = 37.988, df = 3, p-value = 2.843e-08
## Survived
## Class No Yes
## 1st 118 57
## 2nd 154 14
## 3rd 387 75
## Crew 670 192
## Survived
## Class No Yes
## 1st 139.5171 35.48290
## 2nd 133.9364 34.06359
## 3rd 368.3251 93.67487
## Crew 687.2214 174.77864
## Survived
## Class No Yes
## 1st 3.3184854 13.0481274
## 2nd 3.0055123 11.8175321
## 3rd 0.9468552 3.7229900
## Crew 0.4315569 1.6968612
##
## Pearson's Chi-squared test
##
## data: Titanic[, Sex = "Female", Age = "Adult", ]
## X-squared = 117.31, df = 3, p-value < 2.2e-16
## Survived
## Class No Yes
## 1st 4 140
## 2nd 13 80
## 3rd 89 76
## Crew 3 20
## Survived
## Class No Yes
## 1st 36.931765 107.06824
## 2nd 23.851765 69.14824
## 3rd 42.317647 122.68235
## Crew 5.898824 17.10118
## Survived
## Class No Yes
## 1st 29.3649961 10.1290651
## 2nd 4.9371943 1.7030196
## 3rd 51.4972412 17.7632889
## Crew 1.4245515 0.4913801
Correlation tests are performed with:
cor.test(tdf_data$Distance, tdf_data$Year)
##
## Pearson's product-moment correlation
##
## data: tdf_data$Distance and tdf_data$Year
## t = -25.912, df = 2234, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5119713 -0.4481991
## sample estimates:
## cor
## -0.4807206
cor.test(tdf_data$Distance, tdf_data$Year, method = "spearman")
## Warning in cor.test.default(tdf_data$Distance, tdf_data$Year, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: tdf_data$Distance and tdf_data$Year
## S = 2673279681, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.4347636
plot(tdf_data$Distance, tdf_data$Year)
Exercise: The object Soils
contain
characteristics on soil samples. Variables 6-13 are numerical
characteristics. Make a scatterplot matrix of these variables and pick
two variables that you think are linearly correlated. Test your
hypothesis.
##
## Pearson's product-moment correlation
##
## data: pH and Ca
## t = 9.3221, df = 46, p-value = 3.614e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6809493 0.8885997
## sample estimates:
## cor
## 0.8086293
Comparing the means of the given numeric variable
Distance
for the different countries of origin of the
winner Winner_Country
, for the countries DEN and FRA:
selected_countries <- c("DEN", "FRA")
selected_stages <- tdf_data$Winner_Country %in% selected_countries
tdf_small <- data.frame(tdf_data$Winner_Country[selected_stages, drop = TRUE],
tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Winner_Country", "Distance")
t.test(tdf_small$Distance ~ tdf_small$Winner_Country)
##
## Welch Two Sample t-test
##
## data: tdf_small$Distance by tdf_small$Winner_Country
## t = -1.688, df = 20.916, p-value = 0.1063
## alternative hypothesis: true difference in means between group DEN and group FRA is not equal to 0
## 95 percent confidence interval:
## -44.056699 4.584693
## sample estimates:
## mean in group DEN mean in group FRA
## 198.8889 218.6249
t.test(tdf_small$Distance ~ tdf_small$Winner_Country, var.equal = TRUE)
##
## Two Sample t-test
##
## data: tdf_small$Distance by tdf_small$Winner_Country
## t = -0.86449, df = 707, p-value = 0.3876
## alternative hypothesis: true difference in means between group DEN and group FRA is not equal to 0
## 95 percent confidence interval:
## -64.55813 25.08613
## sample estimates:
## mean in group DEN mean in group FRA
## 198.8889 218.6249
How to know if two variances are equal? Bartlett test or Fisher’s test…
bartlett.test(tdf_small$Distance ~ tdf_small$Winner_Country)
##
## Bartlett test of homogeneity of variances
##
## data: tdf_small$Distance by tdf_small$Winner_Country
## Bartlett's K-squared = 11.104, df = 1, p-value = 0.0008615
var.test(tdf_small$Distance ~ tdf_small$Winner_Country)
##
## F test to compare two variances
##
## data: tdf_small$Distance by tdf_small$Winner_Country
## F = 0.23814, num df = 17, denom df = 690, p-value = 0.001182
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.1326281 0.5380715
## sample estimates:
## ratio of variances
## 0.2381388
Comparing the medians of the given numeric variable
Distance
between the two winner countries, FRA and DEN:
wilcox.test(tdf_small$Distance ~ tdf_small$Winner_Country)
##
## Wilcoxon rank sum test with continuity correction
##
## data: tdf_small$Distance by tdf_small$Winner_Country
## W = 5467, p-value = 0.381
## alternative hypothesis: true location shift is not equal to 0
Exercise: In the dataset Soils
, does the pH
significantly differ between samples on depressions and on slopes
(variable Contour
)? Visually confirm with the appropriate
plot.
##
## Shapiro-Wilk normality test
##
## data: Soils$pH[Soils$Contour == "Depression"]
## W = 0.96308, p-value = 0.7179
##
## Shapiro-Wilk normality test
##
## data: Soils$pH[Soils$Contour == "Slope"]
## W = 0.89835, p-value = 0.07564
##
## Bartlett test of homogeneity of variances
##
## data: pH by Contour
## Bartlett's K-squared = 2.8034, df = 1, p-value = 0.09407
##
## Two Sample t-test
##
## data: pH by Contour
## t = -0.21586, df = 30, p-value = 0.8306
## alternative hypothesis: true difference in means between group Depression and group Slope is not equal to 0
## 95 percent confidence interval:
## -0.5688226 0.4600726
## sample estimates:
## mean in group Depression mean in group Slope
## 4.691875 4.746250
ANOVA is based on the previous fitting of a linear model. For
instance, if we want to check if the means Distance
are
different between Type
(with a restricted dataset), we have
to run (under normality and equal variance assumptions):
selected_stages <- (tdf_data$Type %in%
c("Individual time trial", "Plain stage",
"Stage with mountain(s)"))
tdf_small <- data.frame(tdf_data$Type[selected_stages, drop = TRUE],
tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Type", "Distance")
anova(lm(tdf_small$Distance ~ tdf_small$Type))
## Analysis of Variance Table
##
## Response: tdf_small$Distance
## Df Sum Sq Mean Sq F value Pr(>F)
## tdf_small$Type 2 6246184 3123092 580.9 < 2.2e-16 ***
## Residuals 1785 9596640 5376
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Kruskal-Wallis nonparametric test is performed with:
kruskal.test(tdf_small$Distance ~ tdf_small$Type)
##
## Kruskal-Wallis rank sum test
##
## data: tdf_small$Distance by tdf_small$Type
## Kruskal-Wallis chi-squared = 532.8, df = 2, p-value < 2.2e-16
Exercise: In the dataset Soils
, does the pH
significantly differ between the different contours? Visually confirm
with the appropriate plot.
##
## Shapiro-Wilk normality test
##
## data: Soils$pH[Soils$Contour == "Top"]
## W = 0.92004, p-value = 0.1689
##
## Bartlett test of homogeneity of variances
##
## data: pH by Contour
## Bartlett's K-squared = 3.1948, df = 2, p-value = 0.2024
## Analysis of Variance Table
##
## Response: pH
## Df Sum Sq Mean Sq F value Pr(>F)
## Contour 2 0.2607 0.13033 0.2799 0.7572
## Residuals 45 20.9546 0.46566
The sleep
data show the effect (extra
) of
two soporific drugs (group
) on 10 patients
(ID
):
head(sleep)
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
tail(sleep)
## extra group ID
## 15 -0.1 2 5
## 16 4.4 2 6
## 17 5.5 2 7
## 18 1.6 2 8
## 19 4.6 2 9
## 20 3.4 2 10
dim(sleep)
## [1] 20 3
sleep2 <- reshape(sleep, direction = "wide", idvar = "ID", timevar = "group")
dim(sleep2)
## [1] 10 3
head(sleep2)
## ID extra.1 extra.2
## 1 1 0.7 1.9
## 2 2 -1.6 0.8
## 3 3 -0.2 1.1
## 4 4 -1.2 0.1
## 5 5 -0.1 -0.1
## 6 6 3.4 4.4
Paired comparison tests are performed with (under normality assumption):
with(sleep2, t.test(extra.1, extra.2, paired = TRUE))
##
## Paired t-test
##
## data: extra.1 and extra.2
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -2.4598858 -0.7001142
## sample estimates:
## mean difference
## -1.58
and with a nonparametric test:
with(sleep2, wilcox.test(extra.1, extra.2, paired = TRUE))
## Warning in wilcox.test.default(extra.1, extra.2, paired = TRUE): cannot compute
## exact p-value with ties
## Warning in wilcox.test.default(extra.1, extra.2, paired = TRUE): cannot compute
## exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: extra.1 and extra.2
## V = 0, p-value = 0.009091
## alternative hypothesis: true location shift is not equal to 0
If there is a large number of ties, this \(p\)-value should not be used but it can be trusted if the number of ties is small.
boxplot(extra ~ group, data = sleep)
We can check that these tests are equivalent to tests on the difference:
sleep2$diff <- sleep2$extra.2 - sleep2$extra.1
head(sleep2)
## ID extra.1 extra.2 diff
## 1 1 0.7 1.9 1.2
## 2 2 -1.6 0.8 2.4
## 3 3 -0.2 1.1 1.3
## 4 4 -1.2 0.1 1.3
## 5 5 -0.1 -0.1 0.0
## 6 6 3.4 4.4 1.0
t.test(x = sleep2$diff)
##
## One Sample t-test
##
## data: sleep2$diff
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.7001142 2.4598858
## sample estimates:
## mean of x
## 1.58
wilcox.test(x = sleep2$diff)
## Warning in wilcox.test.default(x = sleep2$diff): cannot compute exact p-value
## with ties
## Warning in wilcox.test.default(x = sleep2$diff): cannot compute exact p-value
## with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: sleep2$diff
## V = 45, p-value = 0.009091
## alternative hypothesis: true location is not equal to 0
res <- lm(tdf_data$Distance ~ tdf_data$Year)
res
##
## Call:
## lm(formula = tdf_data$Distance ~ tdf_data$Year)
##
## Coefficients:
## (Intercept) tdf_data$Year
## 2970.493 -1.411
summary(res)
##
## Call:
## lm(formula = tdf_data$Distance ~ tdf_data$Year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -209.37 -43.86 16.05 53.42 225.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2970.49317 107.05746 27.75 <2e-16 ***
## tdf_data$Year -1.41080 0.05445 -25.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79.09 on 2234 degrees of freedom
## Multiple R-squared: 0.2311, Adjusted R-squared: 0.2307
## F-statistic: 671.4 on 1 and 2234 DF, p-value: < 2.2e-16
coefficients(res)
## (Intercept) tdf_data$Year
## 2970.493172 -1.410797
confint(res)
## 2.5 % 97.5 %
## (Intercept) 2760.550672 3180.435673
## tdf_data$Year -1.517567 -1.304026
plot(res)
# http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/
# For details on validations plots
tdf_small <- tdf_data[, c("Distance", "Year")]
head(tdf_small)
## Distance Year
## 1 14.0 2017
## 2 203.5 2017
## 3 212.5 2017
## 4 207.5 2017
## 5 160.5 2017
## 6 216.0 2017
res <- lm(Distance ~ ., data = tdf_small)
summary(res)
##
## Call:
## lm(formula = Distance ~ ., data = tdf_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -209.37 -43.86 16.05 53.42 225.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2970.49317 107.05746 27.75 <2e-16 ***
## Year -1.41080 0.05445 -25.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79.09 on 2234 degrees of freedom
## Multiple R-squared: 0.2311, Adjusted R-squared: 0.2307
## F-statistic: 671.4 on 1 and 2234 DF, p-value: < 2.2e-16
plot(res)
selected_stages <- (tdf_data$Type %in%
c("Individual time trial", "Plain stage",
"Stage with mountain(s)"))
tdf_small <- data.frame(tdf_data$Type[selected_stages, drop = TRUE],
tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Type", "Distance")
head(tdf_small)
## Type Distance
## 1 Individual time trial 14.0
## 2 Individual time trial 22.5
## 3 Individual time trial 37.5
## 4 Individual time trial 13.8
## 5 Individual time trial 54.0
## 6 Individual time trial 33.0
res <- lm(Distance ~ Type, data = tdf_small)
summary(res)
##
## Call:
## lm(formula = Distance ~ Type, data = tdf_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -201.30 -39.27 -10.44 25.73 255.73
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.111 5.121 7.637 3.59e-14 ***
## TypePlain stage 187.160 5.597 33.437 < 2e-16 ***
## TypeStage with mountain(s) 181.789 6.031 30.144 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73.32 on 1785 degrees of freedom
## Multiple R-squared: 0.3943, Adjusted R-squared: 0.3936
## F-statistic: 580.9 on 2 and 1785 DF, p-value: < 2.2e-16
anova(res)
## Analysis of Variance Table
##
## Response: Distance
## Df Sum Sq Mean Sq F value Pr(>F)
## Type 2 6246184 3123092 580.9 < 2.2e-16 ***
## Residuals 1785 9596640 5376
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Exercise: Explain the pH of Soils
by an
additive effect between contour and Ca
and with a additive
effect with interaction between contour and Depth.
##
## Call:
## lm(formula = pH ~ Contour + Ca, data = Soils)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55729 -0.25762 -0.08993 0.16989 1.10483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.41955 0.14900 22.951 < 2e-16 ***
## ContourSlope -0.12625 0.12918 -0.977 0.33377
## ContourTop -0.44012 0.13146 -3.348 0.00168 **
## Ca 0.17917 0.01666 10.754 6.76e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3623 on 44 degrees of freedom
## Multiple R-squared: 0.7278, Adjusted R-squared: 0.7092
## F-statistic: 39.21 on 3 and 44 DF, p-value: 1.711e-12
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = pH ~ Contour + Depth, data = Soils)
##
## $Contour
## diff lwr upr p adj
## Slope-Depression 0.054375 -0.2701619 0.3789119 0.9129085
## Top-Depression -0.121875 -0.4464119 0.2026619 0.6356018
## Top-Slope -0.176250 -0.5007869 0.1482869 0.3925120
##
## $Depth
## diff lwr upr p adj
## 10-30-0-10 -0.3933333 -0.8059384 0.01927173 0.0666195
## 30-60-0-10 -1.1191667 -1.5317717 -0.70656160 0.0000000
## 60-90-0-10 -1.4000000 -1.8126051 -0.98739493 0.0000000
## 30-60-10-30 -0.7258333 -1.1384384 -0.31322827 0.0001569
## 60-90-10-30 -1.0066667 -1.4192717 -0.59406160 0.0000004
## 60-90-30-60 -0.2808333 -0.6934384 0.13177173 0.2781588
##
## Call:
## lm(formula = pH ~ Contour + Depth + Contour:Depth, data = Soils)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7125 -0.1775 -0.0125 0.1681 1.3875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3525 0.1951 27.438 < 2e-16 ***
## ContourSlope 0.1550 0.2759 0.562 0.577703
## ContourTop -0.0200 0.2759 -0.072 0.942608
## Depth10-30 -0.4725 0.2759 -1.713 0.095366 .
## Depth30-60 -0.9900 0.2759 -3.589 0.000982 ***
## Depth60-90 -1.1800 0.2759 -4.277 0.000133 ***
## ContourSlope:Depth10-30 0.2475 0.3901 0.634 0.529848
## ContourTop:Depth10-30 -0.0100 0.3901 -0.026 0.979693
## ContourSlope:Depth30-60 -0.2500 0.3901 -0.641 0.525723
## ContourTop:Depth30-60 -0.1375 0.3901 -0.352 0.726571
## ContourSlope:Depth60-90 -0.4000 0.3901 -1.025 0.312085
## ContourTop:Depth60-90 -0.2600 0.3901 -0.666 0.509396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3901 on 36 degrees of freedom
## Multiple R-squared: 0.7417, Adjusted R-squared: 0.6628
## F-statistic: 9.398 on 11 and 36 DF, p-value: 1.188e-07
## Analysis of Variance Table
##
## Response: pH
## Df Sum Sq Mean Sq F value Pr(>F)
## Contour 2 0.2607 0.1303 0.8562 0.4332
## Depth 3 14.9590 4.9863 32.7582 2.162e-10 ***
## Contour:Depth 6 0.5159 0.0860 0.5648 0.7553
## Residuals 36 5.4798 0.1522
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
res <- glm(Type ~ Distance, data = tdf_small, family = binomial(link = "logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(res)
##
## Call:
## glm(formula = Type ~ Distance, family = binomial(link = "logit"),
## data = tdf_small)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1960 0.0001 0.0041 0.0175 2.9673
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.94801 0.57108 -10.41 <2e-16 ***
## Distance 0.07949 0.00750 10.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1273.54 on 1787 degrees of freedom
## Residual deviance: 201.89 on 1786 degrees of freedom
## AIC: 205.89
##
## Number of Fisher Scoring iterations: 10
If we test the differences between the levels of Contour of the mean
for all numeric variables in Soils
, we obtain 9 p-values on
which multiple testing can be performed:
all_pvals <- apply(Soils[ ,6:14], 2, function(avar)
anova(lm(avar ~ Soils$Group))[1, "Pr(>F)"])
all_pvals
## pH N Dens P Ca Mg
## 1.187636e-07 5.331419e-10 3.456879e-08 1.861579e-09 1.333073e-10 1.579912e-03
## K Na Conduc
## 2.456518e-06 1.369506e-15 3.425749e-17
p.adjust(all_pvals, method = "BH")
## pH N Dens P Ca Mg
## 1.526961e-07 1.199569e-09 5.185319e-08 3.350841e-09 3.999220e-10 1.579912e-03
## K Na Conduc
## 2.763582e-06 6.162778e-15 3.083174e-16
p.adjust(all_pvals, method = "bonferroni")
## pH N Dens P Ca Mg
## 1.068873e-06 4.798277e-09 3.111191e-07 1.675421e-08 1.199766e-09 1.421921e-02
## K Na Conduc
## 2.210866e-05 1.232556e-14 3.083174e-16
data("USArrests")
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
pca_usa <- PCA(USArrests, graph = FALSE)
fviz_eig(pca_usa)
fviz_pca_var(pca_usa, choice = "var")
fviz_pca_ind(pca_usa, choice = "ind")
fviz_pca_ind(pca_usa, choice = "ind", col.ind = USArrests$UrbanPop)
Exercise: Perform PCA of the dataset
athle_records.csv
after log-transformation of the data.
scaled_usa <- scale(USArrests)
dist_usa <- dist(scaled_usa)^2
hc_usa <- hclust(dist_usa, method = "ward.D")
plot(hc_usa)
within_ss <- cumsum(hc_usa$height)
plot(49:1, within_ss, type = "b")
1 - within_ss[46] / within_ss[49] # reproduced inertia
## [1] 0.704374
plot(hc_usa)
rect.hclust(hc_usa, k = 4)
clust_hc <- cutree(hc_usa, k = 4)
clust_hc
## Alabama Alaska Arizona Arkansas California
## 1 2 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 4 2 3 4
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 4 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 4 1 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 4 4 2 4 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 4 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 4 1 2 3 4
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 4 4 3
fviz_pca_ind(pca_usa, col.ind = factor(clust_hc))
init_center <- by(scaled_usa, clust_hc, colMeans)
init_center <- Reduce("rbind", init_center)
clust_kmeans <- kmeans(scaled_usa, centers = init_center)
clust_kmeans
## K-means clustering with 4 clusters of sizes 8, 13, 16, 13
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 1.4118898 0.8743346 -0.8145211 0.01927104
## 2 0.6950701 1.0394414 0.7226370 1.27693964
## 3 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 4 -0.9615407 -1.1066010 -0.9301069 -0.96676331
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 2 2 1 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 4 2 3 4
## Kansas Kentucky Louisiana Maine Maryland
## 3 4 1 4 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 4 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 4 4 2 4 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 4 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 4 1 2 3 4
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 4 4 3
##
## Within cluster sum of squares by cluster:
## [1] 8.316061 19.922437 16.212213 11.952463
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
fviz_pca_ind(pca_usa, col.ind = factor(clust_kmeans$cluster))
Exercise: Perform clustering of the dataset
athle_records.csv
after log-transformation and scaling of
the data.
With hierarchical clustering:
The percentage of explained inertia for 5 clusters is:
## [1] 26
## [1] 0.6657771
With \(k\)-means:
## K-means clustering with 5 clusters of sizes 15, 7, 2, 1, 1
##
## Cluster means:
## X100m X200m X400m X800m X1500m X5000m X10000m
## 1 -0.4023700 -0.3964851 -0.3822649 -0.53348564 -0.4776683 -0.3296392 -0.2340087
## 2 0.6096652 0.5957451 0.5851421 0.98463765 0.9034216 0.6761097 0.4313484
## 3 1.5416791 1.0955510 0.1404281 -0.03487566 -1.1948203 -1.8237604 -1.9512698
## 4 1.0968543 1.8598697 1.9591817 0.43194150 1.5619468 2.5184550 2.8690683
## 5 -2.4123191 -2.2739102 -0.6020593 0.74763079 1.6687663 1.3408854 1.5241629
## SemiMarathon Marathon
## 1 -0.3108179 -0.3790193
## 2 0.3391017 0.3475830
## 3 -1.2471064 -0.9938149
## 4 3.5902911 3.7784209
## 5 1.1924792 1.4614181
##
## Clustering vector:
## Australie Belgique Brésil RoyaumeUni Canada
## 1 1 1 1 1
## Chine Croatie Ethiopie France Allemagne
## 2 2 3 1 1
## Inde Iran Italie Jamaïque Japon
## 2 4 1 5 2
## Kenya Lituanie NouvelleZélande Portugal Russie
## 3 2 2 1 1
## AfriqueduSud Espagne Suède Suisse Ukraine
## 1 1 2 1 1
## USA
## 1
##
## Within cluster sum of squares by cluster:
## [1] 40.471672 23.958026 9.062617 0.000000 0.000000
## (between_SS / total_SS = 67.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
This file has been compiled with the current system:
sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] factoextra_1.0.7 ggplot2_3.3.6 FactoMineR_2.7 scales_1.2.0
## [5] car_3.0-13 carData_3.0-5 RColorBrewer_1.1-3 lubridate_1.8.0
##
## loaded via a namespace (and not attached):
## [1] ggrepel_0.9.1 Rcpp_1.0.8.3 mvtnorm_1.1-3
## [4] lattice_0.20-45 tidyr_1.2.0 multcompView_0.1-8
## [7] assertthat_0.2.1 digest_0.6.29 utf8_1.2.2
## [10] R6_2.5.1 backports_1.4.1 evaluate_0.15
## [13] coda_0.19-4 highr_0.9 pillar_1.7.0
## [16] rlang_1.0.2 rstudioapi_0.13 jquerylib_0.1.4
## [19] DT_0.27 rmarkdown_2.14 labeling_0.4.2
## [22] stringr_1.4.0 htmlwidgets_1.5.4 munsell_0.5.0
## [25] broom_0.8.0 compiler_4.2.2 xfun_0.30
## [28] pkgconfig_2.0.3 htmltools_0.5.2 flashClust_1.01-2
## [31] tidyselect_1.1.2 tibble_3.1.7 fansi_1.0.3
## [34] ggpubr_0.4.0 crayon_1.5.1 dplyr_1.0.9
## [37] withr_2.5.0 MASS_7.3-58.2 leaps_3.1
## [40] grid_4.2.2 jsonlite_1.8.0 xtable_1.8-4
## [43] gtable_0.3.0 lifecycle_1.0.1 DBI_1.1.3
## [46] magrittr_2.0.3 estimability_1.4.1 cli_3.3.0
## [49] stringi_1.7.8 farver_2.1.0 ggsignif_0.6.3
## [52] scatterplot3d_0.3-42 bslib_0.3.1 ellipsis_0.3.2
## [55] vctrs_0.4.1 generics_0.1.2 tools_4.2.2
## [58] glue_1.6.2 purrr_0.3.4 emmeans_1.8.5
## [61] abind_1.4-5 fastmap_1.1.0 yaml_2.3.5
## [64] colorspace_2.0-3 cluster_2.1.4 rstatix_0.7.0
## [67] knitr_1.39 sass_0.4.1