library("lubridate")
library("RColorBrewer")

library("car")
library("scales")

library("FactoMineR")
library("factoextra")

Introduction

The current dataset comes from kaggle https://www.kaggle.com/ralle360/historic-tour-de-france-dataset and provides information about the Tour de France race, including 8 different variables corresponding to 2,236 stages of the race. The data have been extracted from Wikipedia and prepared by RasmusFiskerBang and is made available through a CSV file that you can download on my website (the dataset is licensed under the CC0 license - Public Domain).

Data importation

Even though CSV files can be opened with Excel, we strongly discourage this practice and we will use R directly for this task:

tdf_data <- read.table("stages_TDF.csv", sep = ",", header = TRUE,
                       stringsAsFactors = FALSE, encoding = "UTF-8",
                       quote = "\"")
head(tdf_data)
##   Stage       Date Distance            Origin                  Destination
## 1     1 2017-07-01     14.0        Düsseldorf                   Düsseldorf
## 2     2 2017-07-02    203.5        Düsseldorf                        Liège
## 3     3 2017-07-03    212.5          Verviers                       Longwy
## 4     4 2017-07-04    207.5 Mondorf-les-Bains                       Vittel
## 5     5 2017-07-05    160.5            Vittel La Planche des Belles Filles
## 6     6 2017-07-06    216.0            Vesoul                       Troyes
##                    Type         Winner Winner_Country
## 1 Individual time trial Geraint Thomas            GBR
## 2            Flat stage  Marcel Kittel            GER
## 3 Medium mountain stage    Peter Sagan            SVK
## 4            Flat stage  Arnaud Démare            FRA
## 5 Medium mountain stage      Fabio Aru            ITA
## 6            Flat stage  Marcel Kittel            GER

where:

  • sep is used to provide the character separating columns;

  • header = TRUE indicates that column names are included in the file (in the first row);

  • stringsAsFactor = FALSE indicates that strings must not be converted to type factor (this is the default behavior since R 4.0.0);

  • quote = "\"" indicates which character(s) has(have) to be considered as quotation character and not part of the data.

Other information and options are available in the help page ?read.table or in ?read.csv, ?read.csv2, ?read.delim, ?read.delim2.

We can take a first look at the data with:

head(tdf_data)
##   Stage       Date Distance            Origin                  Destination
## 1     1 2017-07-01     14.0        Düsseldorf                   Düsseldorf
## 2     2 2017-07-02    203.5        Düsseldorf                        Liège
## 3     3 2017-07-03    212.5          Verviers                       Longwy
## 4     4 2017-07-04    207.5 Mondorf-les-Bains                       Vittel
## 5     5 2017-07-05    160.5            Vittel La Planche des Belles Filles
## 6     6 2017-07-06    216.0            Vesoul                       Troyes
##                    Type         Winner Winner_Country
## 1 Individual time trial Geraint Thomas            GBR
## 2            Flat stage  Marcel Kittel            GER
## 3 Medium mountain stage    Peter Sagan            SVK
## 4            Flat stage  Arnaud Démare            FRA
## 5 Medium mountain stage      Fabio Aru            ITA
## 6            Flat stage  Marcel Kittel            GER
summary(tdf_data)
##     Stage               Date              Distance        Origin         
##  Length:2236        Length:2236        Min.   :  1.0   Length:2236       
##  Class :character   Class :character   1st Qu.:156.0   Class :character  
##  Mode  :character   Mode  :character   Median :199.0   Mode  :character  
##                                        Mean   :196.8                     
##                                        3rd Qu.:236.0                     
##                                        Max.   :482.0                     
##  Destination            Type              Winner          Winner_Country    
##  Length:2236        Length:2236        Length:2236        Length:2236       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 

When missing data are present, this last command shows that many rows contain missing values (identified with NA) in each column. The dataset dimension is obtained with:

dim(tdf_data)
## [1] 2236    8

Information on column types can be obtained with:

sapply(tdf_data, class)
##          Stage           Date       Distance         Origin    Destination 
##    "character"    "character"      "numeric"    "character"    "character" 
##           Type         Winner Winner_Country 
##    "character"    "character"    "character"

which indicates that all columns are character except for the third one, which is numeric. Sometimes, numeric variables (and more specifically integers) are used to code for categorical variables but this is not the case in this dataset.

To be allowed to perform properly the subsequent analysis, it is best to precisely define the different types of the columns:

  • Stage: is a character variable but that correspond to the number (or the symbol) of the stage each year. It is better recoded as a factor, which is the proper type in R for categorical variables with a finite number of possible values
tdf_data$Stage <- factor(tdf_data$Stage)
  • Date: is a date and there is a special type in R for dates, which is better used here (for instance, to easily extract the year or the month of the date) as a numeric variable
tdf_data$Date <- as.Date(tdf_data$Date, format = c("%Y-%m-%d"))
tdf_data$Year <- year(tdf_data$Date)
  • Distance: is indeed a numeric variable;

  • Origin and Destination: are categorical variables but the number of possible values (town of origin and destination of the stage) is so large that there is only a minor benefit in converting it to a factor

  • Type: is a categorical variable that is better converted to a factor

tdf_data$Type <- factor(tdf_data$Type)
  • Winner: is the winner name so the number of possible values is so large that there is only a minor benefit in converting it to a factor

  • Winner_Country: is the country code of the winner so is better converted to a factor:

tdf_data$Winner_Country <- factor(tdf_data$Winner_Country)

After these stages, the summary of the dataset is already more informative:

summary(tdf_data)
##      Stage           Date               Distance        Origin         
##  4      : 101   Min.   :1903-07-01   Min.   :  1.0   Length:2236       
##  11     : 100   1st Qu.:1938-07-17   1st Qu.:156.0   Class :character  
##  2      : 100   Median :1969-07-02   Median :199.0   Mode  :character  
##  9      : 100   Mean   :1966-08-02   Mean   :196.8                     
##  10     :  99   3rd Qu.:1991-07-20   3rd Qu.:236.0                     
##  6      :  99   Max.   :2017-07-23   Max.   :482.0                     
##  (Other):1637                                                          
##  Destination                            Type         Winner         
##  Length:2236        Plain stage           :1053   Length:2236       
##  Class :character   Stage with mountain(s): 530   Class :character  
##  Mode  :character   Individual time trial : 205   Mode  :character  
##                     Flat stage            : 110                     
##                     Team time trial       :  87                     
##                     Hilly stage           :  76                     
##                     (Other)               : 175                     
##  Winner_Country      Year     
##  FRA    :691    Min.   :1903  
##  BEL    :460    1st Qu.:1938  
##  ITA    :262    Median :1969  
##  NED    :157    Mean   :1966  
##  ESP    :125    3rd Qu.:1991  
##  GER    : 71    Max.   :2017  
##  (Other):470

Univariate statistics

Numerical characteristics

The variable Distance is the distance of the stage:

mean(tdf_data$Distance) # mean
## [1] 196.783
median(tdf_data$Distance) # median
## [1] 199
min(tdf_data$Distance) # minimum
## [1] 1
max(tdf_data$Distance) # maximum
## [1] 482
# quartiles and min/max
quantile(tdf_data$Distance, probs = c(0, 0.25, 0.5, 0.75, 1))
##   0%  25%  50%  75% 100% 
##    1  156  199  236  482

The option na.rm = TRUE must be used when you have missing values that you don’t want to be taken into account for the computation (otherwise, most of these functions would return the value NA).

Exercise: How to interpret these values? More precisely, what do they say about the variable distribution?

Some of these values are also available with

summary(tdf_data$Distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   156.0   199.0   196.8   236.0   482.0

Dispersion characteristics are obtained with:

var(tdf_data$Distance) # variance
## [1] 8131.78
sd(tdf_data$Distance) # standard deviation
## [1] 90.17639
range(tdf_data$Distance) # range
## [1]   1 482
diff(range(tdf_data$Distance))
## [1] 481

Exercise: How would you compute the inter-quartile range (in just one line of code)?

## 75% 
##  80
## [1] 80

Exercise: What is the coefficient of variation (CV) for this variable?

## [1] 0.4582529

Standard modifications of data include:

  • binarization:
tdf_data$cut1 <- cut(tdf_data$Distance, breaks = 5)
table(tdf_data$cut1)
## 
## (0.519,97.2]   (97.2,193]    (193,290]    (290,386]    (386,482] 
##          320          676          950          211           79
tdf_data$cut2 <- cut(tdf_data$Distance, breaks = 5, labels = FALSE)
table(tdf_data$cut2)
## 
##   1   2   3   4   5 
## 320 676 950 211  79
tdf_data$cut3 <- cut(tdf_data$Distance, breaks = seq(0, 500, by = 100))
table(tdf_data$cut3)
## 
##   (0,100] (100,200] (200,300] (300,400] (400,500] 
##       324       820       825       210        57

Exercise: What is the mode of cut3?

## [1] "(200,300]"
  • centering and scaling
tdf_data$DScaled <- as.vector(scale(tdf_data$Distance))
summary(tdf_data$DScaled)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.17111 -0.45226  0.02458  0.00000  0.43489  3.16288
var(tdf_data$DScaled)
## [1] 1

Charts

Before we start, a short note on color palettes:

display.brewer.all()

  • For a factor (cut2):
pie(table(tdf_data$Type), col = c(brewer.pal(9, "Set1"), brewer.pal(9, "Set2"))) # 18 niveaux > 18 couleurs
## Warning in brewer.pal(9, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

barplot(table(tdf_data$Winner_Country), col = "darkgreen", las = 3,
        cex.names = 0.6, cex.axis=0.6)

  • For numeric variables
hist(tdf_data$Distance)

hist(tdf_data$Distance, main = "Distribution of stage distances", 
     xlab = "distance (km)", breaks = 20)

plot(density(tdf_data$Distance))

hist(tdf_data$Distance, main = "Distribution of stage distances", 
     xlab = "distance (km)", breaks = 20, freq = FALSE)
lines(density(tdf_data$Distance), col = "red", lwd = 2)

boxplot(tdf_data$Distance)

boxplot(tdf_data$Distance, horizontal = TRUE, notch = TRUE)

Bivariate statistics

Factor versus factor

Contingency tables are obtained with the function table (here on a subset of the winner countries and of the stage type).

selected_countries <- names(which(table(tdf_data$Winner_Country) > 10)) 
selected_stages <- tdf_data$Winner_Country %in% selected_countries
cont_table <- table(tdf_data$Type[selected_stages], 
                    tdf_data$Winner_Country[selected_stages, drop = TRUE])
cont_table
##                                
##                                     AUS BEL COL DEN ESP FRA FRG GBR GER ITA LUX
##   Flat cobblestone stage          0   0   0   0   0   0   0   0   0   0   0   0
##   Flat stage                      0   5   2   0   1   5  10   0  30  28   9   0
##   Flat Stage                      0   0   4   0   0   0   2   0   0   0   2   0
##   Half Stage                      0   0   2   0   0   1   0   0   0   0   0   0
##   High mountain stage             0   1   2   3   0   5  11   0   4   1   4   1
##   Hilly stage                     0   1  12   3   4   6  20   0   1   2   5   0
##   Individual time trial           0   3  37   1   0  19  58   4  10   9   8   4
##   Intermediate stage              0   0   1   0   0   0   0   0   0   0   2   0
##   Medium mountain stage           0   4   3   0   0   3   5   0   4   4   3   0
##   Mountain stage                  0   1   0   1   3   9   9   0   0   2   3   4
##   Mountain Stage                  0   0   2   0   0   2   2   0   0   0   1   1
##   Mountain time trial             0   0   1   0   0   3   1   0   1   0   2   1
##   Plain stage                     1  14 266   0   7  17 369   6  11  21 141  26
##   Plain stage with cobblestones   0   0   0   0   0   0   0   0   0   0   0   0
##   Stage with mountain             0   0   5   0   0   0   5   0   0   0   1   0
##   Stage with mountain(s)          3   0 105   8   3  53 184   4   6   3  81  29
##   Team time trial                48   0  18   0   0   0  15   0   0   0   0   4
##   Transition stage                0   0   0   0   0   2   0   0   0   1   0   0
##                                
##                                 NED NOR POR SUI USA
##   Flat cobblestone stage          1   1   0   0   0
##   Flat stage                      2   7   0   0   1
##   Flat Stage                      1   0   0   0   0
##   Half Stage                      2   0   0   0   0
##   High mountain stage             1   2   2   0   0
##   Hilly stage                    11   1   0   2   0
##   Individual time trial          13   1   1  15  16
##   Intermediate stage              0   0   0   0   0
##   Medium mountain stage           2   1   2   0   0
##   Mountain stage                  0   0   0   0   4
##   Mountain Stage                  3   0   0   0   0
##   Mountain time trial             2   0   0   1   0
##   Plain stage                   100   2   5  26   5
##   Plain stage with cobblestones   1   0   0   0   0
##   Stage with mountain             0   0   0   0   0
##   Stage with mountain(s)         18   1   2  13  10
##   Team time trial                 0   0   0   0   2
##   Transition stage                0   1   0   0   0
t(cont_table) / rowSums(cont_table)
##      
##       Flat cobblestone stage   Flat stage   Flat Stage   Half Stage
##                 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##   AUS           0.000000e+00 2.500000e+00 0.000000e+00 0.000000e+00
##   BEL           0.000000e+00 2.000000e-02 2.000000e+00 5.000000e-01
##   COL           0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##   DEN           0.000000e+00 2.000000e-01 0.000000e+00 0.000000e+00
##   ESP           0.000000e+00 1.351351e-01 0.000000e+00 1.111111e-01
##   FRA           0.000000e+00 1.470588e-01 5.405405e-02 0.000000e+00
##   FRG           0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##   GBR           0.000000e+00 1.000000e+01 0.000000e+00 0.000000e+00
##   GER           0.000000e+00 9.032258e-01 0.000000e+00 0.000000e+00
##   ITA           0.000000e+00 2.500000e-01 6.451613e-02 0.000000e+00
##   LUX           0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##   NED           9.832842e-04 1.666667e-01 9.090909e-02 5.555556e-02
##   NOR           1.000000e+00 6.882989e-03 0.000000e+00 0.000000e+00
##   POR           0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##   SUI           0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##   USA           0.000000e+00 1.912046e-03 0.000000e+00 0.000000e+00
##      
##       High mountain stage  Hilly stage Individual time trial Intermediate stage
##              0.000000e+00 0.000000e+00          0.000000e+00       0.000000e+00
##   AUS        1.912046e-03 9.090909e-02          3.000000e+00       0.000000e+00
##   BEL        2.298851e-02 2.294455e-02          3.363636e+00       1.000000e+00
##   COL        7.500000e-01 3.448276e-02          1.912046e-03       0.000000e+00
##   DEN        0.000000e+00 1.000000e+00          0.000000e+00       0.000000e+00
##   ESP        5.000000e-02 3.000000e+00          4.750000e+00       0.000000e+00
##   FRA        1.222222e+00 2.000000e-01          2.900000e+01       0.000000e+00
##   FRG        0.000000e+00 0.000000e+00          4.000000e-02       0.000000e+00
##   GBR        1.081081e-01 2.000000e-01          1.111111e+00       0.000000e+00
##   GER        1.470588e-02 5.405405e-02          1.800000e+00       0.000000e+00
##   ITA        2.010050e-02 7.352941e-02          2.162162e-01       4.000000e-01
##   LUX        3.333333e-01 0.000000e+00          5.882353e-02       0.000000e+00
##   NED        3.225806e-02 3.666667e+00          6.532663e-02       0.000000e+00
##   NOR        5.555556e-02 3.225806e-02          3.333333e-01       0.000000e+00
##   POR        1.818182e-01 0.000000e+00          3.225806e-02       0.000000e+00
##   SUI        0.000000e+00 1.818182e-01          4.166667e-01       0.000000e+00
##   USA        0.000000e+00 0.000000e+00          1.454545e+00       0.000000e+00
##      
##       Medium mountain stage Mountain stage Mountain Stage Mountain time trial
##                0.000000e+00   0.000000e+00   0.000000e+00        0.000000e+00
##   AUS          3.333333e-01   9.090909e-02   0.000000e+00        0.000000e+00
##   BEL          2.949853e-03   0.000000e+00   1.818182e-01        2.777778e-02
##   COL          0.000000e+00   9.832842e-04   0.000000e+00        0.000000e+00
##   DEN          0.000000e+00   3.000000e+00   0.000000e+00        0.000000e+00
##   ESP          5.736138e-03   8.181818e-01   2.000000e+00        2.949853e-03
##   FRA          5.747126e-02   1.720841e-02   1.818182e-01        1.000000e+00
##   FRG          0.000000e+00   0.000000e+00   0.000000e+00        0.000000e+00
##   GBR          2.000000e+00   0.000000e+00   0.000000e+00        1.912046e-03
##   GER          4.000000e-02   1.000000e+00   0.000000e+00        0.000000e+00
##   ITA          3.333333e-01   3.000000e-02   5.000000e-01        5.000000e-01
##   LUX          0.000000e+00   4.444444e-01   1.000000e-02        5.000000e-01
##   NED          5.405405e-02   0.000000e+00   3.333333e-01        2.000000e-02
##   NOR          1.470588e-02   0.000000e+00   0.000000e+00        0.000000e+00
##   POR          1.005025e-02   0.000000e+00   0.000000e+00        0.000000e+00
##   SUI          0.000000e+00   0.000000e+00   0.000000e+00        2.702703e-02
##   USA          0.000000e+00   1.333333e+00   0.000000e+00        0.000000e+00
##      
##        Plain stage Plain stage with cobblestones Stage with mountain
##       5.025126e-03                  0.000000e+00        0.000000e+00
##   AUS 4.666667e+00                  0.000000e+00        0.000000e+00
##   BEL 8.580645e+00                  0.000000e+00        2.512563e-02
##   COL 0.000000e+00                  0.000000e+00        0.000000e+00
##   DEN 6.363636e-01                  0.000000e+00        0.000000e+00
##   ESP 1.416667e+00                  0.000000e+00        0.000000e+00
##   FRA 3.628319e-01                  0.000000e+00        4.545455e-01
##   FRG 6.000000e+00                  0.000000e+00        0.000000e+00
##   GBR 1.000000e+00                  0.000000e+00        0.000000e+00
##   GER 4.015296e-02                  0.000000e+00        0.000000e+00
##   ITA 1.620690e+00                  0.000000e+00        9.090909e-02
##   LUX 6.500000e+00                  0.000000e+00        0.000000e+00
##   NED 5.000000e+01                  2.500000e-01        0.000000e+00
##   NOR 2.000000e-02                  0.000000e+00        0.000000e+00
##   POR 5.555556e-01                  0.000000e+00        0.000000e+00
##   SUI 5.200000e+00                  0.000000e+00        0.000000e+00
##   USA 1.351351e-01                  0.000000e+00        0.000000e+00
##      
##       Stage with mountain(s) Team time trial Transition stage
##                 6.000000e-01    5.333333e+00     0.000000e+00
##   AUS           0.000000e+00    0.000000e+00     0.000000e+00
##   BEL           1.544118e+00    4.864865e-01     0.000000e+00
##   COL           4.020101e-02    0.000000e+00     0.000000e+00
##   DEN           1.000000e+00    0.000000e+00     0.000000e+00
##   ESP           1.709677e+00    0.000000e+00     1.005025e-02
##   FRA           5.111111e+00    4.838710e-01     0.000000e+00
##   FRG           3.636364e-01    0.000000e+00     0.000000e+00
##   GBR           5.000000e-01    0.000000e+00     0.000000e+00
##   GER           2.949853e-03    0.000000e+00     9.090909e-02
##   ITA           8.100000e+01    0.000000e+00     0.000000e+00
##   LUX           2.636364e+00    4.000000e+00     0.000000e+00
##   NED           3.441683e-02    0.000000e+00     0.000000e+00
##   NOR           1.149425e-02    0.000000e+00     9.090909e-02
##   POR           5.000000e-01    0.000000e+00     0.000000e+00
##   SUI           6.500000e+00    0.000000e+00     0.000000e+00
##   USA           1.000000e-01    1.000000e+00     0.000000e+00

Exercise: How to interpret the numbers in the last table (we call them “row profiles”)? For instance, what does 0.4545454545, in the column of France, mean? Compute the column profiles.

##      
##       Flat cobblestone stage  Flat stage  Flat Stage  Half Stage
##                  0.000000000 0.000000000 0.000000000 0.000000000
##   AUS            0.000000000 0.172413793 0.000000000 0.000000000
##   BEL            0.000000000 0.004347826 0.008695652 0.004347826
##   COL            0.000000000 0.000000000 0.000000000 0.000000000
##   DEN            0.000000000 0.055555556 0.000000000 0.000000000
##   ESP            0.000000000 0.040000000 0.000000000 0.008000000
##   FRA            0.000000000 0.014471780 0.002894356 0.000000000
##   FRG            0.000000000 0.000000000 0.000000000 0.000000000
##   GBR            0.000000000 0.447761194 0.000000000 0.000000000
##   GER            0.000000000 0.394366197 0.000000000 0.000000000
##   ITA            0.000000000 0.034351145 0.007633588 0.000000000
##   LUX            0.000000000 0.000000000 0.000000000 0.000000000
##   NED            0.006369427 0.012738854 0.006369427 0.012738854
##   NOR            0.058823529 0.411764706 0.000000000 0.000000000
##   POR            0.000000000 0.000000000 0.000000000 0.000000000
##   SUI            0.000000000 0.000000000 0.000000000 0.000000000
##   USA            0.000000000 0.026315789 0.000000000 0.000000000
##      
##       High mountain stage Hilly stage Individual time trial Intermediate stage
##               0.000000000 0.000000000           0.000000000        0.000000000
##   AUS         0.034482759 0.034482759           0.103448276        0.000000000
##   BEL         0.004347826 0.026086957           0.080434783        0.002173913
##   COL         0.187500000 0.187500000           0.062500000        0.000000000
##   DEN         0.000000000 0.222222222           0.000000000        0.000000000
##   ESP         0.040000000 0.048000000           0.152000000        0.000000000
##   FRA         0.015918958 0.028943560           0.083936324        0.000000000
##   FRG         0.000000000 0.000000000           0.285714286        0.000000000
##   GBR         0.059701493 0.014925373           0.149253731        0.000000000
##   GER         0.014084507 0.028169014           0.126760563        0.000000000
##   ITA         0.015267176 0.019083969           0.030534351        0.007633588
##   LUX         0.014285714 0.000000000           0.057142857        0.000000000
##   NED         0.006369427 0.070063694           0.082802548        0.000000000
##   NOR         0.117647059 0.058823529           0.058823529        0.000000000
##   POR         0.166666667 0.000000000           0.083333333        0.000000000
##   SUI         0.000000000 0.035087719           0.263157895        0.000000000
##   USA         0.000000000 0.000000000           0.421052632        0.000000000
##      
##       Medium mountain stage Mountain stage Mountain Stage Mountain time trial
##                 0.000000000    0.000000000    0.000000000         0.000000000
##   AUS           0.137931034    0.034482759    0.000000000         0.000000000
##   BEL           0.006521739    0.000000000    0.004347826         0.002173913
##   COL           0.000000000    0.062500000    0.000000000         0.000000000
##   DEN           0.000000000    0.166666667    0.000000000         0.000000000
##   ESP           0.024000000    0.072000000    0.016000000         0.024000000
##   FRA           0.007235890    0.013024602    0.002894356         0.001447178
##   FRG           0.000000000    0.000000000    0.000000000         0.000000000
##   GBR           0.059701493    0.000000000    0.000000000         0.014925373
##   GER           0.056338028    0.028169014    0.000000000         0.000000000
##   ITA           0.011450382    0.011450382    0.003816794         0.007633588
##   LUX           0.000000000    0.057142857    0.014285714         0.014285714
##   NED           0.012738854    0.000000000    0.019108280         0.012738854
##   NOR           0.058823529    0.000000000    0.000000000         0.000000000
##   POR           0.166666667    0.000000000    0.000000000         0.000000000
##   SUI           0.000000000    0.000000000    0.000000000         0.017543860
##   USA           0.000000000    0.105263158    0.000000000         0.000000000
##      
##       Plain stage Plain stage with cobblestones Stage with mountain
##       0.019230769                   0.000000000         0.000000000
##   AUS 0.482758621                   0.000000000         0.000000000
##   BEL 0.578260870                   0.000000000         0.010869565
##   COL 0.000000000                   0.000000000         0.000000000
##   DEN 0.388888889                   0.000000000         0.000000000
##   ESP 0.136000000                   0.000000000         0.000000000
##   FRA 0.534008683                   0.000000000         0.007235890
##   FRG 0.428571429                   0.000000000         0.000000000
##   GBR 0.164179104                   0.000000000         0.000000000
##   GER 0.295774648                   0.000000000         0.000000000
##   ITA 0.538167939                   0.000000000         0.003816794
##   LUX 0.371428571                   0.000000000         0.000000000
##   NED 0.636942675                   0.006369427         0.000000000
##   NOR 0.117647059                   0.000000000         0.000000000
##   POR 0.416666667                   0.000000000         0.000000000
##   SUI 0.456140351                   0.000000000         0.000000000
##   USA 0.131578947                   0.000000000         0.000000000
##      
##       Stage with mountain(s) Team time trial Transition stage
##                  0.057692308     0.923076923      0.000000000
##   AUS            0.000000000     0.000000000      0.000000000
##   BEL            0.228260870     0.039130435      0.000000000
##   COL            0.500000000     0.000000000      0.000000000
##   DEN            0.166666667     0.000000000      0.000000000
##   ESP            0.424000000     0.000000000      0.016000000
##   FRA            0.266280753     0.021707670      0.000000000
##   FRG            0.285714286     0.000000000      0.000000000
##   GBR            0.089552239     0.000000000      0.000000000
##   GER            0.042253521     0.000000000      0.014084507
##   ITA            0.309160305     0.000000000      0.000000000
##   LUX            0.414285714     0.057142857      0.000000000
##   NED            0.114649682     0.000000000      0.000000000
##   NOR            0.058823529     0.000000000      0.058823529
##   POR            0.166666667     0.000000000      0.000000000
##   SUI            0.228070175     0.000000000      0.000000000
##   USA            0.263157895     0.052631579      0.000000000
chisq <- chisq.test(cont_table)$statistic
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
ddl <- min(nrow(cont_table) - 1, ncol(cont_table) - 1)
cramerV <- sqrt(chisq / (sum(cont_table) * ddl))
cramerV
## X-squared 
## 0.2617944

Barplots are obtained using the contingency table as well:

selected_stages <- selected_stages & 
  (tdf_data$Type %in% c("Individual time trial", "Plain stage", "Stage with mountain(s)"))
cont_table <- table(tdf_data$Type[selected_stages, drop = TRUE], 
                    tdf_data$Winner_Country[selected_stages, drop = TRUE])
barplot(cont_table, legend.text = TRUE, xlab = "country", ylab = "frequency",
        cex.names = 0.6)

barplot(cont_table, legend.text = TRUE, xlab = "country", ylab = "frequency",
        beside = TRUE, col = brewer.pal(3, "Set3"), cex.names = 0.6)

Numeric versus numeric

Covariances can be obtained using the function cov:

cov(tdf_data$Distance, tdf_data$Year)
## [1] -1332.007
cov(tdf_data$Distance, tdf_data$Year, method = "spearman")
## [1] -181210.6
cov(tdf_data[ ,c("Distance", "Year", "DScaled")])
##             Distance        Year   DScaled
## Distance  8131.78045 -1332.00737  90.17639
## Year     -1332.00737   944.15261 -14.77113
## DScaled     90.17639   -14.77113   1.00000
cor(tdf_data$Distance, tdf_data$Year)
## [1] -0.4807206
cor(tdf_data$Distance, tdf_data$Year, method = "spearman")
## [1] -0.4347636
cor(tdf_data[ ,c("Distance", "Year", "DScaled")])
##            Distance       Year    DScaled
## Distance  1.0000000 -0.4807206  1.0000000
## Year     -0.4807206  1.0000000 -0.4807206
## DScaled   1.0000000 -0.4807206  1.0000000

and correlations are obtained with the function cor that takes the same arguments than cov.

Partial correlation between Distance and Year given DScaled is obtained with:

cor(lm(tdf_data$Distance ~ tdf_data$DScaled)$residuals,
    lm(tdf_data$Year ~ tdf_data$DScaled)$residuals)
## [1] -0.03113725

Exercise: Is it an expected result? Check at the residuals of the first model: what can you say?

summary(lm(tdf_data$Distance ~ tdf_data$DScaled)$residuals)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -4.890e-12 -2.640e-16  2.373e-15  0.000e+00  4.936e-15  4.392e-14

Dot plots (scatterplots) between two variables are obtained with:

plot(tdf_data$Year, tdf_data$Distance, pch = 19)

or with scatterplot matrices:

scatterplotMatrix(tdf_data[, c("Distance", "Year", "DScaled")])

scatterplotMatrix(tdf_data[, c("Distance", "Year", "DScaled")],
                  col = "black", regLine = FALSE, smooth = FALSE, pch = "+")

Exercise: Can you comment on the specific plot between Distance and DScaled?

Numeric versus factor

Within and between variance of Distance with respect to Winner_Country are obtained from the function anova (in the column Mean Sq, within group variance corresponds to the row residuals):

selected_countries <- names(which(table(tdf_data$Winner_Country) > 10))
selected_stages <- tdf_data$Winner_Country %in% selected_countries
anova(lm(tdf_data$Distance[selected_stages] ~ 
           factor(tdf_data$Winner_Country[selected_stages, drop = TRUE])))
## Analysis of Variance Table
## 
## Response: tdf_data$Distance[selected_stages]
##                                                                 Df   Sum Sq
## factor(tdf_data$Winner_Country[selected_stages, drop = TRUE])   16  2312902
## Residuals                                                     2139 15488666
##                                                               Mean Sq F value
## factor(tdf_data$Winner_Country[selected_stages, drop = TRUE])  144556  19.963
## Residuals                                                        7241        
##                                                                  Pr(>F)    
## factor(tdf_data$Winner_Country[selected_stages, drop = TRUE]) < 2.2e-16 ***
## Residuals                                                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and the correlation ratio is the square root of the column F value:

cor_ratio <- anova(lm(tdf_data$Distance[selected_stages] ~ 
           factor(tdf_data$Winner_Country[selected_stages, drop = TRUE])))
sqrt(cor_ratio[1, "Mean Sq"]/(cor_ratio[1, "Mean Sq"] + cor_ratio[2, "Mean Sq"]))
## [1] 0.9758575

Parallel boxplots are obtained using the same syntax with the ~:

boxplot(tdf_data$Distance[selected_stages] ~ 
          tdf_data$Winner_Country[selected_stages, drop = TRUE], las = 3,
        xlab = "winner country", ylab = "distance")

To obtain multiple histograms on the same plot, you can use:

Tests

With one variable

To test if Distance average is equal to 200.

  • with a parametric test, we need to first test if the variable is distributed as a Gaussian variable. Several tests exist to do that but we will use the Shapiro-Wilk’s test:
shapiro.test(tdf_data$Distance)
## 
##  Shapiro-Wilk normality test
## 
## data:  tdf_data$Distance
## W = 0.96778, p-value < 2.2e-16

Despite significant deviation to normality, we will perform a Student test (for the sake of the example):

res <- t.test(tdf_data$Distance, mu = 200, conf.level = 0.99)
res
## 
##  One Sample t-test
## 
## data:  tdf_data$Distance
## t = -1.6869, df = 2235, p-value = 0.09176
## alternative hypothesis: true mean is not equal to 200
## 99 percent confidence interval:
##  191.8666 201.6994
## sample estimates:
## mean of x 
##   196.783
res$statistic
##         t 
## -1.686922
res$p.value
## [1] 0.09175796
res$conf.int
## [1] 191.8666 201.6994
## attr(,"conf.level")
## [1] 0.99
  • with a non-parametric Wilcoxon test (that tests the median instead of the mean):
res <- wilcox.test(tdf_data$Distance, mu = 200, conf.int = TRUE)
res
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  tdf_data$Distance
## V = 1186216, p-value = 0.09295
## alternative hypothesis: true location is not equal to 200
## 95 percent confidence interval:
##  194.4999 200.5000
## sample estimates:
## (pseudo)median 
##          197.5
res$statistic
##       V 
## 1186216
res$p.value
## [1] 0.09294784
res$conf.int
## [1] 194.4999 200.5000
## attr(,"conf.level")
## [1] 0.95

With two factors

For small contingency tables, the independence between rows and columns can be tested with a Fisher’s exact test (to be preferred) or a \(\chi^2\) test (only when Fisher’s exact test is computationally too heavy to run or when the sample size is sufficiently large).

selected_countries <- names(which(table(tdf_data$Winner_Country) > 10))
selected_stages <- tdf_data$Winner_Country %in% selected_countries
selected_stages <- selected_stages & 
  (tdf_data$Type %in% c("Individual time trial", "Plain stage", "Stage with mountain(s)"))
tdf_small <- data.frame(tdf_data$Winner_Country[selected_stages, drop = TRUE],
                        tdf_data$Type[selected_stages, drop = TRUE],
                        tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Winner_Country", "Type", "Distance")
cont_table <- table(tdf_small$Winner_Country, tdf_small$Type)
cont_table
##      
##       Individual time trial Plain stage Stage with mountain(s)
##                           0           1                      3
##   AUS                     3          14                      0
##   BEL                    37         266                    105
##   COL                     1           0                      8
##   DEN                     0           7                      3
##   ESP                    19          17                     53
##   FRA                    58         369                    184
##   FRG                     4           6                      4
##   GBR                    10          11                      6
##   GER                     9          21                      3
##   ITA                     8         141                     81
##   LUX                     4          26                     29
##   NED                    13         100                     18
##   NOR                     1           2                      1
##   POR                     1           5                      2
##   SUI                    15          26                     13
##   USA                    16           5                     10
res <- fisher.test(cont_table)

# Error in fisher.test(cont_table) : 
#   FEXACT error 7(location). LDSTP=18330 is too small for this problem,
#   (pastp=11.752, ipn_0:=ipoin[itp=563]=3932, stp[ipn_0]=10.2479).
# Increase workspace or consider using 'simulate.p.value=TRUE'

In addition to being more suited to large contingency tables, \(\chi^2\) test also provide interesting statistics for interpretation of the results:

res <- chisq.test(cont_table)
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
res
## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = 241.94, df = 32, p-value < 2.2e-16
res$observed
##      
##       Individual time trial Plain stage Stage with mountain(s)
##                           0           1                      3
##   AUS                     3          14                      0
##   BEL                    37         266                    105
##   COL                     1           0                      8
##   DEN                     0           7                      3
##   ESP                    19          17                     53
##   FRA                    58         369                    184
##   FRG                     4           6                      4
##   GBR                    10          11                      6
##   GER                     9          21                      3
##   ITA                     8         141                     81
##   LUX                     4          26                     29
##   NED                    13         100                     18
##   NOR                     1           2                      1
##   POR                     1           5                      2
##   SUI                    15          26                     13
##   USA                    16           5                     10
res$expected
##      
##       Individual time trial Plain stage Stage with mountain(s)
##                   0.4577343    2.339275               1.202990
##   AUS             1.9453709    9.941921               5.112708
##   BEL            46.6889017  238.606095             122.705003
##   COL             1.0299022    5.263370               2.706728
##   DEN             1.1443358    5.848189               3.007476
##   ESP            10.1845888   52.048879              26.766532
##   FRA            69.9189189  357.324324             183.756757
##   FRG             1.6020702    8.187464               4.210466
##   GBR             3.0897067   15.790109               8.120184
##   GER             3.7763082   19.299022               9.924669
##   ITA            26.3197240  134.508338              69.171938
##   LUX             6.7515814   34.504313              17.744106
##   NED            14.9907993   76.611271              39.397930
##   NOR             0.4577343    2.339275               1.202990
##   POR             0.9154687    4.678551               2.405980
##   SUI             6.1794135   31.580219              16.240368
##   USA             3.5474411   18.129385               9.323174
res$residuals^2
##      
##       Individual time trial  Plain stage Stage with mountain(s)
##                4.577343e-01 7.667582e-01           2.684348e+00
##   AUS          5.717380e-01 1.656421e+00           5.112708e+00
##   BEL          2.010645e+00 3.145041e+00           2.554640e+00
##   COL          8.681835e-04 5.263370e+00           1.035151e+01
##   DEN          1.144336e+00 2.268513e-01           1.858170e-05
##   ESP          7.630301e+00 2.360135e+01           2.571102e+01
##   FRA          2.031791e+00 3.815061e-01           3.219869e-04
##   FRG          3.589148e+00 5.844299e-01           1.052041e-02
##   GBR          1.545524e+01 1.453134e+00           5.535811e-01
##   GER          7.225829e+00 1.499208e-01           4.831501e+00
##   ITA          1.275136e+01 3.133016e-01           2.022541e+00
##   LUX          1.121397e+00 2.096067e+00           7.140126e+00
##   NED          2.643810e-01 7.140368e+00           1.162171e+01
##   NOR          6.424077e-01 4.920662e-02           3.425217e-02
##   POR          7.805344e-03 2.208580e-02           6.850435e-02
##   SUI          1.259064e+01 9.860235e-01           6.465361e-01
##   USA          4.371214e+01 9.508361e+00           4.913489e-02
barplot(t(cont_table), legend.text = TRUE, xlab = "country", ylab = "frequency",
        cex.names = 0.6, col = brewer.pal(3, "Set3"))

Exercise: Titanic[ , Sex = "Male", Age = "Adult", ] is the contingency table of Male Adults in Titanics for the variables Class and Survival. Are these two variables independant? Which classes deviate the most from the independence? Same questions for Women.

## 
##  Fisher's Exact Test for Count Data
## 
## data:  Titanic[, Sex = "Male", Age = "Adult", ]
## p-value = 1.559e-08
## alternative hypothesis: two.sided
## 
##  Pearson's Chi-squared test
## 
## data:  Titanic[, Sex = "Male", Age = "Adult", ]
## X-squared = 37.988, df = 3, p-value = 2.843e-08
##       Survived
## Class   No Yes
##   1st  118  57
##   2nd  154  14
##   3rd  387  75
##   Crew 670 192
##       Survived
## Class        No       Yes
##   1st  139.5171  35.48290
##   2nd  133.9364  34.06359
##   3rd  368.3251  93.67487
##   Crew 687.2214 174.77864
##       Survived
## Class          No        Yes
##   1st   3.3184854 13.0481274
##   2nd   3.0055123 11.8175321
##   3rd   0.9468552  3.7229900
##   Crew  0.4315569  1.6968612
## 
##  Pearson's Chi-squared test
## 
## data:  Titanic[, Sex = "Female", Age = "Adult", ]
## X-squared = 117.31, df = 3, p-value < 2.2e-16
##       Survived
## Class   No Yes
##   1st    4 140
##   2nd   13  80
##   3rd   89  76
##   Crew   3  20
##       Survived
## Class         No       Yes
##   1st  36.931765 107.06824
##   2nd  23.851765  69.14824
##   3rd  42.317647 122.68235
##   Crew  5.898824  17.10118
##       Survived
## Class          No        Yes
##   1st  29.3649961 10.1290651
##   2nd   4.9371943  1.7030196
##   3rd  51.4972412 17.7632889
##   Crew  1.4245515  0.4913801

With two numeric variables

Correlation tests are performed with:

cor.test(tdf_data$Distance, tdf_data$Year)
## 
##  Pearson's product-moment correlation
## 
## data:  tdf_data$Distance and tdf_data$Year
## t = -25.912, df = 2234, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5119713 -0.4481991
## sample estimates:
##        cor 
## -0.4807206
cor.test(tdf_data$Distance, tdf_data$Year, method = "spearman")
## Warning in cor.test.default(tdf_data$Distance, tdf_data$Year, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  tdf_data$Distance and tdf_data$Year
## S = 2673279681, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.4347636
plot(tdf_data$Distance, tdf_data$Year)

Exercise: The object Soils contain characteristics on soil samples. Variables 6-13 are numerical characteristics. Make a scatterplot matrix of these variables and pick two variables that you think are linearly correlated. Test your hypothesis.

## 
##  Pearson's product-moment correlation
## 
## data:  pH and Ca
## t = 9.3221, df = 46, p-value = 3.614e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6809493 0.8885997
## sample estimates:
##       cor 
## 0.8086293

With a numeric variable explained by a factor (\(K=2\), independent)

Comparing the means of the given numeric variable Distance for the different countries of origin of the winner Winner_Country, for the countries DEN and FRA:

selected_countries <- c("DEN", "FRA")
selected_stages <- tdf_data$Winner_Country %in% selected_countries
tdf_small <- data.frame(tdf_data$Winner_Country[selected_stages, drop = TRUE],
                        tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Winner_Country", "Distance")

t.test(tdf_small$Distance ~ tdf_small$Winner_Country)
## 
##  Welch Two Sample t-test
## 
## data:  tdf_small$Distance by tdf_small$Winner_Country
## t = -1.688, df = 20.916, p-value = 0.1063
## alternative hypothesis: true difference in means between group DEN and group FRA is not equal to 0
## 95 percent confidence interval:
##  -44.056699   4.584693
## sample estimates:
## mean in group DEN mean in group FRA 
##          198.8889          218.6249
t.test(tdf_small$Distance ~ tdf_small$Winner_Country, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  tdf_small$Distance by tdf_small$Winner_Country
## t = -0.86449, df = 707, p-value = 0.3876
## alternative hypothesis: true difference in means between group DEN and group FRA is not equal to 0
## 95 percent confidence interval:
##  -64.55813  25.08613
## sample estimates:
## mean in group DEN mean in group FRA 
##          198.8889          218.6249

How to know if two variances are equal? Bartlett test or Fisher’s test…

bartlett.test(tdf_small$Distance ~ tdf_small$Winner_Country)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  tdf_small$Distance by tdf_small$Winner_Country
## Bartlett's K-squared = 11.104, df = 1, p-value = 0.0008615
var.test(tdf_small$Distance ~ tdf_small$Winner_Country)
## 
##  F test to compare two variances
## 
## data:  tdf_small$Distance by tdf_small$Winner_Country
## F = 0.23814, num df = 17, denom df = 690, p-value = 0.001182
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1326281 0.5380715
## sample estimates:
## ratio of variances 
##          0.2381388

Comparing the medians of the given numeric variable Distance between the two winner countries, FRA and DEN:

wilcox.test(tdf_small$Distance ~ tdf_small$Winner_Country)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  tdf_small$Distance by tdf_small$Winner_Country
## W = 5467, p-value = 0.381
## alternative hypothesis: true location shift is not equal to 0

Exercise: In the dataset Soils, does the pH significantly differ between samples on depressions and on slopes (variable Contour)? Visually confirm with the appropriate plot.

## 
##  Shapiro-Wilk normality test
## 
## data:  Soils$pH[Soils$Contour == "Depression"]
## W = 0.96308, p-value = 0.7179
## 
##  Shapiro-Wilk normality test
## 
## data:  Soils$pH[Soils$Contour == "Slope"]
## W = 0.89835, p-value = 0.07564
## 
##  Bartlett test of homogeneity of variances
## 
## data:  pH by Contour
## Bartlett's K-squared = 2.8034, df = 1, p-value = 0.09407
## 
##  Two Sample t-test
## 
## data:  pH by Contour
## t = -0.21586, df = 30, p-value = 0.8306
## alternative hypothesis: true difference in means between group Depression and group Slope is not equal to 0
## 95 percent confidence interval:
##  -0.5688226  0.4600726
## sample estimates:
## mean in group Depression      mean in group Slope 
##                 4.691875                 4.746250

With a numeric variable explained by a factor (\(K > 2\), independent)

ANOVA is based on the previous fitting of a linear model. For instance, if we want to check if the means Distance are different between Type (with a restricted dataset), we have to run (under normality and equal variance assumptions):

selected_stages <- (tdf_data$Type %in% 
                      c("Individual time trial", "Plain stage",
                        "Stage with mountain(s)"))
tdf_small <- data.frame(tdf_data$Type[selected_stages, drop = TRUE],
                        tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Type", "Distance")
anova(lm(tdf_small$Distance ~ tdf_small$Type))
## Analysis of Variance Table
## 
## Response: tdf_small$Distance
##                  Df  Sum Sq Mean Sq F value    Pr(>F)    
## tdf_small$Type    2 6246184 3123092   580.9 < 2.2e-16 ***
## Residuals      1785 9596640    5376                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Kruskal-Wallis nonparametric test is performed with:

kruskal.test(tdf_small$Distance ~ tdf_small$Type)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  tdf_small$Distance by tdf_small$Type
## Kruskal-Wallis chi-squared = 532.8, df = 2, p-value < 2.2e-16

Exercise: In the dataset Soils, does the pH significantly differ between the different contours? Visually confirm with the appropriate plot.

## 
##  Shapiro-Wilk normality test
## 
## data:  Soils$pH[Soils$Contour == "Top"]
## W = 0.92004, p-value = 0.1689
## 
##  Bartlett test of homogeneity of variances
## 
## data:  pH by Contour
## Bartlett's K-squared = 3.1948, df = 2, p-value = 0.2024
## Analysis of Variance Table
## 
## Response: pH
##           Df  Sum Sq Mean Sq F value Pr(>F)
## Contour    2  0.2607 0.13033  0.2799 0.7572
## Residuals 45 20.9546 0.46566

With a numeric variables in two paired samples (two groups with the same individuals)

The sleep data show the effect (extra) of two soporific drugs (group) on 10 patients (ID):

head(sleep)
##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6
tail(sleep)
##    extra group ID
## 15  -0.1     2  5
## 16   4.4     2  6
## 17   5.5     2  7
## 18   1.6     2  8
## 19   4.6     2  9
## 20   3.4     2 10
dim(sleep)
## [1] 20  3
sleep2 <- reshape(sleep, direction = "wide", idvar = "ID", timevar = "group")
dim(sleep2)
## [1] 10  3
head(sleep2)
##   ID extra.1 extra.2
## 1  1     0.7     1.9
## 2  2    -1.6     0.8
## 3  3    -0.2     1.1
## 4  4    -1.2     0.1
## 5  5    -0.1    -0.1
## 6  6     3.4     4.4

Paired comparison tests are performed with (under normality assumption):

with(sleep2, t.test(extra.1, extra.2, paired = TRUE))
## 
##  Paired t-test
## 
## data:  extra.1 and extra.2
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -2.4598858 -0.7001142
## sample estimates:
## mean difference 
##           -1.58

and with a nonparametric test:

with(sleep2, wilcox.test(extra.1, extra.2, paired = TRUE))
## Warning in wilcox.test.default(extra.1, extra.2, paired = TRUE): cannot compute
## exact p-value with ties
## Warning in wilcox.test.default(extra.1, extra.2, paired = TRUE): cannot compute
## exact p-value with zeroes
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  extra.1 and extra.2
## V = 0, p-value = 0.009091
## alternative hypothesis: true location shift is not equal to 0

If there is a large number of ties, this \(p\)-value should not be used but it can be trusted if the number of ties is small.

boxplot(extra ~ group, data = sleep)

We can check that these tests are equivalent to tests on the difference:

sleep2$diff <- sleep2$extra.2 - sleep2$extra.1
head(sleep2)
##   ID extra.1 extra.2 diff
## 1  1     0.7     1.9  1.2
## 2  2    -1.6     0.8  2.4
## 3  3    -0.2     1.1  1.3
## 4  4    -1.2     0.1  1.3
## 5  5    -0.1    -0.1  0.0
## 6  6     3.4     4.4  1.0
t.test(x = sleep2$diff)
## 
##  One Sample t-test
## 
## data:  sleep2$diff
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.7001142 2.4598858
## sample estimates:
## mean of x 
##      1.58
wilcox.test(x = sleep2$diff)
## Warning in wilcox.test.default(x = sleep2$diff): cannot compute exact p-value
## with ties
## Warning in wilcox.test.default(x = sleep2$diff): cannot compute exact p-value
## with zeroes
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  sleep2$diff
## V = 45, p-value = 0.009091
## alternative hypothesis: true location is not equal to 0

Linear regression and models

A numeric variable explained by one or several numeric variables

res <- lm(tdf_data$Distance ~ tdf_data$Year)
res
## 
## Call:
## lm(formula = tdf_data$Distance ~ tdf_data$Year)
## 
## Coefficients:
##   (Intercept)  tdf_data$Year  
##      2970.493         -1.411
summary(res)
## 
## Call:
## lm(formula = tdf_data$Distance ~ tdf_data$Year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -209.37  -43.86   16.05   53.42  225.88 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2970.49317  107.05746   27.75   <2e-16 ***
## tdf_data$Year   -1.41080    0.05445  -25.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.09 on 2234 degrees of freedom
## Multiple R-squared:  0.2311, Adjusted R-squared:  0.2307 
## F-statistic: 671.4 on 1 and 2234 DF,  p-value: < 2.2e-16
coefficients(res)
##   (Intercept) tdf_data$Year 
##   2970.493172     -1.410797
confint(res)
##                     2.5 %      97.5 %
## (Intercept)   2760.550672 3180.435673
## tdf_data$Year   -1.517567   -1.304026
plot(res)

# http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/
# For details on validations plots
tdf_small <- tdf_data[, c("Distance", "Year")]
head(tdf_small)
##   Distance Year
## 1     14.0 2017
## 2    203.5 2017
## 3    212.5 2017
## 4    207.5 2017
## 5    160.5 2017
## 6    216.0 2017
res <- lm(Distance ~ ., data = tdf_small)
summary(res)
## 
## Call:
## lm(formula = Distance ~ ., data = tdf_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -209.37  -43.86   16.05   53.42  225.88 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2970.49317  107.05746   27.75   <2e-16 ***
## Year          -1.41080    0.05445  -25.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.09 on 2234 degrees of freedom
## Multiple R-squared:  0.2311, Adjusted R-squared:  0.2307 
## F-statistic: 671.4 on 1 and 2234 DF,  p-value: < 2.2e-16
plot(res)

A numeric variable explained by one (or several) factor(s)

selected_stages <- (tdf_data$Type %in% 
                      c("Individual time trial", "Plain stage",
                        "Stage with mountain(s)"))
tdf_small <- data.frame(tdf_data$Type[selected_stages, drop = TRUE],
                        tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Type", "Distance")
head(tdf_small)
##                    Type Distance
## 1 Individual time trial     14.0
## 2 Individual time trial     22.5
## 3 Individual time trial     37.5
## 4 Individual time trial     13.8
## 5 Individual time trial     54.0
## 6 Individual time trial     33.0
res <- lm(Distance ~ Type, data = tdf_small)
summary(res)
## 
## Call:
## lm(formula = Distance ~ Type, data = tdf_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -201.30  -39.27  -10.44   25.73  255.73 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  39.111      5.121   7.637 3.59e-14 ***
## TypePlain stage             187.160      5.597  33.437  < 2e-16 ***
## TypeStage with mountain(s)  181.789      6.031  30.144  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.32 on 1785 degrees of freedom
## Multiple R-squared:  0.3943, Adjusted R-squared:  0.3936 
## F-statistic: 580.9 on 2 and 1785 DF,  p-value: < 2.2e-16
anova(res)
## Analysis of Variance Table
## 
## Response: Distance
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## Type         2 6246184 3123092   580.9 < 2.2e-16 ***
## Residuals 1785 9596640    5376                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise: Explain the pH of Soils by an additive effect between contour and Ca and with a additive effect with interaction between contour and Depth.

## 
## Call:
## lm(formula = pH ~ Contour + Ca, data = Soils)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55729 -0.25762 -0.08993  0.16989  1.10483 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.41955    0.14900  22.951  < 2e-16 ***
## ContourSlope -0.12625    0.12918  -0.977  0.33377    
## ContourTop   -0.44012    0.13146  -3.348  0.00168 ** 
## Ca            0.17917    0.01666  10.754 6.76e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3623 on 44 degrees of freedom
## Multiple R-squared:  0.7278, Adjusted R-squared:  0.7092 
## F-statistic: 39.21 on 3 and 44 DF,  p-value: 1.711e-12
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = pH ~ Contour + Depth, data = Soils)
## 
## $Contour
##                       diff        lwr       upr     p adj
## Slope-Depression  0.054375 -0.2701619 0.3789119 0.9129085
## Top-Depression   -0.121875 -0.4464119 0.2026619 0.6356018
## Top-Slope        -0.176250 -0.5007869 0.1482869 0.3925120
## 
## $Depth
##                   diff        lwr         upr     p adj
## 10-30-0-10  -0.3933333 -0.8059384  0.01927173 0.0666195
## 30-60-0-10  -1.1191667 -1.5317717 -0.70656160 0.0000000
## 60-90-0-10  -1.4000000 -1.8126051 -0.98739493 0.0000000
## 30-60-10-30 -0.7258333 -1.1384384 -0.31322827 0.0001569
## 60-90-10-30 -1.0066667 -1.4192717 -0.59406160 0.0000004
## 60-90-30-60 -0.2808333 -0.6934384  0.13177173 0.2781588

## 
## Call:
## lm(formula = pH ~ Contour + Depth + Contour:Depth, data = Soils)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7125 -0.1775 -0.0125  0.1681  1.3875 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               5.3525     0.1951  27.438  < 2e-16 ***
## ContourSlope              0.1550     0.2759   0.562 0.577703    
## ContourTop               -0.0200     0.2759  -0.072 0.942608    
## Depth10-30               -0.4725     0.2759  -1.713 0.095366 .  
## Depth30-60               -0.9900     0.2759  -3.589 0.000982 ***
## Depth60-90               -1.1800     0.2759  -4.277 0.000133 ***
## ContourSlope:Depth10-30   0.2475     0.3901   0.634 0.529848    
## ContourTop:Depth10-30    -0.0100     0.3901  -0.026 0.979693    
## ContourSlope:Depth30-60  -0.2500     0.3901  -0.641 0.525723    
## ContourTop:Depth30-60    -0.1375     0.3901  -0.352 0.726571    
## ContourSlope:Depth60-90  -0.4000     0.3901  -1.025 0.312085    
## ContourTop:Depth60-90    -0.2600     0.3901  -0.666 0.509396    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3901 on 36 degrees of freedom
## Multiple R-squared:  0.7417, Adjusted R-squared:  0.6628 
## F-statistic: 9.398 on 11 and 36 DF,  p-value: 1.188e-07
## Analysis of Variance Table
## 
## Response: pH
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## Contour        2  0.2607  0.1303  0.8562    0.4332    
## Depth          3 14.9590  4.9863 32.7582 2.162e-10 ***
## Contour:Depth  6  0.5159  0.0860  0.5648    0.7553    
## Residuals     36  5.4798  0.1522                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A binary variable is explained by the linear relation between predictor(s)

res <- glm(Type ~ Distance, data = tdf_small, family = binomial(link = "logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(res)
## 
## Call:
## glm(formula = Type ~ Distance, family = binomial(link = "logit"), 
##     data = tdf_small)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1960   0.0001   0.0041   0.0175   2.9673  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.94801    0.57108  -10.41   <2e-16 ***
## Distance     0.07949    0.00750   10.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1273.54  on 1787  degrees of freedom
## Residual deviance:  201.89  on 1786  degrees of freedom
## AIC: 205.89
## 
## Number of Fisher Scoring iterations: 10

Multiple testing correction

If we test the differences between the levels of Contour of the mean for all numeric variables in Soils, we obtain 9 p-values on which multiple testing can be performed:

all_pvals <- apply(Soils[ ,6:14], 2, function(avar) 
  anova(lm(avar ~ Soils$Group))[1, "Pr(>F)"])
all_pvals
##           pH            N         Dens            P           Ca           Mg 
## 1.187636e-07 5.331419e-10 3.456879e-08 1.861579e-09 1.333073e-10 1.579912e-03 
##            K           Na       Conduc 
## 2.456518e-06 1.369506e-15 3.425749e-17
p.adjust(all_pvals, method = "BH")
##           pH            N         Dens            P           Ca           Mg 
## 1.526961e-07 1.199569e-09 5.185319e-08 3.350841e-09 3.999220e-10 1.579912e-03 
##            K           Na       Conduc 
## 2.763582e-06 6.162778e-15 3.083174e-16
p.adjust(all_pvals, method = "bonferroni")
##           pH            N         Dens            P           Ca           Mg 
## 1.068873e-06 4.798277e-09 3.111191e-07 1.675421e-08 1.199766e-09 1.421921e-02 
##            K           Na       Conduc 
## 2.210866e-05 1.232556e-14 3.083174e-16

PCA

data("USArrests")
summary(USArrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00
pca_usa <- PCA(USArrests, graph = FALSE)
fviz_eig(pca_usa)

fviz_pca_var(pca_usa, choice = "var")

fviz_pca_ind(pca_usa, choice = "ind")

fviz_pca_ind(pca_usa, choice = "ind", col.ind = USArrests$UrbanPop)

Exercise: Perform PCA of the dataset athle_records.csv after log-transformation of the data.

Clustering

scaled_usa <- scale(USArrests)
dist_usa <- dist(scaled_usa)^2
hc_usa <- hclust(dist_usa, method = "ward.D")
plot(hc_usa)

within_ss <- cumsum(hc_usa$height)
plot(49:1, within_ss, type = "b")

1 - within_ss[46] / within_ss[49] # reproduced inertia
## [1] 0.704374
plot(hc_usa)
rect.hclust(hc_usa, k = 4)

clust_hc <- cutree(hc_usa, k = 4)
clust_hc
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3
fviz_pca_ind(pca_usa, col.ind = factor(clust_hc))

init_center <- by(scaled_usa, clust_hc, colMeans)
init_center <- Reduce("rbind", init_center)
clust_kmeans <- kmeans(scaled_usa, centers = init_center)
clust_kmeans
## K-means clustering with 4 clusters of sizes 8, 13, 16, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2  0.6950701  1.0394414  0.7226370  1.27693964
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              1              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              4              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3 
## 
## Within cluster sum of squares by cluster:
## [1]  8.316061 19.922437 16.212213 11.952463
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
fviz_pca_ind(pca_usa, col.ind = factor(clust_kmeans$cluster))

Exercise: Perform clustering of the dataset athle_records.csv after log-transformation and scaling of the data.

With hierarchical clustering:

The percentage of explained inertia for 5 clusters is:

## [1] 26
## [1] 0.6657771

With \(k\)-means:

## K-means clustering with 5 clusters of sizes 15, 7, 2, 1, 1
## 
## Cluster means:
##        X100m      X200m      X400m       X800m     X1500m     X5000m    X10000m
## 1 -0.4023700 -0.3964851 -0.3822649 -0.53348564 -0.4776683 -0.3296392 -0.2340087
## 2  0.6096652  0.5957451  0.5851421  0.98463765  0.9034216  0.6761097  0.4313484
## 3  1.5416791  1.0955510  0.1404281 -0.03487566 -1.1948203 -1.8237604 -1.9512698
## 4  1.0968543  1.8598697  1.9591817  0.43194150  1.5619468  2.5184550  2.8690683
## 5 -2.4123191 -2.2739102 -0.6020593  0.74763079  1.6687663  1.3408854  1.5241629
##   SemiMarathon   Marathon
## 1   -0.3108179 -0.3790193
## 2    0.3391017  0.3475830
## 3   -1.2471064 -0.9938149
## 4    3.5902911  3.7784209
## 5    1.1924792  1.4614181
## 
## Clustering vector:
##       Australie        Belgique          Brésil      RoyaumeUni          Canada 
##               1               1               1               1               1 
##           Chine         Croatie        Ethiopie          France       Allemagne 
##               2               2               3               1               1 
##            Inde            Iran          Italie        Jamaïque           Japon 
##               2               4               1               5               2 
##           Kenya        Lituanie NouvelleZélande        Portugal          Russie 
##               3               2               2               1               1 
##    AfriqueduSud         Espagne           Suède          Suisse         Ukraine 
##               1               1               2               1               1 
##             USA 
##               1 
## 
## Within cluster sum of squares by cluster:
## [1] 40.471672 23.958026  9.062617  0.000000  0.000000
##  (between_SS / total_SS =  67.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Session information

This file has been compiled with the current system:

sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] factoextra_1.0.7   ggplot2_3.3.6      FactoMineR_2.7     scales_1.2.0      
## [5] car_3.0-13         carData_3.0-5      RColorBrewer_1.1-3 lubridate_1.8.0   
## 
## loaded via a namespace (and not attached):
##  [1] ggrepel_0.9.1        Rcpp_1.0.8.3         mvtnorm_1.1-3       
##  [4] lattice_0.20-45      tidyr_1.2.0          multcompView_0.1-8  
##  [7] assertthat_0.2.1     digest_0.6.29        utf8_1.2.2          
## [10] R6_2.5.1             backports_1.4.1      evaluate_0.15       
## [13] coda_0.19-4          highr_0.9            pillar_1.7.0        
## [16] rlang_1.0.2          rstudioapi_0.13      jquerylib_0.1.4     
## [19] DT_0.27              rmarkdown_2.14       labeling_0.4.2      
## [22] stringr_1.4.0        htmlwidgets_1.5.4    munsell_0.5.0       
## [25] broom_0.8.0          compiler_4.2.2       xfun_0.30           
## [28] pkgconfig_2.0.3      htmltools_0.5.2      flashClust_1.01-2   
## [31] tidyselect_1.1.2     tibble_3.1.7         fansi_1.0.3         
## [34] ggpubr_0.4.0         crayon_1.5.1         dplyr_1.0.9         
## [37] withr_2.5.0          MASS_7.3-58.2        leaps_3.1           
## [40] grid_4.2.2           jsonlite_1.8.0       xtable_1.8-4        
## [43] gtable_0.3.0         lifecycle_1.0.1      DBI_1.1.3           
## [46] magrittr_2.0.3       estimability_1.4.1   cli_3.3.0           
## [49] stringi_1.7.8        farver_2.1.0         ggsignif_0.6.3      
## [52] scatterplot3d_0.3-42 bslib_0.3.1          ellipsis_0.3.2      
## [55] vctrs_0.4.1          generics_0.1.2       tools_4.2.2         
## [58] glue_1.6.2           purrr_0.3.4          emmeans_1.8.5       
## [61] abind_1.4-5          fastmap_1.1.0        yaml_2.3.5          
## [64] colorspace_2.0-3     cluster_2.1.4        rstatix_0.7.0       
## [67] knitr_1.39           sass_0.4.1