13 Aggregate data with tidyverse
After having seen how to organize and manipulate a dataframe, let’s start analyzing them! In particular, in this chapter we will see how to aggregate data based on values of one or more columns.
The first step, as always, is to load the packages and the data:
# 1. Load packages
library(tidyverse)
# 2. Load data
<- read.csv("data/Supplementary_table_13.csv")
df head(df)
SampleID Diagnosis Braak sex AOD PMI RIN CDR Area dataset
1 1005_TCX AD 6 F 90 8 8.6 NA TL MAYO
2 1019_TCX AD 6 F 86 4 7.8 NA TL MAYO
3 1029_TCX AD 6 F 69 4 9.7 NA TL MAYO
4 1034_TCX AD 6 F 88 5 8.9 NA TL MAYO
5 1045_TCX AD 5 F 90 NA 8.4 NA TL MAYO
6 1046_TCX AD 6 F 72 2 9.0 NA TL MAYO
Then, we will change the column types (as in last chapter):
<- df %>%
df mutate("Diagnosis" = factor(Diagnosis, levels = c("Control", "AD")),
"sex" = factor(sex),
"Area" = factor(Area),
"dataset" = factor(dataset))
str(df)
'data.frame': 1493 obs. of 10 variables:
$ SampleID : chr "1005_TCX" "1019_TCX" "1029_TCX" "1034_TCX" ...
$ Diagnosis: Factor w/ 2 levels "Control","AD": 2 2 2 2 2 2 2 2 2 2 ...
$ Braak : num 6 6 6 6 5 6 5.5 6 6 5 ...
$ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ AOD : num 90 86 69 88 90 72 85 82 77 90 ...
$ PMI : num 8 4 4 5 NA 2 NA 15 NA NA ...
$ RIN : num 8.6 7.8 9.7 8.9 8.4 9 8 10 8.6 7.9 ...
$ CDR : num NA NA NA NA NA NA NA NA NA NA ...
$ Area : Factor w/ 6 levels "BM10","BM22",..: 6 6 6 6 6 6 6 6 6 6 ...
$ dataset : Factor w/ 3 levels "MAYO","MSBB",..: 1 1 1 1 1 1 1 1 1 1 ...
We are now ready to start our very first analyses!
Table
The very first function we will explore is table
. It is used on vectors or dataframe columns and it returns the numerosity of each different value of them; when used for 2 vectors/columns, it returns a contingency matrix.
Let’s see:
table(df$sex)
F M
890 603
This is the simplest situation, it is not used so much (remember, summary does the same); but, it can be useful if combined with sum
to get the fraction of each value. In fact, dividing the table for its sum, we can get the fraction of each value:
# 1. Create the table
<- table(df$sex)
tbl_sex
# 2. Calculate the sum of the table
<- sum(tbl_sex)
sum_tbl_sex
# 3. Calculate the fraction
/ sum_tbl_sex tbl_sex
F M
0.5961152 0.4038848
So, ~ 60% of samples comes from females and ~ 40% from males. Yes, we could have just written table(df$sex)/sum(table(df$sex))
, one line, less stored variables.
We can do the same with two columns, and you will find doing it a lot during your analysis:
<- table("sex" = df$sex, "Diagnosis" = df$Diagnosis)
tbl_sex_diagnosis tbl_sex_diagnosis
Diagnosis
sex Control AD
F 341 549
M 285 318
We can also specify names associated to the vectors/columns we use, and they will be displayed in the output. This can be usefule when dealing with two variables whose values are 0/1 or TRUE/FALSE in both.
And if I want to see the fraction of females and males that are AD or Control? Right, we can do it. If you try the previous approach, so dividing for the sum of the table, you will get the proportion of each category NOT divided by sex:
/ sum(tbl_sex_diagnosis) tbl_sex_diagnosis
Diagnosis
sex Control AD
F 0.2283992 0.3677160
M 0.1908908 0.2129940
To get what you really want, you have to use one of colSums
or rowSums
depending on what you want to sum. Let’s see how they work:
# 1. rowSums
<- rowSums(tbl_sex_diagnosis)
tbl_sex_diagnosis_rowsum tbl_sex_diagnosis_rowsum
F M
890 603
# 2. colSums
<- colSums(tbl_sex_diagnosis)
tbl_sex_diagnosis_colsum tbl_sex_diagnosis_colsum
Control AD
626 867
They perform the sum of the values in each row or the values in each column. So, we can use these values and divide our initial table to get the proportion of each sex value in each diagnosis and vice versa.
Remember: when dividing a table for a vector, it performs the division in this way: each element of the first row are divided for the first element of the vector, each element of the second row for the second element of the vector and so on.
/ tbl_sex_diagnosis_rowsum tbl_sex_diagnosis
Diagnosis
sex Control AD
F 0.3831461 0.6168539
M 0.4726368 0.5273632
We can see that ~38% of females are controls and ~62% of them have AD, while for men the proportion is quite balanced with ~47% of Control samples and ~53% of AD.
We can also check what is the proportion of males and females in each Diagnosis condition. To do so, we have to apply a little trick: in fact, we have to transpose our initial table, otherwise the division would not be applied as we want (on each level of Diagnosis), but as before (on each level of sex). We then transpose the result to obtain a table with the same orientation as the initial one.
t(t(tbl_sex_diagnosis) / tbl_sex_diagnosis_colsum)
Diagnosis
sex Control AD
F 0.5447284 0.6332180
M 0.4552716 0.3667820
And here it is: ~54% of controls and ~63% of AD are females.
Group by variables and summarize values of another
The previous function is wonderful also to present data, but you will find yourself dealing with contingency table with more that 2 variables, so here I present you one of the greatest tool for analyzing dataframe: the magical combination of group_by()
and summarize()
from tidyverse.
Let’s see an example and then analyze how it works:
# Suppress summarise info
options(dplyr.summarise.inform = FALSE)
# creating a new df with summarized counts
<- df %>%
df_grouped group_by(sex, Diagnosis, dataset) %>%
summarise(counts = n())
df_grouped
# A tibble: 12 × 4
# Groups: sex, Diagnosis [4]
sex Diagnosis dataset counts
<fct> <fct> <fct> <int>
1 F Control MAYO 37
2 F Control MSBB 186
3 F Control ROSMAP 118
4 F AD MAYO 49
5 F AD MSBB 353
6 F AD ROSMAP 147
7 M Control MAYO 41
8 M Control MSBB 170
9 M Control ROSMAP 74
10 M AD MAYO 33
11 M AD MSBB 221
12 M AD ROSMAP 64
Great! We now have the number of samples for each combination of sex, Diagnosis and dataset.
It is important to understand how this code works:
-
options(dplyr.summarise.inform = FALSE)
: it is a option you can insert at the beginning of a markdown/script, after loading packages, that disable some useless messages when using summarize after group_by -
group_by(sex, Diagnosis, dataset)
: it tells dplyr (a tidyverse package) which groups are defined in this dataframe. In this case, we have 12 groups (2 sex * 2 Diagnosis * 3 dataset) -
summarise()
: takes as input a dataframe with pre-defined groups and performs action on each of them -
counts = n()
: creates a new column called counts filled by the result of functionn()
, which calculates the number of elements in each group
Wow, this is cool… but we can do more things with these functions: in fact, we can use other functions inside summarize()
and pass a column in them; the column is divided into pre-defined group and the function is run on each of them.
For example, let’s calculate the mean and sd values of PMI for each combination of sex and Diagnosis:
%>%
df group_by(sex, Diagnosis) %>%
summarise(n = n(), mean = mean(PMI), sd = sd(PMI))
# A tibble: 4 × 5
# Groups: sex [2]
sex Diagnosis n mean sd
<fct> <fct> <int> <dbl> <dbl>
1 F Control 341 NA NA
2 F AD 549 NA NA
3 M Control 285 NA NA
4 M AD 318 NA NA
Oh, what happened? Why are they all NA? That’s because if a vector contains NAs, functions like mean, min, max, sd etc returns NA, unless na.rm = T
is provided to the functions, which remove NAs from the original vector.
%>%
df group_by(sex, Diagnosis) %>%
summarise(n = n(), mean = mean(PMI, na.rm = T), sd = sd(PMI, na.rm = T))
# A tibble: 4 × 5
# Groups: sex [2]
sex Diagnosis n mean sd
<fct> <fct> <int> <dbl> <dbl>
1 F Control 341 7.98 5.81
2 F AD 549 6.21 4.26
3 M Control 285 9.68 6.99
4 M AD 318 7.10 4.72
Great! We can see that AD samples have a lower PMI value in both males and females. This can already be an indication that some differences are present.
I guarantee you will find yourself doing this a lot in your life.
To conclude this section, let’s use these functions to recreate the proportion table we calculate in the previous section:
%>%
df group_by(sex, Diagnosis) %>%
summarize(counts = n()) %>%
group_by(sex) %>%
reframe(Diagnosis = Diagnosis, proportion = counts/sum(counts))
# A tibble: 4 × 3
sex Diagnosis proportion
<fct> <fct> <dbl>
1 F Control 0.383
2 F AD 0.617
3 M Control 0.473
4 M AD 0.527
Just a couple of considerations:
- After the first summarize, we have grouped again but only by sex, to calculate the proportion of each value of Diagnosis in each sex
-
We used
reframe
instead of summarize just because summarize would have give a warning, but it acts the same in this scenario -
We insert
Diagnosis = Diagnosis
in reframe because otherwise we would have lost this information. Remember: summarize and reframe would return only the columns of the last group_by and the ones calculated in them.
Expand dataframe column using pivot wider
Even though the previous code works fine, I know that comparing in a table-like structure is better sometimes. So here is a function that can help us in the interpretation and in the presentation of the results: pivot_larger
.
It is easier to show than to explain:
%>%
df group_by(sex, Diagnosis) %>%
summarize(counts = n()) %>%
group_by(sex) %>%
reframe(Diagnosis = Diagnosis, proportion = counts/sum(counts)) %>%
pivot_wider(id_cols = sex, names_from = Diagnosis, values_from = proportion)
# A tibble: 2 × 3
sex Control AD
<fct> <dbl> <dbl>
1 F 0.383 0.617
2 M 0.473 0.527
That’s exactly the same as before!
Code explanation:
-
id_cols
is used to define the columns used as “rows” -
names_from
is used to define the columns that defines column groups -
values_from
is used to define the columns used to fill the table
Let’s see another example, because the previous one could have been written in 3 lines of code as before.
<- df %>%
df_wider group_by(sex, Diagnosis, dataset) %>%
summarise(n = n(), mean = mean(PMI, na.rm = T), sd = sd(PMI, na.rm = T)) %>%
pivot_wider(id_cols = c(sex, dataset), names_from = Diagnosis, values_from = c(mean, sd), names_vary = "slowest")
df_wider
# A tibble: 6 × 6
# Groups: sex [2]
sex dataset mean_Control sd_Control mean_AD sd_AD
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 F MAYO 7 7.69 6.34 5.80
2 F MSBB 8.60 5.79 5.76 3.91
3 F ROSMAP 7.25 5.18 7.25 4.51
4 M MAYO 5.51 6.42 8.6 5.83
5 M MSBB 11.6 7.44 7.11 4.74
6 M ROSMAP 7.34 4.05 6.58 4.24
Super cool! In this case we used two variables to define rows and two variables to fill values.
Reshape a dataframe with pivot_longer
Although the previous code returns a cool table, it is suitable only to present data, not for any further analysis. As you may remember from the exercise 12.1, this is not how raw data should be organized.
If you find yourself in a situation in which raw data are like that, use pivot_longer()
to fix it.
%>%
df_wider pivot_longer(cols = mean_Control:sd_AD, names_to = c("statistics", "Diagnosis"), names_sep = "_")
# A tibble: 24 × 5
# Groups: sex [2]
sex dataset statistics Diagnosis value
<fct> <fct> <chr> <chr> <dbl>
1 F MAYO mean Control 7
2 F MAYO sd Control 7.69
3 F MAYO mean AD 6.34
4 F MAYO sd AD 5.80
5 F MSBB mean Control 8.60
6 F MSBB sd Control 5.79
7 F MSBB mean AD 5.76
8 F MSBB sd AD 3.91
9 F ROSMAP mean Control 7.25
10 F ROSMAP sd Control 5.18
# ℹ 14 more rows
Let’s explain the code prior to adjust it a bit:
-
cols
defines which value columns to take and reshape -
mean_Control:sd_AD
means “from column mean_Control to column sd_AD” -
names_to
is used to define names for the new columns that stores the column names defined incols
. We set “statistics” and “Diagnosis” -
names_sep
is used when names_to has more that one value, it splits the names based on the given pattern (in our case, as the columns were like statistics_Diagnosis we used “_“)
Now, as it is a bit confusing having statistics column, we can combine the two pivot_
functions in this way:
%>%
df_wider pivot_longer(cols = mean_Control:sd_AD, names_to = c("statistics", "Diagnosis"), names_sep = "_") %>%
pivot_wider(id_cols = c(sex, dataset, Diagnosis), names_from = statistics, values_from = value)
# A tibble: 12 × 5
# Groups: sex [2]
sex dataset Diagnosis mean sd
<fct> <fct> <chr> <dbl> <dbl>
1 F MAYO Control 7 7.69
2 F MAYO AD 6.34 5.80
3 F MSBB Control 8.60 5.79
4 F MSBB AD 5.76 3.91
5 F ROSMAP Control 7.25 5.18
6 F ROSMAP AD 7.25 4.51
7 M MAYO Control 5.51 6.42
8 M MAYO AD 8.6 5.83
9 M MSBB Control 11.6 7.44
10 M MSBB AD 7.11 4.74
11 M ROSMAP Control 7.34 4.05
12 M ROSMAP AD 6.58 4.24
Tadaaaaa!
I think this is enough for this chapter, here are some exercise to practice these functions.
Exercises
Exercise 13.1 This is a multi-question exercise, that resemble a real-life analysis.
Considering our initial dataframe:
- Are there differences between datasets in RIN value? consider max, min and mean
- Which Area has the highest value of AOD? And, is there difference between sex in these values?
- How many NAs are present in CDR for each combination of Diagnosis and dataset?
- Sort the dataframe based on the average vale of PMI for each area, based also on sex
Solution
We will answer one question at a time.
1.
# 1. Calculate mean, min and max values of RIN for each dataset
<- df %>%
df_RIN_dataset group_by(dataset) %>%
summarise(min = min(RIN, na.rm = T), mean = mean(RIN, na.rm = T), max = max(RIN, na.rm = T))
df_RIN_dataset
# A tibble: 3 × 4
dataset min mean max
<fct> <dbl> <dbl> <dbl>
1 MAYO 5.3 8.11 10
2 MSBB 1 6.23 10
3 ROSMAP 5 7.11 9.9
Yes, it seems that MAYO dataset is the one with better RIN values, while MSBB has the worst samples.
2.
# 1. Calculate max value of AOD for each dataset
<- df %>%
df_AOD_dataset group_by(dataset, sex) %>%
summarise(max_AOD = max(AOD, na.rm = T)) %>%
pivot_wider(id_cols = dataset, names_from = sex, values_from = max_AOD)
df_AOD_dataset
# A tibble: 3 × 3
# Groups: dataset [3]
dataset F M
<fct> <dbl> <dbl>
1 MAYO 90 90
2 MSBB 90 90
3 ROSMAP 90 90
No, there is not difference between dataset, even considering sex.
3.
# 1. Calculate number of CDR NAs for each Diagnosis/dataset combination
<- df %>%
df_nas_diag_dataset group_by(Diagnosis, dataset) %>%
summarise(n_nas = sum(is.na(CDR))) %>%
pivot_wider(id_cols = Diagnosis, names_from = dataset, values_from = n_nas)
df_nas_diag_dataset
# A tibble: 2 × 4
# Groups: Diagnosis [2]
Diagnosis MAYO MSBB ROSMAP
<fct> <int> <int> <int>
1 Control 78 0 192
2 AD 82 0 211
It seems that people that have filled MSBB dataset have done a better job in collecting data than the others ahahah
Remember: is.na()
returns a boolean vector, and the sum of a boolean vector is the number of TRUE values in it.
4.
# 1. Sort dataframe based on mean PMI value for each Area/sex
<- df %>%
df_PMI_area_sex group_by(Area, sex) %>%
summarise(avg_PMI = mean(PMI, na.rm = T)) %>%
arrange(desc(avg_PMI))
df_PMI_area_sex
# A tibble: 12 × 3
# Groups: Area [6]
Area sex avg_PMI
<fct> <fct> <dbl>
1 BM44 M 9.25
2 BM36 M 9.16
3 BM10 M 9.11
4 BM22 M 8.69
5 DLPFC F 7.25
6 BM10 F 7.02
7 DLPFC M 6.99
8 BM36 F 6.70
9 TL F 6.66
10 TL M 6.64
11 BM22 F 6.64
12 BM44 F 6.60
Males BM44 samples have the higher mean of PMI.
I’m so proud of you to have reached this chapter. What’s next? Plotting! Data should be visualized for a better understand of them.