13 Aggregate data with tidyverse

After having seen how to organize and manipulate a dataframe, let’s start analyzing them! In particular, in this chapter we will see how to aggregate data based on values of one or more columns.

The first step, as always, is to load the packages and the data:

# 1. Load packages 
library(tidyverse)

# 2. Load data
df <- read.csv("data/Supplementary_table_13.csv")
head(df)
  SampleID Diagnosis Braak sex AOD PMI RIN CDR Area dataset
1 1005_TCX        AD     6   F  90   8 8.6  NA   TL    MAYO
2 1019_TCX        AD     6   F  86   4 7.8  NA   TL    MAYO
3 1029_TCX        AD     6   F  69   4 9.7  NA   TL    MAYO
4 1034_TCX        AD     6   F  88   5 8.9  NA   TL    MAYO
5 1045_TCX        AD     5   F  90  NA 8.4  NA   TL    MAYO
6 1046_TCX        AD     6   F  72   2 9.0  NA   TL    MAYO

Then, we will change the column types (as in last chapter):

df <- df %>%
  mutate("Diagnosis" = factor(Diagnosis, levels = c("Control", "AD")),
         "sex" = factor(sex),
         "Area" = factor(Area),
         "dataset" = factor(dataset))

str(df)
'data.frame':   1493 obs. of  10 variables:
 $ SampleID : chr  "1005_TCX" "1019_TCX" "1029_TCX" "1034_TCX" ...
 $ Diagnosis: Factor w/ 2 levels "Control","AD": 2 2 2 2 2 2 2 2 2 2 ...
 $ Braak    : num  6 6 6 6 5 6 5.5 6 6 5 ...
 $ sex      : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
 $ AOD      : num  90 86 69 88 90 72 85 82 77 90 ...
 $ PMI      : num  8 4 4 5 NA 2 NA 15 NA NA ...
 $ RIN      : num  8.6 7.8 9.7 8.9 8.4 9 8 10 8.6 7.9 ...
 $ CDR      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Area     : Factor w/ 6 levels "BM10","BM22",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ dataset  : Factor w/ 3 levels "MAYO","MSBB",..: 1 1 1 1 1 1 1 1 1 1 ...

We are now ready to start our very first analyses!

Table

The very first function we will explore is table. It is used on vectors or dataframe columns and it returns the numerosity of each different value of them; when used for 2 vectors/columns, it returns a contingency matrix.
Let’s see:

table(df$sex)

  F   M 
890 603 

This is the simplest situation, it is not used so much (remember, summary does the same); but, it can be useful if combined with sum to get the fraction of each value. In fact, dividing the table for its sum, we can get the fraction of each value:

# 1. Create the table
tbl_sex <- table(df$sex)

# 2. Calculate the sum of the table
sum_tbl_sex <- sum(tbl_sex)

# 3. Calculate the fraction
tbl_sex / sum_tbl_sex

        F         M 
0.5961152 0.4038848 

So, ~ 60% of samples comes from females and ~ 40% from males. Yes, we could have just written table(df$sex)/sum(table(df$sex)), one line, less stored variables.
We can do the same with two columns, and you will find doing it a lot during your analysis:

tbl_sex_diagnosis <- table("sex" = df$sex, "Diagnosis" = df$Diagnosis)
tbl_sex_diagnosis
   Diagnosis
sex Control  AD
  F     341 549
  M     285 318

We can also specify names associated to the vectors/columns we use, and they will be displayed in the output. This can be usefule when dealing with two variables whose values are 0/1 or TRUE/FALSE in both.
And if I want to see the fraction of females and males that are AD or Control? Right, we can do it. If you try the previous approach, so dividing for the sum of the table, you will get the proportion of each category NOT divided by sex:

tbl_sex_diagnosis / sum(tbl_sex_diagnosis)
   Diagnosis
sex   Control        AD
  F 0.2283992 0.3677160
  M 0.1908908 0.2129940

To get what you really want, you have to use one of colSums or rowSums depending on what you want to sum. Let’s see how they work:

# 1. rowSums
tbl_sex_diagnosis_rowsum <- rowSums(tbl_sex_diagnosis)
tbl_sex_diagnosis_rowsum
  F   M 
890 603 
# 2. colSums
tbl_sex_diagnosis_colsum <- colSums(tbl_sex_diagnosis)
tbl_sex_diagnosis_colsum
Control      AD 
    626     867 

They perform the sum of the values in each row or the values in each column. So, we can use these values and divide our initial table to get the proportion of each sex value in each diagnosis and vice versa.

Remember: when dividing a table for a vector, it performs the division in this way: each element of the first row are divided for the first element of the vector, each element of the second row for the second element of the vector and so on.

tbl_sex_diagnosis / tbl_sex_diagnosis_rowsum
   Diagnosis
sex   Control        AD
  F 0.3831461 0.6168539
  M 0.4726368 0.5273632

We can see that ~38% of females are controls and ~62% of them have AD, while for men the proportion is quite balanced with ~47% of Control samples and ~53% of AD.
We can also check what is the proportion of males and females in each Diagnosis condition. To do so, we have to apply a little trick: in fact, we have to transpose our initial table, otherwise the division would not be applied as we want (on each level of Diagnosis), but as before (on each level of sex). We then transpose the result to obtain a table with the same orientation as the initial one.

t(t(tbl_sex_diagnosis) / tbl_sex_diagnosis_colsum)
   Diagnosis
sex   Control        AD
  F 0.5447284 0.6332180
  M 0.4552716 0.3667820

And here it is: ~54% of controls and ~63% of AD are females.

Group by variables and summarize values of another

The previous function is wonderful also to present data, but you will find yourself dealing with contingency table with more that 2 variables, so here I present you one of the greatest tool for analyzing dataframe: the magical combination of group_by() and summarize() from tidyverse. Let’s see an example and then analyze how it works:

# Suppress summarise info
options(dplyr.summarise.inform = FALSE)

# creating a new df with summarized counts
df_grouped <- df %>%
  group_by(sex, Diagnosis, dataset) %>%
  summarise(counts = n())

df_grouped
# A tibble: 12 × 4
# Groups:   sex, Diagnosis [4]
   sex   Diagnosis dataset counts
   <fct> <fct>     <fct>    <int>
 1 F     Control   MAYO        37
 2 F     Control   MSBB       186
 3 F     Control   ROSMAP     118
 4 F     AD        MAYO        49
 5 F     AD        MSBB       353
 6 F     AD        ROSMAP     147
 7 M     Control   MAYO        41
 8 M     Control   MSBB       170
 9 M     Control   ROSMAP      74
10 M     AD        MAYO        33
11 M     AD        MSBB       221
12 M     AD        ROSMAP      64
Great! We now have the number of samples for each combination of sex, Diagnosis and dataset.

It is important to understand how this code works:

  1. options(dplyr.summarise.inform = FALSE): it is a option you can insert at the beginning of a markdown/script, after loading packages, that disable some useless messages when using summarize after group_by
  2. group_by(sex, Diagnosis, dataset): it tells dplyr (a tidyverse package) which groups are defined in this dataframe. In this case, we have 12 groups (2 sex * 2 Diagnosis * 3 dataset)
  3. summarise(): takes as input a dataframe with pre-defined groups and performs action on each of them
  4. counts = n(): creates a new column called counts filled by the result of function n(), which calculates the number of elements in each group

Wow, this is cool… but we can do more things with these functions: in fact, we can use other functions inside summarize() and pass a column in them; the column is divided into pre-defined group and the function is run on each of them.
For example, let’s calculate the mean and sd values of PMI for each combination of sex and Diagnosis:

df %>%
  group_by(sex, Diagnosis) %>%
  summarise(n = n(), mean = mean(PMI), sd = sd(PMI))
# A tibble: 4 × 5
# Groups:   sex [2]
  sex   Diagnosis     n  mean    sd
  <fct> <fct>     <int> <dbl> <dbl>
1 F     Control     341    NA    NA
2 F     AD          549    NA    NA
3 M     Control     285    NA    NA
4 M     AD          318    NA    NA

Oh, what happened? Why are they all NA? That’s because if a vector contains NAs, functions like mean, min, max, sd etc returns NA, unless na.rm = T is provided to the functions, which remove NAs from the original vector.

df %>%
  group_by(sex, Diagnosis) %>%
  summarise(n = n(), mean = mean(PMI, na.rm = T), sd = sd(PMI, na.rm = T))
# A tibble: 4 × 5
# Groups:   sex [2]
  sex   Diagnosis     n  mean    sd
  <fct> <fct>     <int> <dbl> <dbl>
1 F     Control     341  7.98  5.81
2 F     AD          549  6.21  4.26
3 M     Control     285  9.68  6.99
4 M     AD          318  7.10  4.72

Great! We can see that AD samples have a lower PMI value in both males and females. This can already be an indication that some differences are present.
I guarantee you will find yourself doing this a lot in your life.
To conclude this section, let’s use these functions to recreate the proportion table we calculate in the previous section:

df %>%
  group_by(sex, Diagnosis) %>%
  summarize(counts = n()) %>%
  group_by(sex) %>%
  reframe(Diagnosis = Diagnosis, proportion = counts/sum(counts))
# A tibble: 4 × 3
  sex   Diagnosis proportion
  <fct> <fct>          <dbl>
1 F     Control        0.383
2 F     AD             0.617
3 M     Control        0.473
4 M     AD             0.527

Just a couple of considerations:

  • After the first summarize, we have grouped again but only by sex, to calculate the proportion of each value of Diagnosis in each sex
  • We used reframe instead of summarize just because summarize would have give a warning, but it acts the same in this scenario
  • We insert Diagnosis = Diagnosis in reframe because otherwise we would have lost this information. Remember: summarize and reframe would return only the columns of the last group_by and the ones calculated in them.

Expand dataframe column using pivot wider

Even though the previous code works fine, I know that comparing in a table-like structure is better sometimes. So here is a function that can help us in the interpretation and in the presentation of the results: pivot_larger.
It is easier to show than to explain:

df %>%
  group_by(sex, Diagnosis) %>%
  summarize(counts = n()) %>%
  group_by(sex) %>%
  reframe(Diagnosis = Diagnosis, proportion = counts/sum(counts)) %>%
  pivot_wider(id_cols = sex, names_from = Diagnosis, values_from = proportion)
# A tibble: 2 × 3
  sex   Control    AD
  <fct>   <dbl> <dbl>
1 F       0.383 0.617
2 M       0.473 0.527
That’s exactly the same as before!

Code explanation:

  • id_cols is used to define the columns used as “rows”
  • names_from is used to define the columns that defines column groups
  • values_from is used to define the columns used to fill the table

Let’s see another example, because the previous one could have been written in 3 lines of code as before.

df_wider <- df %>%
  group_by(sex, Diagnosis, dataset) %>%
  summarise(n = n(), mean = mean(PMI, na.rm = T), sd = sd(PMI, na.rm = T)) %>%
  pivot_wider(id_cols = c(sex, dataset), names_from = Diagnosis, values_from = c(mean, sd), names_vary = "slowest")

df_wider
# A tibble: 6 × 6
# Groups:   sex [2]
  sex   dataset mean_Control sd_Control mean_AD sd_AD
  <fct> <fct>          <dbl>      <dbl>   <dbl> <dbl>
1 F     MAYO            7          7.69    6.34  5.80
2 F     MSBB            8.60       5.79    5.76  3.91
3 F     ROSMAP          7.25       5.18    7.25  4.51
4 M     MAYO            5.51       6.42    8.6   5.83
5 M     MSBB           11.6        7.44    7.11  4.74
6 M     ROSMAP          7.34       4.05    6.58  4.24

Super cool! In this case we used two variables to define rows and two variables to fill values.

Reshape a dataframe with pivot_longer

Although the previous code returns a cool table, it is suitable only to present data, not for any further analysis. As you may remember from the exercise 12.1, this is not how raw data should be organized.
If you find yourself in a situation in which raw data are like that, use pivot_longer() to fix it.

df_wider %>%
  pivot_longer(cols = mean_Control:sd_AD, names_to = c("statistics", "Diagnosis"), names_sep = "_")
# A tibble: 24 × 5
# Groups:   sex [2]
   sex   dataset statistics Diagnosis value
   <fct> <fct>   <chr>      <chr>     <dbl>
 1 F     MAYO    mean       Control    7   
 2 F     MAYO    sd         Control    7.69
 3 F     MAYO    mean       AD         6.34
 4 F     MAYO    sd         AD         5.80
 5 F     MSBB    mean       Control    8.60
 6 F     MSBB    sd         Control    5.79
 7 F     MSBB    mean       AD         5.76
 8 F     MSBB    sd         AD         3.91
 9 F     ROSMAP  mean       Control    7.25
10 F     ROSMAP  sd         Control    5.18
# ℹ 14 more rows

Let’s explain the code prior to adjust it a bit:

  • cols defines which value columns to take and reshape
  • mean_Control:sd_AD means “from column mean_Control to column sd_AD”
  • names_to is used to define names for the new columns that stores the column names defined in cols. We set “statistics” and “Diagnosis”
  • names_sep is used when names_to has more that one value, it splits the names based on the given pattern (in our case, as the columns were like statistics_Diagnosis we used “_“)

Now, as it is a bit confusing having statistics column, we can combine the two pivot_ functions in this way:

df_wider %>%
  pivot_longer(cols = mean_Control:sd_AD, names_to = c("statistics", "Diagnosis"), names_sep = "_") %>%
  pivot_wider(id_cols = c(sex, dataset, Diagnosis), names_from = statistics, values_from = value)
# A tibble: 12 × 5
# Groups:   sex [2]
   sex   dataset Diagnosis  mean    sd
   <fct> <fct>   <chr>     <dbl> <dbl>
 1 F     MAYO    Control    7     7.69
 2 F     MAYO    AD         6.34  5.80
 3 F     MSBB    Control    8.60  5.79
 4 F     MSBB    AD         5.76  3.91
 5 F     ROSMAP  Control    7.25  5.18
 6 F     ROSMAP  AD         7.25  4.51
 7 M     MAYO    Control    5.51  6.42
 8 M     MAYO    AD         8.6   5.83
 9 M     MSBB    Control   11.6   7.44
10 M     MSBB    AD         7.11  4.74
11 M     ROSMAP  Control    7.34  4.05
12 M     ROSMAP  AD         6.58  4.24

Tadaaaaa!


I think this is enough for this chapter, here are some exercise to practice these functions.

Exercises

Exercise 13.1 This is a multi-question exercise, that resemble a real-life analysis.

Considering our initial dataframe:

  1. Are there differences between datasets in RIN value? consider max, min and mean
  2. Which Area has the highest value of AOD? And, is there difference between sex in these values?
  3. How many NAs are present in CDR for each combination of Diagnosis and dataset?
  4. Sort the dataframe based on the average vale of PMI for each area, based also on sex
Solution

We will answer one question at a time.

1.

# 1. Calculate mean, min and max values of RIN for each dataset
df_RIN_dataset <- df %>%
  group_by(dataset) %>%
  summarise(min = min(RIN, na.rm = T), mean = mean(RIN, na.rm = T), max = max(RIN, na.rm = T))
  
df_RIN_dataset
# A tibble: 3 × 4
  dataset   min  mean   max
  <fct>   <dbl> <dbl> <dbl>
1 MAYO      5.3  8.11  10  
2 MSBB      1    6.23  10  
3 ROSMAP    5    7.11   9.9

Yes, it seems that MAYO dataset is the one with better RIN values, while MSBB has the worst samples.

2.

# 1. Calculate max value of AOD for each dataset
df_AOD_dataset <- df %>%
  group_by(dataset, sex) %>%
  summarise(max_AOD = max(AOD, na.rm = T)) %>%
  pivot_wider(id_cols = dataset, names_from = sex, values_from = max_AOD)

df_AOD_dataset
# A tibble: 3 × 3
# Groups:   dataset [3]
  dataset     F     M
  <fct>   <dbl> <dbl>
1 MAYO       90    90
2 MSBB       90    90
3 ROSMAP     90    90

No, there is not difference between dataset, even considering sex.

3.

# 1. Calculate number of CDR NAs for each Diagnosis/dataset combination
df_nas_diag_dataset <- df %>%
  group_by(Diagnosis, dataset) %>%
  summarise(n_nas = sum(is.na(CDR))) %>%
  pivot_wider(id_cols = Diagnosis, names_from = dataset, values_from = n_nas)

df_nas_diag_dataset
# A tibble: 2 × 4
# Groups:   Diagnosis [2]
  Diagnosis  MAYO  MSBB ROSMAP
  <fct>     <int> <int>  <int>
1 Control      78     0    192
2 AD           82     0    211

It seems that people that have filled MSBB dataset have done a better job in collecting data than the others ahahah

Remember: is.na() returns a boolean vector, and the sum of a boolean vector is the number of TRUE values in it.

4.

# 1. Sort dataframe based on mean PMI value for each Area/sex
df_PMI_area_sex <- df %>%
  group_by(Area, sex) %>%
  summarise(avg_PMI = mean(PMI, na.rm = T)) %>%
  arrange(desc(avg_PMI))

df_PMI_area_sex
# A tibble: 12 × 3
# Groups:   Area [6]
   Area  sex   avg_PMI
   <fct> <fct>   <dbl>
 1 BM44  M        9.25
 2 BM36  M        9.16
 3 BM10  M        9.11
 4 BM22  M        8.69
 5 DLPFC F        7.25
 6 BM10  F        7.02
 7 DLPFC M        6.99
 8 BM36  F        6.70
 9 TL    F        6.66
10 TL    M        6.64
11 BM22  F        6.64
12 BM44  F        6.60

Males BM44 samples have the higher mean of PMI.

I’m so proud of you to have reached this chapter. What’s next? Plotting! Data should be visualized for a better understand of them.