9 Matrices

Here we are at the second basic form of data organization in R, and the most important one for expression data: the matrix.
As we have seen in the previous chapter, sometimes, especially for expression data, there is the need of having a more complex data structure than the vector. Matrices are 2-dimensional data objects containing the same type of data: usually we have features (genes, proteins, ecc) on rows and samples on columns.

Create a matrix

There are many ways to create a matrix, we will briefly see most of them (the ones we are going to use in real life).
The first one, that is not so used (I know I just told you the opposite, but…) is to change a vector into a matrix using the function matrix() and setting the number :

# 1. create a vector
my_vector <- 1:12
my_vector

 [1]  1  2  3  4  5  6  7  8  9 10 11 12

# 2. create the matrix
my_matrix <- matrix(my_vector, nrow = 3)
my_matrix

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

There are few things to notice here:

We can set nrow or ncol
If the length of the vector is not a multiple of nrow or ncol, it starts to repeat the vector from the beginning to fill the matrix (try using matrix(my_vector, nrow = 5))
It fills the matrix by column by default, we can change the behavior setting byrow = T, try it

Binding

Another method, usually used, is to “bind” two or more vectors to create a matrix through rbind() or cbind(). It is easier with an example.
Do you remember in exercise 8.2 that we had three patients’ expression levels in three different vectors?! Now we can comine them into a matrix in this way:

# 1. Create the vectors
patient1 <- c("SEC24C" = 12, "CDH7" = 1, "LDB2" = 13, "SEM3A" = 16, "FEZF2" = 21, "NTMT1" = 43, "BIP" = 29, "HOMER" = 22)
patient2 <- c("SEC24C" = 2, "CDH7" = 11, "LDB2" = 13, "SEM3A" = 22, "FEZF2" = 21, "NTMT1" = 31, "BIP" = 12, "HOMER" = 8)
patient3 <- c("SEC24C" = 8, "CDH7" = 3, "LDB2" = 22, "SEM3A" = 14, "FEZF2" = 13, "NTMT1" = 45, "BIP" = 37, "HOMER" = 2)

# 2. Bind vectors
r_patients <- rbind(patient1, patient2, patient3)
c_patients <- cbind(patient1, patient2, patient3)

r_patients

         SEC24C CDH7 LDB2 SEM3A FEZF2 NTMT1 BIP HOMER
patient1     12    1   13    16    21    43  29    22
patient2      2   11   13    22    21    31  12     8
patient3      8    3   22    14    13    45  37     2

c_patients

       patient1 patient2 patient3
SEC24C       12        2        8
CDH7          1       11        3
LDB2         13       13       22
SEM3A        16       22       14
FEZF2        21       21       13
NTMT1        43       31       45
BIP          29       12       37
HOMER        22        8        2

Wow, those matrices are really cool! Let’s analyze the code:

First of all, we have modified the vectors to be the same size and with the same order for the genes. This is crucial, otherwise we will mess up/li>
What would have been messed up? The column names in the first matrix and the row names in the second matrix. We will see these two features in a while. Just keep in mind that when you have to combine different named vectors, they should be in the same order. Try it with vectors not in the same order and see
rbind() combine by rows, putting each vector in a row, while cbind does the opposite

Transform into a matrix

One of the most used function to create a matrix is to transform a data.frame (next chapter) into a matrix using the function as.matrix().
Let’s use mtcars, a data.frame available in R, and transform it in a matrix:

# 1. See class of mtcars
class(mtcars)

[1] "data.frame"

# 2. Transform to a matrix
mt_mat <- as.matrix(mtcars)

# 3. See class of new variable
class(mt_mat)

[1] "matrix" "array"

Here we introduced the concept of class in R. This is not important for us at the moment, neither will be in the future. Just think this way: typeof() is for the type of data inside a variable while the class is the “type” of the structure of the variable.

Ok, now that we’ve seen how to create a matrix, we have to dig into some important concepts.

Rownames and colnames

Matrices can have row names and column names (think as the extension of a named vector). We have already seen these features in [#Binding], and here we will see them in details.
Let’s start by visualizing the first matrix we have created:

my_matrix

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

We can extract row names and column names with rownames() and colnames() functions, respectively:

rownames(my_matrix)

NULL

colnames(my_matrix)

NULL

They have not been set, as we can see by the fact that the output of these functions is NULL and that when we printed the matrix we had “[,1] [,2] [,3] [,4]”.

Set and edit row and column names

You have to think at row and column names as simple vectors, so we can create two vectors (on for rows and one for columns) and assign them to row and column names. Remember: they must be the same length as the rows and the columns.

# 1. Create vectors
my_row_names <- c("First row", "Second row", "Third row")
my_col_names <- c("First column", "Second column", "Third column", "Fourth column")

# 2. Assign vectors to row and column names
rownames(my_matrix) <- my_row_names
colnames(my_matrix) <- my_col_names

# 3. Print result
my_matrix

           First column Second column Third column Fourth column
First row             1             4            7            10
Second row            2             5            8            11
Third row             3             6            9            12

rownames(my_matrix)

[1] "First row"  "Second row" "Third row"

colnames(my_matrix)

[1] "First column"  "Second column" "Third column"  "Fourth column"

Tadaaa. Now we have a more complete matrix. As row names and column names are vectors, you can easily change them all (or part of them). For example:

# 1. Change third row names
rownames(my_matrix)[3] <- "3rd row"

# 2. Print
my_matrix

           First column Second column Third column Fourth column
First row             1             4            7            10
Second row            2             5            8            11
3rd row               3             6            9            12

Indexing

As for vectors, we can extract values of the matrix by indexing: we can get a single value, a vector of values and either smaller matrices.

Slicing

The first method is similar to vector slicing, the only exception here is that we have to set a value for the row/s and one for the column/s in the form [row/s, column/s].
Here some examples:

# Get only a value, from second row and third column
my_matrix[2, 3]

[1] 8

# Get a vector of values, from first and third row and second column
my_matrix[c(1, 3), 2]

First row   3rd row 
        4         6

# Get a matrix, from second and third row and second and third column
my_matrix[2:3, 2:3]

           Second column Third column
Second row             5            8
3rd row                6            9

# Get ALL elements of a row (leave the column part empty)
my_matrix[2:3,]

           First column Second column Third column Fourth column
Second row            2             5            8            11
3rd row               3             6            9            12

You can use every slicing technique seen in slicing section of vectors.

Using names

If we have set rownames and/or column names, we can use them to index a matrix (similarly to named vectors), in the form [row/s, column/s].

# Get only a value, from second row and third column
my_matrix["Second row", "Third column"]

[1] 8

# Get a vector of values, from first and third row and second column
my_matrix[c("First row", "3rd row"), "Second column"]

First row   3rd row 
        4         6

# Get a matrix, from second and third row and second and third column
my_matrix[c("Second row", "3rd row"), c("Second column", "Third column")]

           Second column Third column
Second row             5            8
3rd row                6            9

# Get ALL elements of a row (leave the column part empty)
my_matrix[c("Second row", "3rd row"),]

           First column Second column Third column Fourth column
Second row            2             5            8            11
3rd row               3             6            9            12

Here, you can’t use expansions (:) to select from row with name X to row with name X+3, it is only possible with numbers. In fact, if you remember, the command n:n+m create a vector of numbers from n to n + m. Try it!

Using logicals

Again, as for vectors we can use logicals to extrapolate values of a matrix.
Let’s say we want to get the names of the rows in which the third column has a odd number, we can do as follow:

# 1. Create a boolean vector
is_odd <- my_matrix[, "Third column"] %% 2 == 1

# 2. Slice the matrix based on the boolean vector
sliced_odd <- my_matrix[is_odd,]

# 3. Get row names
rownames(sliced_odd)

[1] "First row" "3rd row"

This will be so useful when talking about expression matrix (e.g. when we want to extrapolate genes that have tot expression in patient X or Y).

Functions for all matrix

Alright, it is time for some practical applications to explore fundamental matrix functions.

Head/Tail

The first functions are head() and tail(). Let’s say we receive an expression table but they didn’t tell us much about it; we can use these two functions to look up the first and the last rows respectively.
Here is how to use them:

# 1. Load data (first time we see it)
expr_data <- read.csv(file = "data/All_counts.csv", # name of the file
                      header = T # usually column names are in the first row
                      )

# 2. Type of
class(expr_data)

[1] "data.frame"

# 3. Head
head(expr_data)

          X wt1 wt2 wt3 wt4 wt5 wt6 ko1 ko2 ko3 ko4 ko5 ko6
1   DDX11L1   4   1   2   1   1   1   0   6   3   1   0   2
2    WASH7P 421  46 342 345 192  18 233 217 125 313 321 726
3 MIR6859-3   0   0   0   0   0   0   0   0   0   0   0   0
4 MIR6859-2   0   0   0   0   0   0   0   0   0   0   0   0
5 MIR6859-1   0   0   0   0   0   0   0   0   0   0   0   0
6 MIR6859-4   0   0   0   0   0   0   0   0   0   0   0   0

Ok, this is a brief example in which we have loaded a data matrix with the function read.csv() (as we know that the file is a .csv, but there are plenty of functions for all different type of files). We then checked the class of the expr_data and we get data.frame; lastly we used head() to see the structure of the first rows and we can say that:

Gene symbols are in a column
We have 6 wt samples and 6 ko samples
Probably these are row counts as we do not have decimal values

We can edit the previous code to fix some issues we encounter

# 1. Gene names as rownames
expr_data <- read.csv(file = "data/All_counts.csv",
                      header = T,
                      row.names = 1 # first column
                      )

# 2. Transform to matrix
expr_data <- as.matrix(expr_data)

# Check
class(expr_data)

[1] "matrix" "array"

head(expr_data, n = 10) # we can specify the number of rows we want

           wt1 wt2 wt3 wt4 wt5 wt6 ko1 ko2 ko3 ko4 ko5 ko6
DDX11L1      4   1   2   1   1   1   0   6   3   1   0   2
WASH7P     421  46 342 345 192  18 233 217 125 313 321 726
MIR6859-3    0   0   0   0   0   0   0   0   0   0   0   0
MIR6859-2    0   0   0   0   0   0   0   0   0   0   0   0
MIR6859-1    0   0   0   0   0   0   0   0   0   0   0   0
MIR6859-4    0   0   0   0   0   0   0   0   0   0   0   0
MIR1302-2    0   0   0   0   0   0   0   0   0   0   0   0
MIR1302-11   0   0   0   0   0   0   0   0   0   0   0   0
MIR1302-9    0   0   0   0   0   0   0   0   0   0   0   0
MIR1302-10   0   0   0   0   0   0   0   0   0   0   0   0

Ok, fine. We are ready to go and explore our first expression matrix.

Dim, nrow, ncol

Head and tail are great friends, but you always want to know the dimensions of the matrix. It can be done using dim() function, which returns the number of rows and columns (this order!), or nrow() and ncol().

# dim
dim(expr_data)

[1] 26475    12

# nrow
nrow(expr_data)

[1] 26475

# ncol
ncol(expr_data)

[1] 12

We see that our matrix has 26475 rows and 12 columns.

Summary and transpose

After having seen the basic structure of the first and last lines, the number of rows and columns, we usually want to have some info about the distribution of the genes, the counts ecc.
Here comes the function summary():

summary(expr_data)

      wt1                wt2                wt3                wt4             wt5          
 Min.   :     0.0   Min.   :     0.0   Min.   :     0.0   Min.   :    0   Min.   :     0.0  
 1st Qu.:     0.0   1st Qu.:     0.0   1st Qu.:     0.0   1st Qu.:    0   1st Qu.:     0.0  
 Median :    19.0   Median :    16.0   Median :    16.0   Median :   15   Median :    16.0  
 Mean   :   411.4   Mean   :   324.4   Mean   :   326.7   Mean   :  274   Mean   :   317.3  
 3rd Qu.:   272.0   3rd Qu.:   214.0   3rd Qu.:   218.0   3rd Qu.:  186   3rd Qu.:   212.0  
 Max.   :327719.0   Max.   :536733.0   Max.   :377069.0   Max.   :90474   Max.   :352034.0  
      wt6                ko1               ko2                ko3                ko4          
 Min.   :     0.0   Min.   :    0.0   Min.   :     0.0   Min.   :     0.0   Min.   :     0.0  
 1st Qu.:     0.0   1st Qu.:    0.0   1st Qu.:     0.0   1st Qu.:     0.0   1st Qu.:     0.0  
 Median :    17.0   Median :   13.0   Median :    18.0   Median :    19.0   Median :    17.0  
 Mean   :   316.6   Mean   :  250.1   Mean   :   378.2   Mean   :   351.4   Mean   :   373.8  
 3rd Qu.:   224.0   3rd Qu.:  179.0   3rd Qu.:   245.0   3rd Qu.:   246.0   3rd Qu.:   255.0  
 Max.   :209954.0   Max.   :61650.0   Max.   :641616.0   Max.   :497463.0   Max.   :236803.0  
      ko5               ko6          
 Min.   :    0.0   Min.   :     0.0  
 1st Qu.:    0.0   1st Qu.:     0.0  
 Median :   17.0   Median :    19.0  
 Mean   :  319.8   Mean   :   423.4  
 3rd Qu.:  233.0   3rd Qu.:   288.0  
 Max.   :62660.0   Max.   :338679.0

For each column it returns some descriptive. But, ehi, on the columns we have the samples, we want this statistics for the genes, how can we do it? Here comes t() that transpose the matrix so that rows became columns and vice versa.

expr_data_t <- t(expr_data)

expr_data_t[1:5, 1:8]

    DDX11L1 WASH7P MIR6859-3 MIR6859-2 MIR6859-1 MIR6859-4 MIR1302-2 MIR1302-11
wt1       4    421         0         0         0         0         0          0
wt2       1     46         0         0         0         0         0          0
wt3       2    342         0         0         0         0         0          0
wt4       1    345         0         0         0         0         0          0
wt5       1    192         0         0         0         0         0          0

Perfect, now we can do summary again (we do a subset of the data because we don’t want to do it for all genes, as it will be huge).

set.seed(123)

# 1. Get random numbers to select random genes
r_numb <- runif(n = 7, min = 1, max = ncol(expr_data_t))

# 2. Floor numbers to get integers
r_numb <- floor(r_numb)

# 3. Get the summary of sliced expr_data_t
summary(expr_data_t[, r_numb])

      GPX2            BTN3A3           KRT17         USP17L2        PHF19            DBT        
 Min.   :0.0000   Min.   :  5.00   Min.   :0.00   Min.   :0.0   Min.   : 10.0   Min.   :  23.0  
 1st Qu.:0.0000   1st Qu.: 15.00   1st Qu.:0.00   1st Qu.:0.0   1st Qu.: 20.0   1st Qu.: 415.2  
 Median :0.5000   Median : 24.00   Median :0.00   Median :0.0   Median : 97.0   Median : 753.0  
 Mean   :0.5833   Mean   : 35.67   Mean   :0.25   Mean   :0.5   Mean   :112.2   Mean   : 642.8  
 3rd Qu.:1.0000   3rd Qu.: 32.50   3rd Qu.:0.00   3rd Qu.:1.0   3rd Qu.:168.5   3rd Qu.: 938.2  
 Max.   :2.0000   Max.   :150.00   Max.   :2.00   Max.   :3.0   Max.   :324.0   Max.   :1107.0  
     PPM1B       
 Min.   :  73.0  
 1st Qu.: 392.5  
 Median : 562.0  
 Mean   : 632.7  
 3rd Qu.: 838.0  
 Max.   :1466.0

Here we have some interesting descriptive for each gene: we can see, for example, that DBT has more counts than GPX2, or that PPM1B has the higher Max. counts of all these 7 genes.
There are two new functions here: set.seed and runif. Let’s start from the latter, it returns n random number uniformly distributed between min and max included. Do you understand why I used max = ncol(expr_data_t)? I’m sure you do. But to be sure that it is clear, I did it because I don’t want to hard code (insert the real number) of the number of columns of the expression matrix, because if I change matrix I have to change this number as well. So, I let R calculate it for me. As it is random, in order to let you have the same number as me, I set a seed. The theory behind “randomness” in computer science is complex, so I’m not going to deep it. If you want, there are plenty of resources on the internet on this topic.
Here, just an example of the use of the seed.

# 1. Two random vectors of 5 numbers without set.seed
no_seed1 <- runif(n = 5, min = 1, max = 20)
no_seed2 <- runif(n = 5, min = 1, max = 20)

# 2. Two random vectors setting the seed
set.seed(123) # here you can place any number, each generate a unique random series
with_seed1 <- runif(n = 5, min = 1, max = 20)
set.seed(123) # here you can place any number, each generate a unique random series
with_seed2 <- runif(n = 5, min = 1, max = 20)

print(no_seed1)

[1] 17.955962 11.477265  9.675680 19.179834  9.613349

print(no_seed2)

[1] 13.873842 11.880035  2.955569 18.096674  5.675667

print(with_seed1)

[1]  6.463973 15.977798  8.770562 17.777331 18.868878

print(with_seed2)

[1]  6.463973 15.977798  8.770562 17.777331 18.868878

Great! Remember, setting the seed is fundamental in analysis in which randomness take place, so they will always return the same results.

Sum, Max, Min, …

Alright, there are plenty of functions that returns useful values of all the data in a matrix, such as the max value, the min one, the sum of all the values, the mean of all the values etc.
There is nothing easier than watch them in action.

# 1. create a random value matrix
set.seed(214)
my_matrix2 <- matrix(runif(n = 24, min = 1, max = 20), nrow = 4)
my_matrix2

          [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
[1,]  7.079653 17.000935 12.786656 15.687742  1.260719  1.082183
[2,] 18.702227 14.015268  9.345658  4.225409  9.341111 11.385358
[3,] 10.809797 19.592980 13.632435 13.774924 19.548749  5.461854
[4,]  8.975963  7.808112 15.024569  6.271463  6.136042 16.151705

# 2. Calculate sum
my_sum <- sum(my_matrix2)
print(paste("Sum of the values of my_matrix2:", my_sum))

[1] "Sum of the values of my_matrix2: 265.101512494963"

# 3. Calculate min
my_min <- min(my_matrix2)
print(paste("Min of the values of my_matrix2:", my_min))

[1] "Min of the values of my_matrix2: 1.08218298689462"

# 4. Calculate max
my_max <- max(my_matrix2)
print(paste("Max of the values of my_matrix2:", my_max))

[1] "Max of the values of my_matrix2: 19.5929795864504"

# 5. Calculate mean
my_mean <- mean(my_matrix2)
print(paste("Mean of the values of my_matrix2:", my_mean))

[1] "Mean of the values of my_matrix2: 11.0458963539568"

Here, we calculate these stats for all the data in the matrix, but if I want to calculate it by rows or columns? Here it is

Apply a function to all rows or columns

You can easily think about summary function to answer the previous question. Right, I’ll give it to you, but how if you want to handle the result? How about having only the vector of the mean values for each row? This is achieved through apply. Let’s see it in action, then I’ll explain it.

# 1. Calculate the mean for each column
mean_col <- apply(X = expr_data_t[, r_numb], MARGIN = 1, FUN = mean)

# 2. Calculate the mean for each column
mean_row <- apply(X = expr_data_t[, r_numb], MARGIN = 2, FUN = mean)

mean_col

     wt1      wt2      wt3      wt4      wt5      wt6      ko1      ko2      ko3      ko4      ko5      ko6 
207.1429 178.2857 149.7143 101.8571 202.4286 209.0000 208.5714 231.8571 141.4286 225.2857 243.2857 343.4286

mean_row

       GPX2      BTN3A3       KRT17     USP17L2       PHF19         DBT       PPM1B 
  0.5833333  35.6666667   0.2500000   0.5000000 112.1666667 642.8333333 632.6666667

See? It’s super easy. It requires the matrix (X), the function (mean), and the margin used to calculate over (1 for rows, 2 for columns). You can apply all the functions that take as input numeric vectors.
If the matrix has row names or column names set, it returns a named vactor, otherwise it returns a normal vector.

Exercises

Now it’s you turn to familiarize with matrices.

Exercise 9.1 You have collected expression levels of 4 genes of interest of 3 patients. Calculate the sum of the read for each patient, in order to normalize the values in further analysis.
I’ll give you the data.

CXCR4 <- c("P1" = 25, "P2" = 12, "P3" = 14)
LCT <- c("P3" = 12, "P2" = 10, "P1" = 17)
PTPN7 <- c("P1" = 2, "P3" = 3, "P2" = 4)
LHX9 <- c("P2" = 20, "P1" = 28, "P3" = 11)

Solution

# 1. Reorder the vectors to have the same order of patients
LCT <- LCT[names(CXCR4)]
PTPN7 <- PTPN7[names(CXCR4)]
LHX9 <- LHX9[names(CXCR4)]

# 2. Create the matrix
expr_mat_pat <- rbind(CXCR4, LCT, PTPN7, LHX9)

# 3. Calculate the sum of the reads for each patient
read_sum_pat <- apply(expr_mat_pat, MARGIN = 2, mean)

print(read_sum_pat)

  P1   P2   P3 
18.0 11.5 10.0

Exercise 9.2 We know that genes which mean is < 5, they should be excluded from the analysis (it is an example). Re-do the calculation of exercise 9.1 excluding those genes.

Solution

# 1. Calculate the mean of the reads for each gene
read_sum_gene <- apply(expr_mat_pat, MARGIN = 1, mean)

# 2. Get the name of the genes to keep
gene_2_keep <- names(read_sum_gene)[read_sum_gene > 5]

# 3. Use genes_2_keep to subset matrix and calculate sum
read_sum_pat_filt <- apply(expr_mat_pat[gene_2_keep, ], MARGIN = 2, mean)

print(read_sum_pat_filt)

      P1       P2       P3 
23.33333 14.00000 12.33333

Great, another big chapter has come to an end. We are closer and closer to your real world applications. For your expression (mRNA or proteins) data you could already start some analysis, but wait for the next chapter, in which we will talk about data.frame.