9 Matrices
Here we are at the second basic form of data organization in R, and the most important one for expression data: the matrix.
As we have seen in the previous chapter, sometimes, especially for expression data, there is the need of having a more complex data structure than the vector. Matrices are 2-dimensional data objects containing the same type of data: usually we have features (genes, proteins, ecc) on rows and samples on columns.
Create a matrix
There are many ways to create a matrix, we will briefly see most of them (the ones we are going to use in real life).
The first one, that is not so used (I know I just told you the opposite, but…) is to change a vector into a matrix using the function matrix()
and setting the number :
# 1. create a vector
<- 1:12
my_vector my_vector
[1] 1 2 3 4 5 6 7 8 9 10 11 12
# 2. create the matrix
<- matrix(my_vector, nrow = 3)
my_matrix my_matrix
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
There are few things to notice here:
-
We can set
nrow
orncol
-
If the length of the vector is not a multiple of nrow or ncol, it starts to repeat the vector from the beginning to fill the matrix (try using
matrix(my_vector, nrow = 5)
) -
It fills the matrix by column by default, we can change the behavior setting
byrow = T
, try it
Binding
Another method, usually used, is to “bind” two or more vectors to create a matrix through rbind()
or cbind()
. It is easier with an example.
Do you remember in exercise 8.2 that we had three patients’ expression levels in three different vectors?! Now we can comine them into a matrix in this way:
# 1. Create the vectors
<- c("SEC24C" = 12, "CDH7" = 1, "LDB2" = 13, "SEM3A" = 16, "FEZF2" = 21, "NTMT1" = 43, "BIP" = 29, "HOMER" = 22)
patient1 <- c("SEC24C" = 2, "CDH7" = 11, "LDB2" = 13, "SEM3A" = 22, "FEZF2" = 21, "NTMT1" = 31, "BIP" = 12, "HOMER" = 8)
patient2 <- c("SEC24C" = 8, "CDH7" = 3, "LDB2" = 22, "SEM3A" = 14, "FEZF2" = 13, "NTMT1" = 45, "BIP" = 37, "HOMER" = 2)
patient3
# 2. Bind vectors
<- rbind(patient1, patient2, patient3)
r_patients <- cbind(patient1, patient2, patient3)
c_patients
r_patients
SEC24C CDH7 LDB2 SEM3A FEZF2 NTMT1 BIP HOMER
patient1 12 1 13 16 21 43 29 22
patient2 2 11 13 22 21 31 12 8
patient3 8 3 22 14 13 45 37 2
c_patients
patient1 patient2 patient3
SEC24C 12 2 8
CDH7 1 11 3
LDB2 13 13 22
SEM3A 16 22 14
FEZF2 21 21 13
NTMT1 43 31 45
BIP 29 12 37
HOMER 22 8 2
Wow, those matrices are really cool! Let’s analyze the code:
- First of all, we have modified the vectors to be the same size and with the same order for the genes. This is crucial, otherwise we will mess up/li>
- What would have been messed up? The column names in the first matrix and the row names in the second matrix. We will see these two features in a while. Just keep in mind that when you have to combine different named vectors, they should be in the same order. Try it with vectors not in the same order and see
-
rbind()
combine by rows, putting each vector in a row, whilecbind
does the opposite
Transform into a matrix
One of the most used function to create a matrix is to transform a data.frame
(next chapter) into a matrix using the function as.matrix()
.
Let’s use mtcars
, a data.frame
available in R, and transform it in a matrix:
# 1. See class of mtcars
class(mtcars)
[1] "data.frame"
# 2. Transform to a matrix
<- as.matrix(mtcars)
mt_mat
# 3. See class of new variable
class(mt_mat)
[1] "matrix" "array"
Here we introduced the concept of class
in R. This is not important for us at the moment, neither will be in the future. Just think this way: typeof()
is for the type of data inside a variable while the class
is the “type” of the structure of the variable.
Ok, now that we’ve seen how to create a matrix, we have to dig into some important concepts.
Rownames and colnames
Matrices can have row names and column names (think as the extension of a named vector). We have already seen these features in [#Binding], and here we will see them in details.
Let’s start by visualizing the first matrix we have created:
my_matrix
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
We can extract row names and column names with rownames()
and colnames()
functions, respectively:
rownames(my_matrix)
NULL
colnames(my_matrix)
NULL
They have not been set, as we can see by the fact that the output of these functions is NULL
and that when we printed the matrix we had “[,1] [,2] [,3] [,4]”.
Set and edit row and column names
You have to think at row and column names as simple vectors, so we can create two vectors (on for rows and one for columns) and assign them to row and column names. Remember: they must be the same length as the rows and the columns.
# 1. Create vectors
<- c("First row", "Second row", "Third row")
my_row_names <- c("First column", "Second column", "Third column", "Fourth column")
my_col_names
# 2. Assign vectors to row and column names
rownames(my_matrix) <- my_row_names
colnames(my_matrix) <- my_col_names
# 3. Print result
my_matrix
First column Second column Third column Fourth column
First row 1 4 7 10
Second row 2 5 8 11
Third row 3 6 9 12
rownames(my_matrix)
[1] "First row" "Second row" "Third row"
colnames(my_matrix)
[1] "First column" "Second column" "Third column" "Fourth column"
Tadaaa. Now we have a more complete matrix. As row names and column names are vectors, you can easily change them all (or part of them). For example:
# 1. Change third row names
rownames(my_matrix)[3] <- "3rd row"
# 2. Print
my_matrix
First column Second column Third column Fourth column
First row 1 4 7 10
Second row 2 5 8 11
3rd row 3 6 9 12
Indexing
As for vectors, we can extract values of the matrix by indexing: we can get a single value, a vector of values and either smaller matrices.
Slicing
The first method is similar to vector slicing, the only exception here is that we have to set a value for the row/s and one for the column/s in the form [row/s, column/s]
.
Here some examples:
# Get only a value, from second row and third column
2, 3] my_matrix[
[1] 8
# Get a vector of values, from first and third row and second column
c(1, 3), 2] my_matrix[
First row 3rd row
4 6
# Get a matrix, from second and third row and second and third column
2:3, 2:3] my_matrix[
Second column Third column
Second row 5 8
3rd row 6 9
# Get ALL elements of a row (leave the column part empty)
2:3,] my_matrix[
First column Second column Third column Fourth column
Second row 2 5 8 11
3rd row 3 6 9 12
You can use every slicing technique seen in slicing section of vectors.
Using names
If we have set rownames and/or column names, we can use them to index a matrix (similarly to named vectors), in the form [row/s, column/s]
.
# Get only a value, from second row and third column
"Second row", "Third column"] my_matrix[
[1] 8
# Get a vector of values, from first and third row and second column
c("First row", "3rd row"), "Second column"] my_matrix[
First row 3rd row
4 6
# Get a matrix, from second and third row and second and third column
c("Second row", "3rd row"), c("Second column", "Third column")] my_matrix[
Second column Third column
Second row 5 8
3rd row 6 9
# Get ALL elements of a row (leave the column part empty)
c("Second row", "3rd row"),] my_matrix[
First column Second column Third column Fourth column
Second row 2 5 8 11
3rd row 3 6 9 12
Here, you can’t use expansions (:
) to select from row with name X to row with name X+3, it is only possible with numbers. In fact, if you remember, the command n:n+m create a vector of numbers from n to n + m. Try it!
Using logicals
Again, as for vectors we can use logicals to extrapolate values of a matrix.
Let’s say we want to get the names of the rows in which the third column has a odd number, we can do as follow:
# 1. Create a boolean vector
<- my_matrix[, "Third column"] %% 2 == 1
is_odd
# 2. Slice the matrix based on the boolean vector
<- my_matrix[is_odd,]
sliced_odd
# 3. Get row names
rownames(sliced_odd)
[1] "First row" "3rd row"
This will be so useful when talking about expression matrix (e.g. when we want to extrapolate genes that have tot expression in patient X or Y).
Functions for all matrix
Alright, it is time for some practical applications to explore fundamental matrix functions.
Head/Tail
The first functions are head()
and tail()
. Let’s say we receive an expression table but they didn’t tell us much about it; we can use these two functions to look up the first and the last rows respectively.
Here is how to use them:
# 1. Load data (first time we see it)
<- read.csv(file = "data/All_counts.csv", # name of the file
expr_data header = T # usually column names are in the first row
)
# 2. Type of
class(expr_data)
[1] "data.frame"
# 3. Head
head(expr_data)
X wt1 wt2 wt3 wt4 wt5 wt6 ko1 ko2 ko3 ko4 ko5 ko6
1 DDX11L1 4 1 2 1 1 1 0 6 3 1 0 2
2 WASH7P 421 46 342 345 192 18 233 217 125 313 321 726
3 MIR6859-3 0 0 0 0 0 0 0 0 0 0 0 0
4 MIR6859-2 0 0 0 0 0 0 0 0 0 0 0 0
5 MIR6859-1 0 0 0 0 0 0 0 0 0 0 0 0
6 MIR6859-4 0 0 0 0 0 0 0 0 0 0 0 0
Ok, this is a brief example in which we have loaded a data matrix with the function read.csv()
(as we know that the file is a .csv, but there are plenty of functions for all different type of files). We then checked the class of the expr_data and we get data.frame
; lastly we used head()
to see the structure of the first rows and we can say that:
- Gene symbols are in a column
- We have 6 wt samples and 6 ko samples
- Probably these are row counts as we do not have decimal values
We can edit the previous code to fix some issues we encounter
# 1. Gene names as rownames
<- read.csv(file = "data/All_counts.csv",
expr_data header = T,
row.names = 1 # first column
)
# 2. Transform to matrix
<- as.matrix(expr_data)
expr_data
# Check
class(expr_data)
[1] "matrix" "array"
head(expr_data, n = 10) # we can specify the number of rows we want
wt1 wt2 wt3 wt4 wt5 wt6 ko1 ko2 ko3 ko4 ko5 ko6
DDX11L1 4 1 2 1 1 1 0 6 3 1 0 2
WASH7P 421 46 342 345 192 18 233 217 125 313 321 726
MIR6859-3 0 0 0 0 0 0 0 0 0 0 0 0
MIR6859-2 0 0 0 0 0 0 0 0 0 0 0 0
MIR6859-1 0 0 0 0 0 0 0 0 0 0 0 0
MIR6859-4 0 0 0 0 0 0 0 0 0 0 0 0
MIR1302-2 0 0 0 0 0 0 0 0 0 0 0 0
MIR1302-11 0 0 0 0 0 0 0 0 0 0 0 0
MIR1302-9 0 0 0 0 0 0 0 0 0 0 0 0
MIR1302-10 0 0 0 0 0 0 0 0 0 0 0 0
Ok, fine. We are ready to go and explore our first expression matrix.
Dim, nrow, ncol
Head and tail are great friends, but you always want to know the dimensions of the matrix. It can be done using dim()
function, which returns the number of rows and columns (this order!), or nrow()
and ncol()
.
# dim
dim(expr_data)
[1] 26475 12
# nrow
nrow(expr_data)
[1] 26475
# ncol
ncol(expr_data)
[1] 12
We see that our matrix has 26475 rows and 12 columns.
Summary and transpose
After having seen the basic structure of the first and last lines, the number of rows and columns, we usually want to have some info about the distribution of the genes, the counts ecc.
Here comes the function summary()
:
summary(expr_data)
wt1 wt2 wt3 wt4 wt5
Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0
1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0.0
Median : 19.0 Median : 16.0 Median : 16.0 Median : 15 Median : 16.0
Mean : 411.4 Mean : 324.4 Mean : 326.7 Mean : 274 Mean : 317.3
3rd Qu.: 272.0 3rd Qu.: 214.0 3rd Qu.: 218.0 3rd Qu.: 186 3rd Qu.: 212.0
Max. :327719.0 Max. :536733.0 Max. :377069.0 Max. :90474 Max. :352034.0
wt6 ko1 ko2 ko3 ko4
Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
Median : 17.0 Median : 13.0 Median : 18.0 Median : 19.0 Median : 17.0
Mean : 316.6 Mean : 250.1 Mean : 378.2 Mean : 351.4 Mean : 373.8
3rd Qu.: 224.0 3rd Qu.: 179.0 3rd Qu.: 245.0 3rd Qu.: 246.0 3rd Qu.: 255.0
Max. :209954.0 Max. :61650.0 Max. :641616.0 Max. :497463.0 Max. :236803.0
ko5 ko6
Min. : 0.0 Min. : 0.0
1st Qu.: 0.0 1st Qu.: 0.0
Median : 17.0 Median : 19.0
Mean : 319.8 Mean : 423.4
3rd Qu.: 233.0 3rd Qu.: 288.0
Max. :62660.0 Max. :338679.0
For each column it returns some descriptive. But, ehi, on the columns we have the samples, we want this statistics for the genes, how can we do it? Here comes t()
that transpose the matrix so that rows became columns and vice versa.
<- t(expr_data)
expr_data_t
1:5, 1:8] expr_data_t[
DDX11L1 WASH7P MIR6859-3 MIR6859-2 MIR6859-1 MIR6859-4 MIR1302-2 MIR1302-11
wt1 4 421 0 0 0 0 0 0
wt2 1 46 0 0 0 0 0 0
wt3 2 342 0 0 0 0 0 0
wt4 1 345 0 0 0 0 0 0
wt5 1 192 0 0 0 0 0 0
Perfect, now we can do summary again (we do a subset of the data because we don’t want to do it for all genes, as it will be huge).
set.seed(123)
# 1. Get random numbers to select random genes
<- runif(n = 7, min = 1, max = ncol(expr_data_t))
r_numb
# 2. Floor numbers to get integers
<- floor(r_numb)
r_numb
# 3. Get the summary of sliced expr_data_t
summary(expr_data_t[, r_numb])
GPX2 BTN3A3 KRT17 USP17L2 PHF19 DBT
Min. :0.0000 Min. : 5.00 Min. :0.00 Min. :0.0 Min. : 10.0 Min. : 23.0
1st Qu.:0.0000 1st Qu.: 15.00 1st Qu.:0.00 1st Qu.:0.0 1st Qu.: 20.0 1st Qu.: 415.2
Median :0.5000 Median : 24.00 Median :0.00 Median :0.0 Median : 97.0 Median : 753.0
Mean :0.5833 Mean : 35.67 Mean :0.25 Mean :0.5 Mean :112.2 Mean : 642.8
3rd Qu.:1.0000 3rd Qu.: 32.50 3rd Qu.:0.00 3rd Qu.:1.0 3rd Qu.:168.5 3rd Qu.: 938.2
Max. :2.0000 Max. :150.00 Max. :2.00 Max. :3.0 Max. :324.0 Max. :1107.0
PPM1B
Min. : 73.0
1st Qu.: 392.5
Median : 562.0
Mean : 632.7
3rd Qu.: 838.0
Max. :1466.0
Here we have some interesting descriptive for each gene: we can see, for example, that DBT has more counts than GPX2, or that PPM1B has the higher Max. counts of all these 7 genes.
There are two new functions here: set.seed
and runif
. Let’s start from the latter, it returns n
random number uniformly distributed between min
and max
included. Do you understand why I used max = ncol(expr_data_t)
? I’m sure you do. But to be sure that it is clear, I did it because I don’t want to hard code (insert the real number) of the number of columns of the expression matrix, because if I change matrix I have to change this number as well. So, I let R calculate it for me. As it is random, in order to let you have the same number as me, I set a seed. The theory behind “randomness” in computer science is complex, so I’m not going to deep it. If you want, there are plenty of resources on the internet on this topic.
Here, just an example of the use of the seed.
# 1. Two random vectors of 5 numbers without set.seed
<- runif(n = 5, min = 1, max = 20)
no_seed1 <- runif(n = 5, min = 1, max = 20)
no_seed2
# 2. Two random vectors setting the seed
set.seed(123) # here you can place any number, each generate a unique random series
<- runif(n = 5, min = 1, max = 20)
with_seed1 set.seed(123) # here you can place any number, each generate a unique random series
<- runif(n = 5, min = 1, max = 20)
with_seed2
print(no_seed1)
[1] 17.955962 11.477265 9.675680 19.179834 9.613349
print(no_seed2)
[1] 13.873842 11.880035 2.955569 18.096674 5.675667
print(with_seed1)
[1] 6.463973 15.977798 8.770562 17.777331 18.868878
print(with_seed2)
[1] 6.463973 15.977798 8.770562 17.777331 18.868878
Great! Remember, setting the seed is fundamental in analysis in which randomness take place, so they will always return the same results.
Sum, Max, Min, …
Alright, there are plenty of functions that returns useful values of all the data in a matrix, such as the max value, the min one, the sum of all the values, the mean of all the values etc.
There is nothing easier than watch them in action.
# 1. create a random value matrix
set.seed(214)
<- matrix(runif(n = 24, min = 1, max = 20), nrow = 4)
my_matrix2 my_matrix2
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 7.079653 17.000935 12.786656 15.687742 1.260719 1.082183
[2,] 18.702227 14.015268 9.345658 4.225409 9.341111 11.385358
[3,] 10.809797 19.592980 13.632435 13.774924 19.548749 5.461854
[4,] 8.975963 7.808112 15.024569 6.271463 6.136042 16.151705
# 2. Calculate sum
<- sum(my_matrix2)
my_sum print(paste("Sum of the values of my_matrix2:", my_sum))
[1] "Sum of the values of my_matrix2: 265.101512494963"
# 3. Calculate min
<- min(my_matrix2)
my_min print(paste("Min of the values of my_matrix2:", my_min))
[1] "Min of the values of my_matrix2: 1.08218298689462"
# 4. Calculate max
<- max(my_matrix2)
my_max print(paste("Max of the values of my_matrix2:", my_max))
[1] "Max of the values of my_matrix2: 19.5929795864504"
# 5. Calculate mean
<- mean(my_matrix2)
my_mean print(paste("Mean of the values of my_matrix2:", my_mean))
[1] "Mean of the values of my_matrix2: 11.0458963539568"
Here, we calculate these stats for all the data in the matrix, but if I want to calculate it by rows or columns? Here it is
Apply a function to all rows or columns
You can easily think about summary
function to answer the previous question. Right, I’ll give it to you, but how if you want to handle the result? How about having only the vector of the mean values for each row? This is achieved through apply
. Let’s see it in action, then I’ll explain it.
# 1. Calculate the mean for each column
<- apply(X = expr_data_t[, r_numb], MARGIN = 1, FUN = mean)
mean_col
# 2. Calculate the mean for each column
<- apply(X = expr_data_t[, r_numb], MARGIN = 2, FUN = mean)
mean_row
mean_col
wt1 wt2 wt3 wt4 wt5 wt6 ko1 ko2 ko3 ko4 ko5
207.1429 178.2857 149.7143 101.8571 202.4286 209.0000 208.5714 231.8571 141.4286 225.2857 243.2857
ko6
343.4286
mean_row
GPX2 BTN3A3 KRT17 USP17L2 PHF19 DBT PPM1B
0.5833333 35.6666667 0.2500000 0.5000000 112.1666667 642.8333333 632.6666667
See? It’s super easy. It requires the matrix (X), the function (mean), and the margin used to calculate over (1 for rows, 2 for columns). You can apply all the functions that take as input numeric vectors.
If the matrix has row names or column names set, it returns a named vactor, otherwise it returns a normal vector.
Exercises
Now it’s you turn to familiarize with matrices.
Exercise 9.1 You have collected expression levels of 4 genes of interest of 3 patients. Calculate the sum of the read for each patient, in order to normalize the values in further analysis.
I’ll give you the data.
<- c("P1" = 25, "P2" = 12, "P3" = 14)
CXCR4 <- c("P3" = 12, "P2" = 10, "P1" = 17)
LCT <- c("P1" = 2, "P3" = 3, "P2" = 4)
PTPN7 <- c("P2" = 20, "P1" = 28, "P3" = 11) LHX9
Solution
# 1. Reorder the vectors to have the same order of patients
<- LCT[names(CXCR4)]
LCT <- PTPN7[names(CXCR4)]
PTPN7 <- LHX9[names(CXCR4)]
LHX9
# 2. Create the matrix
<- rbind(CXCR4, LCT, PTPN7, LHX9)
expr_mat_pat
# 3. Calculate the sum of the reads for each patient
<- apply(expr_mat_pat, MARGIN = 2, mean)
read_sum_pat
print(read_sum_pat)
P1 P2 P3
18.0 11.5 10.0
Exercise 9.2 We know that genes which mean is < 5, they should be excluded from the analysis (it is an example). Re-do the calculation of exercise 9.1 excluding those genes.
Solution
# 1. Calculate the mean of the reads for each gene
<- apply(expr_mat_pat, MARGIN = 1, mean)
read_sum_gene
# 2. Get the name of the genes to keep
<- names(read_sum_gene)[read_sum_gene > 5]
gene_2_keep
# 3. Use genes_2_keep to subset matrix and calculate sum
<- apply(expr_mat_pat[gene_2_keep, ], MARGIN = 2, mean)
read_sum_pat_filt
print(read_sum_pat_filt)
P1 P2 P3
23.33333 14.00000 12.33333
Great, another big chapter has come to an end. We are closer and closer to your real world applications. For your expression (mRNA or proteins) data you could already start some analysis, but wait for the next chapter, in which we will talk about data.frame
.