8 Vectors
I know that in the previous chapters you were thinking “ok, but I rarely have one data for a variable, I usually have multiple data”, and you are right, so let’s see the first form of data organization in R, the base of all: the vector.
A vector is a collection of data of the same type, for example all the weights of a group of people, all the names of your genes of interest, the expression levels of your genes of interest.
To create a vector in R is simple, just use the function c()
and put inside every data you need:
<- c(160, 148, 158, 170)
heights <- c("Adcyp1", "Tle4", "Psd95", "Bip", "Sst")
genes
heights
[1] 160 148 158 170
genes
[1] "Adcyp1" "Tle4" "Psd95" "Bip" "Sst"
I told you that the data inside of a vector must be of the same type, in fact
<- c(14, "Most", 45, 5, TRUE)
my_info my_info
[1] "14" "Most" "45" "5" "TRUE"
Transforms everything into strings because we have a string in it. To confirm it, we can ask R to tell us the type of data we have in a vector by using the function typeof()
:
typeof(heights)
[1] "double"
typeof(my_info)
[1] "character"
Named vectors
There is a particular type of vectors called named vectors that come in handy especially when creating graphs: every value in the vector has a “name” associated to it. Imagine like giving a unique name-tag to each value; for example, associate an expression value to each gene.
There are 3 ways of creating a named vector, I will show you here from the most fast to the most complex:
# 1st method (the best)
# create gene and value vector first
<- c("Adcyp1", "Tle4", "Psd95", "Bip", "Sst")
genes <- c(12, 200, 40, 1, 129)
expr_values
# assign names to the vector
names(expr_values) <- genes
expr_values
Adcyp1 Tle4 Psd95 Bip Sst
12 200 40 1 129
You can see here that every expression value has its own name.
# 2nd method (as good as the first)
# create gene and value vector first
<- c("Adcyp1", "Tle4", "Psd95", "Bip", "Sst")
genes <- c(12, 200, 40, 1, 129)
expr_values
# create a structure
<- structure(genes, names = expr_values)
expr_values expr_values
12 200 40 1 129
"Adcyp1" "Tle4" "Psd95" "Bip" "Sst"
This is the preferred method when the values are not in a standalone vectors but, for example, are a column of a dataframe.
# 3rd method, the worst
# directly create the named vector
<- c("Adcyp1" = 12, "Tle4" = 200, "Psd95" = 40, "Bip" = 1, "Sst" = 129)
expr_values expr_values
Adcyp1 Tle4 Psd95 Bip Sst
12 200 40 1 129
This takes so long to write and it is never used, as you will always have the values and the names as columns of a dataframe or individual vectors already defined or obtained through a function.
But, what is the main advantage of using named vectors? The possibility of extracting values of interest, this is called indexing.
Indexing
Indexing is one of the most used features, if not the used one, to retrieve data from vector, matrix, dataframes, ecc. There are many ways, let’s start with the named vectors and then move on with other strategies.
But first, a tip: the key element in indexing is a pair of squared bracket []
, in which you specify what to retrieve. So, remember: parenthesis after a function, square brackets to index.
Named vectors
To extract values from a named vector, we can put inside the square brackets a character (even a character vector or a character variable) with the name of the value we want to extract:
# one value
"Tle4"] expr_values[
Tle4
200
# a vector
c("Tle4", "Psd95")] expr_values[
Tle4 Psd95
200 40
# a variable
<- "Bip"
to_extract expr_values[to_extract]
Bip
1
Slicing
Another method is to specify the position of the values we want to extract. First, there are two things to keep in mind:
- In R numeration of index starts at 1, so the first element is at position 1, the second 2 ecc (in other programming language it starts at 0)
-
The length of a vector, meaning the number of items it is composed by can be extrapolate using the function
length()
Having set these concepts, let’s do some examples (I know it can be boring, but these are the fundamentals of data analysis. you really thank me in the future).
# get the length of a vector
print(length(expr_values))
[1] 5
Now we know that our expression values vector contains 5 elements, we can now start to index it:
# get first element
1] expr_values[
Adcyp1
12
# get first and third element
c(1, 3)] expr_values[
Adcyp1 Psd95
12 40
# get from element 2 to 4
2:4] expr_values[
Tle4 Psd95 Bip
200 40 1
# get from element 3 to the end
3:length(expr_values)] expr_values[
Psd95 Bip Sst
40 1 129
# get every element but the third
-3] expr_values[
Adcyp1 Tle4 Bip Sst
12 200 1 129
Ok, now that we have seen some example, we can look at some of them in more details:
-
expr_values[2:4]
: we haven’t seen this yes, but the expression<value1>:<value2>
creates a vector with numbers from value1 to value2 -
expr_values[3:length(expr_values)]
: since the functionlength()
returns a value, we can use this function directly into the square brackets -
expr_values[-3]
: the minus indicates except
Using logicals
We can also use logical and boolean values to index a vector. This is one of the most used way, and you will use it quite a lot. Why? Let’s see it in action.
c(T, F, F, T, T)] expr_values[
Adcyp1 Bip Sst
12 1 129
What has happened?
When indexed with boolean, only the values in which is TRUE
(T
) are returned, and this is super cool. Do you remember in the previous chapter how we get boolean as results from an expression?! Great, so we can use expressions that returns TRUE or FALSE and use them to index a vector:
# retrieve values < 59
< 59] expr_values[expr_values
Adcyp1 Psd95 Bip
12 40 1
We can use also a more complicated expression:
# retrieve values < 30 or > 150
< 30) | (expr_values > 150)] expr_values[(expr_values
Adcyp1 Tle4 Bip
12 200 1
Do you see how useful it could be in an analysis? I’m sure you do, so let’s move on!
One function can be applied to all elements
We’ve just seen a feature of the vectors: we can apply a function to each element of the vector. Previously we have evaluated if each element of the vector was < 59 (or < 30 or > 150).
Now, we will see more examples, starting from numeric vectors.
# operations can be performed to each value
* 2 expr_values
Adcyp1 Tle4 Psd95 Bip Sst
24 400 80 2 258
/ 10 expr_values
Adcyp1 Tle4 Psd95 Bip Sst
1.2 20.0 4.0 0.1 12.9
# tranform a numeric vector to a character one
as.character(expr_values)
[1] "12" "200" "40" "1" "129"
With character vectors we can do:
# calculate number of characters
nchar(genes)
[1] 6 4 5 3 3
# use grep to see which genes start with letter T
grep(pattern = "^T", x = genes, value = T) # ^ indicates the start of the line, can you guess why we used it?
[1] "Tle4"
Functions specific of vectors
Up to now nothing new, we have not seen any new function. But now we will see some new functions specific for vectors, starting, as always, from numbers:
Example 8.1 Let’s say we have a mice and we want to test the time spent in a cage, in particular we want to calculate the sum, the mean and the sd of the time spent not in the center of the cage.
# create the named vector
<- c("center", "top", "right", "bottom", "left")
areas <- c(14, 22, 29, 12, 2)
time names(time) <- areas
# extrapolate data not in center
<- time[-(names(time) == "center")]
not_center
# calculate mean, sum and sd
<- sum(not_center)
sum_time <- mean(not_center)
mean_time <- sd(not_center)
sd_time
# print results
print(paste("The mice spent", sum_time,
"seconds not in the center of the cage, with a mean of", mean_time,
"seconds in each area and a sd of", sd_time))
[1] "The mice spent 65 seconds not in the center of the cage, with a mean of 16.25 seconds in each area and a sd of 11.7862914721581"
This example implemented lots of things we have seen up to now, and it shows how on numerical vectors we can calculate sum, mean and sd; but these are not the only functions, we have also var (variance), min, max and others. We are going to see them later when needed.
Sorting a vector
Another important function is sort()
as it gives us the possibility to sort the values of the vectors.
sort(expr_values)
Bip Adcyp1 Psd95 Sst Tle4
1 12 40 129 200
sort(expr_values, decreasing = T)
Tle4 Sst Psd95 Adcyp1 Bip
200 129 40 12 1
By default, it sorts in ascending order, we can change the behavior by setting decreasing = T
.
Let’s see a couple of trick with sorting:
<- paste0("gene", 1:20)
features features
[1] "gene1" "gene2" "gene3" "gene4" "gene5" "gene6" "gene7" "gene8" "gene9" "gene10"
[11] "gene11" "gene12" "gene13" "gene14" "gene15" "gene16" "gene17" "gene18" "gene19" "gene20"
sort(features)
[1] "gene1" "gene10" "gene11" "gene12" "gene13" "gene14" "gene15" "gene16" "gene17" "gene18"
[11] "gene19" "gene2" "gene20" "gene3" "gene4" "gene5" "gene6" "gene7" "gene8" "gene9"
What can we see here? First of all, a cool method to create a vector of words with increasing numbers (the combination paste and 1:20); then, we see that sorting has put “gene2” after all “gene1X”, because it sorts in alphabetical order. For this reason, it is recommended to use 01, 02, 03 ecc if we know that we have more than 9 elements (this works also for computer file names).
::: {.example #sort-names}
Here we want to sort the expression levels based on their names.
:::
sort(names(expr_values))] expr_values[
Adcyp1 Bip Psd95 Sst Tle4
12 1 40 129 200
Tadaaa, we used sort on the names of expression levels and used the sorted names to index the named vector.
Unique values
As the title suggests, there is a function (unique()
) that returns tha unique values of a vector. It is useful in different situations, we will use it a lot. The usage is so simple:
# create a vector with repeated values
<- c(1:10, 2:4, 3:6) # can you guess the values of this vector without typing it in R?
my_vector
unique(my_vector)
[1] 1 2 3 4 5 6 7 8 9 10
Logical vectors sum and mean
In the example 8.1 we have seen how to calculate the sum and the mean of numerical vectors, but it can be done also on vectors full of boolean, and it can be very useful. I’ll show you this example:
Example 8.2 We have a vector representing the response to a treatment of different patient, the vector is coded by logicals. Here is the vector: c(T, F, T, T, T, F, T, F, F, T, F, F, F, F, F, T, T). Calculate the number and the percentage of responders (2 decimal places).
# 1. Create the vector
<- c(T, F, T, T, T, F, T, F, F, T, F, F, F, F, F, T, T)
response
# 2. Calculate the number of responders
<- sum(response)
n_responders
# 3. Calculate the percentage of responders
<- mean(response) * 100
p_responders <- round(p_responders, 2)
p_responders
print(paste("There are", n_responders, "responders, corresponding to", p_responders, "% of total patients."))
[1] "There are 8 responders, corresponding to 47.06 % of total patients."
What happened here? The trick is that R interprets TRUE as 1 and FALSE as 0. Remember this also for future applications.
Operations between vectors
Don’t give up, I know this chapter ha been so long, but now we will see the last part: the most important operations we can do between vectors.
Mathematical
First of all, mathematical operations: we can do mathematical operations between vectors only if the vectors are the same size, otherwise it will raise an error. This because each operation is performed one element by the corrisponding element of the other vector.
Example 8.3 Let’s say we have 3 vectors representing the total amount of aminoacids found in three different samples for 5 proteins. We want to calculate, for each protein, the fraction of aminoacids in each sample.
# 1. Define starting vectors
<- c("SEC24C", "BIP", "CD4", "RSPO2", "LDB2")
proteins <- c(12, 52, 14, 33, 22)
sample1 <- c(5, 69, 26, 45, 3)
sample2 <- c(8, 20, 5, 39, 48)
sample3
names(sample1) <- names(sample2) <- names(sample3) <- proteins
# 2. Calculate sum of aa for each protein
<- sample1 + sample2 + sample3
sum_aa
# 3. Calculate the fraction for each sample
<- sample1 / sum_aa * 100
sample1_fr <- sample2 / sum_aa * 100
sample2_fr <- sample3 / sum_aa * 100
sample3_fr
# 4. Print the results
sample1_fr
SEC24C BIP CD4 RSPO2 LDB2
48.00000 36.87943 31.11111 28.20513 30.13699
sample2_fr
SEC24C BIP CD4 RSPO2 LDB2
20.000000 48.936170 57.777778 38.461538 4.109589
sample3_fr
SEC24C BIP CD4 RSPO2 LDB2
32.00000 14.18440 11.11111 33.33333 65.75342
Ok, I know it seems difficult, but let’s analyze each step:
- Here the new step is that we can assign the same values to multiple variables by chaining assignment statements
- Since the vectors have the same size, we can sum them together. IMPORTANT: the operation is performed based on position, NOT names. So if our vectors would have had the names in different order, we should have ordered them
Different and common elements
Usually we want to compare two vectors to find distinct and common elements (eg. upregulated genes in two analysis). To do it, we can use two functions: intersect()
(which find the common elements between two vectors), and setdiff()
(which returns the element in the first vector not present in the second).
# 1. Define 2 vectors
<- c("NCOA1", "CENPO", "ASXL2", "HADHA", "ADGRF3")
upregulated_1 <- c("ADGRF3", "SLC5A6", "NRBP1", "NCOA1", "HADHA")
upregulated_2
# 2. Find common genes
<- intersect(upregulated_1, upregulated_2)
common
# 3. Find different genes
<- setdiff(upregulated_1, upregulated_2)
only_1 <- setdiff(upregulated_2, upregulated_1)
only_2
# 4. Print results
print(cat("Common genes are:", paste(common, collapse = ", "), "\n",
"Genes specifically upregulated in analysis 1 are:", paste(only_1, collapse = ", "), "\n",
"Genes specifically upregulated in analysis 2 are:", paste(only_2, collapse = ", "), "\n")
)
Common genes are: NCOA1, HADHA, ADGRF3
Genes specifically upregulated in analysis 1 are: CENPO, ASXL2
Genes specifically upregulated in analysis 2 are: SLC5A6, NRBP1
NULL
Here you are. We can add three more notions:
-
cat()
is like print, but it accept special characters -
when pasting a vector, an additional argument
collapse = "<chr>"
can be added. It tells R to collapse all the element of a vector in a single character element and separate them through(“,” for us) -
"\n"
means add a new line, so it tells to print the next sentence in a new line. It is a special character, so it works with cat/li>
%in%
A slightly different function (if we can call it this way) is %in%
. When comparing two vectors, it returns TRUE or FALSE for each element of the first vector based on the presence of that element in the second vector:
%in% upregulated_2 upregulated_1
[1] TRUE FALSE FALSE TRUE TRUE
Sometimes it is useful to index a vector based on another vector. We will see some usages.
Match
Last but not least, the match()
function. It takes two vectors into consideration and returns, for each element of the first vector, the position of that element in the second vector. If an element is not present, it will return NA
, we will describe this element in a dedicated chapter.
So, how can it be useful? Usually it is done to rearrange and reorder a vector to match another vector. For example, let’s say that two vectors of example 8.3 have names in different order; prior to do all calculation we need to match the names order.
# 1. Define starting vectors
<- c("SEC24C", "BIP", "CD4", "RSPO2", "LDB2")
proteins1 <- c("CD4", "RSPO2", "BIP", "LDB2", "SEC24C")
proteins2 <- c(12, 52, 14, 33, 22)
sample1 <- c(5, 69, 26, 45, 3)
sample2
names(sample1) <- proteins1
names(sample2) <- proteins2
sample1
SEC24C BIP CD4 RSPO2 LDB2
12 52 14 33 22
sample2
CD4 RSPO2 BIP LDB2 SEC24C
5 69 26 45 3
As we can see, the names are in different order, so we want to fix this:
<- match(names(sample1), names(sample2))
idx
idx
[1] 5 3 1 2 4
We can use these indexes to index our sample2.
<- sample2[idx]
sample2
sample1
SEC24C BIP CD4 RSPO2 LDB2
12 52 14 33 22
sample2
SEC24C BIP CD4 RSPO2 LDB2
3 26 5 69 45
Now they are in the same order, so we can continue the analysis.
Exercises
Great, let’s do some exercises. They wrap up lots of concept we’ve just seen. However, I encourage you to try again every function we have studied so far.
It doesn’t matter if you will do them in a different way, as long as the results are identical. In these chapters I want you to understand the steps, not to use the perfect and most efficient code.
Exercise 8.1 We have received the data of the expression levels (in reads) of some genes of interest. We are interested in the difference between expression levels of mitochondrial vs non-mitochondrial genes; in particular we want to see how many reads maps to those categories (both counts and percentage).
The starting vector is the following: c(“SEC24C” = 52, “MT-ATP8” = 14, “LDB2” = 22, “MT-CO3” = 16, “MT-ND4” = 2, “NTMT1” = 33, “BIP” = 20, “MT-ND5” = 42)
PS: Mitochondrial genes starts with MT-.
Solution
# 1. Create the vector
<- c("SEC24C" = 52, "MT-ATP8" = 14, "LDB2" = 22, "MT-CO3" = 16, "MT-ND4" = 2, "NTMT1" = 33, "BIP" = 20, "MT-ND5" = 42)
expr_levels
# 2. Get names of mitochondrial genes
<- grep(pattern = "^MT-", x = names(expr_levels), value = T)
mito_genes
# 3. Calculate total number of counts for each category
<- sum(expr_levels[mito_genes])
total_mito <- sum(expr_levels[-(names(expr_levels) %in% (mito_genes))])
total_no_mito
# 4. Calculate %
<- round(total_mito / sum(expr_levels) * 100, 2)
perc_mito <- round(total_no_mito / sum(expr_levels) * 100, 2)
perc_no_mito
# 5. Print results
cat("Reads mapping to mitochondrial genes are", total_mito, "(", perc_mito, "%), while the ones mapping to other genes are", total_no_mito, "(", perc_no_mito, "%)")
Reads mapping to mitochondrial genes are 74 ( 36.82 %), while the ones mapping to other genes are 149 ( 74.13 %)
Exercise 8.2 You were given the mass spectrometry results of an analysis on 3 patients. These are the results: patient1 c(“SEC24C” = 12, “CDH7” = 1, “LDB2” = 13, “SEM3A” = 16, “FEZF2” = 21, “NTMT1” = 43, “BIP” = 29, “HOMER” = 22), patient2 c(“SEC24C” = 2, “CDH7” = 11, “SEM5A” = 13, “HCN1” = 22, “NTMT1” = 31, “BIP” = 12, “HOMER” = 8), patient3 c(“SEC24B” = 20, “BIP” = 12, “HOMER” = 13, “SEM3A” = 49, “HCN1” = 16, “NTMT1” = 27). Calculate the expression mean of common genes.
Solution
# 1. Create the vectors
<- c("SEC24C" = 12, "CDH7" = 1, "LDB2" = 13, "SEM3A" = 16, "FEZF2" = 21, "NTMT1" = 43, "BIP" = 29, "HOMER" = 22)
patient1 <- c("SEC24C" = 2, "CDH7" = 11, "SEM5A" = 13, "HCN1" = 22, "NTMT1" = 31, "BIP" = 12, "HOMER" = 8)
patient2 <- c("SEC24B" = 20, "BIP" = 12, "HOMER" = 13, "SEM3A" = 49, "HCN1" = 16, "NTMT1" = 27)
patient3
# 2. Identify common genes
<- intersect(names(patient1), names(patient2))
common1_2 <- intersect(common1_2, names(patient3))
common_all
# 3. Subset for common genes
<- patient1[common_all]
patien1_sub <- patient2[common_all]
patien2_sub <- patient3[common_all]
patien3_sub
# 4. Calculate the mean for each gene
<- (patien1_sub + patien2_sub + patien3_sub) / 3
common_mean
common_mean
NTMT1 BIP HOMER
33.66667 17.66667 14.33333
You can see how having the info for each patient in a different vector is not as handy, for this reason for expression data we use matrices. In the next chapter we will talk about them.