8 Vectors

I know that in the previous chapters you were thinking “ok, but I rarely have one data for a variable, I usually have multiple data”, and you are right, so let’s see the first form of data organization in R, the base of all: the vector.
A vector is a collection of data of the same type, for example all the weights of a group of people, all the names of your genes of interest, the expression levels of your genes of interest.
To create a vector in R is simple, just use the function c() and put inside every data you need:

heights <- c(160, 148, 158, 170)
genes <- c("Adcyp1", "Tle4", "Psd95", "Bip", "Sst")

heights

[1] 160 148 158 170

genes

[1] "Adcyp1" "Tle4"   "Psd95"  "Bip"    "Sst"

I told you that the data inside of a vector must be of the same type, in fact

my_info <- c(14, "Most", 45, 5, TRUE)
my_info

[1] "14"   "Most" "45"   "5"    "TRUE"

Transforms everything into strings because we have a string in it. To confirm it, we can ask R to tell us the type of data we have in a vector by using the function typeof():

typeof(heights)

[1] "double"

typeof(my_info)

[1] "character"

Named vectors

There is a particular type of vectors called named vectors that come in handy especially when creating graphs: every value in the vector has a “name” associated to it. Imagine like giving a unique name-tag to each value; for example, associate an expression value to each gene.
There are 3 ways of creating a named vector, I will show you here from the most fast to the most complex:

# 1st method (the best)
# create gene and value vector first
genes <- c("Adcyp1", "Tle4", "Psd95", "Bip", "Sst")
expr_values <- c(12, 200, 40, 1, 129)

# assign names to the vector
names(expr_values) <- genes
expr_values

Adcyp1   Tle4  Psd95    Bip    Sst 
    12    200     40      1    129

You can see here that every expression value has its own name.

# 2nd method (as good as the first)
# create gene and value vector first
genes <- c("Adcyp1", "Tle4", "Psd95", "Bip", "Sst")
expr_values <- c(12, 200, 40, 1, 129)

# create a structure 
expr_values <- structure(genes, names = expr_values)
expr_values

      12      200       40        1      129 
"Adcyp1"   "Tle4"  "Psd95"    "Bip"    "Sst"

This is the preferred method when the values are not in a standalone vectors but, for example, are a column of a dataframe.

# 3rd method, the worst
# directly create the named vector
expr_values <- c("Adcyp1" = 12, "Tle4" = 200, "Psd95" = 40, "Bip" = 1, "Sst" = 129)
expr_values

Adcyp1   Tle4  Psd95    Bip    Sst 
    12    200     40      1    129

This takes so long to write and it is never used, as you will always have the values and the names as columns of a dataframe or individual vectors already defined or obtained through a function.
But, what is the main advantage of using named vectors? The possibility of extracting values of interest, this is called indexing.

Indexing

Indexing is one of the most used features, if not the used one, to retrieve data from vector, matrix, dataframes, ecc. There are many ways, let’s start with the named vectors and then move on with other strategies.
But first, a tip: the key element in indexing is a pair of squared bracket [], in which you specify what to retrieve. So, remember: parenthesis after a function, square brackets to index.

Named vectors

To extract values from a named vector, we can put inside the square brackets a character (even a character vector or a character variable) with the name of the value we want to extract:

# one value
expr_values["Tle4"]

Tle4 
 200

# a vector
expr_values[c("Tle4", "Psd95")]

 Tle4 Psd95 
  200    40

# a variable
to_extract <- "Bip"
expr_values[to_extract]

Bip 
  1

Slicing

Another method is to specify the position of the values we want to extract. First, there are two things to keep in mind:

In R numeration of index starts at 1, so the first element is at position 1, the second 2 ecc (in other programming language it starts at 0)
The length of a vector, meaning the number of items it is composed by can be extrapolate using the function length()

Having set these concepts, let’s do some examples (I know it can be boring, but these are the fundamentals of data analysis. you really thank me in the future).

# get the length of a vector
print(length(expr_values))

[1] 5

Now we know that our expression values vector contains 5 elements, we can now start to index it:

# get first element
expr_values[1]

Adcyp1 
    12

# get first and third element
expr_values[c(1, 3)]

Adcyp1  Psd95 
    12     40

# get from element 2 to 4
expr_values[2:4]

 Tle4 Psd95   Bip 
  200    40     1

# get from element 3 to the end
expr_values[3:length(expr_values)]

Psd95   Bip   Sst 
   40     1   129

# get every element but the third
expr_values[-3]

Adcyp1   Tle4    Bip    Sst 
    12    200      1    129

Ok, now that we have seen some example, we can look at some of them in more details:

expr_values[2:4]: we haven’t seen this yes, but the expression <value1>:<value2> creates a vector with numbers from value1 to value2
expr_values[3:length(expr_values)]: since the function length() returns a value, we can use this function directly into the square brackets
expr_values[-3]: the minus indicates except

Using logicals

We can also use logical and boolean values to index a vector. This is one of the most used way, and you will use it quite a lot. Why? Let’s see it in action.

expr_values[c(T, F, F, T, T)]

Adcyp1    Bip    Sst 
    12      1    129

What has happened?
When indexed with boolean, only the values in which is TRUE (T) are returned, and this is super cool. Do you remember in the previous chapter how we get boolean as results from an expression?! Great, so we can use expressions that returns TRUE or FALSE and use them to index a vector:

# retrieve values < 59
expr_values[expr_values < 59]

Adcyp1  Psd95    Bip 
    12     40      1

We can use also a more complicated expression:

# retrieve values < 30 or > 150
expr_values[(expr_values < 30) | (expr_values > 150)]

Adcyp1   Tle4    Bip 
    12    200      1

Do you see how useful it could be in an analysis? I’m sure you do, so let’s move on!

One function can be applied to all elements

We’ve just seen a feature of the vectors: we can apply a function to each element of the vector. Previously we have evaluated if each element of the vector was < 59 (or < 30 or > 150).
Now, we will see more examples, starting from numeric vectors.

# operations can be performed to each value
expr_values * 2

Adcyp1   Tle4  Psd95    Bip    Sst 
    24    400     80      2    258

expr_values / 10

Adcyp1   Tle4  Psd95    Bip    Sst 
   1.2   20.0    4.0    0.1   12.9

# tranform a numeric vector to a character one
as.character(expr_values)

[1] "12"  "200" "40"  "1"   "129"

With character vectors we can do:

# calculate number of characters
nchar(genes)

[1] 6 4 5 3 3

# use grep to see which genes start with letter T
grep(pattern = "^T", x = genes, value = T) # ^ indicates the start of the line, can you guess why we used it?

[1] "Tle4"

Functions specific of vectors

Up to now nothing new, we have not seen any new function. But now we will see some new functions specific for vectors, starting, as always, from numbers:

Example 8.1 Let’s say we have a mice and we want to test the time spent in a cage, in particular we want to calculate the sum, the mean and the sd of the time spent not in the center of the cage.

# create the named vector
areas <- c("center", "top", "right", "bottom", "left")
time <- c(14, 22, 29, 12, 2)
names(time) <- areas

# extrapolate data not in center
not_center <- time[-(names(time) == "center")]

# calculate mean, sum and sd
sum_time <- sum(not_center)
mean_time <- mean(not_center)
sd_time <- sd(not_center)

# print results
print(paste("The mice spent", sum_time, 
            "seconds not in the center of the cage, with a mean of", mean_time, 
            "seconds in each area and a sd of", sd_time))

[1] "The mice spent 65 seconds not in the center of the cage, with a mean of 16.25 seconds in each area and a sd of 11.7862914721581"

This example implemented lots of things we have seen up to now, and it shows how on numerical vectors we can calculate sum, mean and sd; but these are not the only functions, we have also var (variance), min, max and others. We are going to see them later when needed.

Sorting a vector

Another important function is sort() as it gives us the possibility to sort the values of the vectors.

sort(expr_values)

   Bip Adcyp1  Psd95    Sst   Tle4 
     1     12     40    129    200

sort(expr_values, decreasing = T)

  Tle4    Sst  Psd95 Adcyp1    Bip 
   200    129     40     12      1

By default, it sorts in ascending order, we can change the behavior by setting decreasing = T.

Let’s see a couple of trick with sorting:

features <- paste0("gene", 1:20)
features

 [1] "gene1"  "gene2"  "gene3"  "gene4"  "gene5"  "gene6"  "gene7"  "gene8"  "gene9"  "gene10" "gene11"
[12] "gene12" "gene13" "gene14" "gene15" "gene16" "gene17" "gene18" "gene19" "gene20"

sort(features)

 [1] "gene1"  "gene10" "gene11" "gene12" "gene13" "gene14" "gene15" "gene16" "gene17" "gene18" "gene19"
[12] "gene2"  "gene20" "gene3"  "gene4"  "gene5"  "gene6"  "gene7"  "gene8"  "gene9"

What can we see here? First of all, a cool method to create a vector of words with increasing numbers (the combination paste and 1:20); then, we see that sorting has put “gene2” after all “gene1X”, because it sorts in alphabetical order. For this reason, it is recommended to use 01, 02, 03 ecc if we know that we have more than 9 elements (this works also for computer file names).
::: {.example #sort-names} Here we want to sort the expression levels based on their names. :::

expr_values[sort(names(expr_values))]

Adcyp1    Bip  Psd95    Sst   Tle4 
    12      1     40    129    200

Tadaaa, we used sort on the names of expression levels and used the sorted names to index the named vector.

Unique values

As the title suggests, there is a function (unique()) that returns tha unique values of a vector. It is useful in different situations, we will use it a lot. The usage is so simple:

# create a vector with repeated values
my_vector <- c(1:10, 2:4, 3:6) # can you guess the values of this vector without typing it in R?

unique(my_vector)

 [1]  1  2  3  4  5  6  7  8  9 10

Logical vectors sum and mean

In the example 8.1 we have seen how to calculate the sum and the mean of numerical vectors, but it can be done also on vectors full of boolean, and it can be very useful. I’ll show you this example:

Example 8.2 We have a vector representing the response to a treatment of different patient, the vector is coded by logicals. Here is the vector: c(T, F, T, T, T, F, T, F, F, T, F, F, F, F, F, T, T). Calculate the number and the percentage of responders (2 decimal places).

# 1. Create the vector
response <- c(T, F, T, T, T, F, T, F, F, T, F, F, F, F, F, T, T)

# 2. Calculate the number of responders
n_responders <- sum(response)

# 3. Calculate the percentage of responders
p_responders <- mean(response) * 100
p_responders <- round(p_responders, 2)

print(paste("There are", n_responders, "responders, corresponding to", p_responders, "% of total patients."))

[1] "There are 8 responders, corresponding to 47.06 % of total patients."

What happened here? The trick is that R interprets TRUE as 1 and FALSE as 0. Remember this also for future applications.

Operations between vectors

Don’t give up, I know this chapter ha been so long, but now we will see the last part: the most important operations we can do between vectors.

Mathematical

First of all, mathematical operations: we can do mathematical operations between vectors only if the vectors are the same size, otherwise it will raise an error. This because each operation is performed one element by the corrisponding element of the other vector.

Example 8.3 Let’s say we have 3 vectors representing the total amount of aminoacids found in three different samples for 5 proteins. We want to calculate, for each protein, the fraction of aminoacids in each sample.

# 1. Define starting vectors
proteins <- c("SEC24C", "BIP", "CD4", "RSPO2", "LDB2")
sample1 <- c(12, 52, 14, 33, 22)
sample2 <- c(5, 69, 26, 45, 3)
sample3 <- c(8, 20, 5, 39, 48)

names(sample1) <- names(sample2) <- names(sample3) <- proteins

# 2. Calculate sum of aa for each protein
sum_aa <- sample1 + sample2 + sample3

# 3. Calculate the fraction for each sample
sample1_fr <- sample1 / sum_aa * 100
sample2_fr <- sample2 / sum_aa * 100
sample3_fr <- sample3 / sum_aa * 100

# 4. Print the results
sample1_fr

  SEC24C      BIP      CD4    RSPO2     LDB2 
48.00000 36.87943 31.11111 28.20513 30.13699

sample2_fr

   SEC24C       BIP       CD4     RSPO2      LDB2 
20.000000 48.936170 57.777778 38.461538  4.109589

sample3_fr

  SEC24C      BIP      CD4    RSPO2     LDB2 
32.00000 14.18440 11.11111 33.33333 65.75342

Ok, I know it seems difficult, but let’s analyze each step:

Here the new step is that we can assign the same values to multiple variables by chaining assignment statements
Since the vectors have the same size, we can sum them together. IMPORTANT: the operation is performed based on position, NOT names. So if our vectors would have had the names in different order, we should have ordered them

Different and common elements

Usually we want to compare two vectors to find distinct and common elements (eg. upregulated genes in two analysis). To do it, we can use two functions: intersect() (which find the common elements between two vectors), and setdiff() (which returns the element in the first vector not present in the second).

# 1. Define 2 vectors
upregulated_1 <- c("NCOA1", "CENPO", "ASXL2", "HADHA", "ADGRF3")
upregulated_2 <- c("ADGRF3", "SLC5A6", "NRBP1", "NCOA1", "HADHA")

# 2. Find common genes
common <- intersect(upregulated_1, upregulated_2)

# 3. Find different genes
only_1 <- setdiff(upregulated_1, upregulated_2)
only_2 <- setdiff(upregulated_2, upregulated_1)

# 4. Print results
print(cat("Common genes are:", paste(common, collapse = ", "), "\n", 
          "Genes specifically upregulated in analysis 1 are:", paste(only_1, collapse = ", "), "\n",
          "Genes specifically upregulated in analysis 2 are:", paste(only_2, collapse = ", "), "\n")
      )

Common genes are: NCOA1, HADHA, ADGRF3 
 Genes specifically upregulated in analysis 1 are: CENPO, ASXL2 
 Genes specifically upregulated in analysis 2 are: SLC5A6, NRBP1 
NULL

Here you are. We can add three more notions:

cat() is like print, but it accept special characters
when pasting a vector, an additional argument collapse = "<chr>" can be added. It tells R to collapse all the element of a vector in a single character element and separate them through (“,” for us)
"\n" means add a new line, so it tells to print the next sentence in a new line. It is a special character, so it works with cat/li>

%in%

A slightly different function (if we can call it this way) is %in%. When comparing two vectors, it returns TRUE or FALSE for each element of the first vector based on the presence of that element in the second vector:

upregulated_1 %in% upregulated_2

[1]  TRUE FALSE FALSE  TRUE  TRUE

Sometimes it is useful to index a vector based on another vector. We will see some usages.

Match

Last but not least, the match() function. It takes two vectors into consideration and returns, for each element of the first vector, the position of that element in the second vector. If an element is not present, it will return NA, we will describe this element in a dedicated chapter.
So, how can it be useful? Usually it is done to rearrange and reorder a vector to match another vector. For example, let’s say that two vectors of example 8.3 have names in different order; prior to do all calculation we need to match the names order.

# 1. Define starting vectors
proteins1 <- c("SEC24C", "BIP", "CD4", "RSPO2", "LDB2")
proteins2 <- c("CD4", "RSPO2", "BIP", "LDB2", "SEC24C")
sample1 <- c(12, 52, 14, 33, 22)
sample2 <- c(5, 69, 26, 45, 3)

names(sample1) <- proteins1
names(sample2) <- proteins2

sample1

SEC24C    BIP    CD4  RSPO2   LDB2 
    12     52     14     33     22

sample2

   CD4  RSPO2    BIP   LDB2 SEC24C 
     5     69     26     45      3

As we can see, the names are in different order, so we want to fix this:

idx <- match(names(sample1), names(sample2))

idx

[1] 5 3 1 2 4

We can use these indexes to index our sample2.

sample2 <- sample2[idx]

sample1

SEC24C    BIP    CD4  RSPO2   LDB2 
    12     52     14     33     22

sample2

SEC24C    BIP    CD4  RSPO2   LDB2 
     3     26      5     69     45

Now they are in the same order, so we can continue the analysis.

Exercises

Great, let’s do some exercises. They wrap up lots of concept we’ve just seen. However, I encourage you to try again every function we have studied so far.
It doesn’t matter if you will do them in a different way, as long as the results are identical. In these chapters I want you to understand the steps, not to use the perfect and most efficient code.

Exercise 8.1 We have received the data of the expression levels (in reads) of some genes of interest. We are interested in the difference between expression levels of mitochondrial vs non-mitochondrial genes; in particular we want to see how many reads maps to those categories (both counts and percentage).

The starting vector is the following: c(“SEC24C” = 52, “MT-ATP8” = 14, “LDB2” = 22, “MT-CO3” = 16, “MT-ND4” = 2, “NTMT1” = 33, “BIP” = 20, “MT-ND5” = 42)

PS: Mitochondrial genes starts with MT-.

Solution

# 1. Create the vector
expr_levels <- c("SEC24C" = 52, "MT-ATP8" = 14, "LDB2" = 22, "MT-CO3" = 16, "MT-ND4" = 2, "NTMT1" = 33, "BIP" = 20, "MT-ND5" = 42)

# 2. Get names of mitochondrial genes
mito_genes <- grep(pattern = "^MT-", x = names(expr_levels), value = T)

# 3. Calculate total number of counts for each category
total_mito <- sum(expr_levels[mito_genes])
total_no_mito <- sum(expr_levels[-(names(expr_levels) %in% (mito_genes))])

# 4. Calculate %
perc_mito <- round(total_mito / sum(expr_levels) * 100, 2)
perc_no_mito <- round(total_no_mito / sum(expr_levels) * 100, 2)

# 5. Print results
cat("Reads mapping to mitochondrial genes are", total_mito, "(", perc_mito, "%), while the ones mapping to other genes are", total_no_mito, "(", perc_no_mito, "%)")

Reads mapping to mitochondrial genes are 74 ( 36.82 %), while the ones mapping to other genes are 149 ( 74.13 %)

Exercise 8.2 You were given the mass spectrometry results of an analysis on 3 patients. These are the results: patient1 c(“SEC24C” = 12, “CDH7” = 1, “LDB2” = 13, “SEM3A” = 16, “FEZF2” = 21, “NTMT1” = 43, “BIP” = 29, “HOMER” = 22), patient2 c(“SEC24C” = 2, “CDH7” = 11, “SEM5A” = 13, “HCN1” = 22, “NTMT1” = 31, “BIP” = 12, “HOMER” = 8), patient3 c(“SEC24B” = 20, “BIP” = 12, “HOMER” = 13, “SEM3A” = 49, “HCN1” = 16, “NTMT1” = 27). Calculate the expression mean of common genes.

Solution

# 1. Create the vectors
patient1 <- c("SEC24C" = 12, "CDH7" = 1, "LDB2" = 13, "SEM3A" = 16, "FEZF2" = 21, "NTMT1" = 43, "BIP" = 29, "HOMER" = 22)
patient2 <- c("SEC24C" = 2, "CDH7" = 11, "SEM5A" = 13, "HCN1" = 22, "NTMT1" = 31, "BIP" = 12, "HOMER" = 8)
patient3 <- c("SEC24B" = 20, "BIP" = 12, "HOMER" = 13, "SEM3A" = 49, "HCN1" = 16, "NTMT1" = 27)

# 2. Identify common genes
common1_2 <- intersect(names(patient1), names(patient2))
common_all <- intersect(common1_2, names(patient3))

# 3. Subset for common genes
patien1_sub <- patient1[common_all]
patien2_sub <- patient2[common_all]
patien3_sub <- patient3[common_all]

# 4. Calculate the mean for each gene
common_mean <- (patien1_sub + patien2_sub + patien3_sub) / 3

common_mean

   NTMT1      BIP    HOMER 
33.66667 17.66667 14.33333

You can see how having the info for each patient in a different vector is not as handy, for this reason for expression data we use matrices. In the next chapter we will talk about them.