6 Character
When you analyze a dataset, you don’t have only numbers: you’ll have gene names, protein names, mouse strains and other variables that R calls character (I will alsways call them strings, but remember that for R they are character).
A character can be a single letter, a word or even a whole text. To create a character variable you have to include it in ""
(double quotes) or in ''
(single quotes):
<- "SEC24C"
mychar_d typeof(mychar_d)
[1] "character"
<- 'SEC24C'
mychar_s typeof(mychar_s)
[1] "character"
Even numbers can be considered as character if included by ““:
<- 12
weight_n typeof(weight_n)
[1] "double"
<- "12"
weight_c typeof(weight_c)
[1] "character"
See? Just adding the double/single quotes it changes everything.
IMPORTANT: R interprets everything that is between a pair of single/double quotes as character, so if you forget to close a quote, nothing will work. Moreover, you can’t use a single quote to close a character opened with a double quote and vice versa.
Concatenate strings
Sometimes you want to concatenate different string into one single string, and we can to it with two similar functions: paste()
and paste0()
. The unique difference among them is that in the former you can decide what character to use to separate what you are concatenating (default is a white space), while the latter do not insert any character.
For example, let’s say you have a variable condition
and a variable treatment
, and you want to concatenate them to create the variable sample
; you will do:
<- "control"
condition <- "vector"
treatment
<- paste(condition,
sample
treatment, sep = ".") # default is " "
print(sample)
[1] "control.vector"
<- paste0(condition,
sample0
treatment)
print(sample0)
[1] "controlvector"
You concatenate as many character as you want, just put them all inside that function.
Substring
I have to tell you about substring, but I never used this function in my life. It is used to slice the character and take only a part of it. The function is called substr()
, let’s see how it works:
substr(mychar_s, start = 2, stop = 4)
[1] "EC2"
It needs a start and a stop value, they are both included in the result (in this case, character at position 2, 3 and 4 are sliced).
Remember: in R everything starts at 1, so the first element is at position 1. In other languages it starts from 0, but in R it starts from 1.
Extra funtcion never used but that can be important if we use substr and we don’t know how long is a character is nchar()
nchar(mychar_s)
[1] 6
Let’s say we want to take from position 3 until the end, but we don’t know how long is a character, we can do it with:
substr(mychar_s, start = 3, stop = nchar(mychar_s))
[1] "C24C"
What happened here is that the result of the function nchar(mychar_s)
is used as the value to indicate the stop. We will usually use these method in R, especially when we don’t want to use memory to store a value that is used only once (as in this case).
Substitution
Alright, let’s move to something way more useful for our analysis: the function sub()
, and its big brother gsub()
. They are used to substitute part of a character with another, the only difference is that the former changes only the first occurrence, while the latter every occurrence.
Let’s say we have the sample variable written as "control_vector_3"
and we want to get rid of the underscores, we will do:
<- "control_variable_2"
sample2
<- sub(pattern = "_", # the part of the string to search to substitute
sub_only replacement = " ", # what to use to replace it
x = sample2) # variable in which to search
print(sub_only)
[1] "control variable_2"
<- gsub(pattern = "_", # the part of the string to search to substitute
gsub_all replacement = " ", # what to use to replace it
x = sample2) # variable in which to search
print(gsub_all)
[1] "control variable 2"
This is the basic way of using these functions, but they can be very useful in complex analysis, but we will see them later on.
Grep
Further parents of substitution, are grep()
and its brother grepl()
: they are used to check if a particular pattern is present in a character variable. The first one returns the index (we’ll see this concept in few chapters) or the value of the variable that match the pattern, the second one returns a boolean value (TRUE or FALSE) indicatung the presence of the pattern in the variable.
Let’s see few example right away:
<- "GAPDH"
gene_to_test <- "GA"
pattern_to_check_1 <- "PH"
pattern_to_check_2
<- grep(pattern = pattern_to_check_1,
grep_1 x = gene_to_test,
value = T) # default is FALSE (returns index)
print(grep_1)
[1] "GAPDH"
<- grep(pattern = pattern_to_check_2,
grep_2 x = gene_to_test,
value = T) # default is FALSE (returns index)
print(grep_2)
character(0)
<- grepl(pattern = pattern_to_check_1,
grepl_1 x = gene_to_test)
print(grepl_1)
[1] TRUE
<- grepl(pattern = pattern_to_check_2,
grepl_2 x = gene_to_test)
print(grepl_2)
[1] FALSE
We can see that the grep_2
gives character(0)
as result, meaning that it is an empty character variable.
You are going to use grep a lot during the analysis, even if you find it not so interesting now, trust me 😉.
Transform to type character
As for numbers, we can also convert a variable into a string: we use the function as.character()
. You will use this function when a column of a dataframe is read as numeric while you want it to be read as character instead.
<- 35
ml_to_add typeof(ml_to_add)
[1] "double"
<- as.character(ml_to_add)
ml_to_add typeof(ml_to_add)
[1] "character"
This is an important lesson: in R variables can be overwritten with other types of data (this can’t be done in other languages such as Java, C++, ecc.). This can be both handy and risky at the same time: handy because we can save memory by overwriting useless variables, risky because we can overwrite a variable without checking if the type is maintained (maybe it is important).
Exercises
Exercise 5.2 In a list of genes to test, you found a gene called “NRG_1” and one called “SST R”. In the report you have to present you want to print out “Involved genes are NRG1 and SSTR2”. How to do it?
Solution
<- "NRG_1"
gene1 <- "SST R"
gene2
# correct gene names
<- gsub(pattern = "_", replacement = "", gene1)
gene1 <- gsub(pattern = " R", replacement = " R2", gene1)
gene2
# concatenate all the string
<- paste("Involved genes are", gene1, "and", gene2)
all_string print(all_string)
[1] "Involved genes are NRG1 and NRG1"
Exercise 5.2 In a variable you have the gene name “HCN1” and in another you have its number of aa (890, as numeric!!!). Print out the string “HCN1 protein is 890 aa long”
Solution
<- "HCN1"
gene <- 890
aa_num
# one string solution
<- paste(gene, "protein is", as.character(aa_num), "aa long")
to_print print(to_print)
[1] "HCN1 protein is 890 aa long"
Don’t worry if you had done it in another way or if it didn’t work. We are here to learn, and these exercises are done to challenge you on things you have seen in different pieces.
Head on to the next lesson, in which we are going to talk about booleans.