11 Functions, packages and reproducibility

I’m sorry, I know last time I told you we would have start exploratory data analysis… but there are some crucial concepts that you should know prior to start your analysis: functions, packages and reproducibility.

Functions

We have encountered them quite a few times in previous chapters. We know that in R functions can be used by writing the name of the function, followed by parenthesis with some values in it.
I’m not here to describe how to write new functions etc (if you want to know more about it, this is a useful guide), but you have to know that the functions we’ve seen (e.g. mean(), max(), median(), length()) are so-called “built-in”, so they are accessible in R since its installation.
However, there are lots and lots of functions created by different “users” around the world that are fundamental for different types of analysis but that are not “pre-built” in R. They are collected in different packages.

Packages

So, you have to imagine a package as a box containing different functions that you want to use.
“Right, and how can I use them?” Great question, let’s see how to install a package.

Install a package

To install a package you have to know in which repository it is stored (imagine a repository as a free App Store/Play Store). Most of the packages are store on CRAN (website), while the majority of genomic-related packages are store in Bioconductor (website).
As we are going to use CRAN packages, here I show you how to install those packages.
Let’s say we want to install tidyverse (which is a collection of packages that I love for data analysis), we will use the command install.packages("<name_of_the_package>").

install.packages("tidyverse")

I strongly suggest you to run this code and follow the instructions that pop up in the console.
The process will take a while and if all worked fine, it should prompt “DONE The downloaded source packages are in….” at the end.

Load a package

Installing a package is not sufficient to be able to use its functions, you have to load it in every session you want to use it. So, I now it sounds like a big deal, but it’s easier than it seems: at the beginning of your script you have to write library(<name_of_the_package>) for each package you want to load.
For example, in our case we will write:

library(tidyverse)

Now that we have loaded the package, we are able to use its functions.

Reproducibility

Here we are at the most important part of this chapter: data reproducibility. I mentioned you just few things about functions and packages, I know… but those concepts are important to understand how data reproducibility works.

You know that reproducibility is a key aspect of every experiment and analysis. When you are analyzing data with R there are few things that are mandatory for reproducibility:

  1. Write every step and code you run
  2. Use set.seed for randomization steps
  3. Use the same version of R and of the packages

As we have already seen the first 2 points, we will now discuss about the third one.
When you are installing a package, you are installing a certain version if it. In fact, during time, packages changes with new functionalities, fixed bugs and so on. For this reason, the results of an analysis done with version 1.0 of a package may be different from the ones using version 5.2… it should not be the case, but some times it happens because the same functions may change a bit.
So, how to control it?

Control package versions

To check which version of a package you have installed you can use the command packageVersion("<name_of_the_package>"). For example:

packageVersion("tidyverse")
[1] '2.0.0'

I have installed version 2.0.0 of tidyverse. DON’T worry if your version is not the same as mine (I know it sounds controversial, but here we are explaining things, you should stick with you own version).

Session info

Another way is to look at the sessionInfo, which returns all the package loaded in the current session (remember, a session starts when you start R and ends when you exit or restart R).

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 12.7

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dunn.test_1.3.5 glue_1.6.2      gginnards_0.1.2 vcd_1.4-9       ggmosaic_0.3.3  ggpubr_0.4.0   
 [7] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2     purrr_1.0.1     readr_2.1.4    
[13] tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.2   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] httr_1.4.6         sass_0.4.0         jsonlite_1.8.4     viridisLite_0.4.0  carData_3.0-4     
 [6] bslib_0.3.1        highr_0.9          yaml_2.2.1         ggrepel_0.9.1      pillar_1.9.0      
[11] backports_1.3.0    lattice_0.20-45    digest_0.6.28      RColorBrewer_1.1-2 ggsignif_0.6.3    
[16] colorspace_2.0-2   productplots_0.1.1 cowplot_1.1.1      htmltools_0.5.2    plyr_1.8.6        
[21] pkgconfig_2.0.3    broom_1.0.5        bookdown_0.24      scales_1.2.1       tzdb_0.2.0        
[26] timechange_0.2.0   generics_0.1.1     farver_2.1.0       car_3.0-12         withr_2.5.0       
[31] lazyeval_0.2.2     cli_3.6.1          magrittr_2.0.3     evaluate_0.14      fansi_0.5.0       
[36] MASS_7.3-54        rstatix_0.7.0      tools_4.1.0        data.table_1.14.2  hms_1.1.3         
[41] lifecycle_1.0.3    plotly_4.10.0      munsell_0.5.0      compiler_4.1.0     jquerylib_0.1.4   
[46] rlang_1.1.1        rstudioapi_0.15.0  htmlwidgets_1.5.4  labeling_0.4.2     rmarkdown_2.11    
[51] gtable_0.3.0       abind_1.4-5        R6_2.5.1           gridExtra_2.3      zoo_1.8-9         
[56] knitr_1.36         fastmap_1.1.0      utf8_1.2.2         stringi_1.7.5      Rcpp_1.0.7        
[61] vctrs_0.6.2        tidyselect_1.2.0   xfun_0.28          lmtest_0.9-39     

Here it is reported the version of R, and of all the packages loaded during this session.
You should use this command at the end of all your analyses, especially if you are using Markdown (we will see them soon, very soon) and provide it when you want to share the analysis to someone else (or publish a paper with an analysis performed in R). In fact, in “Material and Methods” section of a paper, you should write down the version of R and of the packages used for analyses.

IMPORTANT: for all these reasons, you should NOT upgrade R or any packages you are using for an analysis, even if it asks you for any upgrade during packages installation. There are better ways to control package versions etc, but that is out of the scope of this book (if you are interested, go and learn about conda environments here).

So, with these concepts in mind, let’s start our first data analysis in R in the next chapter.

link versione pacchetto

Importanza riproducibilità Versione pacchetti Conda environment (per approfondire), altrimenti non aggiornare mai.