System analysis: Learning R

Author

David Kneis (firstname.lastname @ tu-dresden.de)

Published

October 21, 2024

$\leftarrow$ BACK TO TEACHING SECTION OF MY HOMEPAGE

1 Motivations to learn R

R was originally designed for statistical computing. It allows you to analyze and visualize data in a repeatable manner. That is, you can re-do all steps of the analysis on a single key press. Hence, you can instantly respond to data updates and bug-fixes.
An R script is also a documentation of your analysis workflow. This is in contrast to a click-based approach, where you have to take notes (or rely on your ability to remember…).
Beyond doing statistics and visualization, R can also be used for process-based modeling in the hydrological and biological context, for example. This is a great means to develop understanding of systems and to facilitate experiment design through pre-simulation.
Knowledge about R (and programming in general) can be considered as a partial insurance against unemployment. Data analysts are wanted everywhere.
R is free software and there is proper official documentation. There is also good support on the web through a large user community. There is even a conference called “useR”.
There exists a wealth of add-on packages and you can contribute yourself. All of these packages are well organized and documented.
Of course, there are alternatives to R. Python is nowadays probably the closest competitor which comes with specific pros and cons. If you are proficient in python, stick to it.

2 Working comfortably

2.1 Development environments for R

To work with R productively, you want to use a powerful text editor at least that comes with syntax highlighting and features for code execution. A lightweight solution on Linux is, for example, the geany editor. However, there is nowadays one software that most people use: Rstudio. Rstudio allows you to edit R code, watch generated output and graphics, and access the help system in a single graphical user interface.

Note: You can still run R without any particular development environment on top in a non-interactive mode. To learn about this option, try Rscript --help at a shell terminal.

2.2 Basic use of Rstudio

Create an empty text file at a convenient place (e.g. your personal folder on the local computer or your personal network drive). To do so, use the Rstudio menu (file -> new R script) followed by (file -> save as). If you have a good file manager, you could also create a plain text file from scratch and then open the latter in Rstudio. By convention, the file name extension should be either “.R” or “.r”.
Populate the file with R commands. For the purpose of testing, you could put the following two statements supposed to yield information about your computer’s operating system, the user, and the current time. Hint: If you want to learn more about these or other statements, put the cursor within the name of a function, e.g. print(), and press the F1 key.

print(Sys.info()["sysname"])
print(Sys.info()["user"])
print(Sys.time())

Next, you want to “run” the script, i.e. you want the statements to be executed. There are two options:

The run option: Place the cursor on a statement (e.g. the first one) and press the “run” button. This will execute the particular statement. It will also advance the cursor such that, on a repeated push of the button, the next statement will be executed. In that way you could go through all the statements step by step, one after another.

The source option: I personally hardly use the “run” button to execute individual statements. I always use the “source” button instead (or the respective combination of keys). As opposed to “run”, the “source” button causes execution of all statements in the script in a single sweep.

In any case, you should obtain a similar output as the one displayed below in the console window.

sysname 
"Linux"

    user 
"dkneis"

[1] "2024-10-21 15:50:02 CEST"

The reason I prefer “source” over “run”: In a script, the successful execution of a statement often depends on the completion of preceding statements (found closer to the top of the file). Thus, if you want to execute a particular statement, you actually want to execute all statements before this one too. With “source” that is what happens on a single button hit. But what if the statement of interest is somewhere in the middle of a script and you really want to run the script up to this line only? The simple solution: Put a call to the stop() function beneath. The execution triggered with “source” will end at the desired line.

3 Basics of the language

Note the official documentation provided at the official R-project website.

3.1 Overview on language elements

To work R, we have to primarily learn about the basic building blocks of the language. The table below provides an overview.

Language element	Relevance
Variables	Used to store data under a name (a.k.a. identifier).
Operators	Allow, e.g., mathematical and logical operations on variables.
Functions	Transform input information in desired output. Can be very simple but also do very complex tasks. Can be ready-made or self-designed. Functions are the key to productive work.
Flow control stuctures	Allow for iteration and branching. Mostly needed for programming of low-level algorithms.
Comments	Allow basic documentation of the code. Essential, if you give your code to someone else or look at your own scripts later.

The names of variables and functions need to follow a convention. Basically, identifiers should start with a letter. The initial letter is followed by more letter, digits, and underscores (_) or periods (.). Names are case sensitive!

Generally, you want to use short names one the one hand (less typing). At the same time, you want to use names that let you and others easily recognize the content of variables or the meaning of functions. People tend to use just x as a name for variables of limited lifetime.

3.2 Variables: Types, shapes, and subsetting

3.2.1 Overview

Variable store data. First of all, we should distinguish between data types.

Type	Variable contains	Example
numeric	Numbers	1e+8
integer	Integer numbers	100
locical	Binary information	TRUE, FALSE
character	character strings	“Lake Constance”

In addition, R variables can contains some special values. E.g. NA indicates missing values, Inf indicates an infinite number, and NULL is an indicator for “empty”.

We need to further distinguish between:

scalar variables which hold a single value of a particular type
vectors and arrays which can hold multiple values, all of the same type
lists, which combine multiple variables of different type data in a single compound variable

3.2.2 Scalar variables

We start with some assignment operations to scalar variables. Using the <- instead of = for assignments is recommended.

x <- 3.1415          # numeric
x <- "E. coli"       # character (demonstrates overwriting of pervious result and type)
x <- TRUE            # logical

Utility functions to inspect the contents of simple variables include:

print(x)            # print value
typeof(x)           # returns the data type
is.character(x)     # allows checking for a data type (is.numeric, is.character, ...)

3.2.3 Vectors

Besides scalar variables, we often deal with vectors. Understanding the use of vectors is crucial because they play a central role in tabular data. A table column is essentially a vector. Here are some common vector constructor statements:

x <- seq(from=0, to=2, by=0.25)       # numeric sequence
x <- 1:5                              # integer sequence
x <- rep(0, times=3)                  # replication
x <- c("Athens", "Paris", "Rome")     # vector of strings
x <- runif(5)                         # 5 random numbers in range 0...1

R allows the individual elements of a vector to be named. This is very useful because it makes access to individual elements easier and safer. Besides the examples below, you may consider the convenience function setNames() from the default package stats.

x <- c(red=255, green=0, blue=128)    # using the constructor

x <- c(255, 0, 128)                   # or by assigning names later which is
names(x) <- c("red","green","blue")   # also the key to renaming

To inspect the contents of a vector-valued variable, the following functions are often used.

length(x)                             # returns the number of elements
names(x)                              # to retrieve element names
print(head(x))                        # prints just the first few elements
print(tail(x))                        # as the name says
unique(x)                             # collapse to different values only
length(unique(x))                     # nested call; how many different values?

When working with vectors, we must have a means to access the values of individual elements or ranges of elements. This is known as “subsetting”. R allows for subsetting in three ways as demonstrated below. All variants use brackets [] as the subsetting operator.

Subsetting by position (using the element index) is simple and fast:

x[1]                    # 1st element
x[1:2]                  # using a vector of positions

If vector elements are named, we can subset by name. This is often safer than using element indices. Reason: If vector elements are added and/or dropped, element positions change while element names still remain valid.

x["red"]                # single element
x[c("blue","red")]      # using a vector of names

Finally, we can use a logical mask to subset a vector. Only elements where the mask vector is TRUE will be retrieved.

mask <- x > 0           # a logical mask vector
x[mask]                 # and its use

x[x > 0]                # same as above
x[which(x > 0)]         # still the same

3.2.4 Matrices

So far we looked at one-dimensional vectors. In mathematics we often need matrices as well. The latter can be regarded as vectors with two dimensions. Many of the principles we learnt for vectors equally apply to matrices.

A matrix can be constructed from a vector by breaking it as regular distances into rows and columns.

v <- 1:6
x <- matrix(v, nrow=2, byrow=FALSE)
x <- matrix(v, ncol=3, byrow=TRUE)

Another common way of matrix construction is to glue several vectors together, each representing a column. It is possible and often useful to name the columns.

x <- cbind(
  temp= c(0, 10, 20),           # 1st column
  rate= c(0, 0.2, 0.4)          # 2nd column
)

Similarly, we can concatenate vectors row-wise to obtain a matrix.

x <- rbind(
  tap=   c(NO3= 0.5, NH4=  0, O2=12),    # 1st row
  river= c(NO3=   5, NH4=  1, O2=10),
  WWTP=  c(NO3=  10, NH4= 10, O2= 3)
)

Like for vectors, there are utility functions to query the dimensions and to retrieve or set the names or rows and columns.

ncol(x)            # number of columns
nrow(x)
dim(x)             # returns the lengths along both dimensions
colnames(x)        # get or set column names
rownames(x)

Because of the two dimensions, we need to specify both rows and columns to retrieve the elements of a matrix. Like in vector subsetting, we can either specify the desired elements by position, by name (if named), or using logical masks. Again, we use brackets [] for subsetting, but now with two dimensions. Rows are considered as the first dimension, columns go second.

x[1, ]                 # 1st row (vector)
x[, 1]                 # 1st column (vector)
x[1, 1]                # top left element (scalar)
x[1:2, 2:ncol(x)]      # sub-matrix
x[x[,"O2"] > 5, "O2"]  # subset by name / logical mask

Note: If a matrix is subset to yield just a single row or column, R will return a vector by default (and not a matrix with just one column or row). While this is sometimes desired, if often leads to issues in R scripts as demonstrated below.

x <- matrix((1:9)/10, ncol=3)   # 3 x 3 matrix
z <- x[,1:2]                    # select 2 columns
dim(z)                          # still a matrix
z <- x[,1]                      # select just one column
dim(z)                          # not a matrix anymore
z <- x[,1,drop=FALSE]           # like above, but request dimension to be kept
dim(z)                          # still a matrix

Below are examples for very common operations on matrices.

x <- matrix((1:9)/10, ncol=3)   # 3 x 3 matrix

rowSums(x)                      # computes row sums
apply(x, 1, sum)                # does the same in a more generic way
apply(x, 1, max)                # here we changed the summary function

colSums(x)                      # equivalent examples for columns
apply(x, 2, sum)                
apply(x, 2, max)

3.2.5 Plain lists

As opposed to vectors and matrices, a list stores multiple values of different data types. Because list can be nested, they can be used to represent arbitrarily complex data structures.

x <- list(              # creates a list
  speciesID=1,
  name="E. coli",
  rodShaped=TRUE
)

The operator to access elements of a list is the dollar sign $. However, like for named vectors, brackets [], and namely double brackets [[]] are useful for subsetting.

x$speciesID
x["speciesID"]
x[["speciesID"]]
x[[1]]

3.2.6 Data frames (tabular data)

Most of the time, we want to work with tabular data. Traditionally, R has a special type for storing such data known as data.frame. In the statement below, we construct a data frame with information on chemical elements.

x <- data.frame(
  element= c("C", "Si", "N", "P"),
  group=   c(4, 4, 5, 5),
  mass=    c(12.01, 28.09, 14.01, 30.97)
)

The example illustrates that, generally, each table column can be of a different data type. At the same time, within a column, the data type is consistent. Hence, a table (and thus a data frame) is actually a list of vectors, all being of identical length.

At the same time, tables consist of rows and columns just like matrices and, consequently, many of the operations for matrices are equally applicable to data frames:

ncol(x)       # also 'nrow(x)'
colnames(x)   # also 'rownames(x)'
x[ ,3]        # last column, access by index
x[ ,ncol(x)]  #   as above
x[ ,"mass"]   # last column, access by name

The difference to a matrix becomes obvious when we retrieve information that involves multiple columns (and thus potentially different data types).

x[1, ]           # first row, returns a list!
unlist(x[1,])    # we enforce to obtain a vector; what is the data type now?

Because data frames are lists behind the scenes, the usual techniques for lists are perfectly applicable as well.

x$element     # extract column using list syntax
names(x)      # same as 'colnames(x)'

There is much more to learn about data frames and the work with tabular data in general, namely about import, export, reshaping, merging. A dedicated section on that topic follows a little later.

3.2.7 Understanding the contents of large and complex variables

Many objects we deal with are just too large and/or too complex to be reasonably output using print(). Below are some suggestions that work in practice.

print(head(x))      # print just the first or last few elements; works well ...
print(tail(x))      # ... for vectors, matrices, and data frames

str(x)              # display the structure; very useful for large lists

lapply(x, typeof)   # query the type for all elements of a list; very handy for
                    # understanding the data types of table columns

3.3 Mathematical operations

Simple math operation use the common operators: +-*/^.

Important: If either of the operands is a vector, the result will also be a vector. If both operands are vectors (of equal length), the operations is executed on corresponding elements.

x <- 1:10
x * 2     # each element multiplied with scalar
x * x     # operands of same length: works element-wise

Math operations can yield invalid result. There are dedicated functions to identity those cases, for example:

is.finite(1 / 0)

For multiplying matrices, there are two cases whose results are totally different.

# this is just a standard multiplication of corresponding elements
values <- matrix(1:6, ncol=3)
weights <- matrix(runif(6), ncol=3)
values * weights

          [,1]      [,2]      [,3]
[1,] 0.7893886 2.2768709 0.1834618
[2,] 0.6577907 0.4050006 3.6981310

# this is a matrix multiplication in the mathematical sense using a designated operator
x <- matrix(1:6, ncol=3)
y <- matrix(1:6, ncol=2, byrow=TRUE)
x %*% y

     [,1] [,2]
[1,]   35   44
[2,]   44   56

3.4 Comparisons and logical operators

R provides a set of commonly used operators for comparisons like >, <, >=, <=. It is tempting and often possible to use == and != to test for equality and inequality. However, comparisons sometimes need careful consideration:

Be careful with comparisons involving floating point numbers. While some numbers can be represented by a computer exactly, others can not.
Comparing variables that differ in type (e.g. numeric vs. integer) is bad style at least and may give unexpected results at worst.
Using the designated function identical() for equality testing prevents many of the potential issues.

x <- runif(1:10)
x <= 0.5             # comparison for each element
!(x > 5)             # negation, same as 'x != 5'

identical(2, 10/5)   # a safe test for exact identity

There are further operators to formulate logical AND as well as OR statements. Like mathematical operators, they compare corresponding elements if both operands are vectors.

x <- c(TRUE, FALSE)
y <- c(FALSE, TRUE)
x & y                 # element wise AND
x | y                 # element wise OR

Quite often you want to test for the occurrence of a few TRUE or FALSE values in a vector. Then, these functions come in handy:

all(x)                  # are all elements TRUE?
any(x)                  # same as !all(x), but more efficient

3.5 String operations

Doing operations on character strings is a common task. Quite often, we want to concatenate strings using paste() or truncate strings using substr().

x <- data.frame(
  city = c("Vienna", "Paris", "Oslo"),
  country= c("Austria", "France", "Norway")
)

# country codes
substr(x[,"country"], 1, 1)

# concatenate corresponding elements
paste(x[,"city"], x[,"country"], sep=" is the capital of ")

# glue together everything
paste(x[,"city"], x[,"country"], sep=" is the capital of ", collapse=", ")

Another example involving numbers and some formatting:

x <- sample(x=1:10, size=3)

paste("The mean of",paste(x, collapse=", "),"is",signif(mean(x),3))

There are many further operations on strings and one can do very useful magics if one knows how to use regular expressions. However, this is a complex expert-level topic. If you are interested, I recommend to look at the help pages of grepl() and gsub().

# find molecules containing carbon
x <- c("H20", "CO2", "NaCl", "CH4", "C2H2", "CrO")
grepl(x, pattern="C[^a-z]")

3.6 Control structures

3.6.1 Branching

In a script, you often want to execute statements based on a condition. The outcome of testing for a condition may be binary like in the following example.

x <- runif(1)
if (x > 0.5) {         # logical statement
  print("x > 1/2")     # executed if TRUE
}                      # { } defines a block of statements

But one can also test for more than a single alternative.

x <- 2 * runif(1) - 1  # a random number in range [-1,1]
if (x > 0) {
  print("positive")
} else if (x < 0) {    # explicit alternative
  print("negative")        
} else {               # if all explicit tests failed (i.e. returned FALSE)
  print("zero")        # very unlikely to be printed ever
}

3.6.2 Iteration

Iteration allows for repeated evaluation of code blocks and the corresponding language constructs are typically called “loops”. Models to predict system dynamics, for example, typically contain a time loop to iterate over hours or days of the forecasting period. In numerical computations, iteration is also necessary to gradually approach a solution of adequate precision.

The simplest loop uses the for statement to advance a counter or iterate over the elements of a vector, for example.

# example of a "time loop"
time_step <- 24
number_of_days <- 14
for (i in 1:number_of_days) {
  now <- (i-1) * time_step
  print(now)
  # the computation of results for current time interval would go here
}

# classical iteration over the elements of a vector
x <- 1:8
for (i in 1:length(x)) {    # using an integer as the iteration variable
  print(x[i])
}

R provides also other types of loops, namely through the while statement but they are used less often.

3.7 Functions

So far, we have already used many of the built-in functions of R’s base package. To be really productive, you need to know how to write custom functions yourself. Only through using functions, you can write high-quality scripts which are transparent, i.e. easy to understand. Functions make essential parts of your code re-usable because, once implemented, a function can be applied in many scripts. Writing transparent and reusable code should be your goal.

Functions always consist of two parts:

the function interface specifies the name of the function and, within parenthesis, a list of formal arguments (if needed). The latter provide a means to pass values into the function when it is called.
the function body is the part where input data are actually processed. It is enclosed in curly braces {}. The very last statement of the function body determines the return value.

Let’s write a function that multiplies two numeric vectors. For the purpose of better visibility we put some more line breaks than usual.

multiply <- function (             # argument list starts at opening "("
  x, y                             # allow for two arguments                                 
) {                                # function body starts at opening "{"
  stopifnot(is.numeric(x))         # sanity checks on arguments
  stopifnot(is.numeric(y))
  x * y                            # final statement defines the return value
}

Now that the function was defined, we can call it. Calling a function means to pass actual values to its formal arguments in order to retrieve the return value.

x <- multiply(1:4, 2:5)

In the call above, the actual arguments were passed to the formal arguments by position. As things get more complicated, passing arguments by name is a safer alternative. When names are used, arguments can be passed in any convenient order.

x <- multiply(x=1:4, y=2:5)
x <- multiply(y=2:5, x=1:4)

In a good programming style, statements in the function body should exclusively operate on the values that were passed as arguments. However, this is not really enforced by the R language and it is thus possible to write quite unsafe function code that relies on the existence of global variables. Clear recommendation: Don’t do that. Such functions are not reusable and problems created by their use are hard to trace.

dont <- function (x) {  # R will try hard to find a value for variable 'a'
  x * a                 # somewhere in a parent environment
}
a <- 2
dont(3)                 # works
rm(a)
dont(3)                 # no longer

4 R topics of primary relevance

4.1 Using add-on packages

A lot of additional functionality is provided through add-on packages. There are several thousand of such packages and most of them are held in a central repository. Installing packages from that repository is straightforward:

install.packages("somePackage")     # not meant to be run; package name is fake

R packages are also developed and stored in other places like, e.g., on github. With minor effort, it is possible to also install packages directly from those external repositories (e.g. using the install_github() function from package devtools). However, you should be aware of the risk associated with installing software from anywhere. In case of doubt, don’t install!

In order to use functionality from an installed package, you can load the package like so:

require("somePackage")              # package must be installed for this to work

However, loading is not always necessary. In fact, you can often just specify the name of the (installed) package containing the function of interest as demonstrated below. This is good style anyway, as it clearly indicates “this particular functions is provided by an add-on package”.

somePackage::someFunction()         # calls the function without loading package

4.2 Import of tabular data

4.2.1 Plain text vs. spreadsheets

Tabular data is what we work with most of the time. If you think of tables stored on a computer, you probably think of spreadsheets. Of course, R can read spreadsheet data in common native formats (e.g. “.ods” files in case of LibreOffice or “.xlsx”, if you use what most people use). Consider looking at packages like readODS or readxl to import spreadsheets directly.

Nevertheless, it is a good idea to be familiar with the one, most versatile function to import tabular data called read.table. It is made to process a wide range of tables stored as delimited text. The latter term refers to tables stored as plain text where the columns are indicated by a particular character (the delimiter). The most commonly used delimiters you will encounter are the semicolon and the tab character (a stretchable white space, generally encoded as \t). Below is an example of delimited text.

Order;Species;OptimumTemperature
Enterobacterales;Escherichia coli;40
Enterobacterales;Klebsiella pneumoniae;37
Pseudomonadales;Pseudomonas putida;30

Here are some reasons to prefer delimited text files over native spreadsheets:

Can be opened on any computer system with any special software; only a text editor is needed.
Plain text files are very unlikely to be infected by computer viruses.
Can be produced by and imported into any spreadsheet software and many other programs (e.g. geographical information systems or data bases).
If the tab character is used as the delimiter (which is what I recommend), you can transfer data between text files and spreadsheets using just copy and paste.

The main limitations of delimited text in comparison to spreadsheets are: You cannot store formatting and you can only store values (not the formulas behind). But since you want to implement your formulas in R anyway, who cares?

4.2.2 A solution for most situations

Here is an example of using read.table to import the bacteria data set displayed above. Besides the name/path of the file, we typically want to specify the delimiter (via the sep argument) at least. Moreover, we want to make sure that the first line of the file is interpreted as column names (rather than data). This is what we achieve with header=TRUE.

x <- read.table(file="bacteria.txt", sep=";", header=TRUE)

Beginners often struggle to provide read.table with a proper file name/path and R thus refuses to find the data set. Section 5.1 addresses typical issues and solutions.

The read.table function allows for many more arguments to accommodate to special situations. For example, you can directly import data from a web page. Look at the respective help page, if needed. Below, I illustrate just special uses of read.table. It refers to the case where you want to store a small tabular data set in the R script itself.

myText <- "
longName;shortName
Escherichia coli;E. coli
Klebsiella pneumoniae;K. pneu.
Pseudomonas putida;P. putida
"
myTable <- read.table(sep=";", header=TRUE, text=myText)

4.3 Table layouts: Long vs. wide

It is possible to store a particular data set in tables of different layout. Most typically, one can chose between a long table with only few columns (“long format”) and a table with fewer rows but many columns (“wide format”). As an example, considering a small data set of water quality criteria measured at the inlet and outlet of a reservoir.

This piece of R code would read the data in wide format:

wide <- read.table(header=T, sep="", text='
site       O2   Temperature   Chlorophyll
influent  8.9          15.5           4.0
effluent  8.7          19.3          40.0
')

while this imports the data in long format:

long <- read.table(header=T, sep="", text='
site         variable  value
influent           O2    8.9
influent  Temperature   15.5
influent  Chlorophyll    4.0
effluent           O2    8.7
effluent  Temperature   19.3
effluent  Chlorophyll   40.0
')

So which of the formats is better? Well, it depends! I have collected a few criteria that may help with the decision in Table 1. In case of doubt, you probably want to chose the long format. This is almost always the case if data were measured with multiple levels of replication (e.g. in space and time). For instance, in this example, the wide format would be difficult to use measurements were taken at different dates too. If one would stick to the wide format, it was necessary to introduce a lot of new columns (e.g. “O2_Jan01”, “O2_Jan02”, …) with specially constructed names. In the long format, one would simply add a “date” column - that’s it.

Table 1: Pros and cons of the wide and long format for storage of tabular data.
Criterion	Wide format	Long format
Each value is connected to more than one ID variable (e.g. date, time, replicate)	Difficult, needs convention for composite column names	Perfect
I need to type in data manually	Suitable	May be less convenient
My data set has a lot missing values	Waste of storage space	Perfect
My data come from a data base or I am building one	Not recommended	Perfect

At one point, you will need to convert data from one table layout to the other and vice versa. This is easily done with R. The main difficulty is that, at the time of this writing, there are many alternative functions for this particular job (Table 2) and the ones shipped with base R are surprisingly ugly to use. My current recommendation is to use the data.table package.

Table 2: Some options to transform between wide and long format.
Package	Wide to long	Long to wide	Comments
none	`reshape`	`reshape`	Puzzling documentation, difficult to remember proper use
`data.table`	`melt`	`dcast`	Needs just this package without of dependencies. Fast to install and update.
`reshape2`	`melt`	`dcast`	Deprecated, use `data.table` instead.
`tidyr`	`pivot_longer`	`pivot_wider`	This package depends on many other heavy packages. If you need `tidyr` anyway, go with these functions.

Finally, here is a quick demonstration of how to perform the transformations using dcast and meltfrom the data.table package. Check the respective help pages to learn about details and further options.

library("data.table")
wide2 <- data.table::dcast(data=as.data.table(long),   # long --> wide
  formula=site ~ parameter, value.var="value")
long2 <- data.table::melt(data=as.data.table(wide),    # wide --> long
  id.vars="site")

4.4 Merging tables

There is the merge function to join two tables based on a common column. Consider the following example:

observed <- read.table(header=TRUE, sep=",", text='
environment,species
wastewater,Escherichia coli
wastewater,Klebsiella pneumoniae
wastewater,Enterococcus faecalis
lake water,Aphanizomenon gracile
lake water,Oscillatoria princeps
')

taxonomy <- read.table(header=TRUE, sep=",", text='
species,order,class
Escherichia coli,Enterobacterales,Gammaproteobacteria
Klebsiella pneumoniae,Enterobacterales,Gammaproteobacteria
Enterococcus faecalis,Lactobacillales,Bacilli
Aphanizomenon gracile,Nostocales,Cyanophyceae
Oscillatoria princeps,Oscillatoriales,Cyanophyceae
')

combined <- merge(observed, taxonomy, by="species")
print(unique(combined[,c("environment","class")]))

  environment               class
1  lake water        Cyanophyceae
2  wastewater             Bacilli
3  wastewater Gammaproteobacteria

To safely join tables even in complex situations, please review the further optional arguments of the merge function.

Also note that many other package provide the same capabilities as merge under different function names. I see no reason to prefer those over the original merge.

4.5 Operations on long-table slices

Tables in long format can generally be split into “slices”, i.e. sets of rows that share a common value in one column (or a unique combination of values in multiple columns). Usually, one is interested in computing a statistics over the individual slices. Nowadays, people tend to use pipe-based operations for that purpose but the traditional ways still work perfectly.

Consider, for example, the below data set of a microbiological study where culturable bacteria were enumerated in the effluent of a wastewater treatment plant (WWTP) and the receiving river (CFU: colony forming units). The bacteria were grown on chromogenic agar to distinguish major groups with blue indicating Escherichia coli.

bacteria <- read.table(header=TRUE, sep="\t", text='
campaign    source  color   CFU_per_100ml
2016-10-25  wwtp    blue    13500
2016-10-25  wwtp    grey    11300
2016-10-25  wwtp    other   37000
2016-10-25  river   blue    4020
2016-10-25  river   grey    4450
2016-10-25  river   other   3100
2016-11-22  wwtp    blue    16500
2016-11-22  wwtp    grey    4500
2016-11-22  wwtp    other   3380
2016-11-22  river   blue    578
2016-11-22  river   grey    55
2016-11-22  river   other   493
2016-12-13  wwtp    blue    8250
2016-12-13  wwtp    grey    4000
2016-12-13  wwtp    other   2000
2016-12-13  river   blue    1150
2016-12-13  river   grey    550
2016-12-13  river   other   1100
2017-02-14  wwtp    blue    38000
2017-02-14  wwtp    grey    3500
2017-02-14  wwtp    other   11500
2017-02-14  river   blue    1530
2017-02-14  river   grey    135
2017-02-14  river   other   168
')

We might be interested, for example, in the average bacterial densites in effluent and river water. For that purpose, we want to analyze just two slices generated from the table’s “source” field. The tapply function (or the related aggregate) is made for that purpose.

x <- tapply(bacteria$CFU_per_100ml, bacteria$source, mean)
print(x)

    river      wwtp 
 1444.083 12785.833

Slicing could also be done based on multiple fields as demonstrated by the following example:

x <- tapply(bacteria$CFU_per_100ml, bacteria[,c("source","color")], mean)
print(x)

       color
source     blue   grey    other
  river  1819.5 1297.5  1215.25
  wwtp  19062.5 5825.0 13470.00

So far, the function we used to compute the summary statistics for each slice (mean) was fairly simple because …

it operated on a single table column (CFU_per_100ml)
it returned a scalar result (a single return value per slice).

R supports also complex slice-based operations. In the following example, we find the maximum numbers together with the respective sampling dates. That is, our statistics function deals with multiple input columns and also returns multiple values.

# function to compute the statistics of a slice
fun <- function(slice) {
  i <- which.max(slice$CFU_per_100ml)
  slice[i,]
}

# apply 'fun' to slices
x <- by(bacteria, bacteria[,c("source","color")], fun)

# transform the output of 'by' into an ordinary data frame
x <- do.call(rbind, x)

print(x)

     campaign source color CFU_per_100ml
4  2016-10-25  river  blue          4020
19 2017-02-14   wwtp  blue         38000
5  2016-10-25  river  grey          4450
2  2016-10-25   wwtp  grey         11300
6  2016-10-25  river other          3100
3  2016-10-25   wwtp other         37000

4.6 Plotting

4.6.1 General options for plotting in R

Visualizing data is probably one of the main reasons to use R. Many add-on packages specialized on visualization have been developed. ggplot is nowadays the most widely used one and if you search the web, people will likely suggest ggplot-based solutions. Such add-on packages provide high-level functions that offer quick solutions to standard plotting tasks. However, I feel that the difficulties start as soon as you want to customize layouts and styles.

Good to know that all essential plotting functions are built into R’s base and graphics packages both of which are installed and loaded by default. The functions provided by the two packages provide all the low-level building blocks to create and customize plots, from default scatter plots to arbitrarily complex graphics. In order to apply those functions, you need to learn how they work. However, once you became friend with base and graphics, you will be able to visualize whatever you like in the way you want. I personally stick to that approach whenever possible. Consequently, this section will be targeted at important functions in base and graphics.

4.6.2 The default interface for x-y plotting

Consider the little time series of pH and dissolved oxygen in a lake.

quality <- "
hour   pH    O2
   7  7.0   8.5
   9  7.2   9.2
  11  7.9  10.1
  13  8.2  10.8
  15  8.4  11.4
"
x <- read.table(sep="", header=TRUE, text=quality)

Let’s start with plotting the dynamics of pH. Following the usual convention, we put the time on the x-axis. Here is what I would typically use:

plot(x[,"hour"], x[,"pH"])

Note that exactly the same could be achieved using either the list-based syntax for data frames, or the with function for simplified access to the columns, or even the formula interface of plot as illustrated below.

plot(x$hour, x$pH)            # does the same as the line above
with(x, plot(hour, pH))       # another possibility
plot(pH ~ hour, data=x)       # yet another possibility

If you feel that the dots should be connected by a line, try one of these:

plot(x[,"hour"], x[,"pH"], type="l")      # just lines
plot(x[,"hour"], x[,"pH"], type="b")      # both, points ("p") and lines ("l")

It is important to note that, in all of the above examples, the axes were scaled automatically to fit the data fed into plot.

4.6.3 Multiple data sets in the same plot

Typically, you want to combine multiple data in a single plot. Sticking to the water quality example, we now like to plot the dynamics of pH and oxygen together to illustrate a possible correlation. Here is what you can do:

Create the plot with the range of the y-axis being chosen to fit all data sets. This is achieved by issuing a plot statement with the appropriate value being passed to the ylim argument.
Add further data sets by secondary plotting functions like lines and points. These secondary functions rely on a previous call to plot and will not work in stand-alone mode.
Finally, some kind of legend is required to identify the individual data set.

This would lead to a code like this:

yrange <- range(x[,c("pH","O2")])                   # min/max of data columns
plot(x[,"hour"], x[,"pH"], ylim=yrange, type="l",   # 1st data set; using ylim
  xlab="time", ylab="value")
lines(x[,"hour"], x[,"O2"], col="red")              # 2nd data set
legend("topleft", lty=1, col=c("black", "red"), legend=c("pH","O2"))

While the code works fine, it is not as elegant as it could be. Below is a more generic version that would attempt to add a separate line for all variables contained in the data set. Note that it starts by creating an empty plot by passing type="n" to plot. As opposed to the code above, this one is clearly more reusable.

dataCols <- names(x)[names(x) != "hour"]          # names of columns with data
yrange <- range(x[,dataCols])                     # min/max of data
plot(range(x[,"hour"]), yrange, type="n",         # empty plot; properly scaled
  xlab="time", ylab="value")
for (dc in dataCols) {                            # add all data series
  lines(x[,"hour"], x[,dc], col=match(dc, dataCols))
}
legend("topleft", lty=1, col=1:length(dataCols), legend=dataCols)

Luckily, the two variables in our sample data had a quite similar range of values and a single y-axis was therefore sufficient. In many real-world situations, this is not the case and you may want a second axis on the right hand side. Here is a possible solution illustrated on the same data set as before:

par(mar=c(5, 5, 1, 5))                 # need to create space for 2nd axis
plot(x[,"hour"], x[,"pH"], type="l",   # 1st data set
  xlab="time", ylab="pH")
par(new=TRUE)                          # allows another plot on top
plot(x[,"hour"], x[,"O2"], type="b",   # 2st data set, no axis yet
  axes=FALSE, ann=FALSE)
axis(side=4)                           # axis on the right
mtext(side=4, "O2", line=2.5)          # annotation of right axis
legend("topleft", bty="n", lty=1,      # bty="n" drops the ugly legend box
  pch=c(NA, 1), legend=c("pH", "O2"))

So far, you have seen some of the building blocks but there are many more. You could look, for example, at secondary plot commands like rect, polygon, or symbols to add boxes, polygons, or circles to plots initialized by a call to plot.

4.6.4 Selected high-level plotting functions

Barplots can simply be produced with the barplot function. See the respective help page for possibilities to fed data into the function. Here is an application to the water quality data.

barplot(x[,"pH"], names.arg=x[,"hour"], ylab="pH")   # using a vector as input

For numerical data, one often wants to visually inspect their distribution. Then, the hist function is a simple means to plot the respective histogram. The code below illustrates the application to both, normally uniform and normally distributed random data. It also demonstrates a common way of putting two plots side by side.

par(mfrow=c(1,2))                            # layout with 2 columns
hist(runif(100), main="")                    # 1st histogram, title suppressed
hist(rnorm(n=100, mean=0, sd=1), main="")    # 2nd

par(mfrow=c(1,1))                            # reset layout

Another function I often use in data analysis is boxplot to illustrate distributions based on mean values and quantiles. You’ll probably learn about this function in a separate statistics course.

4.7 Statistical modeling and testing

This is what R was originally developed for. These topics are omitted from the document as they are dealt with in a separate statistics course.

5 Hints for practical work

5.1 Organizing files and directories

5.1.1 Naming files and directories

The set of characters allowed in file and folder names is limited. Typically, you should only use characters, letters, and the underscore. While white space is allowed nowadays, you can make live easier by not using white space in file and folder names.

Good file names are short but long enough to easily recognize the file contents. The same applies to directory names (a.k.a. folder names).

Good example	Bad counterpart	Why is bad it bad?
waterLevel.txt	data.txt	No clue what is in the file.
waterLevel_rev2023-10-11.txt	waterLevel_finalVersion.txt	What you think is the final version often receives future updates.

Consequently, names are usually built by concatenating words. The improve readability, one can put underscores (_) between the words. Alternatively, the initial letter of all but the first word is capitalized. The good and bad examples above illustrate both.

5.1.2 Understanding file paths and the working directory

The file system is organized like the above-ground part of a tree. Folders represent branches, files are the leaves of the tree. The basis of the tree is the root and the term “root” is also used to denote the basis of a file system. On Mac and Linux, there is a single root represented by just the forward slash (/). Windows can grow multiply trees in parallel and they all start with the drive letter (for example c:\).

Consider the following example of a file path on Mac/Linux and Windows, respectively. The represent so-called absolute path because the whole route through the branches of the tree, starting from the root, is specified.

/home/david/myData/waterLevel.txt
c:\Users\david\myData\waterLevel.txt

By contrast, relative paths do not begin at the root. Instead they start at some branch in the middle of the tree, for example:

myData/waterLevel.txt
myData\waterLevel.txt

Say you want to import a file into R (using, e.g., read.table) based on its relative path. How can R know where the file is? In order to make it work, you must set the working directory of R. This is the base directory that R will assume if it gets confronted with relative paths. In the example, the proper working directory would by /home/david or c:\Users\david, respectively. So this would work on Mac/Linux

setwd("/home/david")
read.table(file="myData/waterLevel.txt", header=TRUE)

and on Windows.

setwd("c:/Users/david")        # ! NOTE THAT I CHANGED "\" INTO "/"
read.table(file="myData/waterLevel.txt", header=TRUE)

If you want to query rather than set R’s working directory, use getwd().

5.1.3 Dealing with Windows file paths in R

In the example code just above, I substituted all delimiters in the Windows file path from the backslash to the forward slash. This is necessary because letters after a backslash have a special meaning in R and would be misinterpreted. For example, \noodle would be interpreted as a newline followed by the string oodle.

5.1.4 A recommendation for smaller projects

If you understand file paths, the meaning of the working directory, and the used of the setwd(), you have the basic knowledge to make data import and export work. However, experienced programmers tend to avoid both the use of absolute path but setwd() as well. Instead, they manage to the the working directory to be identical to the directory that contains the script file, they want to execute. What?

Consider the following situation. Here, the R script and the data file are in different folders but the two branches are quite close to each other (thinking in trees).

/ --- home
        |
        + --- someone
        |
        + --- myname
                |
                + --- something
                |
                + --- systemAnalysis
                         |
                         + --- residenceTime
                                     |
                                     + --- inputs
                                     |        |
                                     |        + --- lakeData.txt    <== data
                                     |
                                     + --- outputs
                                     |       
                                     + --- analyze.r                <== script

In that situation, the import statement in the file analyze.r would ideally read like so:

x <- read.table(file="inputs/lakeData", header=TRUE)

Note that we neither used a relative path nor did we set the working directory through setwd() explicitly. To make it work, you want to click the following sequence of buttons in the menu of Rstudio while the script analyze.r is your active document (cursor somewhere in there): “Session -> Set working directory -> To source file location”. With this, you set R’s working directory to folder containing analyze.r which is residenceTime. Starting from there, R will perfectly find the data in the sub-folder inputs mentioned in the read.table statement.

Was this more confusing that helpful?

Probably. But you may want to read that section again later. The approach will prove useful once you move the entire folder myname to a different place (e.g. if you migrate to a new computer). Same if you want to share a copy of the systemAnalysis folder with someone. The script would still run without any adjustments.

5.2 Generic skeleton of a typical R script

For most scripts of low to medium complexity one can use the same basic outline. Here is what I found useful in many applications. Note that this section of code is not meant for execution because the mentioned file names and packages are freely invented.

# Start with a brief comment of what the script does; it may also be good
# to list issues that still need to be solved.

# Initial statement to make R forget everything it may have had in memory
# from earlier computations. It allows you to be sure that the script only
# uses data and functions defined within or explicitly loaded into the script.
rm(list=ls())

# Next, I load packages, if any. Loading packages at the top of the script
# makes it easier to recognize such dependencies on add-on code.
library("fancyPackage")

# Then, I define parameters (constants) used in anywhere below which may need
# adjustment later. It is a good idea to do it here, at the top of the script.
# Then you know that nothing needs to be adjusted anywhere below. The names of
# input and output files are typical examples of such parameters.
myInputFile <- "myInputFile.txt"
myOutputFile <- "myOutputFile.txt"

# Next I implement any self-designed functions.
cylinderVolume <- function (radius, length) {
  3.1415 * radius^2 * lenght
}

# Next I import any external data the script should process. In most cases,
# I import plain text with the given arguments. Tab-separated text is what you
# get automatically if you paste the contents of a spreadsheet into a text file.
x <- read.table(file=myInputFile, header=TRUE, sep="\t")

# After data import, you typically want to check whether you actually got what
# you expected. This can be done either visually or by check functions that hold
# further execution in case of problems.
print(head(x))
stopifnot(all(c("radius", "length") %in% names(d)))

# Next follows the main part of the script. Usually, the imported data are
# processes with the help of functions, among which are the ones that were
# self-designed above. Most typically, this part of the script produces plots
# and/or creates summary tables for subsequent human interpretation.
d <- cbind(d, volume=cylinderVolume(radius=d$radius, length=d$length))

# Finally, the script may spent some statements on the export of any newly
# generated information. In the case of tabular data, a statement like the
# one below usually does the job.
write.table(file=myOutputFile, sep="\t", col.names=TRUE, row.names=FALSE)

5.3 Splitting code across multiple files

Contents to be added later.

5.4 Alternative styles of R programming

Different styles exist but they won’t be a major topic of this document. You’ll learn about other approaches to data processing (using pipe operators) and plotting (using the ggplot package) in a different course.