Chapter 2 Getting started with R

2.1 RStudio Interface and Data

2.1.1 Download and Install RStudio

This course is based on the statistical software R. R is easier to use in the development environment RStudio (it works on both Windows, Apple, and other OS).

It is possible to download a free version of RStudio Desktop from the official websites.

You might also use a free online version of RStudio by registering to the RStudio Cloud free plan. However, the free plan gives you just 15 hours per months. Our lessons take 4.5 hours per month, and since you also need to practice, the best choice is to install RStudio and R on your computer.

Now we are going to see how to get started with RStudio Desktop.

First, download and install a free version of RStudio Desktop and open the software.

2.1.2 Create a RStudio Project and Import data

When starting a data analysis project with RStudio, we create a new dedicated environment where we will keep all the scripts (files containing the code to perform the analysis), data sets, and outputs of the analysis (such as plots and tables). This dedicated work-space is simply called a project.

To create a new project with RStudio, follows these steps:

  • click on File (on the top left);
  • then, click on New Project;
  • select New Directory, and New Project;
  • choose a folder for the project, and give a name to your project. You can use the name Time-Series-Analysis-With-R;

In this way, it will be created a new folder for the project, in the main folder specified in the previous step. In this folder, you will find a file .Rproj, the name of which is the name you assigned to your project. To work on this project, you just need to open the .Rproj file.

2.1.3 Create a Script

Once the project has been created, we can open a new script and save it.

A script is a file containing code. We can create a first script named basic-r-syntax, where you will test the basic code we are going to see. The script will be saved with extension .r.

You can open, change, and save the file every time you work on it. To save your code is important, otherwise you would have to write the same code every time you work on the project!

Create and save a script

Update a script and run code

2.1.4 The RStudio User Interface

The interface of RStudio is organized in four main quadrants:

  • The top-left quadrant is the editor. Here you can create or open a script and compose the R commands.
  • The top-right quadrant shows the R workspace, which holds the data and other objects you have created in the current R session. The bottom-right quadrant is a window for graphics output, but it also has tabs to manage your file directories, R packages, and the R Help facility.
  • On the bottom left is the R Console window, where the code gets executed and the output is produced. You can run the commands, sending the code from the editor to the console, by highlighting it and hitting the Run button, or the Ctrl-Enter key combination. It is also possible to type and run commands directly into the console window (in this case, nothing will be saved).
  • The top-right quadrant shows the R workspace, which holds the data and other objects you have created in the current R session. There is the file tab, where you can navigate files and folders and find, for instance, the data sets you want to upload.
  • The bottom-right quadrant is a window for graphics output. Here you can visualize your plots. There is also a tab for the R packages, and the R Help facility.

2.1.5 Load and Save Data

To load data into R you can click on the file window in the top-right quadrant, navigate your files/folders, and once you have found your data set file, you can just click it and follow the semi-automated import procedure.

Import Data

Otherwise, you can upload a data set by using a function. For instance, to import a csv file, one of the most common format for data sets, it can be employed the function read.csv. The main argument of this function is the path of the file you want to upload. To specify the file path, consider that you are working within a specific environment, that is, your working directory is the folder of the project (you can double check the working directory you are working in, by running the command getwd()). Thus, to indicate the path of the data set you want to upload, you can write a dot followed by a slash ./, followed by the path of the data set inside the working directory. For instance, in the case below, the data set is saved in a folder named data inside the working directory. The name of the data set is tweets_vienna and its extension is .csv. Therefore, the code to upload the file is as follows:

fake_news <- read.csv("./data/fake-news-stories-over-time-20210111144200.csv")

To save data there are a few options. Generally, if you want to save a data set, you can opt for the .csv or the .rds format. The .rds format is only readable by R, while the .csv is a “universal” format (you can read it with Excel, for instance).

To save a file as .csv it can be used the function write.csv. The main arguments of this function are the name of the object that has to be saved, the path to the folder where the object will be saved, and the name we want to assign to the file.

write.csv(fake_news, file = "./data/fake_news.csv")

To save .rds file the procedure is similar, but the saveRDS function has to be employed. Instead, to read an rds file, the appropriate function is readRDS.

saveRDS(fake_news, file = "./data/fake_news.rds")

fake_news <- readRDS("./data/fake_news.rds")   # read a .rds file

In the code above you can notice an hash mark sign followed by some text. It is a comment. Comments are textual content used to describe the code in order to make it easier to understand and reuse it. Comments are written after the hash mark sign (#), because the text written after the hash mark sign is ignored by R: you can read the comments, but R does not consider them as code.

2.1.6 Create new Folders

It is a good practice to create, in the main folder of the project, sub-folders dedicated to different type of files used in the project, such as a folder “data” for the data sets.

To create a new folder you can go to the Files windows in the RStudio interface, click New Folder, and give it a name.

2.2 Basic R

2.2.1 Objects

An object is an R entity composed of a name and a value.

The arrow (<-) sign is used to create objects and assign a value to an object (or to change or “update” its previous value).

Example: create an object with name “object_consisting_of_a_number” and value equal 2:

object_consisting_of_a_number <- 2

Enter the name of the object in the console and run the command: the value assigned to the object will be printed.

object_consisting_of_a_number
## [1] 2

The object is equal to its value. Therefore, for instance, an object with a numerical value can be used to perform arithmetical operations.

object_consisting_of_a_number * 10
## [1] 20

The value of an object can be transformed:

object_consisting_of_a_number <- object_consisting_of_a_number * 10

object_consisting_of_a_number
## [1] 20

An object can also represent a function.

Example: create an object for the sum (addition) function:

function_sum <- function(x, y){
  result <- x + y
  return(result)
}

The function can now be applied to two numerical values:

function_sum(5, 2)
## [1] 7

Actually, we don’t need this function, since mathematical functions are already implemented in R.

sum(5, 2)
## [1] 7
5 + 7
## [1] 12
2 * 3
## [1] 6
3^2
## [1] 9
sqrt(9)
## [1] 3

The value of an object can be a number, a function, but also a vector. Vectors are sequences of values.

vector_of_numbers <- c(1,2,3,4,5,6,7,8,9,10) 
vector_of_numbers
##  [1]  1  2  3  4  5  6  7  8  9 10

A vector of numbers can be the argument of mathematical operations.

vector_of_numbers * 2
##  [1]  2  4  6  8 10 12 14 16 18 20
vector_of_numbers + 3
##  [1]  4  5  6  7  8  9 10 11 12 13

Other R objects are matrix, list, and data.frame.

A matrix is a table composed of rows and columns containing only numerical values.

a_matrix <- matrix(data = 1:50, nrow = 10, ncol = 5)

a_matrix
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1   11   21   31   41
##  [2,]    2   12   22   32   42
##  [3,]    3   13   23   33   43
##  [4,]    4   14   24   34   44
##  [5,]    5   15   25   35   45
##  [6,]    6   16   26   36   46
##  [7,]    7   17   27   37   47
##  [8,]    8   18   28   38   48
##  [9,]    9   19   29   39   49
## [10,]   10   20   30   40   50

A list is just a list of other objects. For instance, this list includes a numerical value, a vectors of numbers, and a matrix.

a_list <- list(object_consisting_of_a_number, vector_of_numbers, a_matrix)

a_list
## [[1]]
## [1] 20
## 
## [[2]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[3]]
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1   11   21   31   41
##  [2,]    2   12   22   32   42
##  [3,]    3   13   23   33   43
##  [4,]    4   14   24   34   44
##  [5,]    5   15   25   35   45
##  [6,]    6   16   26   36   46
##  [7,]    7   17   27   37   47
##  [8,]    8   18   28   38   48
##  [9,]    9   19   29   39   49
## [10,]   10   20   30   40   50

A data.frame is like a matrix that can contain numbers but also other types of data, such as characters (a textual type of data), or factors (unordered categorical variables, such as gender, or ordered categories, such as low, medium, high).

Data sets are usually stored in data.frame. For instance, if you import a csv or an Excel file in R, the corresponding R object is a data.frame.

# this is an object (vector) consisting of a series of numerical values
numerical_vector <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
numerical_vector
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14
# this is another object (vector) consisting of a series of categorical values
categorical_vector <- c("Monday", "Tuesday", "Monday", "Tuesday", "Monday", "Wednesday","Thursday", "Wednesday", "Thursday", "Saturday", "Sunday", "Friday", "Saturday", "Sunday")
categorical_vector
##  [1] "Monday"    "Tuesday"   "Monday"    "Tuesday"   "Monday"    "Wednesday" "Thursday"  "Wednesday"
##  [9] "Thursday"  "Saturday"  "Sunday"    "Friday"    "Saturday"  "Sunday"
# this is an object consisting of a data.frame, created combining vectors through the function "data.frame"
a_dataframe <- data.frame("first_variable" = numerical_vector,
                          "second_variable" = categorical_vector)
a_dataframe
##    first_variable second_variable
## 1               1          Monday
## 2               2         Tuesday
## 3               3          Monday
## 4               4         Tuesday
## 5               5          Monday
## 6               6       Wednesday
## 7               7        Thursday
## 8               8       Wednesday
## 9               9        Thursday
## 10             10        Saturday
## 11             11          Sunday
## 12             12          Friday
## 13             13        Saturday
## 14             14          Sunday

To access a specific column of a data.frame, you can use the name of the data.frame, the dollar symbol $, and the name of the column.

a_dataframe$first_variable
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14
a_dataframe$second_variable
##  [1] "Monday"    "Tuesday"   "Monday"    "Tuesday"   "Monday"    "Wednesday" "Thursday"  "Wednesday"
##  [9] "Thursday"  "Saturday"  "Sunday"    "Friday"    "Saturday"  "Sunday"

It is possible to add columns to a data.frame by writing:

  • the name of the data.frame
  • the dollar sign
  • a name for the new column
  • the arrow sign <-
  • a vector of values to be stored in the new column (it has to have length equal to the other vectors composing the data.frame)
a_dataframe$a_new_variable <- c(12, 261, 45, 29, 54, 234, 45, 42, 6, 267, 87, 3, 12, 9)
a_dataframe
##    first_variable second_variable a_new_variable
## 1               1          Monday             12
## 2               2         Tuesday            261
## 3               3          Monday             45
## 4               4         Tuesday             29
## 5               5          Monday             54
## 6               6       Wednesday            234
## 7               7        Thursday             45
## 8               8       Wednesday             42
## 9               9        Thursday              6
## 10             10        Saturday            267
## 11             11          Sunday             87
## 12             12          Friday              3
## 13             13        Saturday             12
## 14             14          Sunday              9

It is possible to visualize the first few rows of a data.frame by using the function head.

head(a_dataframe)
##   first_variable second_variable a_new_variable
## 1              1          Monday             12
## 2              2         Tuesday            261
## 3              3          Monday             45
## 4              4         Tuesday             29
## 5              5          Monday             54
## 6              6       Wednesday            234

Exercise: visualize the first rows of a data.frame and access its columns

2.2.2 Functions

A function is a coded operation that applies to an object (e.g.: a number, a textual feature etc.) to transform it based on specific rules. A function has a name (the name of the function) and some arguments. Among the arguments of a function there is always an object or a value, for instance a numerical value, which is the content the function is applied to, and other possible arguments (either mandatory or optional).

Functions are operations applied to objects that give a certain output. E.g.: the arithmetical operation “addition” is a function that applies to two or more numbers to give, as its output, their sum. The arguments of the “sum” function are the numbers that are added together.

The name of the function is written out of parentheses, and the arguments of the function inside the parentheses:

sum(5, 3)
## [1] 8

Arguments of functions can be numbers but also textual features. For instance, the function paste creates a string composed of the strings that it takes as arguments.

paste("the", "cat", "is", "at", "home")
## [1] "the cat is at home"

In R you can sometimes find a “nested” syntax, which can be confusing. The best practice is to keep things as simple as possible.

# this comment, written after the hash mark, describe what is going on here: two "paste" function nested together have been used (improperly! because they make the code more complicated than necessary) to show how functions can be nested together. It would have been better to use the "paste" function just one time!
paste(paste("the", "cat", "is", "at", "home"), "and", "sleeps", "on", "the", "sofa")
## [1] "the cat is at home and sleeps on the sofa"

To sum up, functions manipulate and transform objects. Data wrangling, data visualization, as well as data analysis, are performed through functions.

2.2.3 Data Types

Variables can have different R formats, such as:

  • double: numbers that include decimals (0.1, 5.676, 121.67). This format is appropriate for continuous variables;
  • integer: such as 1, 2, 3, 10, 400. It is a format suitable to count data;
  • factors: for categorical variables. Factors can be ordered (e.g.: level of agreement: “high”, “medium”, “low”), or not (e.g.: hair colors “blond”, “dark brown”, “brown”);
  • characters: textual labels;
  • logicals: the format of logical values (i.e.: TRUE and FALSE)
  • dates: used to represent days;
  • POSIX: a class of R format to represent dates and times.
R data formats. Tables from Gaubatz, K. T. (2014). [A Survivor's Guide to R: An Introduction for the Uninitiated and the Unnerved](https://us.sagepub.com/en-us/nam/a-survivors-guide-to-r/book242607). SAGE Publications.

Figure 2.1: R data formats. Tables from Gaubatz, K. T. (2014). A Survivor’s Guide to R: An Introduction for the Uninitiated and the Unnerved. SAGE Publications.

It is better to specify the appropriate type of data when importing a data set. In the example below, the data format are specified by using the import process of RStudio.

Notice that the data of type “date” requires users to specify the additional information regarding the format of the dates. Indeed, dates can be written in many different ways, and to read dates in R it is necessary to specify the structure of the date. In the example, dates are in the format Year-Month-Day, which is represented in R as “%Y-%m-%d” (further details will be provided in another section of the book).

Import data and specify data types

2.2.4 Excercise

  • Upload the data set “election news small”, using the appropriate data format;
  • Open the script “basic-r-script” and perform the following operations:
    • Check the first few rows of the data set;
    • Access the single columns;
    • Save the data frame with the name “election_news_small_test” in the folder “data” by using the function “write.csv” (to review the procedure go to the section “Load and Save Data” on this book);
    • Comment the code (the comments have to be written after the hash sign #);
    • Save the script.