Chapter 3 Basic Data Wrangling with Tidyverse
Data wrangling is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.
Another definition is as follows: Data wrangling is the process of profiling and transforming datasets to ensure they are actionable for a set of analysis tasks. One central goal is to make data usable: to put data in a form that can be parsed and manipulated by analysis tools. Another goal is to ensure that data is responsive to the intended analyses: that the data contain the necessary information, at an acceptable level of description and correctness, to support successful modeling and decision-making.
How to “manipulate” data sets in R:
- use basic R functions;
- employ specific libraries such as tidyverse. Tidyverse is an R library composed of functions that allow users to perform basic and advanced data science operations. https://www.tidyverse.org.
In R, a library (or “package”) is a coherent collection of functions, usually created for specific purposes.
To work with the tidyverse library, it is necessary to install it first, by using the following command: install.packages(“tidyverse”).
After having installed tidyverse (or any other library), it is necessary to load it, so as we can work with its functions in the current R session:
# to load a library used the command library(NAME-OF-THE-LIBRARY)
library(tidyverse)
Besides using the function install.packages(NAME-OF-THE-LIBRARY) by using a line of code, it is also possible to use the RStudio interface.
3.1 The Pipe Operator %>%
Tidyverse has a peculiar syntax that makes use of the so-called pipe operator %>%, like in the following example:
%>%
a_dataframe group_by(second_variable) %>%
summarize(mean = mean(a_new_variable))
## # A tibble: 7 × 2
## second_variable mean
## <chr> <dbl>
## 1 Friday 3
## 2 Monday 37
## 3 Saturday 140.
## 4 Sunday 48
## 5 Thursday 25.5
## 6 Tuesday 145
## 7 Wednesday 138
To manipulate data sets we can rely on the functions included in dplyr: a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges, such as mutate, rename, summarize.
library(readr)
<- read_csv("data/tweets_covid_small.csv",
tweets col_types = cols(created_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"),
retweet_count = col_integer()))
head(tweets)
## # A tibble: 6 × 4
## created_at screen_name source retweet_count
## <dttm> <chr> <chr> <int>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter for iPhone 0
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter for iPhone 0
## 3 2021-03-24 08:53:52 AlexS1595 Twitter for iPhone 3
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter Web App 0
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter Web App 4
## 6 2021-03-24 08:53:51 GGrahambute Twitter for iPad 96
3.2 Mutate
The function mutate adds new variables to a data.frame or overwrites existing variables.
<- tweets %>%
tweets mutate(log_retweet_count = log(retweet_count))
head(tweets)
## # A tibble: 6 × 5
## created_at screen_name source retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter for iPhone 0 -Inf
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter for iPhone 0 -Inf
## 3 2021-03-24 08:53:52 AlexS1595 Twitter for iPhone 3 1.10
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter Web App 0 -Inf
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter Web App 4 1.39
## 6 2021-03-24 08:53:51 GGrahambute Twitter for iPad 96 4.56
3.3 Rename
Rename is a function to change the name of columns (sometimes it could be useful).
<- tweets %>%
tweets # rename (new_name = old_name)
rename(device = source)
head(tweets)
## # A tibble: 6 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter for iPhone 0 -Inf
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter for iPhone 0 -Inf
## 3 2021-03-24 08:53:52 AlexS1595 Twitter for iPhone 3 1.10
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter Web App 0 -Inf
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter Web App 4 1.39
## 6 2021-03-24 08:53:51 GGrahambute Twitter for iPad 96 4.56
The previous two steps can be performed at the same time, by concatenating the operations through the pipe %>% operator.
# load again the data set
library(readr)
<- read_csv("data/tweets_covid_small.csv",
tweets col_types = cols(created_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"),
retweet_count = col_integer()))
<- tweets %>%
tweets mutate(log_retweet_count = log(retweet_count+1)) %>%
rename(device = source)
head(tweets)
## # A tibble: 6 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter for iPhone 0 0
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter for iPhone 0 0
## 3 2021-03-24 08:53:52 AlexS1595 Twitter for iPhone 3 1.39
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter Web App 0 0
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter Web App 4 1.61
## 6 2021-03-24 08:53:51 GGrahambute Twitter for iPad 96 4.57
To check the data format of the variables stored in the data.frame, it can be used the command str():
str(tweets)
## tibble [100 × 5] (S3: tbl_df/tbl/data.frame)
## $ created_at : POSIXct[1:100], format: "2021-03-24 08:53:52" "2021-03-24 08:53:25" "2021-03-24 08:53:52" ...
## $ screen_name : chr [1:100] "DoYourThingUK" "DoYourThingUK" "AlexS1595" "MakesworthAcc" ...
## $ device : chr [1:100] "Twitter for iPhone" "Twitter for iPhone" "Twitter for iPhone" "Twitter Web App" ...
## $ retweet_count : int [1:100] 0 0 3 0 4 96 0 1 0 3 ...
## $ log_retweet_count: num [1:100] 0 0 1.39 0 1.61 ...
Sometimes variables are stored in the data.frame in the wrong format (see the paragraph “data type”), so we want to convert them into a new format. For this purpose we can use, again, the function mutate, along with other functions such as.integer, as.numeric, as.character, as.factors, or as.logical, as.Date, or as.POSIXct() based on the desired data format (it is possible and advisable to upload the data by paying attention to the type of data. If you upload the data in the right format, you can skip this step).
%>%
tweets mutate(device = as.character(device)) %>%
head()
## # A tibble: 6 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter for iPhone 0 0
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter for iPhone 0 0
## 3 2021-03-24 08:53:52 AlexS1595 Twitter for iPhone 3 1.39
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter Web App 0 0
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter Web App 4 1.61
## 6 2021-03-24 08:53:51 GGrahambute Twitter for iPad 96 4.57
3.4 Summarize and group_by
To aggregate data and calculate synthetic values (for instance, the average number of tweets by day), it can be used the function group_by (to aggregate data, for instance by day), and summarize, to calculate the summary values.
<- tweets %>%
tweets_summary group_by(screen_name) %>%
summarize(average_retweets = mean(retweet_count))
head(tweets_summary)
## # A tibble: 6 × 2
## screen_name average_retweets
## <chr> <dbl>
## 1 2EXvoZ6nublpw1F 164
## 2 AdilHaiderMD 80
## 3 AlexS1595 3
## 4 Andecave 20
## 5 anshunandanpra4 2
## 6 ApKido 150
It is also possible to create more than one summary variables at once.
<- tweets %>%
tweets_summary group_by(screen_name) %>%
summarize(average_retweets = mean(retweet_count),
average_log_retweets = mean(log_retweet_count))
head(tweets_summary)
## # A tibble: 6 × 3
## screen_name average_retweets average_log_retweets
## <chr> <dbl> <dbl>
## 1 2EXvoZ6nublpw1F 164 5.11
## 2 AdilHaiderMD 80 4.39
## 3 AlexS1595 3 1.39
## 4 Andecave 20 3.04
## 5 anshunandanpra4 2 1.10
## 6 ApKido 150 5.02
3.4.1 Count occurrences
A useful operation to perform when summarizing data, is to count the occurrences of a certain variable. For instance, to count the number of tweets sent by each user, it can be used the function n() inside the summarize function.
<- tweets %>%
tweets_summary group_by(screen_name) %>%
summarize(average_retweets = mean(retweet_count),
average_log_retweets = mean(log_retweet_count),
number_of_tweets = n())
head(tweets_summary)
## # A tibble: 6 × 4
## screen_name average_retweets average_log_retweets number_of_tweets
## <chr> <dbl> <dbl> <int>
## 1 2EXvoZ6nublpw1F 164 5.11 1
## 2 AdilHaiderMD 80 4.39 1
## 3 AlexS1595 3 1.39 1
## 4 Andecave 20 3.04 1
## 5 anshunandanpra4 2 1.10 1
## 6 ApKido 150 5.02 1
3.5 Arrange
To explore a data set it can be useful to sort the data (e.g.: from the lowest to the highest value of a variable). With tidyverse, we can order a data.frame by using the function arrange.
To sort the data from the highest to the lowest value (descending order) the minus sign (or the “desc” function) has to be added.
%>%
tweets_summary arrange(-number_of_tweets) %>%
head()
## # A tibble: 6 × 4
## screen_name average_retweets average_log_retweets number_of_tweets
## <chr> <dbl> <dbl> <int>
## 1 iprdhzb 0.667 0.462 3
## 2 benphillips76 3.5 1.45 2
## 3 DoYourThingUK 0 0 2
## 4 MakesworthAcc 2 0.805 2
## 5 viralvideovlogs 3.5 1.45 2
## 6 2EXvoZ6nublpw1F 164 5.11 1
%>%
tweets_summary arrange(desc(average_retweets)) %>%
head()
## # A tibble: 6 × 4
## screen_name average_retweets average_log_retweets number_of_tweets
## <chr> <dbl> <dbl> <int>
## 1 Oliver_Miguel1 1988 7.60 1
## 2 Lil_3arbiii 1627 7.40 1
## 3 Kittyhawk681 1285 7.16 1
## 4 JulesFox12 1091 7.00 1
## 5 rosaesaa26 983 6.89 1
## 6 lewisabzueta 822 6.71 1
Without the minus sign (or the “desc” command), data are sorted from the lowest to the highest value.
%>%
tweets_summary arrange(number_of_tweets) %>%
head()
## # A tibble: 6 × 4
## screen_name average_retweets average_log_retweets number_of_tweets
## <chr> <dbl> <dbl> <int>
## 1 2EXvoZ6nublpw1F 164 5.11 1
## 2 AdilHaiderMD 80 4.39 1
## 3 AlexS1595 3 1.39 1
## 4 Andecave 20 3.04 1
## 5 anshunandanpra4 2 1.10 1
## 6 ApKido 150 5.02 1
3.6 Filter
The function filter keeps only the cases (the “rows”) we want to focus on. The arguments of this function are the conditions that have to be fulfilled to filter the data: a) the name of the column that we want to filter, b) the values to be kept.
%>%
tweets filter(retweet_count >= 500) %>%
arrange(-retweet_count)
## # A tibble: 10 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:31 Oliver_Miguel1 Twitter for Android 1988 7.60
## 2 2021-03-24 08:53:48 Lil_3arbiii Twitter for iPhone 1627 7.40
## 3 2021-03-24 08:53:48 Kittyhawk681 Twitter Web App 1285 7.16
## 4 2021-03-24 08:53:37 JulesFox12 Twitter for Android 1091 7.00
## 5 2021-03-24 08:53:42 rosaesaa26 Twitter for Android 983 6.89
## 6 2021-03-24 08:53:42 lewisabzueta Twitter for Android 822 6.71
## 7 2021-03-24 08:53:42 jonvthvn08 Twitter for iPhone 768 6.65
## 8 2021-03-24 08:53:34 florent61647053 Twitter for Android 768 6.65
## 9 2021-03-24 08:53:37 Ritu89903967 Twitter for Android 709 6.57
## 10 2021-03-24 08:53:33 Hurica3 Twitter for iPhone 575 6.36
In the examples below, notice the use of a double equal sign ==, and also of the quotation marks to indicate the modalities of a categorical variable.
%>%
tweets filter(retweet_count == 1988)
## # A tibble: 1 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:31 Oliver_Miguel1 Twitter for Android 1988 7.60
%>%
tweets filter(device == "Twitter for Android")
## # A tibble: 33 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:49 marcin_lukawski Twitter for Android 1 0.693
## 2 2021-03-24 08:53:49 LebodyRanya Twitter for Android 0 0
## 3 2021-03-24 08:53:47 anshunandanpra4 Twitter for Android 2 1.10
## 4 2021-03-24 08:53:44 insoumise007 Twitter for Android 5 1.79
## 5 2021-03-24 08:53:43 Metamorfopsies Twitter for Android 0 0
## 6 2021-03-24 08:53:43 keepsmiling_130 Twitter for Android 164 5.11
## 7 2021-03-24 08:53:43 lovebresil01 Twitter for Android 81 4.41
## 8 2021-03-24 08:53:42 LightHealing Twitter for Android 1 0.693
## 9 2021-03-24 08:53:42 lewisabzueta Twitter for Android 822 6.71
## 10 2021-03-24 08:53:42 rosaesaa26 Twitter for Android 983 6.89
## # … with 23 more rows
It is also possible to use several conditions at the same time.
%>%
tweets filter(device == "Twitter for Android",
> 200) %>%
retweet_count arrange(-retweet_count)
## # A tibble: 8 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:31 Oliver_Miguel1 Twitter for Android 1988 7.60
## 2 2021-03-24 08:53:37 JulesFox12 Twitter for Android 1091 7.00
## 3 2021-03-24 08:53:42 rosaesaa26 Twitter for Android 983 6.89
## 4 2021-03-24 08:53:42 lewisabzueta Twitter for Android 822 6.71
## 5 2021-03-24 08:53:34 florent61647053 Twitter for Android 768 6.65
## 6 2021-03-24 08:53:37 Ritu89903967 Twitter for Android 709 6.57
## 7 2021-03-24 08:53:27 aspeaker66 Twitter for Android 331 5.81
## 8 2021-03-24 08:53:42 JamesAn26254230 Twitter for Android 201 5.31
3.7 Select
Select is used to keep just some of the columns of the original data.frame. For instance, we can apply the function in order to keep just the column “device” and “retweet_count”.
%>%
tweets select(device, retweet_count) %>%
head()
## # A tibble: 6 × 2
## device retweet_count
## <chr> <int>
## 1 Twitter for iPhone 0
## 2 Twitter for iPhone 0
## 3 Twitter for iPhone 3
## 4 Twitter Web App 0
## 5 Twitter Web App 4
## 6 Twitter for iPad 96
3.8 Exercise
Here are some exercises to consolidate the fundamental R skills learned during these first lessons:
Download the csv file tweets_vienna_small.csv in this folder and put it into the project folder “data”
Upload the data set in R, setting the appropriate formats for the variables
Create a new script named “data-manipulation” with your surname and name as follows: “YOUR SURNAME-YOUR NAME-data-manipulation”, and save it
In the script, write the code to perform the following operations:
load the library “Tidyverse”
show the first rows of the data frame by using the function head
create a new data frame “tweets_vienna_small_updated” by updating the dataframe “tweets_vienna_small” by using the function mutate to create a new column “log_friends_count” whose values are the log of the values in the column “friends_count” (you don’t need to add 1 to the values in the column “friends_count”)
save the updated dataframe “tweets_vienna_small_updated”, by using the following code to save a csv file (please change YOUR SURNAME-YOUR NAME with your actual surname and name): write.csv(tweets_vienna_small_updated, file = “./data/YOUR SURNAME-YOUR NAME-tweets_vienna_small_updated.csv”, row.names=F) (we add row.names=F to avoid saving the number that indexes each row)
create a new data frame named “summary_tweets_vienna_small” aggregating the data by “screen_name” (using the function group_by) and then summarizing the data (by using the function summarize) as follows:
in a column named “average_favorite_count”, calculate the average of “favorite_count” by “screen_name” (that is, by user)
in a column named “average_retweet_count”, calculate the average of “retweet_count” (by user, obviously, since the data are already aggregated by user’ name)
in a column named “number_of_tweets”, calculate the number of tweets published by each users (by using the function n())
save the “summary_tweets_vienna_small” in the data folder of the project, in csv format, and with the name “YOUR SURNAME-YOUR NAME-summary_tweets_vienna_small.csv” (remember to specify row.names=F and to change YOUR SURNAME-YOUR NAME with your actual surname and name)
create a new data frame object called “summary_tweets_vienna_small_filtered”, where you will save the data.frame summary_tweets_vienna_small, after having filtered the rows with average_retweet_count higher than 10 (by using the function filter), and after having selected the column “screen_name” and “average_retweet_count” (so, you should end up with a data frame with just two columns, “screen_name” and “average_retweet_count”, and the rows with “average_retweet_count” higher than 10)
save the data frame “summary_tweets_vienna_small_filtered” with the name “YOUR SURNAME-YOUR NAME-summary_tweets_vienna_small_filtered.csv” in the folder data (remember to specify row.names=F and to change YOUR SURNAME-YOUR NAME with your actual surname and name).
Save the script “YOUR SURNAME-YOUR NAME-data-manipulation” with all the code you have used to perform these analysis. Write a comment in the script (using the hash mark #) if you are not able to do something.
Upload the script “YOUR SURNAME-YOUR NAME-data-manipulation.r” and the files “YOUR SURNAME-YOUR NAME-tweets_vienna_small_updated.csv”, “YOUR SURNAME-YOUR NAME-summary_tweets_vienna_small.csv”, and the file “YOUR SURNAME-YOUR NAME-summary_tweets_vienna_small_filtered.csv” on Moodle, in the folder “HomeWork-1”. The deadline is Sunday 11 April.