Chapter3 Basic Data Wrangling with Tidyverse
Data wrangling is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.
Another definition is as follows: Data wrangling is the process of profiling and transforming datasets to ensure they are actionable for a set of analysis tasks. One central goal is to make data usable: to put data in a form that can be parsed and manipulated by analysis tools. Another goal is to ensure that data is responsive to the intended analyses: that the data contain the necessary information, at an acceptable level of description and correctness, to support successful modeling and decision-making.
How to “manipulate” data sets in R:
- use basic R functions;
- employ specific libraries such as tidyverse. Tidyverse is an R library composed of functions that allow users to perform basic and advanced data science operations. https://www.tidyverse.org.
In R, a library (or “package”) is a coherent collection of functions, usually created for specific purposes.
To work with the tidyverse library, it is necessary to install it first, by using the following command: install.packages(“tidyverse”).
After having installed tidyverse (or any other library), it is necessary to load it, so as we can work with its functions in the current R session:
## ── Attaching core tidyverse packages ────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ──────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Besides using the function install.packages(NAME-OF-THE-LIBRARY) by using a line of code, it is also possible to use the RStudio interface.
Click here to watch “Install and Load Libraries”
3.1 The Pipe Operator %>%
Tidyverse has a peculiar syntax that makes use of the so-called pipe operator %>%, like in the following example:
## # A tibble: 7 × 2
## second_variable mean
## <chr> <dbl>
## 1 Friday 3
## 2 Monday 37
## 3 Saturday 140.
## 4 Sunday 48
## 5 Thursday 25.5
## 6 Tuesday 145
## 7 Wednesday 138
To manipulate data sets we can rely on the functions included in dplyr: a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges, such as mutate, rename, summarize.
library(readr)
tweets <- read_csv("data/tweets_covid_small.csv",
col_types = cols(created_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"),
retweet_count = col_integer()))
head(tweets)## # A tibble: 6 × 4
## created_at screen_name source retweet_count
## <dttm> <chr> <chr> <int>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter for iPhone 0
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter for iPhone 0
## 3 2021-03-24 08:53:52 AlexS1595 Twitter for iPhone 3
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter Web App 0
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter Web App 4
## 6 2021-03-24 08:53:51 GGrahambute Twitter for iPad 96
Click here to watch “Import Data and Specify Data Types (Dates and Times)”
3.2 Mutate
The function mutate adds new variables to a data.frame or overwrites existing variables.
## # A tibble: 6 × 5
## created_at screen_name source retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter f… 0 -Inf
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter f… 0 -Inf
## 3 2021-03-24 08:53:52 AlexS1595 Twitter f… 3 1.10
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter W… 0 -Inf
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter W… 4 1.39
## 6 2021-03-24 08:53:51 GGrahambute Twitter f… 96 4.56
3.3 Rename
rename is a function to change the name of columns (sometimes it can be useful).
## # A tibble: 6 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter f… 0 -Inf
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter f… 0 -Inf
## 3 2021-03-24 08:53:52 AlexS1595 Twitter f… 3 1.10
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter W… 0 -Inf
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter W… 4 1.39
## 6 2021-03-24 08:53:51 GGrahambute Twitter f… 96 4.56
The previous two steps can be performed at the same time, by concatenating the operations through the pipe %>% operator.
# load again the data set
library(readr)
tweets <- read_csv("data/tweets_covid_small.csv",
col_types = cols(created_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"),
retweet_count = col_integer()))
tweets <- tweets %>%
mutate(log_retweet_count = log(retweet_count+1)) %>%
rename(device = source)
head(tweets)## # A tibble: 6 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter f… 0 0
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter f… 0 0
## 3 2021-03-24 08:53:52 AlexS1595 Twitter f… 3 1.39
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter W… 0 0
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter W… 4 1.61
## 6 2021-03-24 08:53:51 GGrahambute Twitter f… 96 4.57
To check the data format of the variables stored in the data.frame, you can use the command str().
str(tweets)## tibble [100 × 5] (S3: tbl_df/tbl/data.frame)
## $ created_at : POSIXct[1:100], format: "2021-03-24 08:53:52" "2021-03-24 08:53:25" ...
## $ screen_name : chr [1:100] "DoYourThingUK" "DoYourThingUK" "AlexS1595" "MakesworthAcc" ...
## $ device : chr [1:100] "Twitter for iPhone" "Twitter for iPhone" "Twitter for iPhone" "Twitter Web App" ...
## $ retweet_count : int [1:100] 0 0 3 0 4 96 0 1 0 3 ...
## $ log_retweet_count: num [1:100] 0 0 1.39 0 1.61 ...
Sometimes variables are stored in the data.frame in the wrong format (see the paragraph “data type”), so we may want to convert them to a new format. For this purpose, we can use the function mutate along with other functions such as as.integer, as.numeric, as.character, as.factor, as.logical, as.Date, or as.POSIXct() depending on the desired data format (it is possible and advisable to upload the data by paying attention to the type of data. If you upload the data in the correct format, you can skip this step).
tweets %>%
mutate(device = as.character(device)) %>%
head()## # A tibble: 6 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter f… 0 0
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter f… 0 0
## 3 2021-03-24 08:53:52 AlexS1595 Twitter f… 3 1.39
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter W… 0 0
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter W… 4 1.61
## 6 2021-03-24 08:53:51 GGrahambute Twitter f… 96 4.57
3.4 Summarize and group_by
To aggregate data and calculate summary values (for instance, the average number of tweets by day), you can use the function group_by (to aggregate data, for instance by day) and summarize to calculate the summary values.
tweets_summary <- tweets %>%
group_by(screen_name) %>%
summarize(average_retweets = mean(retweet_count))
head(tweets_summary)## # A tibble: 6 × 2
## screen_name average_retweets
## <chr> <dbl>
## 1 2EXvoZ6nublpw1F 164
## 2 AdilHaiderMD 80
## 3 AlexS1595 3
## 4 Andecave 20
## 5 ApKido 150
## 6 BBVA_Trader 0
It is also possible to create more than one summary variables at once.
tweets_summary <- tweets %>%
group_by(screen_name) %>%
summarize(average_retweets = mean(retweet_count),
average_log_retweets = mean(log_retweet_count))
head(tweets_summary)## # A tibble: 6 × 3
## screen_name average_retweets average_log_retweets
## <chr> <dbl> <dbl>
## 1 2EXvoZ6nublpw1F 164 5.11
## 2 AdilHaiderMD 80 4.39
## 3 AlexS1595 3 1.39
## 4 Andecave 20 3.04
## 5 ApKido 150 5.02
## 6 BBVA_Trader 0 0
3.4.1 Count occurrences
A useful operation to perform when summarizing data is to count the occurrences of a certain variable. For instance, to count the number of tweets sent by each user, you can use the function n() inside the summarize function.
tweets_summary <- tweets %>%
group_by(screen_name) %>%
summarize(average_retweets = mean(retweet_count),
average_log_retweets = mean(log_retweet_count),
number_of_tweets = n())
head(tweets_summary)## # A tibble: 6 × 4
## screen_name average_retweets average_log_retweets number_of_tweets
## <chr> <dbl> <dbl> <int>
## 1 2EXvoZ6nublpw1F 164 5.11 1
## 2 AdilHaiderMD 80 4.39 1
## 3 AlexS1595 3 1.39 1
## 4 Andecave 20 3.04 1
## 5 ApKido 150 5.02 1
## 6 BBVA_Trader 0 0 1
3.5 Arrange
To explore a data set, it can be useful to sort the data (e.g., from the lowest to the highest value of a variable). With tidyverse, we can order a data.frame by using the function arrange.
To sort the data from the highest to the lowest value (descending order), the minus sign (or the desc function) has to be added.
## # A tibble: 6 × 4
## screen_name average_retweets average_log_retweets number_of_tweets
## <chr> <dbl> <dbl> <int>
## 1 iprdhzb 0.667 0.462 3
## 2 DoYourThingUK 0 0 2
## 3 MakesworthAcc 2 0.805 2
## 4 benphillips76 3.5 1.45 2
## 5 viralvideovlogs 3.5 1.45 2
## 6 2EXvoZ6nublpw1F 164 5.11 1
## # A tibble: 6 × 4
## screen_name average_retweets average_log_retweets number_of_tweets
## <chr> <dbl> <dbl> <int>
## 1 Oliver_Miguel1 1988 7.60 1
## 2 Lil_3arbiii 1627 7.40 1
## 3 Kittyhawk681 1285 7.16 1
## 4 JulesFox12 1091 7.00 1
## 5 rosaesaa26 983 6.89 1
## 6 lewisabzueta 822 6.71 1
Without the minus sign (or the “desc” command), data are sorted from the lowest to the highest value.
## # A tibble: 6 × 4
## screen_name average_retweets average_log_retweets number_of_tweets
## <chr> <dbl> <dbl> <int>
## 1 2EXvoZ6nublpw1F 164 5.11 1
## 2 AdilHaiderMD 80 4.39 1
## 3 AlexS1595 3 1.39 1
## 4 Andecave 20 3.04 1
## 5 ApKido 150 5.02 1
## 6 BBVA_Trader 0 0 1
3.6 Filter
The function filter keeps only the cases (the “rows”) we want to focus on. The arguments of this function are the conditions that must be fulfilled to filter the data: a) the name of the column that we want to filter, and b) the values to be kept.
## # A tibble: 10 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:31 Oliver_Miguel1 Twitte… 1988 7.60
## 2 2021-03-24 08:53:48 Lil_3arbiii Twitte… 1627 7.40
## 3 2021-03-24 08:53:48 Kittyhawk681 Twitte… 1285 7.16
## 4 2021-03-24 08:53:37 JulesFox12 Twitte… 1091 7.00
## 5 2021-03-24 08:53:42 rosaesaa26 Twitte… 983 6.89
## 6 2021-03-24 08:53:42 lewisabzueta Twitte… 822 6.71
## 7 2021-03-24 08:53:42 jonvthvn08 Twitte… 768 6.65
## 8 2021-03-24 08:53:34 florent61647053 Twitte… 768 6.65
## 9 2021-03-24 08:53:37 Ritu89903967 Twitte… 709 6.57
## 10 2021-03-24 08:53:33 Hurica3 Twitte… 575 6.36
In the examples below, notice the use of a double equal sign ==, and also of the quotation marks to indicate the modalities of a categorical variable.
## # A tibble: 1 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:31 Oliver_Miguel1 Twitter … 1988 7.60
## # A tibble: 33 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:49 marcin_lukawski Twitte… 1 0.693
## 2 2021-03-24 08:53:49 LebodyRanya Twitte… 0 0
## 3 2021-03-24 08:53:47 anshunandanpra4 Twitte… 2 1.10
## 4 2021-03-24 08:53:44 insoumise007 Twitte… 5 1.79
## 5 2021-03-24 08:53:43 Metamorfopsies Twitte… 0 0
## 6 2021-03-24 08:53:43 keepsmiling_130 Twitte… 164 5.11
## 7 2021-03-24 08:53:43 lovebresil01 Twitte… 81 4.41
## 8 2021-03-24 08:53:42 LightHealing Twitte… 1 0.693
## 9 2021-03-24 08:53:42 lewisabzueta Twitte… 822 6.71
## 10 2021-03-24 08:53:42 rosaesaa26 Twitte… 983 6.89
## # ℹ 23 more rows
It is also possible to use several conditions at the same time.
## # A tibble: 8 × 5
## created_at screen_name device retweet_count log_retweet_count
## <dttm> <chr> <chr> <int> <dbl>
## 1 2021-03-24 08:53:31 Oliver_Miguel1 Twitter… 1988 7.60
## 2 2021-03-24 08:53:37 JulesFox12 Twitter… 1091 7.00
## 3 2021-03-24 08:53:42 rosaesaa26 Twitter… 983 6.89
## 4 2021-03-24 08:53:42 lewisabzueta Twitter… 822 6.71
## 5 2021-03-24 08:53:34 florent61647053 Twitter… 768 6.65
## 6 2021-03-24 08:53:37 Ritu89903967 Twitter… 709 6.57
## 7 2021-03-24 08:53:27 aspeaker66 Twitter… 331 5.81
## 8 2021-03-24 08:53:42 JamesAn26254230 Twitter… 201 5.31
3.7 Select
select is used to keep only some of the columns of the original data.frame. For instance, we can apply the function to keep just the columns device and retweet_count.
## # A tibble: 6 × 2
## device retweet_count
## <chr> <int>
## 1 Twitter for iPhone 0
## 2 Twitter for iPhone 0
## 3 Twitter for iPhone 3
## 4 Twitter Web App 0
## 5 Twitter Web App 4
## 6 Twitter for iPad 96