Chapter3 Basic Data Wrangling with Tidyverse

Data wrangling is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

Another definition is as follows: Data wrangling is the process of profiling and transforming datasets to ensure they are actionable for a set of analysis tasks. One central goal is to make data usable: to put data in a form that can be parsed and manipulated by analysis tools. Another goal is to ensure that data is responsive to the intended analyses: that the data contain the necessary information, at an acceptable level of description and correctness, to support successful modeling and decision-making.

How to “manipulate” data sets in R:

  • use basic R functions;
  • employ specific libraries such as tidyverse. Tidyverse is an R library composed of functions that allow users to perform basic and advanced data science operations. https://www.tidyverse.org.

In R, a library (or “package”) is a coherent collection of functions, usually created for specific purposes.

To work with the tidyverse library, it is necessary to install it first, by using the following command: install.packages(“tidyverse”).

After having installed tidyverse (or any other library), it is necessary to load it, so as we can work with its functions in the current R session:

# to load a library used the command library(NAME-OF-THE-LIBRARY)
library(tidyverse)
## ── Attaching core tidyverse packages ────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ──────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Besides using the function install.packages(NAME-OF-THE-LIBRARY) by using a line of code, it is also possible to use the RStudio interface.

Click here to watch “Install and Load Libraries”

3.1 The Pipe Operator %>%

Tidyverse has a peculiar syntax that makes use of the so-called pipe operator %>%, like in the following example:

a_dataframe %>%
  group_by(second_variable) %>%
  summarize(mean = mean(a_new_variable))
## # A tibble: 7 × 2
##   second_variable  mean
##   <chr>           <dbl>
## 1 Friday            3  
## 2 Monday           37  
## 3 Saturday        140. 
## 4 Sunday           48  
## 5 Thursday         25.5
## 6 Tuesday         145  
## 7 Wednesday       138

To manipulate data sets we can rely on the functions included in dplyr: a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges, such as mutate, rename, summarize.

library(readr)
tweets <- read_csv("data/tweets_covid_small.csv", 
    col_types = cols(created_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"), 
        retweet_count = col_integer()))
head(tweets)
## # A tibble: 6 × 4
##   created_at          screen_name   source             retweet_count
##   <dttm>              <chr>         <chr>                      <int>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter for iPhone             0
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter for iPhone             0
## 3 2021-03-24 08:53:52 AlexS1595     Twitter for iPhone             3
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter Web App                0
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter Web App                4
## 6 2021-03-24 08:53:51 GGrahambute   Twitter for iPad              96

Click here to watch “Import Data and Specify Data Types (Dates and Times)”

3.2 Mutate

The function mutate adds new variables to a data.frame or overwrites existing variables.

tweets <- tweets %>%
  mutate(log_retweet_count = log(retweet_count))
  
head(tweets)
## # A tibble: 6 × 5
##   created_at          screen_name   source     retweet_count log_retweet_count
##   <dttm>              <chr>         <chr>              <int>             <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter f…             0           -Inf   
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter f…             0           -Inf   
## 3 2021-03-24 08:53:52 AlexS1595     Twitter f…             3              1.10
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter W…             0           -Inf   
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter W…             4              1.39
## 6 2021-03-24 08:53:51 GGrahambute   Twitter f…            96              4.56

3.3 Rename

rename is a function to change the name of columns (sometimes it can be useful).

tweets <- tweets %>%
  # rename (new_name = old_name)
  rename(device = source)

head(tweets)
## # A tibble: 6 × 5
##   created_at          screen_name   device     retweet_count log_retweet_count
##   <dttm>              <chr>         <chr>              <int>             <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter f…             0           -Inf   
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter f…             0           -Inf   
## 3 2021-03-24 08:53:52 AlexS1595     Twitter f…             3              1.10
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter W…             0           -Inf   
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter W…             4              1.39
## 6 2021-03-24 08:53:51 GGrahambute   Twitter f…            96              4.56

The previous two steps can be performed at the same time, by concatenating the operations through the pipe %>% operator.

# load again the data set
library(readr)
tweets <- read_csv("data/tweets_covid_small.csv", 
    col_types = cols(created_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"), 
        retweet_count = col_integer()))

tweets <- tweets %>%
  mutate(log_retweet_count = log(retweet_count+1)) %>%
  rename(device = source)

head(tweets)
## # A tibble: 6 × 5
##   created_at          screen_name   device     retweet_count log_retweet_count
##   <dttm>              <chr>         <chr>              <int>             <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter f…             0              0   
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter f…             0              0   
## 3 2021-03-24 08:53:52 AlexS1595     Twitter f…             3              1.39
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter W…             0              0   
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter W…             4              1.61
## 6 2021-03-24 08:53:51 GGrahambute   Twitter f…            96              4.57

To check the data format of the variables stored in the data.frame, you can use the command str().

str(tweets)
## tibble [100 × 5] (S3: tbl_df/tbl/data.frame)
##  $ created_at       : POSIXct[1:100], format: "2021-03-24 08:53:52" "2021-03-24 08:53:25" ...
##  $ screen_name      : chr [1:100] "DoYourThingUK" "DoYourThingUK" "AlexS1595" "MakesworthAcc" ...
##  $ device           : chr [1:100] "Twitter for iPhone" "Twitter for iPhone" "Twitter for iPhone" "Twitter Web App" ...
##  $ retweet_count    : int [1:100] 0 0 3 0 4 96 0 1 0 3 ...
##  $ log_retweet_count: num [1:100] 0 0 1.39 0 1.61 ...

Sometimes variables are stored in the data.frame in the wrong format (see the paragraph “data type”), so we may want to convert them to a new format. For this purpose, we can use the function mutate along with other functions such as as.integer, as.numeric, as.character, as.factor, as.logical, as.Date, or as.POSIXct() depending on the desired data format (it is possible and advisable to upload the data by paying attention to the type of data. If you upload the data in the correct format, you can skip this step).

tweets %>%
  mutate(device = as.character(device)) %>%
  head()
## # A tibble: 6 × 5
##   created_at          screen_name   device     retweet_count log_retweet_count
##   <dttm>              <chr>         <chr>              <int>             <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter f…             0              0   
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter f…             0              0   
## 3 2021-03-24 08:53:52 AlexS1595     Twitter f…             3              1.39
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter W…             0              0   
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter W…             4              1.61
## 6 2021-03-24 08:53:51 GGrahambute   Twitter f…            96              4.57

3.4 Summarize and group_by

To aggregate data and calculate summary values (for instance, the average number of tweets by day), you can use the function group_by (to aggregate data, for instance by day) and summarize to calculate the summary values.

tweets_summary <- tweets %>%
  group_by(screen_name) %>%
  summarize(average_retweets = mean(retweet_count))

head(tweets_summary)
## # A tibble: 6 × 2
##   screen_name     average_retweets
##   <chr>                      <dbl>
## 1 2EXvoZ6nublpw1F              164
## 2 AdilHaiderMD                  80
## 3 AlexS1595                      3
## 4 Andecave                      20
## 5 ApKido                       150
## 6 BBVA_Trader                    0

It is also possible to create more than one summary variables at once.

tweets_summary <- tweets %>%
  group_by(screen_name) %>%
  summarize(average_retweets = mean(retweet_count),
            average_log_retweets = mean(log_retweet_count))

head(tweets_summary)
## # A tibble: 6 × 3
##   screen_name     average_retweets average_log_retweets
##   <chr>                      <dbl>                <dbl>
## 1 2EXvoZ6nublpw1F              164                 5.11
## 2 AdilHaiderMD                  80                 4.39
## 3 AlexS1595                      3                 1.39
## 4 Andecave                      20                 3.04
## 5 ApKido                       150                 5.02
## 6 BBVA_Trader                    0                 0

3.4.1 Count occurrences

A useful operation to perform when summarizing data is to count the occurrences of a certain variable. For instance, to count the number of tweets sent by each user, you can use the function n() inside the summarize function.

tweets_summary <- tweets %>%
  group_by(screen_name) %>%
  summarize(average_retweets = mean(retweet_count),
            average_log_retweets = mean(log_retweet_count),
            number_of_tweets = n())

head(tweets_summary)
## # A tibble: 6 × 4
##   screen_name     average_retweets average_log_retweets number_of_tweets
##   <chr>                      <dbl>                <dbl>            <int>
## 1 2EXvoZ6nublpw1F              164                 5.11                1
## 2 AdilHaiderMD                  80                 4.39                1
## 3 AlexS1595                      3                 1.39                1
## 4 Andecave                      20                 3.04                1
## 5 ApKido                       150                 5.02                1
## 6 BBVA_Trader                    0                 0                   1

3.5 Arrange

To explore a data set, it can be useful to sort the data (e.g., from the lowest to the highest value of a variable). With tidyverse, we can order a data.frame by using the function arrange.

To sort the data from the highest to the lowest value (descending order), the minus sign (or the desc function) has to be added.

tweets_summary  %>%
  arrange(-number_of_tweets) %>%
  head()
## # A tibble: 6 × 4
##   screen_name     average_retweets average_log_retweets number_of_tweets
##   <chr>                      <dbl>                <dbl>            <int>
## 1 iprdhzb                    0.667                0.462                3
## 2 DoYourThingUK              0                    0                    2
## 3 MakesworthAcc              2                    0.805                2
## 4 benphillips76              3.5                  1.45                 2
## 5 viralvideovlogs            3.5                  1.45                 2
## 6 2EXvoZ6nublpw1F          164                    5.11                 1
tweets_summary  %>%
  arrange(desc(average_retweets)) %>%
  head()
## # A tibble: 6 × 4
##   screen_name    average_retweets average_log_retweets number_of_tweets
##   <chr>                     <dbl>                <dbl>            <int>
## 1 Oliver_Miguel1             1988                 7.60                1
## 2 Lil_3arbiii                1627                 7.40                1
## 3 Kittyhawk681               1285                 7.16                1
## 4 JulesFox12                 1091                 7.00                1
## 5 rosaesaa26                  983                 6.89                1
## 6 lewisabzueta                822                 6.71                1

Without the minus sign (or the “desc” command), data are sorted from the lowest to the highest value.

tweets_summary  %>%
  arrange(number_of_tweets) %>%
  head()
## # A tibble: 6 × 4
##   screen_name     average_retweets average_log_retweets number_of_tweets
##   <chr>                      <dbl>                <dbl>            <int>
## 1 2EXvoZ6nublpw1F              164                 5.11                1
## 2 AdilHaiderMD                  80                 4.39                1
## 3 AlexS1595                      3                 1.39                1
## 4 Andecave                      20                 3.04                1
## 5 ApKido                       150                 5.02                1
## 6 BBVA_Trader                    0                 0                   1

3.6 Filter

The function filter keeps only the cases (the “rows”) we want to focus on. The arguments of this function are the conditions that must be fulfilled to filter the data: a) the name of the column that we want to filter, and b) the values to be kept.

tweets %>%
  filter(retweet_count >= 500) %>%
  arrange(-retweet_count)
## # A tibble: 10 × 5
##    created_at          screen_name     device  retweet_count log_retweet_count
##    <dttm>              <chr>           <chr>           <int>             <dbl>
##  1 2021-03-24 08:53:31 Oliver_Miguel1  Twitte…          1988              7.60
##  2 2021-03-24 08:53:48 Lil_3arbiii     Twitte…          1627              7.40
##  3 2021-03-24 08:53:48 Kittyhawk681    Twitte…          1285              7.16
##  4 2021-03-24 08:53:37 JulesFox12      Twitte…          1091              7.00
##  5 2021-03-24 08:53:42 rosaesaa26      Twitte…           983              6.89
##  6 2021-03-24 08:53:42 lewisabzueta    Twitte…           822              6.71
##  7 2021-03-24 08:53:42 jonvthvn08      Twitte…           768              6.65
##  8 2021-03-24 08:53:34 florent61647053 Twitte…           768              6.65
##  9 2021-03-24 08:53:37 Ritu89903967    Twitte…           709              6.57
## 10 2021-03-24 08:53:33 Hurica3         Twitte…           575              6.36

In the examples below, notice the use of a double equal sign ==, and also of the quotation marks to indicate the modalities of a categorical variable.

tweets %>%
  filter(retweet_count == 1988) 
## # A tibble: 1 × 5
##   created_at          screen_name    device    retweet_count log_retweet_count
##   <dttm>              <chr>          <chr>             <int>             <dbl>
## 1 2021-03-24 08:53:31 Oliver_Miguel1 Twitter …          1988              7.60
tweets %>%
  filter(device == "Twitter for Android")
## # A tibble: 33 × 5
##    created_at          screen_name     device  retweet_count log_retweet_count
##    <dttm>              <chr>           <chr>           <int>             <dbl>
##  1 2021-03-24 08:53:49 marcin_lukawski Twitte…             1             0.693
##  2 2021-03-24 08:53:49 LebodyRanya     Twitte…             0             0    
##  3 2021-03-24 08:53:47 anshunandanpra4 Twitte…             2             1.10 
##  4 2021-03-24 08:53:44 insoumise007    Twitte…             5             1.79 
##  5 2021-03-24 08:53:43 Metamorfopsies  Twitte…             0             0    
##  6 2021-03-24 08:53:43 keepsmiling_130 Twitte…           164             5.11 
##  7 2021-03-24 08:53:43 lovebresil01    Twitte…            81             4.41 
##  8 2021-03-24 08:53:42 LightHealing    Twitte…             1             0.693
##  9 2021-03-24 08:53:42 lewisabzueta    Twitte…           822             6.71 
## 10 2021-03-24 08:53:42 rosaesaa26      Twitte…           983             6.89 
## # ℹ 23 more rows

It is also possible to use several conditions at the same time.

tweets %>%
  filter(device == "Twitter for Android",
         retweet_count > 200) %>%
  arrange(-retweet_count)
## # A tibble: 8 × 5
##   created_at          screen_name     device   retweet_count log_retweet_count
##   <dttm>              <chr>           <chr>            <int>             <dbl>
## 1 2021-03-24 08:53:31 Oliver_Miguel1  Twitter…          1988              7.60
## 2 2021-03-24 08:53:37 JulesFox12      Twitter…          1091              7.00
## 3 2021-03-24 08:53:42 rosaesaa26      Twitter…           983              6.89
## 4 2021-03-24 08:53:42 lewisabzueta    Twitter…           822              6.71
## 5 2021-03-24 08:53:34 florent61647053 Twitter…           768              6.65
## 6 2021-03-24 08:53:37 Ritu89903967    Twitter…           709              6.57
## 7 2021-03-24 08:53:27 aspeaker66      Twitter…           331              5.81
## 8 2021-03-24 08:53:42 JamesAn26254230 Twitter…           201              5.31

3.7 Select

select is used to keep only some of the columns of the original data.frame. For instance, we can apply the function to keep just the columns device and retweet_count.

tweets %>%
  dplyr::select(device, retweet_count) %>%
  head()
## # A tibble: 6 × 2
##   device             retweet_count
##   <chr>                      <int>
## 1 Twitter for iPhone             0
## 2 Twitter for iPhone             0
## 3 Twitter for iPhone             3
## 4 Twitter Web App                0
## 5 Twitter Web App                4
## 6 Twitter for iPad              96