Chapter 3 Basic Data Wrangling with Tidyverse

Data wrangling is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

Another definition is as follows: Data wrangling is the process of profiling and transforming datasets to ensure they are actionable for a set of analysis tasks. One central goal is to make data usable: to put data in a form that can be parsed and manipulated by analysis tools. Another goal is to ensure that data is responsive to the intended analyses: that the data contain the necessary information, at an acceptable level of description and correctness, to support successful modeling and decision-making.

How to “manipulate” data sets in R:

  • use basic R functions;
  • employ specific libraries such as tidyverse. Tidyverse is an R library composed of functions that allow users to perform basic and advanced data science operations. https://www.tidyverse.org.

In R, a library (or “package”) is a coherent collection of functions, usually created for specific purposes.

To work with the tidyverse library, it is necessary to install it first, by using the following command: install.packages(“tidyverse”).

After having installed tidyverse (or any other library), it is necessary to load it, so as we can work with its functions in the current R session:

# to load a library used the command library(NAME-OF-THE-LIBRARY)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Besides using the function install.packages(NAME-OF-THE-LIBRARY) by using a line of code, it is also possible to use the RStudio interface.

Install and Load a Library

3.1 The Pipe Operator %>%

Tidyverse has a peculiar syntax that makes use of the so-called pipe operator %>%, like in the following example:

a_dataframe %>%
  group_by(second_variable) %>%
  summarize(mean = mean(a_new_variable))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 2
##   second_variable  mean
##   <fct>           <dbl>
## 1 Friday            3  
## 2 Monday           37  
## 3 Saturday        140. 
## 4 Sunday           48  
## 5 Thursday         25.5
## 6 Tuesday         145  
## 7 Wednesday       138

To manipulate data sets we can rely on the functions included in dplyr: a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges, such as mutate, rename, summarize.

library(readr)
tweets <- read_csv("data/tweets_covid_small.csv", 
    col_types = cols(created_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"), 
        retweet_count = col_integer()))
head(tweets)
## # A tibble: 6 x 4
##   created_at          screen_name   source             retweet_count
##   <dttm>              <chr>         <chr>                      <int>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter for iPhone             0
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter for iPhone             0
## 3 2021-03-24 08:53:52 AlexS1595     Twitter for iPhone             3
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter Web App                0
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter Web App                4
## 6 2021-03-24 08:53:51 GGrahambute   Twitter for iPad              96

Import data and specify data types (dates and times)

3.2 Mutate

The function mutate adds new variables to a data.frame or overwrites existing variables.

tweets <- tweets %>%
  mutate(log_retweet_count = log(retweet_count))
  
head(tweets)
## # A tibble: 6 x 5
##   created_at          screen_name   source             retweet_count log_retweet_count
##   <dttm>              <chr>         <chr>                      <int>             <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter for iPhone             0           -Inf   
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter for iPhone             0           -Inf   
## 3 2021-03-24 08:53:52 AlexS1595     Twitter for iPhone             3              1.10
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter Web App                0           -Inf   
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter Web App                4              1.39
## 6 2021-03-24 08:53:51 GGrahambute   Twitter for iPad              96              4.56

3.3 Rename

Rename is a function to change the name of columns (sometimes it could be useful).

tweets <- tweets %>%
  # rename (new_name = old_name)
  rename(device = source)

head(tweets)
## # A tibble: 6 x 5
##   created_at          screen_name   device             retweet_count log_retweet_count
##   <dttm>              <chr>         <chr>                      <int>             <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter for iPhone             0           -Inf   
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter for iPhone             0           -Inf   
## 3 2021-03-24 08:53:52 AlexS1595     Twitter for iPhone             3              1.10
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter Web App                0           -Inf   
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter Web App                4              1.39
## 6 2021-03-24 08:53:51 GGrahambute   Twitter for iPad              96              4.56

The previous two steps can be performed at the same time, by concatenating the operations through the pipe %>% operator.

# load again the data set
library(readr)
tweets <- read_csv("data/tweets_covid_small.csv", 
    col_types = cols(created_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"), 
        retweet_count = col_integer()))

tweets <- tweets %>%
  mutate(log_retweet_count = log(retweet_count+1)) %>%
  rename(device = source)

head(tweets)
## # A tibble: 6 x 5
##   created_at          screen_name   device             retweet_count log_retweet_count
##   <dttm>              <chr>         <chr>                      <int>             <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter for iPhone             0              0   
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter for iPhone             0              0   
## 3 2021-03-24 08:53:52 AlexS1595     Twitter for iPhone             3              1.39
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter Web App                0              0   
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter Web App                4              1.61
## 6 2021-03-24 08:53:51 GGrahambute   Twitter for iPad              96              4.57

To check the data format of the variables stored in the data.frame, it can be used the command str():

str(tweets)
## tibble [100 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ created_at       : POSIXct[1:100], format: "2021-03-24 08:53:52" "2021-03-24 08:53:25" "2021-03-24 08:53:52" ...
##  $ screen_name      : chr [1:100] "DoYourThingUK" "DoYourThingUK" "AlexS1595" "MakesworthAcc" ...
##  $ device           : chr [1:100] "Twitter for iPhone" "Twitter for iPhone" "Twitter for iPhone" "Twitter Web App" ...
##  $ retweet_count    : int [1:100] 0 0 3 0 4 96 0 1 0 3 ...
##  $ log_retweet_count: num [1:100] 0 0 1.39 0 1.61 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   created_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"),
##   ..   screen_name = col_character(),
##   ..   source = col_character(),
##   ..   retweet_count = col_integer()
##   .. )

Sometimes variables are stored in the data.frame in the wrong format (see the paragraph “data type”), so we want to convert them into a new format. For this purpose we can use, again, the function mutate, along with other functions such as.integer, as.numeric, as.character, as.factors, or as.logical, as.Date, or as.POSIXct() based on the desired data format (it is possible and advisable to upload the data by paying attention to the type of data. If you upload the data in the right format, you can skip this step).

tweets %>%
  mutate(device = as.character(device)) %>%
  head()
## # A tibble: 6 x 5
##   created_at          screen_name   device             retweet_count log_retweet_count
##   <dttm>              <chr>         <chr>                      <int>             <dbl>
## 1 2021-03-24 08:53:52 DoYourThingUK Twitter for iPhone             0              0   
## 2 2021-03-24 08:53:25 DoYourThingUK Twitter for iPhone             0              0   
## 3 2021-03-24 08:53:52 AlexS1595     Twitter for iPhone             3              1.39
## 4 2021-03-24 08:53:51 MakesworthAcc Twitter Web App                0              0   
## 5 2021-03-24 08:53:39 MakesworthAcc Twitter Web App                4              1.61
## 6 2021-03-24 08:53:51 GGrahambute   Twitter for iPad              96              4.57

3.4 Summarize and group_by

To aggregate data and calculate synthetic values (for instance, the average number of tweets by day), it can be used the function group_by (to aggregate data, for instance by day), and summarize, to calculate the summary values.

tweets_summary <- tweets %>%
  group_by(screen_name) %>%
  summarize(average_retweets = mean(retweet_count))

head(tweets_summary)
## # A tibble: 6 x 2
##   screen_name     average_retweets
##   <chr>                      <dbl>
## 1 2EXvoZ6nublpw1F              164
## 2 AdilHaiderMD                  80
## 3 AlexS1595                      3
## 4 Andecave                      20
## 5 anshunandanpra4                2
## 6 ApKido                       150

It is also possible to create more than one summary variables at once.

tweets_summary <- tweets %>%
  group_by(screen_name) %>%
  summarize(average_retweets = mean(retweet_count),
            average_log_retweets = mean(log_retweet_count))

head(tweets_summary)
## # A tibble: 6 x 3
##   screen_name     average_retweets average_log_retweets
##   <chr>                      <dbl>                <dbl>
## 1 2EXvoZ6nublpw1F              164                 5.11
## 2 AdilHaiderMD                  80                 4.39
## 3 AlexS1595                      3                 1.39
## 4 Andecave                      20                 3.04
## 5 anshunandanpra4                2                 1.10
## 6 ApKido                       150                 5.02

3.4.1 Count occurrences

A useful operation to perform when summarizing data, is to count the occurrences of a certain variable. For instance, to count the number of tweets sent by each user, it can be used the function n() inside the summarize function.

tweets_summary <- tweets %>%
  group_by(screen_name) %>%
  summarize(average_retweets = mean(retweet_count),
            average_log_retweets = mean(log_retweet_count),
            number_of_tweets = n())

head(tweets_summary)
## # A tibble: 6 x 4
##   screen_name     average_retweets average_log_retweets number_of_tweets
##   <chr>                      <dbl>                <dbl>            <int>
## 1 2EXvoZ6nublpw1F              164                 5.11                1
## 2 AdilHaiderMD                  80                 4.39                1
## 3 AlexS1595                      3                 1.39                1
## 4 Andecave                      20                 3.04                1
## 5 anshunandanpra4                2                 1.10                1
## 6 ApKido                       150                 5.02                1

3.5 Arrange

To explore a data set it can be useful to sort the data (e.g.: from the lowest to the highest value of a variable). With tidyverse, we can order a data.frame by using the function arrange.

To sort the data from the highest to the lowest value (descending order) the minus sign (or the “desc” function) has to be added.

tweets_summary  %>%
  arrange(-number_of_tweets) %>%
  head()
## # A tibble: 6 x 4
##   screen_name     average_retweets average_log_retweets number_of_tweets
##   <chr>                      <dbl>                <dbl>            <int>
## 1 iprdhzb                    0.667                0.462                3
## 2 benphillips76              3.5                  1.45                 2
## 3 DoYourThingUK              0                    0                    2
## 4 MakesworthAcc              2                    0.805                2
## 5 viralvideovlogs            3.5                  1.45                 2
## 6 2EXvoZ6nublpw1F          164                    5.11                 1
tweets_summary  %>%
  arrange(desc(average_retweets)) %>%
  head()
## # A tibble: 6 x 4
##   screen_name    average_retweets average_log_retweets number_of_tweets
##   <chr>                     <dbl>                <dbl>            <int>
## 1 Oliver_Miguel1             1988                 7.60                1
## 2 Lil_3arbiii                1627                 7.40                1
## 3 Kittyhawk681               1285                 7.16                1
## 4 JulesFox12                 1091                 7.00                1
## 5 rosaesaa26                  983                 6.89                1
## 6 lewisabzueta                822                 6.71                1

Without the minus sign (or the “desc” command), data are sorted from the lowest to the highest value.

tweets_summary  %>%
  arrange(number_of_tweets) %>%
  head()
## # A tibble: 6 x 4
##   screen_name     average_retweets average_log_retweets number_of_tweets
##   <chr>                      <dbl>                <dbl>            <int>
## 1 2EXvoZ6nublpw1F              164                 5.11                1
## 2 AdilHaiderMD                  80                 4.39                1
## 3 AlexS1595                      3                 1.39                1
## 4 Andecave                      20                 3.04                1
## 5 anshunandanpra4                2                 1.10                1
## 6 ApKido                       150                 5.02                1

3.6 Filter

The function filter keeps only the cases (the “rows”) we want to focus on. The arguments of this function are the conditions that have to be fulfilled to filter the data: a) the name of the column that we want to filter, b) the values to be kept.

tweets %>%
  filter(retweet_count >= 500) %>%
  arrange(-retweet_count)
## # A tibble: 10 x 5
##    created_at          screen_name     device              retweet_count log_retweet_count
##    <dttm>              <chr>           <chr>                       <int>             <dbl>
##  1 2021-03-24 08:53:31 Oliver_Miguel1  Twitter for Android          1988              7.60
##  2 2021-03-24 08:53:48 Lil_3arbiii     Twitter for iPhone           1627              7.40
##  3 2021-03-24 08:53:48 Kittyhawk681    Twitter Web App              1285              7.16
##  4 2021-03-24 08:53:37 JulesFox12      Twitter for Android          1091              7.00
##  5 2021-03-24 08:53:42 rosaesaa26      Twitter for Android           983              6.89
##  6 2021-03-24 08:53:42 lewisabzueta    Twitter for Android           822              6.71
##  7 2021-03-24 08:53:42 jonvthvn08      Twitter for iPhone            768              6.65
##  8 2021-03-24 08:53:34 florent61647053 Twitter for Android           768              6.65
##  9 2021-03-24 08:53:37 Ritu89903967    Twitter for Android           709              6.57
## 10 2021-03-24 08:53:33 Hurica3         Twitter for iPhone            575              6.36

In the examples below, notice the use of a double equal sign ==, and also of the quotation marks to indicate the modalities of a categorical variable.

tweets %>%
  filter(retweet_count == 1988) 
## # A tibble: 1 x 5
##   created_at          screen_name    device              retweet_count log_retweet_count
##   <dttm>              <chr>          <chr>                       <int>             <dbl>
## 1 2021-03-24 08:53:31 Oliver_Miguel1 Twitter for Android          1988              7.60
tweets %>%
  filter(device == "Twitter for Android")
## # A tibble: 33 x 5
##    created_at          screen_name     device              retweet_count log_retweet_count
##    <dttm>              <chr>           <chr>                       <int>             <dbl>
##  1 2021-03-24 08:53:49 marcin_lukawski Twitter for Android             1             0.693
##  2 2021-03-24 08:53:49 LebodyRanya     Twitter for Android             0             0    
##  3 2021-03-24 08:53:47 anshunandanpra4 Twitter for Android             2             1.10 
##  4 2021-03-24 08:53:44 insoumise007    Twitter for Android             5             1.79 
##  5 2021-03-24 08:53:43 Metamorfopsies  Twitter for Android             0             0    
##  6 2021-03-24 08:53:43 keepsmiling_130 Twitter for Android           164             5.11 
##  7 2021-03-24 08:53:43 lovebresil01    Twitter for Android            81             4.41 
##  8 2021-03-24 08:53:42 LightHealing    Twitter for Android             1             0.693
##  9 2021-03-24 08:53:42 lewisabzueta    Twitter for Android           822             6.71 
## 10 2021-03-24 08:53:42 rosaesaa26      Twitter for Android           983             6.89 
## # … with 23 more rows

It is also possible to use several conditions at the same time.

tweets %>%
  filter(device == "Twitter for Android",
         retweet_count > 200) %>%
  arrange(-retweet_count)
## # A tibble: 8 x 5
##   created_at          screen_name     device              retweet_count log_retweet_count
##   <dttm>              <chr>           <chr>                       <int>             <dbl>
## 1 2021-03-24 08:53:31 Oliver_Miguel1  Twitter for Android          1988              7.60
## 2 2021-03-24 08:53:37 JulesFox12      Twitter for Android          1091              7.00
## 3 2021-03-24 08:53:42 rosaesaa26      Twitter for Android           983              6.89
## 4 2021-03-24 08:53:42 lewisabzueta    Twitter for Android           822              6.71
## 5 2021-03-24 08:53:34 florent61647053 Twitter for Android           768              6.65
## 6 2021-03-24 08:53:37 Ritu89903967    Twitter for Android           709              6.57
## 7 2021-03-24 08:53:27 aspeaker66      Twitter for Android           331              5.81
## 8 2021-03-24 08:53:42 JamesAn26254230 Twitter for Android           201              5.31

3.7 Select

Select is used to keep just some of the columns of the original data.frame. For instance, we can apply the function in order to keep just the column “device” and “retweet_count”.

tweets %>%
  select(device, retweet_count) %>%
  head()
## # A tibble: 6 x 2
##   device             retweet_count
##   <chr>                      <int>
## 1 Twitter for iPhone             0
## 2 Twitter for iPhone             0
## 3 Twitter for iPhone             3
## 4 Twitter Web App                0
## 5 Twitter Web App                4
## 6 Twitter for iPad              96

3.8 Exercise

Here are some exercises to consolidate the fundamental R skills learned during these first lessons:

  • Download the csv file tweets_vienna_small.csv in this folder and put it into the project folder “data”

  • Upload the data set in R, setting the appropriate formats for the variables

  • Create a new script named “data-manipulation” with your surname and name as follows: “YOUR SURNAME-YOUR NAME-data-manipulation”, and save it

  • In the script, write the code to perform the following operations:

    • load the library “Tidyverse”

    • show the first rows of the data frame by using the function head

    • create a new data frame “tweets_vienna_small_updated” by updating the dataframe “tweets_vienna_small” by using the function mutate to create a new column “log_friends_count” whose values are the log of the values in the column “friends_count” (you don’t need to add 1 to the values in the column “friends_count”)

    • save the updated dataframe “tweets_vienna_small_updated”, by using the following code to save a csv file (please change YOUR SURNAME-YOUR NAME with your actual surname and name): write.csv(tweets_vienna_small_updated, file = “./data/YOUR SURNAME-YOUR NAME-tweets_vienna_small_updated.csv”, row.names=F) (we add row.names=F to avoid saving the number that indexes each row)

    • create a new data frame named “summary_tweets_vienna_small” aggregating the data by “screen_name” (using the function group_by) and then summarizing the data (by using the function summarize) as follows:

      • in a column named “average_favorite_count”, calculate the average of “favorite_count” by “screen_name” (that is, by user)

      • in a column named “average_retweet_count”, calculate the average of “retweet_count” (by user, obviously, since the data are already aggregated by user’ name)

      • in a column named “number_of_tweets”, calculate the number of tweets published by each users (by using the function n())

      • save the “summary_tweets_vienna_small” in the data folder of the project, in csv format, and with the name “YOUR SURNAME-YOUR NAME-summary_tweets_vienna_small.csv” (remember to specify row.names=F and to change YOUR SURNAME-YOUR NAME with your actual surname and name)

      • create a new data frame object called “summary_tweets_vienna_small_filtered”, where you will save the data.frame summary_tweets_vienna_small, after having filtered the rows with average_retweet_count higher than 10 (by using the function filter), and after having selected the column “screen_name” and “average_retweet_count” (so, you should end up with a data frame with just two columns, “screen_name” and “average_retweet_count”, and the rows with “average_retweet_count” higher than 10)

      • save the data frame “summary_tweets_vienna_small_filtered” with the name “YOUR SURNAME-YOUR NAME-summary_tweets_vienna_small_filtered.csv” in the folder data (remember to specify row.names=F and to change YOUR SURNAME-YOUR NAME with your actual surname and name).

Save the script “YOUR SURNAME-YOUR NAME-data-manipulation” with all the code you have used to perform these analysis. Write a comment in the script (using the hash mark #) if you are not able to do something.

Upload the script “YOUR SURNAME-YOUR NAME-data-manipulation.r” and the files “YOUR SURNAME-YOUR NAME-tweets_vienna_small_updated.csv”, “YOUR SURNAME-YOUR NAME-summary_tweets_vienna_small.csv”, and the file “YOUR SURNAME-YOUR NAME-summary_tweets_vienna_small_filtered.csv” on Moodle, in the folder “HomeWork-1”. The deadline is Sunday 11 April.