# Chapter 10 Intervention Analysis

In this chapter we are going to learn about *intervention analysis* (sometimes also called *interrupted time-series analysis*) and to see how to conduct a intervention analysis.

**Intervention analysis** is typically conducted with the Box & Jenkins ARIMA framework and traditionally uses a method introduced by Box and Tiao (1975)^{8}, who provided a framework for assessing *the effect of an intervention on a time series* under study.

As summarized by Box and Tiao: *Given a known intervention, is there evidence that change in the series of the kind expected actually occurred, and, if so, what can be said of the nature and magnitude of the change?*. In other words *Intervention analysis estimates the effect of an external or exogenous intervention on a time-series*. To conduct such an analysis, it is necessary to know the date of the intervention.

Intervention analysis is a “quasi-experimental” design and an interesting approach to test whether *exogenous shocks*, such as, for instance, the introduction of a new policy, *impact on a time series process in a significant way*, that is, **by changing the mean function or trend** of a time series.

Behind intervention analysis there is the *causal hypothesis* that observations *after* a **treatment** (the **“intervention”**) have a different level or slope from those before the intervention/interruption.

Besides *intervention* or *interrupted time-series analysis*, the analysis can be conducted through the *segmented regression* method. However, as in the case of traditional regression models applied to time series data, this approach does not take into account the autocorrelated structure of time series. Other methods include more complex computational approaches.

## 10.1 Types of intervention

There are different types of interventions. For instance, an intervention can have an abrupt impact determining a permanent or temporary change, a sudden and short-lived change due to an event, or a more gradual yet permanent change.

## 10.2 Intervention analysis with ARIMA

To exemplify an intervention analysis we are going to reproduce the example in the paper Interrupted time series analysis using autoregressive integrated moving average (ARIMA) models: a guide for evaluating large-scale health interventions.

The data to run the analysis can be downloaded here.

The example evaulates the impact of a health policy intervention (an Australian health policy intervention that restricted the conditions under which a particular medicine (quetiapine) could be subsidised). The same methodological process can be applied to evaluate any intervention in any context.

The case study is described as follows:

(…) due to growing concerns about inappropriate prescribing, after January 1, 2014 new prescriptions for this tablet strength could not include refills. Our primary outcome was the

number of monthly dispensingsof 25 mg quetiapine, of which we had48 months of observations(January 2011 to December 2014).

Thus, data comprises 48 months of observations, and the date of the intervention is January 1, 2014.

There is also seasonality in the process:

In Australia, medicine dispensing claims have significant

yearly seasonality. Medicines are subsidised for citizens and eligible residents through the Pharmaceutical Benefits Scheme (PBS), with people paying an out-of-pocket co-payment towards the cost of their medicines, while the remainder is subsidised. If a person’s (or family’s) total out-of-pocket costs reach the “Safety Net threshold” for the calendar year, they are eligible for a reduced co-payment for the remainder of that year. Thus, there is an incentive for people reaching their Safety Net to refill their medicines more frequently towards the end of the year. Hence, we see anincrease in prescriptions at the end of the year, followed by a decrease in January.

The researchers hypothesize the nature of the intervention as follows (see the picture below):

(…) due to the nature of the intervention we postulated there would be an immediate drop in dispensings post-intervention (step change), as well as a change in slope (ramp). Thus, we included variables representing both types of impacts in our model. For both impacts, h = 0 and r = 0.

In the sentence above, *h* describes when the effect happens while *r* represents the decay pattern (see the picture below).

First we load the data, converting it to a time series format, and we visualize the time series along with a vertical lines representing the date of the intervention.

```
# Load data
quet <- read.csv(file = "./data/12874_2021_1235_MOESM1_ESM.csv")
# Convert data to time series object
quet.ts <- ts(quet[,2], frequency=12, start=c(2011, 1))
# Plot data to visualize time series
plot.ts(quet.ts, ylim=c(0, 40000), col = "blue", xlab = "Month", ylab = "Dispensings")
# Add vertical line indicating date of intervention (January 1, 2014)
abline(v=2014, col = "gray", lty = "dashed", lwd=2)
```

Next, we have to create the dummy variables representing our intervention.
This can be tricky in R. In this case, the authors convert the time of the *ts* object in a more human-readable format through the *as.yearmon* function (this is a *zoo* function and you can use it by loading the *xts* library).

```
library(xts)
# Create variable representing step change and view
step <- as.numeric(as.yearmon(time(quet.ts)) >= "Jan 2014")
step
```

`## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1`

The above vectors is a dummy variable for the intervention. It has value equal zero before the date of the intervention, and 1 after that.

Next, in this specific case, we also want to create a variable representing a constant increasing change, capturing an increasing effect of the intervention over time.
Also in this case the creation of the variable can be a little tricky. We create two vectors by using the *rep* and the *seq* function, and concatenate them by using the *c* function.

The argument of the *rep* function are two integers *x* and *times* (*rep(x, times)*), and the function creates a vectors that repeat (“rep”) the *x* values the number of times specified by *times*. We have 36 months before the intervention, and we assign them the value zero.

`## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0`

Instead, we use the *seq* function to create a vectors with increasing values. This part of the variable represent a gradual increase after the intervention (we have 12 months of data after the intervention). The function *seq* takes the three arguments *from*, *to*, and *by*: *from* and *to* are the starting and end values of the sequence, *by* is the increment of the sequence. In our case we create a sequence of values that increases from 1 to 12 by 1.

`## [1] 1 2 3 4 5 6 7 8 9 10 11 12`

To create the variable we need, we concatenate both the function with the *c* function, as follows:

```
# Create variable representing ramp (change in slope) and view
ramp <- c(rep(0, 36), seq(1, 12, 1))
ramp
```

```
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [34] 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12
```

We search for an appropriate ARIMA model for the data by using the *auto.arima* function (*forecast* package). We include the variables we have created as external regressors.

```
library(forecast)
# Use automated algorithm to identify parameters
model1 <- auto.arima(quet.ts, xreg = cbind(step, ramp), stepwise=FALSE)
# Check residuals
checkresiduals(model1)
```

```
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(2,1,0)(0,1,1)[12] errors
## Q* = 9.5692, df = 5, p-value = 0.0884
##
## Model df: 5. Total lags used: 10
```

The resulting model is an *ARIMA(2,1,0)(0,1,1)[12]*.

```
## Series: quet.ts
## Regression with ARIMA(2,1,0)(0,1,1)[12] errors
##
## Coefficients:
## ar1 ar2 sma1 step ramp
## -0.873 -0.6731 -0.6069 -3284.7792 -1396.6523
## s.e. 0.124 0.1259 0.3872 602.3347 106.6327
##
## sigma^2 estimated as 648828: log likelihood=-284.45
## AIC=580.89 AICc=583.89 BIC=590.23
```

We use the information retrieved from the auto.arima function to fit the same ARIMA model to the data, without including the intervention (the variables we created), using just the data up to the date of the intervention (up to January 2014). To do that, we use the *window* function in order to restrict the set of data we consider, indicating December 2013 as the end of our series.

```
# To forecast the counterfactual, model data excluding post-intervention time period
model2 <- Arima(window(quet.ts, end = c(2013, 12)), order = c(2, 1, 0),
seasonal = list(order = c(0, 1, 1), period = 12))
```

Next, we forecast the 12 months we didn’t include (starting from January 2014 until the end of the period of observation, December 2014) by using the *forecast* function (library *forecast*). The logic behind this operation is to see what would have happened to the series in the absence of the intervention. In other words, we use the prediction as a *couterfactual* in order to describe a possible effect of the intervention on the series, by determining how the observed values diverges from this forecast.

```
# Forecast 12 months post-intervention and convert to time series object
fc <- forecast(model2, h = 12)
# covert the average forecast (fc$mean) in a time series object
fc.ts <- ts(as.numeric(fc$mean), start=c(2014, 1), frequency = 12)
# Combine the observed and the forecast data
quet.ts.2 <- ts.union(quet.ts, fc.ts)
quet.ts.2
```

```
## quet.ts fc.ts
## Jan 2011 16831 NA
## Feb 2011 17234 NA
## Mar 2011 20546 NA
## Apr 2011 19226 NA
## May 2011 21136 NA
## Jun 2011 20939 NA
## Jul 2011 21103 NA
## Aug 2011 22897 NA
## Sep 2011 22162 NA
## Oct 2011 22184 NA
## Nov 2011 23108 NA
## Dec 2011 25967 NA
## Jan 2012 20123 NA
## Feb 2012 21715 NA
## Mar 2012 24497 NA
## Apr 2012 21720 NA
## May 2012 25053 NA
## Jun 2012 23915 NA
## Jul 2012 24972 NA
## Aug 2012 26183 NA
## Sep 2012 24163 NA
## Oct 2012 26172 NA
## Nov 2012 26642 NA
## Dec 2012 29086 NA
## Jan 2013 24002 NA
## Feb 2013 24190 NA
## Mar 2013 26052 NA
## Apr 2013 26707 NA
## May 2013 29077 NA
## Jun 2013 26927 NA
## Jul 2013 30300 NA
## Aug 2013 29854 NA
## Sep 2013 28824 NA
## Oct 2013 31519 NA
## Nov 2013 32084 NA
## Dec 2013 33160 NA
## Jan 2014 24827 29127.50
## Feb 2014 23285 29671.28
## Mar 2014 23884 31156.37
## Apr 2014 21921 31339.65
## May 2014 22715 33843.48
## Jun 2014 19919 31809.61
## Jul 2014 20560 34498.50
## Aug 2014 18961 34774.18
## Sep 2014 18780 33302.09
## Oct 2014 17998 35641.85
## Nov 2014 16624 36184.57
## Dec 2014 18450 37792.03
```

By plotting the data, we can visualize the predicted values in the absence of the intervention (red dashed line) as well as the observed values (blue line). It seems that the health policy considerably impacted the analyzed prescriptions.

```
# Plot
plot.ts(quet.ts.2, plot.type = "single",
col=c('blue','red'), xlab="Month", ylab="Dispensings",
lty=c("solid", "dashed"), ylim=c(0,40000))
abline(v=2014, lty="dashed", col="gray")
```

Coming back to our initial ARIMA model including the intervention variables, calculating also the confidence intervals and the significance of the coefficients by using the *coeftest* and the *confint* function in the *lmtest* library, we can quantify the impact of the policy.

```
## Series: quet.ts
## Regression with ARIMA(2,1,0)(0,1,1)[12] errors
##
## Coefficients:
## ar1 ar2 sma1 step ramp
## -0.873 -0.6731 -0.6069 -3284.7792 -1396.6523
## s.e. 0.124 0.1259 0.3872 602.3347 106.6327
##
## sigma^2 estimated as 648828: log likelihood=-284.45
## AIC=580.89 AICc=583.89 BIC=590.23
```

```
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.87301 0.12396 -7.0428 1.885e-12 ***
## ar2 -0.67314 0.12587 -5.3480 8.893e-08 ***
## sma1 -0.60694 0.38722 -1.5674 0.117
## step -3284.77921 602.33466 -5.4534 4.941e-08 ***
## ramp -1396.65226 106.63266 -13.0978 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```
## 2.5 % 97.5 %
## ar1 -1.1159662 -0.6300574
## ar2 -0.9198342 -0.4264429
## sma1 -1.3658734 0.1519965
## step -4465.3334555 -2104.2249712
## ramp -1605.6484348 -1187.6560814
```

The estimated step change was − 3285 dispensings (95% CI − 4465 to − 2104) while the estimated change in slope was − 1397 dispensings per month (95% CI − 1606 to − 1188). (The figure, ndr) shows the values predicted by our ARIMA model in absence of the intervention (counterfactual) compared with the observed values. This means that the change in subsidy for 25 mg quetiapine in January 2014 was associated with an immediate, sustained decrease of 3285 dispensings, with a further decrease of 1397 dispensings every month. In other words, there were 4682 (3285 + 1397) fewer dispensings in January 2014 than predicted had the subsidy changes not been implemented. In February 2014, there were 6079 fewer dispensings (3285 + 2*1397). Importantly, our findings should only be considered valid for the duration of the study period (i.e. until December 2014).

## 10.3 CausalImpact

While ARIMA modeling is the classic choice for intervention models, more complex computational apporaches have been developed. We can consider the Bayesian approach implemented in the package *CausalImpact* developed at *Google* to estimate causal impacts in a quasi-experimental framework (you can find here a video presentation)^{9}.

We use the simulated dataset used in the Google tutorial on the package, creating two time series \(y\) and \(x\) of length 100, simulating an abrupt intervention at time 71 determining a permanent increment of 10 points in the \(y\) series.

```
set.seed(1)
x1 <- 100 + arima.sim(model = list(ar = 0.999), n = 100)
y <- 1.2 * x1 + rnorm(100)
y[71:100] <- y[71:100] + 10
dat <- ts.intersect(y, x1)
```

Then, it is necessary to specify the pre-intervention and post-intervention period. In the pre-intervention period no impact is expected.

The function *CausalImpact* uses the values of the original time series \(y\) in the pre-intervention period, and the predictors correlated to the \(y\) (in this case \(x\)), to forecast the values that \(y\) would have had without the intervention (*counterfactual*).

To accurately forecast the \(y\) values, which is necessary to obtain valid results from the analysis, it is necessary to have a proper model of the \(y\) series (based on the series itself and its predictors). Then, the differences in the expected (forecasted) \(y\) values without intervention, and the actual \(y\) values following the intervention, are compared in order to estimate the impact of the intervention.

By using the function *plot* on the resulting model, three plots are visualized:

The first panel shows the data and a counterfactual prediction for the post-treatment period. The second panel shows the difference between observed data and counterfactual predictions. This is the pointwise causal effect, as estimated by the model. The third panel adds up the pointwise contributions from the second panel, resulting in a plot of the cumulative effect of the intervention. (…) the above inferences depend critically on the assumption that the covariates were not themselves affected by the intervention. The model also assumes that the relationship between covariates and treated time series, as established during the pre-period, remains stable throughout the post-period.

Besides plotting the results, it is possible to create a summary of the model, and by adding the argument “report” inside the function *summary*, a convenient explanations of the results is printed.

```
## Posterior inference {CausalImpact}
##
## Average Cumulative
## Actual 117 3511
## Prediction (s.d.) 107 (0.37) 3196 (10.96)
## 95% CI [106, 107] [3175, 3217]
##
## Absolute effect (s.d.) 11 (0.37) 315 (10.96)
## 95% CI [9.8, 11] [294.0, 336]
##
## Relative effect (s.d.) 9.9% (0.34%) 9.9% (0.34%)
## 95% CI [9.2%, 11%] [9.2%, 11%]
##
## Posterior tail-area probability p: 0.00101
## Posterior prob. of a causal effect: 99.8993%
##
## For more details, type: summary(impact, "report")
```

```
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 117.05. By contrast, in the absence of an intervention, we would have expected an average response of 106.54. The 95% interval of this counterfactual prediction is [105.83, 107.25]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 10.51 with a 95% interval of [9.80, 11.21]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 3.51K. By contrast, had the intervention not taken place, we would have expected a sum of 3.20K. The 95% interval of this prediction is [3.18K, 3.22K].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +10%. The 95% interval of this percentage is [+9%, +11%].
##
## This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (10.51) to the original goal of the underlying intervention.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.001). This means the causal effect can be considered statistically significant.
```

The authors of the package underline the importance of the statistical assumptions to get valid results, and about possible strategies to ascertain that the assumptions are met, they write the following advice:

Here are a few ways of getting started. First of all, it is critical to reason why the covariates that are included in the model (this was x1 in the example) were not themselves affected by the intervention. Sometimes it helps to plot all covariates and do a visual sanity check. Next, it is a good idea to examine how well the outcome data y can be predicted before the beginning of the intervention. This can be done by running CausalImpact() on an imaginary intervention. Then check how well the model predicted the data following this imaginary intervention. We would expect not to find a significant effect, i.e., counterfactual estimates and actual data should agree reasonably closely. Finally, when presenting or writing up results, be sure to list the above assumptions explicitly, including the priors in model.args, and discuss them with your audience.

We can now try to use the library with the data from the previous example (the library requires data in the same format, thus some pre-processing is needed).

```
quet$month <- as.Date(quet$month, format="%d-%b-%y")
quet_xts <- xts(quet$dispensings, order.by = quet$month)
pre.period <- c(as.Date("2011-01-01"), as.Date("2013-12-01"))
post.period <- c(as.Date("2014-01-01"), as.Date("2014-12-01"))
```

We fit an automated model (without specifying a particular form for the intervention).

```
## Posterior inference {CausalImpact}
##
## Average Cumulative
## Actual 117 3511
## Prediction (s.d.) 107 (0.37) 3196 (10.96)
## 95% CI [106, 107] [3175, 3217]
##
## Absolute effect (s.d.) 11 (0.37) 315 (10.96)
## 95% CI [9.8, 11] [294.0, 336]
##
## Relative effect (s.d.) 9.9% (0.34%) 9.9% (0.34%)
## 95% CI [9.2%, 11%] [9.2%, 11%]
##
## Posterior tail-area probability p: 0.00101
## Posterior prob. of a causal effect: 99.8993%
##
## For more details, type: summary(impact, "report")
```

```
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 117.05. By contrast, in the absence of an intervention, we would have expected an average response of 106.54. The 95% interval of this counterfactual prediction is [105.83, 107.25]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 10.51 with a 95% interval of [9.80, 11.21]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 3.51K. By contrast, had the intervention not taken place, we would have expected a sum of 3.20K. The 95% interval of this prediction is [3.18K, 3.22K].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +10%. The 95% interval of this percentage is [+9%, +11%].
##
## This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (10.51) to the original goal of the underlying intervention.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.001). This means the causal effect can be considered statistically significant.
```

### 10.3.1 Exercise

Let’s try to analyze a real-world case study in the field of online communication.

We hypothesize that the uncertainty and anxiety caused by the pandemic has increased people’s need of information. Social media are an important information source today, and the pages of the Health Ministries can be considered trustable information sources. At the same time, Facebook has favored the circulation of posts from verified health and government organization. It is therefore possible that the pages of the Ministries of Health have increased their follower base during the pandemic.

Thus, let’s try to assess the impact of the pandemic on the growth in “page likes” of some Facebook pages of Health Ministry. We’ll use the *CausalImpact* library. It is the easiest solution, since we just need to indicate the intervention period, but it does not (necessarily) require other variables.

Download here the dataset of the Facebook page of the Italian Ministry of Health and save it in your *data* folder.

Some background information: On 9 March 2020, the government of Italy under Prime Minister Giuseppe Conte imposed a national lockdown. The World Health Organization (WHO) on March 11 declared COVID-19 a pandemic. For the sake of simplicity, we can use the declaration of WHO as the date of our “intervention” (more fine-grained analysis are possible as well).

Upload the data:

Let’s use the *xts* format, since in this case it is easier to use (in particular, it’s easier to work with the date format).

```
# dates in format "date"
MinisteroSalute$beginning_of_interval <- as.Date(MinisteroSalute$beginning_of_interval)
# xts time seires
MinisteroSalute <- xts(MinisteroSalute$page_likes,
order.by = MinisteroSalute$beginning_of_interval)
```

Let’s take a look at the data.

`## [1] "2018-12-30"`

`## [1] "2020-12-31"`

The dates of the intervention:

```
date_pre <- c(as.Date("2018-12-30"), as.Date("2020-03-10"))
date_post <- c(as.Date("2020-03-11"), as.Date("2020-12-31"))
```

Fit the model:

```
## Posterior inference {CausalImpact}
##
## Average Cumulative
## Actual 5.5e+05 1.6e+08
## Prediction (s.d.) 1.8e+05 (13743) 5.2e+07 (4068069)
## 95% CI [1.5e+05, 2e+05] [4.5e+07, 6e+07]
##
## Absolute effect (s.d.) 3.8e+05 (13743) 1.1e+08 (4068069)
## 95% CI [3.5e+05, 4.0e+05] [1.0e+08, 1.2e+08]
##
## Relative effect (s.d.) 214% (7.8%) 214% (7.8%)
## 95% CI [199%, 228%] [199%, 228%]
##
## Posterior tail-area probability p: 0.00107
## Posterior prob. of a causal effect: 99.89293%
##
## For more details, type: summary(impact, "report")
```

```
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 554.03K. By contrast, in the absence of an intervention, we would have expected an average response of 176.52K. The 95% interval of this counterfactual prediction is [151.27K, 202.82K]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 377.51K with a 95% interval of [351.21K, 402.75K]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 163.99M. By contrast, had the intervention not taken place, we would have expected a sum of 52.25M. The 95% interval of this prediction is [44.78M, 60.03M].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +214%. The 95% interval of this percentage is [+199%, +228%].
##
## This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (377.51K) to the original goal of the underlying intervention.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.001). This means the causal effect can be considered statistically significant.
```

We can also repeat the same analysis with the data from the Facebook page of the Austrian Ministry of Health (“Bundesministerium für Soziales, Gesundheit, Pflege und Konsumentenschutz - Eingang Sozialministerium”): Download here the data, save it in your *data* folder, and perform the intervention analysis.

The chart below is from Markus Pollak, Nikolaus Kowarz und Julia Partheymüller (2020): Chronology of the Corona Crisis in Austria - Part 1: Background, the way to the lockdown, the acute phase and economic consequences

Box, G. E., & Tiao, G. C. (1975). Intervention analysis with applications to economic and environmental problems. Journal of the American Statistical association, 70(349), 70-79↩︎

CausalImpact 1.2.1, Brodersen et al., Annals of Applied Statistics (2015). http://google.github.io/CausalImpact/↩︎