---
title: "An R package for doing things with events"
output: rmarkdown::html_vignette
bibliography: events.bib
vignette: >
  %\VignetteIndexEntry{An R package for doing things with events}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


**Note [28.12.2021]**
This package was written before the use of dplyr and tidyr was 
widespread. Functions from those packages supersede almost everything here.


# Introduction

The `events` package takes political event data in the form generated by KEDS 
[@Schrodtetal1994; @Gerneretal1994].  For this vignette we use the
Reuters-derived event chronology from the collapse of Yugoslavia, focusing 
on Serbian and Bosnian interactions in the period in 1991 and 1995.
The events in this event data are coded according to the WEIS event scheme [@WEIS].
```{r setup}
library(events)
```
In the following sections we perform a typical set of data manipulations; we load 
and clean a set of event data, restrict it to actors and period of interest, apply
a scale to the raw events, aggregate to make a time series and plot the results.
The package does not current contain function for the analysis of event data 
because once the data is finally in a regular time series format, other packages 
can be used to analyse it.  The package provides the link between raw output from
an event data extraction system such as KEDS/TABARI and a set of regularly spaced
time series.

## Data Loading and Cleaning

A version of the Balkans data is built into the package.  Here we load and summarise it.
```{r read.data}
data("balkans.weis")
summary(balkans.weis)
```
An event data set can be constructed from text file event data output using the
`read_keds` function.  And event data set is essentially a data frame with 
column names `date`, `source`, `target`, and `code`.
```{r show.data}
head(balkans.weis)
```
Subsequent columns of event label, shown above, and matching phrase from the original
text, not shown above, are optional.

Duplicated stories are a common type of information extraction error.  We can prefilter 
the events by removing all instances of the same pair of actors experiencing the same event
on the same date using the on-a-day filter
```{r one.a.day}
dd1 <- one_a_day(balkans.weis)
```
This can also be applied as part of the `read_keds` function. 

## Actor Filtering

In the next step we filter out actors whose interactions are not of interest.  
A complete list of actors is given by `actors` function
```{r actors}
head(actors(dd1))
```
The functions `sources` and `targets` list actor codes in the corresponding roles, and `codes` lists all the codes that are used.

We will focus on actors identified in the data as Serbia 'SER' and the Serbian military 'SERMIL', and Bosnia 'BOS' and the Bosnian military `BOSMIL'
```{r filter.actors}
dd2 <- filter_actors(dd1, fun = spotter("SER", "SERMIL", "BOS", "BOSMIL"))
```
The `filter_actors` function takes two arguments, an event data set and a filter function, 
and returns a filtered
event data set.  The filter may be any function that returns TRUE for things that are of 
interest and FALSE otherwise.  Here we have used a convenience function `spotter`, which creates a function that returns TRUE for any exact matches of its arguments.  

The function takes an optional `which` argument which can be used to specify that the filtering should apply to 'source', 'target' or 'both', which is the default.

We would like to treat the Serbian and Bosnian actors identified in the previous step as equivalent and refer to then for convenience as 'ser' and 'bos' respectively.  We do this by aggregating actor codes:
```{r aggregate.actors}
actor.agg <- list(ser = c("SER", "SERMIL"), bos = c("BOS", "BOSMIL"))
dd3 <- map_actors(dd2, fun = actor.agg)
```
Here we specify the mapping from new to old actor codes as a list and pass it to the mapping function.  We could also have written a function that for any object returned its new name, in the same style as the filter function in the previous section.  For example
```
actor.aggregator <- function(oldname){ 
    newname <- NA
    if (oldname %in% c("SER", "SERMIL")) newname <- "ser"
    if (oldname %in% c("BOS", "BOSMIL")) newname <- "bos"
    return(newname)
}
```
would work, but it's rather longwinded.

## Temporal Restriction

We will focus on the period between January 1991 and December 1995
```{r filter.time}
dd4 <- filter_time(dd3, start = "1991-01-01", end = "1995-12-30")
```
The optional `start` and `end` parameters may be anything that can be converted into a `Date` object.

The new data set is considerably smaller than before
```{r summary}
summary(dd4)
```

## Scaling

Scales are mappings from event codes to real numbers.  You can create your own event code by constructing 
a headerless csv file with event codes in the first column and numbers in the second column, and reading it in with the `make_scale` command.  This is a thin wrapper around the `read.csv` function.  

Here we will use the extended Goldstein scale bundled with the package [@Goldstein1992] 
(These codes are taken from Phil Schrodt's pages at Parus Analytics).  
This maps WEIS event codes onto a number representing level of conflict or cooperation.
```{r intro.scaling}
data("weis.goldstein.scale")
summary(weis.goldstein.scale)
```
When we apply the scale to an event data set a column is added with the same name as the scale
```{r add.scale}
dd5 <- add_eventscale(dd4, weis.goldstein.scale)
head(dd5)
```

### Score Aggregation

The final step is to aggregate quantities of interest into a regular time series for each directed pair of actors.  Here we construct a typical dyad set using the summed scored event counts per week:
```{r aggregate.scores}
dyads <- make_dyads(dd5, scale = "goldstein", unit = "week", monday = TRUE, 
		fun = sum, missing.data = 0)
```
We are asserting here that weekly counts should start on a monday, that they should be summed rather than e.g. averaged, and that weeks with no events observed should be given score zero.  Note that this is only an example; these are not necessarily sensible setting for actual applications.  

Alternative aggregation units are 'day', 'month', 'quarter', and 'year'.  The 
`fun` parameter should be any function that will transform a numerical vector into a scalar.

The output of `make_dyads` is a list of directed dyad time series.  All combinations of actors are constructed, so it is a good idea to filter and aggregate actors before calling the function.  The naming scheme for the dyads is concatenation with a period: `dyads$ser.bos` is the temporally aggregated sequence of summed scores with 
the 'ser' actor as source and 'bos' the target, `dyads$bos.ser` is the reverse direction, and  `dyads$ser.ser` is the activities internal to the 'ser' actor.  
```{r dyads}
tail(dyads$ser.bos)
```
The directed dyad can be treated like a regular time series:
```{r serbos, fig.width = 6, fig.height = 5}
with(dyads$ser.bos, 
     plot(goldstein ~ date, type = "l", lwd = 2))
```

There are a few gaps in this series.  This is because the scale does not cover all the events that occur in the event data.  We can investigate this further with
```{r gaps}
scale_coverage(weis.goldstein.scale, dd5)
```

### Count Aggregation

If `scale` is NULL a sequence then directed dyadic event count streams are created instead of scaled scores.  This will generate an event count for each distinct event code and each temporal unit.  Sometimes it is helpful to aggregate code before constructing these count streams.    Here we aggregate them into four categories: verbal and material cooperation, and verbal and material conflict
```{r event.aggregation}
evts <- codes(dd4)
event.agg <- list(
    coop.verb = grep("02.|03.|04.|05.|08.|09.|10.", evts, value = TRUE),
    coop.mat = grep("01.|06.|07.", evts, value = TRUE),
    conf.verb = grep("11.|12.|13.|14.|15.|16.|17.", evts, value = TRUE),
    conf.mat = grep("18.|19.|20.|21.|22.", evts, value = TRUE)
)
dc1 <- map_codes(dd4, fun = event.agg)
```
Like the other aggregation function, `map_codes` function in the final line takes a list or a function to map old event codes to new ones.  We start by using the `codes` function to list all the event codes that are used in the data.  WEIS is a two level scheme that by convention indicates the upper level code category in first two digits and subcategory in remaining digits.  Here, we use `grep` to identify all the codes in "01", "06", and
"07" at any level and assign them to a new material cooperation category `mat.coop`.
```{r make.dyads}
dyad.counts <- make_dyads(dc1, scale = NULL, unit = "week", monday = TRUE, 
		fun = sum, missing.data = 0)
tail(dyad.counts$ser.bos)
```

## References