---
title: "Introduction to Austin"
author: "Will Lowe"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to Austin}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


## Getting Data In

Austin works with any two dimensional matrix-like object for which `is.wfm` returns TRUE.  An object is a wfm when it is indexable in two dimensions, has a complete set of row and column names,
and has the dimension names 'docs' and 'words'.  Whether these are on rows or columns does not matter.  The function `wfm` will construct a suitable object from any column and row labeled matrix-like object such as a data.frame or matrix.

Austin offers the helper functions `as.worddoc` and `as.docword` to extract the raw count data the appropriate way up.  That is, as a matrix where words are rows and documents are columns, or where documents are rows and words are columns.  The `docs` and `words` return vectors of document names and words respectively.  `getdocs` is used for extracting particular sets of documents from a wfm by name or index.

The function `trim` can be used to remove low frequency words and words that occur in only a few documents.  It can also be used to sample randomly from the set of words.  This can be helpful to speed up analyses and check robustness to vocabulary choice.

### Importing CSV Data

```{r, echo=FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```

If you already have word frequency data in a file in comma-separated value (.csv) format you can read it in using
```{r,eval=FALSE}
data <- wfm('mydata.csv')
```
This function assumes that the word labels are in the first column and document names are in the first row, pretty much as [JFreq](http://www.conjugateprior.org/software/jfreq) would offer it to you by default.  Assuming that words are rows and documents are columnsuse.  If words are columns, add a `word.margin=2` argument.  

### Counting Words from Inside R

Assuming Java is installed on your system then you can count words in text files and generate an appropriate `wfm` object in one step using the `jwordfreq` (Although you'll  probably have more control over the process using JFreq).

# Scaling with Wordfish

Austin implements the one dimensional text scaling model Wordfish
(Slapin and Proksch, 2008).  When document positions
are random variables the model is known as Rhetorical Ideal Points
(Monroe and Maeda, 2004) which is formally equivalent to a Item Response
Theory and closely related to the generalized Latent Trait models with a Poisson link, e.g. Moustaki and Knott 2000.  

Austin
implements a version of Wordfish with faster convergence, analytic
or bootstrapped standard errors, and integration into R's usual model
functions, `summary`, `coef`, `predict`, etc.

This model class has two equivalent parameterizations: In the first, word counts are Poisson processes with means conditional on document position `theta`, word positions `beta`, document specific offsets `alpha` and word-specific offsets `psi`.

In the Austin implementation the
parameters are estimated by a Conditional Maximum Likelihood with a
regularization constraint on `beta`s that is interpretable as a
shared zero mean prior with standard deviation `sigma`.  

Alternatively, conditioning on each document's length gives a multinomial parameterisation in terms of `theta` as before, logits of word rates using the first word as the baseline.  This is the form of
the model reported by Austin and used for prediction.  
Austin treats
the first parameterization as a computational convenience to make
estimation more efficient.  The `coef` function takes a `form`
parameter if you need to see the other parameterisation.

We start by loading the package
```{r}
library('austin')
```
and generating an (unrealistically small) set of test data 
according to the assumptions above
```{r}
dd <- sim.wordfish(docs=10, vocab=12)
```
The resulting object is of class `sim.wordfish` and contains
the generating parameters (in the form of the first model).  The two elements
of interest are the vector of document positions
```{r}
dd$theta
```
and the generated data Y
```{r}
as.worddoc(dd$Y)
```
where Y is an object of class `wfm`.

To scale this data we use the wordfish function
```{r}
wf <- wordfish(dd$Y)
```
The model is by default globally identified by requiring that
`theta[10] > theta[1]`.  This will
be true for all simulated data (with more than 10 documents).  For real
data more suitable values may be set using the `dir` argument.

Estimated document positions can be summarized using
```{r}
summary(wf)
```
To examine the word-specific parameters use
```{r}
coef(wf)
```
Estimated document positions and 95% confidence intervals can also be
graphed (For more than a few tens of words the confidence
  intervals will probably be 'implausibly' small.  They are
  nevertheless asymptotically correct given the model assumptions. It
  is those assumptions you might doubt.).  Any unnamed second argument
to the plot function is taken as a vector of true document positions.
These are then plotted over the original plot, as shown in Figure~\ref{fig1}.

```{r,echo=FALSE,fig=TRUE,fig.width = 5,fig.height = 5}
plot(wf, dd$theta)
```

Positions for new documents can also be estimated.  Here we generate
predictions and confidence intervals for existing documents D4 and D5
in the original data set
```{r}
predict(wf, newdata=getdocs(dd$Y, c(4,5)), se.fit=TRUE, interval='conf')
```

# Scaling with Wordscores

Wordscores (Laver et al. 2003) is a method for scaling texts closely
related to both correspondence analysis by implementing an incomplete
reciprocal averaging algorithm, and to quadratic ordination as an
approximation to an unfolding model (Lowe 2008, 2014).

Austin refers to the algorithm described in Laver et al. 2003 as
`classic' Wordscores to distinguish it from versions closer to
correspondence analysis.  A classic Wordscores analysis has several
distinguishing features.

## Classic Wordscores

Classic Wordscores estimates scores for words ('wordscores`) using
only word frequency information from documents with known positions
('reference' documents).  There is therefore no iterative estimation
process since document positions are observed.  Documents with unknown
positions ('virgin' documents) are treated as out of sample.

Positions for out of sample documents are estimated by averaging the scores of
the words they contain and re-scaling in an ad-hoc fashion that has
generated some discussion and various alternatives.  The method also offers
standard errors for the out of sample documents (These are
  probably incorrect -- partly because the probability model from they
  would have to be derived is unclear and partly because they can be 
  quite implausible in some applications).

To replicate the example analysis in Laver et al. we begin loading
the test data
```{r}
data(lbg)
```
So we take a look at the word counts we've got to work with
```{r}
as.docword(lbg)
```
and then fit a classic Wordscores model to them.  Assume we know the
positions of document R1 through R5 and wish to scale V1.  

We first separate the reference documents from the virgin document:
```{r}
ref <- getdocs(lbg, 1:5)
vir <- getdocs(lbg, 'V1') 
```
then fit the model using the reference documents
```{r}
ws <- classic.wordscores(ref, scores=seq(-1.5, 1.5, by=0.75))
```
We can summarise the results
```{r}
summary(ws)
```
The summary presents details about the reference documents.  If we want to see the 
wordscores that were generated we look for the model's coefficients
```{r}
coef(ws)
```
which can also be plotted.

We can now get a position for the virgin document
```{r}
predict(ws, newdata=vir)
```
When more than one document is to be predicted, an ad-hoc procedure 
is applied by default to the predicted positions to rescale them to the same
variance as the reference scores.  This may or may not be what you want.

## Correspondence Analysis

Wordscores approximates correspondence analysis, which is defined for
more than one dimension.  To explore this approach to document scaling
you may find the `ca` or `anacor` packages useful.  A rather limited subset of correspondence analysis is implemented by the MASS package's `corresp` function.  

# References

Laver, Michael, and John Garry. 2000. “Estimating Policy Positions from Political Texts.” American Journal of Political Science 44(3): 619–34.

Lowe, Will. 2008. “Understanding Wordscores.” Political Analysis 16(4): 356–71.

Lowe, Will. 2013. “There’s (basically) Only One Way to Do It.” Paper presented at APSA 2013, Chicago IL. Available at [SSRN](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2318543).

Monroe, Burt L., and Ko Maeda. 2004. “Talk’s Cheap: Text-Based Estimation of Rhetorical Ideal-Points.”

Moustaki, Irini, and Martin Knott. 2000. “Generalized Latent Trait Models.” Psychometrika 65(3): 391–411.

Slapin, Jonathan B., and Sven-Oliver Proksch. 2008. “A Scaling Model for Estimating Time-Series Party Positions from Texts.” American Journal of Political Science 52(3): 705–22.