Introduction to Austin

Getting Data In

Austin works with any two dimensional matrix-like object for which is.wfm returns TRUE. An object is a wfm when it is indexable in two dimensions, has a complete set of row and column names, and has the dimension names ‘docs’ and ‘words’. Whether these are on rows or columns does not matter. The function wfm will construct a suitable object from any column and row labeled matrix-like object such as a data.frame or matrix.

Austin offers the helper functions as.worddoc and as.docword to extract the raw count data the appropriate way up. That is, as a matrix where words are rows and documents are columns, or where documents are rows and words are columns. The docs and words return vectors of document names and words respectively. getdocs is used for extracting particular sets of documents from a wfm by name or index.

The function trim can be used to remove low frequency words and words that occur in only a few documents. It can also be used to sample randomly from the set of words. This can be helpful to speed up analyses and check robustness to vocabulary choice.

Importing CSV Data

If you already have word frequency data in a file in comma-separated value (.csv) format you can read it in using

data <- wfm('mydata.csv')

This function assumes that the word labels are in the first column and document names are in the first row, pretty much as JFreq would offer it to you by default. Assuming that words are rows and documents are columnsuse. If words are columns, add a word.margin=2 argument.

Counting Words from Inside R

Assuming Java is installed on your system then you can count words in text files and generate an appropriate wfm object in one step using the jwordfreq (Although you’ll probably have more control over the process using JFreq).

Scaling with Wordfish

Austin implements the one dimensional text scaling model Wordfish (Slapin and Proksch, 2008). When document positions are random variables the model is known as Rhetorical Ideal Points (Monroe and Maeda, 2004) which is formally equivalent to a Item Response Theory and closely related to the generalized Latent Trait models with a Poisson link, e.g. Moustaki and Knott 2000.

Austin implements a version of Wordfish with faster convergence, analytic or bootstrapped standard errors, and integration into R’s usual model functions, summary, coef, predict, etc.

This model class has two equivalent parameterizations: In the first, word counts are Poisson processes with means conditional on document position theta, word positions beta, document specific offsets alpha and word-specific offsets psi.

In the Austin implementation the parameters are estimated by a Conditional Maximum Likelihood with a regularization constraint on betas that is interpretable as a shared zero mean prior with standard deviation sigma.

Alternatively, conditioning on each document’s length gives a multinomial parameterisation in terms of theta as before, logits of word rates using the first word as the baseline. This is the form of the model reported by Austin and used for prediction.
Austin treats the first parameterization as a computational convenience to make estimation more efficient. The coef function takes a form parameter if you need to see the other parameterisation.

We start by loading the package

library('austin')

and generating an (unrealistically small) set of test data according to the assumptions above

dd <- sim.wordfish(docs=10, vocab=12)

The resulting object is of class sim.wordfish and contains the generating parameters (in the form of the first model). The two elements of interest are the vector of document positions

dd$theta
##  [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446  0.1651446
##  [7]  0.4954337  0.8257228  1.1560120  1.4863011

and the generated data Y

as.worddoc(dd$Y)
##      docs
## words D01 D02 D03 D04 D05 D06 D07 D08 D09 D10
##   W01  34  31  34  24  20  15  17  14  12  10
##   W02  34  29  28  35  20  26  17  11   8   7
##   W03  46  28  30  35  16  25  18  22  14  14
##   W04   5  12  13  18  16  22  31  30  41  25
##   W05   9  11  19  17  24  18  31  28  34  45
##   W06  11   9  15  21  19  12  20  30  36  39
##   W07 100 114  79  76  79  75  52  37  31  10
##   W08 100  91  97  74  76  60  54  50  21  28
##   W09  95  83  82  71  74  61  42  37  23  19
##   W10  13  36  35  46  45  68  51  83  90 103
##   W11  26  28  31  35  59  50  79  81 102  97
##   W12  27  28  37  48  52  68  88  77  88 103

where Y is an object of class wfm.

To scale this data we use the wordfish function

wf <- wordfish(dd$Y)

The model is by default globally identified by requiring that theta[10] > theta[1]. This will be true for all simulated data (with more than 10 documents). For real data more suitable values may be set using the dir argument.

Estimated document positions can be summarized using

summary(wf)
## Call:
##  wordfish(wfm = dd$Y)
## 
## Document Positions:
##     Estimate Std. Error   Lower    Upper
## D01 -1.41777    0.11293 -1.6391 -1.19643
## D02 -1.06654    0.10232 -1.2671 -0.86600
## D03 -0.78640    0.09612 -0.9748 -0.59801
## D04 -0.48076    0.09150 -0.6601 -0.30143
## D05 -0.24835    0.08943 -0.4236 -0.07307
## D06 -0.06219    0.08865 -0.2359  0.11156
## D07  0.43925    0.09040  0.2621  0.61644
## D08  0.70621    0.09367  0.5226  0.88979
## D09  1.31602    0.10757  1.1052  1.52686
## D10  1.59784    0.11730  1.3679  1.82775

To examine the word-specific parameters use

coef(wf)
## $words
##           beta      psi
## W01 -0.3459182 2.981312
## W02 -0.4151621 2.973547
## W03 -0.2700622 3.168263
## W04  0.6007661 2.986491
## W05  0.6092431 3.085643
## W06  0.6176637 2.974974
## W07 -0.4498973 4.069899
## W08 -0.3608094 4.102484
## W09 -0.4141838 3.978310
## W10  0.6343688 3.957057
## W11  0.6462525 3.983051
## W12  0.5843343 4.054843
## 
## $docs
##           alpha
## D01  0.00000000
## D02  0.05165051
## D03  0.08462643
## D04  0.10118919
## D05  0.09897796
## D06  0.08750664
## D07  0.01285390
## D08 -0.05216542
## D09 -0.25905866
## D10 -0.37771057
## 
## attr(,"class")
## [1] "coef.wordfish" "list"

Estimated document positions and 95% confidence intervals can also be graphed (For more than a few tens of words the confidence intervals will probably be ‘implausibly’ small. They are nevertheless asymptotically correct given the model assumptions. It is those assumptions you might doubt.). Any unnamed second argument to the plot function is taken as a vector of true document positions. These are then plotted over the original plot, as shown in Figure~.

Positions for new documents can also be estimated. Here we generate predictions and confidence intervals for existing documents D4 and D5 in the original data set

predict(wf, newdata=getdocs(dd$Y, c(4,5)), se.fit=TRUE, interval='conf')
##            fit     se.fit       lwr         upr
## D04 -0.4807625 0.09149940 -0.660098 -0.30142698
## D05 -0.2483462 0.08943009 -0.423626 -0.07306647

Scaling with Wordscores

Wordscores (Laver et al. 2003) is a method for scaling texts closely related to both correspondence analysis by implementing an incomplete reciprocal averaging algorithm, and to quadratic ordination as an approximation to an unfolding model (Lowe 2008, 2014).

Austin refers to the algorithm described in Laver et al. 2003 as `classic’ Wordscores to distinguish it from versions closer to correspondence analysis. A classic Wordscores analysis has several distinguishing features.

Classic Wordscores

Classic Wordscores estimates scores for words (‘wordscores`) using only word frequency information from documents with known positions (’reference’ documents). There is therefore no iterative estimation process since document positions are observed. Documents with unknown positions (‘virgin’ documents) are treated as out of sample.

Positions for out of sample documents are estimated by averaging the scores of the words they contain and re-scaling in an ad-hoc fashion that has generated some discussion and various alternatives. The method also offers standard errors for the out of sample documents (These are probably incorrect – partly because the probability model from they would have to be derived is unclear and partly because they can be quite implausible in some applications).

To replicate the example analysis in Laver et al. we begin loading the test data

data(lbg)

So we take a look at the word counts we’ve got to work with

as.docword(lbg)
##     words
## docs A B  C  D  E  F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T
##   R1 2 3 10 22 45 78 115 146 158 146 115  78  45  22  10   3   2   0   0   0
##   R2 0 0  0  0  0  2   3  10  22  45  78 115 146 158 146 115  78  45  22  10
##   R3 0 0  0  0  0  0   0   0   0   0   2   3  10  22  45  78 115 146 158 146
##   R4 0 0  0  0  0  0   0   0   0   0   0   0   0   0   0   2   3  10  22  45
##   R5 0 0  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   V1 0 0  0  0  0  0   0   2   3  10  22  45  78 115 146 158 146 115  78  45
##     words
## docs   U   V   W   X   Y   Z  ZA  ZB  ZC  ZD  ZE ZF ZG ZH ZI ZJ ZK
##   R1   0   0   0   0   0   0   0   0   0   0   0  0  0  0  0  0  0
##   R2   3   2   0   0   0   0   0   0   0   0   0  0  0  0  0  0  0
##   R3 115  78  45  22  10   3   2   0   0   0   0  0  0  0  0  0  0
##   R4  78 115 146 158 146 115  78  45  22  10   3  2  0  0  0  0  0
##   R5   2   3  10  22  45  78 115 146 158 146 115 78 45 22 10  3  2
##   V1  22  10   3   2   0   0   0   0   0   0   0  0  0  0  0  0  0

and then fit a classic Wordscores model to them. Assume we know the positions of document R1 through R5 and wish to scale V1.

We first separate the reference documents from the virgin document:

ref <- getdocs(lbg, 1:5)
vir <- getdocs(lbg, 'V1') 

then fit the model using the reference documents

ws <- classic.wordscores(ref, scores=seq(-1.5, 1.5, by=0.75))

We can summarise the results

summary(ws)
## Call:
##  classic.wordscores(wfm = ref, scores = seq(-1.5, 1.5, by = 0.75))
## 
## Reference Document Statistics:
## 
##    Total Min Max Mean Median Score
## R1  1000   0 158   27      0 -1.50
## R2  1000   0 158   27      0 -0.75
## R3  1000   0 158   27      0  0.00
## R4  1000   0 158   27      0  0.75
## R5  1000   0 158   27      0  1.50

The summary presents details about the reference documents. If we want to see the wordscores that were generated we look for the model’s coefficients

coef(ws)
##         Score
## A  -1.5000000
## B  -1.5000000
## C  -1.5000000
## D  -1.5000000
## E  -1.5000000
## F  -1.4812500
## G  -1.4809322
## H  -1.4519231
## I  -1.4083333
## J  -1.3232984
## K  -1.1846154
## L  -1.0369898
## M  -0.8805970
## N  -0.7500000
## O  -0.6194030
## P  -0.4507576
## Q  -0.2992424
## R  -0.1305970
## S   0.0000000
## T   0.1305970
## U   0.2992424
## V   0.4507576
## W   0.6194030
## X   0.7500000
## Y   0.8805970
## Z   1.0369898
## ZA  1.1846154
## ZB  1.3232984
## ZC  1.4083333
## ZD  1.4519231
## ZE  1.4809322
## ZF  1.4812500
## ZG  1.5000000
## ZH  1.5000000
## ZI  1.5000000
## ZJ  1.5000000
## ZK  1.5000000

which can also be plotted.

We can now get a position for the virgin document

predict(ws, newdata=vir)
## 37 of 37 words (100%) are scorable
## 
##     Score Std. Err. Rescaled  Lower  Upper
## V1 -0.448    0.0119   -0.448 -0.459 -0.437

When more than one document is to be predicted, an ad-hoc procedure is applied by default to the predicted positions to rescale them to the same variance as the reference scores. This may or may not be what you want.

Correspondence Analysis

Wordscores approximates correspondence analysis, which is defined for more than one dimension. To explore this approach to document scaling you may find the ca or anacor packages useful. A rather limited subset of correspondence analysis is implemented by the MASS package’s corresp function.

References

Laver, Michael, and John Garry. 2000. “Estimating Policy Positions from Political Texts.” American Journal of Political Science 44(3): 619–34.

Lowe, Will. 2008. “Understanding Wordscores.” Political Analysis 16(4): 356–71.

Lowe, Will. 2013. “There’s (basically) Only One Way to Do It.” Paper presented at APSA 2013, Chicago IL. Available at SSRN.

Monroe, Burt L., and Ko Maeda. 2004. “Talk’s Cheap: Text-Based Estimation of Rhetorical Ideal-Points.”

Moustaki, Irini, and Martin Knott. 2000. “Generalized Latent Trait Models.” Psychometrika 65(3): 391–411.

Slapin, Jonathan B., and Sven-Oliver Proksch. 2008. “A Scaling Model for Estimating Time-Series Party Positions from Texts.” American Journal of Political Science 52(3): 705–22.