---
title: "Using Your Own Items and Vectors"
author: "Will Lowe"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using Your Own Vectors}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

To run the statistics you'll need your own items and your own vectors.  The process
has two parts

1. Acquire a list of words and associate them with a condition
2. Extract vectors for each word

## Associate words with conditions

To see the format required for items (words) take a look at the item information 
from the first WEAT study
```{r, eval = FALSE}
weat1 <- cbn_get_items(type = "WEAT")
```
the top of this data frame looks like 
```{r eval = FALSE}
  Study Condition     Word   Role
1 WEAT1   Flowers    aster target
2 WEAT1   Flowers   clover target
3 WEAT1   Flowers hyacinth target
4 WEAT1   Flowers marigold target
5 WEAT1   Flowers    poppy target
6 WEAT1   Flowers   azalea target
```
Here, the study is called `WEAT1`, the conditions are `Flowers`, `Insects`, 
the `target` roles, and `Pleasant` and `Unpleasant`, the `attribute` roles. 
The helper function `cbn_make_items` can be helpful for creating this structure 
for your own words.  Naturally the words you have vectors for should match the 
words you have item information for. 

## Extract vectors for each word

The `cbn` package bundles all the vectors you will need to replicate the 
paper analyses using the GloVe 840B 300-dimensional Common Crawl data. 
If, however, you want to work with different items you'll need to point the package
at your own file of word vectors.  The process is:

1. Download and unpack a text file of word vectors
2. Point the package at the file of word vectors
3. Extract vectors for your choice of words
4. Analyze Your Vectors

In the following I'll assume that you 
still want to use the Common Crawl, but these instructions will work for 
any word vectors that arrive in the same file format.  That format is
essentially
```
sausage 0.1234 -0.5555 1.4149
```
i.e. word, space, float, space, float, space float ... newline. 
This is what the code will assume when attempting to read things in.

### Download and unpack a text file of word vectors

If you want to use the GloVe Common Crawl data, then go to it's homepage
and download one of the files under 'Download pre-trained word vectors',
e.g. http://nlp.stanford.edu/data/glove.840B.300d.zip

When download is complete, unzip the file.  This should create
a roughly 5G file called `glove.840B.300d.txt`. I'll assume you
downloaded it to `~/Documents`.

### Point the package at the file of word vectors

Load the package and assign this location
```{r, eval = FALSE}
library(cbn)

cbn_set_vectorfile_location("~/Documents/glove.840B.300d.txt")
```
You can retrieve this location using `cbn_get_vectorfile_location()`.
If you change your prefered vectors, just call it again with a new location.
If you'd like this location to be remembered across R session add 
`persist = TRUE` to the function call.

### Extract vectors for your choice of words

To get a matrix of vectors for your words
```{r, eval = FALSE}
words <- c("Hugh", "Pugh", "Barney", "McGrew")
mat <- cbn_extract_word_vectors(words)
```
By default there is no reporting, but for a couple of hundred words this
function should return in around a minute for the 840B Common Crawl vectors.

If you want to watch progress, set `verbose` to TRUE.  A second argument
`report_every` controls how often a progress dot appears. It defaults to 
100000 (lines).

`mat` is a matrix with as many rows as `words` and as many column as the 
length of the vectors.  Ifyou are using the vectors above that will be 300.
The matrix has `words` as rownames and no column names.  In the event that
one of your words is not found in the vector file, the corresponding row of 
`mat` is filled with NAs.

## Analyze your vectors

With item information and corresponding vectors for your own words 
(or your own vectors) you can now use all the statistical functions 
described in the other vignettes.