--- title: "Using Your Own Items and Vectors" author: "Will Lowe" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using Your Own Vectors} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` To run the statistics you'll need your own items and your own vectors. The process has two parts 1. Acquire a list of words and associate them with a condition 2. Extract vectors for each word ## Associate words with conditions To see the format required for items (words) take a look at the item information from the first WEAT study ```{r, eval = FALSE} weat1 <- cbn_get_items(type = "WEAT") ``` the top of this data frame looks like ```{r eval = FALSE} Study Condition Word Role 1 WEAT1 Flowers aster target 2 WEAT1 Flowers clover target 3 WEAT1 Flowers hyacinth target 4 WEAT1 Flowers marigold target 5 WEAT1 Flowers poppy target 6 WEAT1 Flowers azalea target ``` Here, the study is called `WEAT1`, the conditions are `Flowers`, `Insects`, the `target` roles, and `Pleasant` and `Unpleasant`, the `attribute` roles. The helper function `cbn_make_items` can be helpful for creating this structure for your own words. Naturally the words you have vectors for should match the words you have item information for. ## Extract vectors for each word The `cbn` package bundles all the vectors you will need to replicate the paper analyses using the GloVe 840B 300-dimensional Common Crawl data. If, however, you want to work with different items you'll need to point the package at your own file of word vectors. The process is: 1. Download and unpack a text file of word vectors 2. Point the package at the file of word vectors 3. Extract vectors for your choice of words 4. Analyze Your Vectors In the following I'll assume that you still want to use the Common Crawl, but these instructions will work for any word vectors that arrive in the same file format. That format is essentially ``` sausage 0.1234 -0.5555 1.4149 ``` i.e. word, space, float, space, float, space float ... newline. This is what the code will assume when attempting to read things in. ### Download and unpack a text file of word vectors If you want to use the GloVe Common Crawl data, then go to it's homepage and download one of the files under 'Download pre-trained word vectors', e.g. http://nlp.stanford.edu/data/glove.840B.300d.zip When download is complete, unzip the file. This should create a roughly 5G file called `glove.840B.300d.txt`. I'll assume you downloaded it to `~/Documents`. ### Point the package at the file of word vectors Load the package and assign this location ```{r, eval = FALSE} library(cbn) cbn_set_vectorfile_location("~/Documents/glove.840B.300d.txt") ``` You can retrieve this location using `cbn_get_vectorfile_location()`. If you change your prefered vectors, just call it again with a new location. If you'd like this location to be remembered across R session add `persist = TRUE` to the function call. ### Extract vectors for your choice of words To get a matrix of vectors for your words ```{r, eval = FALSE} words <- c("Hugh", "Pugh", "Barney", "McGrew") mat <- cbn_extract_word_vectors(words) ``` By default there is no reporting, but for a couple of hundred words this function should return in around a minute for the 840B Common Crawl vectors. If you want to watch progress, set `verbose` to TRUE. A second argument `report_every` controls how often a progress dot appears. It defaults to 100000 (lines). `mat` is a matrix with as many rows as `words` and as many column as the length of the vectors. Ifyou are using the vectors above that will be 300. The matrix has `words` as rownames and no column names. In the event that one of your words is not found in the vector file, the corresponding row of `mat` is filled with NAs. ## Analyze your vectors With item information and corresponding vectors for your own words (or your own vectors) you can now use all the statistical functions described in the other vignettes.