To run the statistics you’ll need your own items and your own vectors. The process has two parts
To see the format required for items (words) take a look at the item information from the first WEAT study
the top of this data frame looks like
Study Condition Word Role
1 WEAT1 Flowers aster target
2 WEAT1 Flowers clover target
3 WEAT1 Flowers hyacinth target
4 WEAT1 Flowers marigold target
5 WEAT1 Flowers poppy target
6 WEAT1 Flowers azalea target
Here, the study is called WEAT1
, the conditions are
Flowers
, Insects
, the target
roles, and Pleasant
and Unpleasant
, the
attribute
roles. The helper function
cbn_make_items
can be helpful for creating this structure
for your own words. Naturally the words you have vectors for should
match the words you have item information for.
The cbn
package bundles all the vectors you will need to
replicate the paper analyses using the GloVe 840B 300-dimensional Common
Crawl data. If, however, you want to work with different items you’ll
need to point the package at your own file of word vectors. The process
is:
In the following I’ll assume that you still want to use the Common Crawl, but these instructions will work for any word vectors that arrive in the same file format. That format is essentially
sausage 0.1234 -0.5555 1.4149
i.e. word, space, float, space, float, space float … newline. This is what the code will assume when attempting to read things in.
If you want to use the GloVe Common Crawl data, then go to it’s homepage and download one of the files under ‘Download pre-trained word vectors’, e.g. http://nlp.stanford.edu/data/glove.840B.300d.zip
When download is complete, unzip the file. This should create a
roughly 5G file called glove.840B.300d.txt
. I’ll assume you
downloaded it to ~/Documents
.
Load the package and assign this location
You can retrieve this location using
cbn_get_vectorfile_location()
. If you change your prefered
vectors, just call it again with a new location. If you’d like this
location to be remembered across R session add
persist = TRUE
to the function call.
To get a matrix of vectors for your words
By default there is no reporting, but for a couple of hundred words this function should return in around a minute for the 840B Common Crawl vectors.
If you want to watch progress, set verbose
to TRUE. A
second argument report_every
controls how often a progress
dot appears. It defaults to 100000 (lines).
mat
is a matrix with as many rows as words
and as many column as the length of the vectors. Ifyou are using the
vectors above that will be 300. The matrix has words
as
rownames and no column names. In the event that one of your words is not
found in the vector file, the corresponding row of mat
is
filled with NAs.
With item information and corresponding vectors for your own words (or your own vectors) you can now use all the statistical functions described in the other vignettes.