---
title: "Alternative Statistics"
author: "Will Lowe"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    fig_width: 7
    fig_height: 7 
vignette: >
  %\VignetteIndexEntry{Alternative Statistics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This time we'll use the simple
bootstrapping techniques in `weat_boot` and `wefat_boot`.

First we load up the package and arrange the graphics
```{r}
library(cbn)

library(ggplot2)
theme_set(theme_minimal())
```

## Replicating WEAT

Are flowers more pleasant than insects?

Grab the items from the first WEAT study and the vectors corresponding to 
those words
```{r}
its <- cbn_get_items("WEAT", 1)
head(its)
its_vecs <- cbn_get_item_vectors("WEAT", 1)
dim(its_vecs)
```

Now get a bootstrapped difference of differences of cosines
```{r}
res <- weat_boot(its, its_vecs, 
                 x_name = "Flowers", y_name = "Insects",
                 a_name = "Pleasant", b_name = "Unpleasant",
                 se.calc = "quantile")
res
```
Apparently flowers are more pleasant than insects.

## Replicating WEFAT

Same as before to get items and vectors
```{r}
its <- cbn_get_items("WEFAT", 1)
its_vecs <- cbn_get_item_vectors("WEFAT", 1)
```
Now to get bootstrapped differences of cosines. Note that there is no `y_name` this
time and we will get a statistic for each `x_name`.
```{r}
res <- wefat_boot(its, its_vecs, x_name = "Careers",
                  a_name = "MaleAttributes", b_name = "FemaleAttributes",
                  se.calc = "quantile")
head(res)
```

This is a bit hard to interpret, so we'll make a picture

```{r}
ggplot(res, aes(x = median, y = 1:nrow(res))) +
  geom_point(col = "grey") +
  geom_point(aes(x = diff)) +
  geom_errorbarh(aes(xmin = lwr, xmax = upr), height = 0) +
  geom_text(aes(x = upr, label = Careers), hjust = "left", nudge_x = 0.005) +
  xlim(-0.25, 0.25) +
  ylab("Careers") +
  xlab("Cosine difference (male - female)")
```

This will work quite generally for WEFATs, but remember to mention the 
right condition name in `geom_text`.

I don't have the male / female proportions for different jobs, so we can't 
compare them right now.

## The Gender of Androgenous Names

First we get the vector differences
```{r}
its <- cbn_get_items("WEFAT", 2)
its_vecs <- cbn_get_item_vectors("WEFAT", 2)
res <- wefat_boot(its, its_vecs, x_name = "AndrogynousNames",
                  a_name = "MaleAttributes", b_name = "FemaleAttributes",
                  se.calc = "quantile")
```
Then we find the gender proportions for each name from the census.  
This is most easily done using the `gender` package, which queries the US 
Social Security Administration to get the proportion of stated males and females 
with any particular first name.  

This data is bundled with the package, so we'll join this to `res`
```{r}
data(cbn_gender_name_stats)
head(cbn_gender_name_stats)

res <- merge(res, cbn_gender_name_stats, 
             by.x = "AndrogynousNames", by.y = "name")
```
If you want to do it yourself, e.g. to look at gender over different time 
periods, or use a different gender source, then 
```{r, eval = FALSE}
library(gender)

names <- c("Hugh", "Pugh", "Barney")
gender_name_stats <- gender(names)
```
and replace `cbn_gender_name_stats` with `gender_name_stats`.

To see how they relate, we'll plot proportion male with the `diff` column of
`res`
```{r}
ggplot(res, aes(x = proportion_male, y = diff)) +
  geom_hline(yintercept = 0, alpha = 0.5, color = "grey") + 
  geom_point() +
  geom_text(aes(label = AndrogynousNames), hjust = "left", nudge_x = 0.005) +
  xlim(0, 1) +
  xlab("Population proportion male") +
  ylab("Cosine difference (male - female)")
```

The correlation is
```{r}
cor.test(res$diff,res$proportion_male)
```