Gender Encoding
This R package includes an internally bundled C library for categorising the gender of first names, thanks to Michael Jörg (available here). The library is extremely fast and flexible; covers all European languages and a host of others, and categorises gender very accurately. Here’s a test run on a very large data set of names from the English-speaking world:
u <- "https://github.com/hadley/data-baby-names/raw/master/baby-names.csv"
if (!file.exists ("baby-names.csv"))
chk <- download.file (u, "baby-names.csv")
n <- read.csv ("baby-names.csv", stringsAsFactors = FALSE)
format (nrow (n), big.mark = ",")
#> [1] "258,000"
st <- system.time (x <- get_gender (n$name))
st
#> user system elapsed
#> 0.102 0.081 0.183
Var1 | Freq |
---|---|
IS_FEMALE | 103059 |
IS_MALE | 95751 |
IS_MOSTLY_FEMALE | 15919 |
IS_MOSTLY_MALE | 17290 |
IS_UNISEX_NAME | 11296 |
NAME_NOT_FOUND | 14685 |
Var1 | Freq |
---|---|
boy | 129000 |
girl | 129000 |
Categorising 258,000 names took only 0.183 seconds, or around 100,000 names per second. The following code compares the accuracy, noting that many names are of course unisex, whereas the “baby-names” data are direct records of individual names and sex.
x$gender [x$gender == "IS_MALE"] <- "boy"
x$gender [x$gender == "IS_MOSTLY_MALE"] <- "boy"
x$gender [x$gender == "IS_FEMALE"] <- "girl"
x$gender [x$gender == "IS_MOSTLY_FEMALE"] <- "girl"
index_right <- which (x$gender == n$sex)
message (format (length (index_right), big.mark = ","), " / ",
format (nrow (x), big.mark = ","),
" of names correctly classified = ",
formatC (100 * length (index_right) / nrow (x),
format = "f", digits = 1), "%")
#> 217,630 / 258,000 of names correctly classified = 84.4%
Noting that the baby name records are structured over time, and include many repeats of the same names, we can try to create “mostly girl/boy” categories based on relative proportions.
library (dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
categorise_sex <- function (sex, size) {
# define relative proportions:
# if > rel_props [2], then category is singular
# else if > rel_props [1], then category is "mostly" singular,
# else category is unisex
rel_props <- c (4, 1000)
if (length (size) == 1)
return (sex)
bi <- which (sex == "boy")
gi <- which (sex == "girl")
if (size [bi] > (size [gi] * rel_props [2]))
return ("boy")
else if (size [gi] > (size [bi] * rel_props [2]))
return ("girl")
else if (size [bi] > (size [gi] * rel_props [1]))
return ("mostly boy")
else if (size [gi] > (size [bi] * rel_props [1]))
return ("mostly girl")
else
return ("unisex")
}
n2 <- n |>
group_by (name, sex) |>
summarise (size = n ()) |>
group_by (name) |>
summarise (category = categorise_sex (sex, size))
#> `summarise()` has grouped output by 'name'. You can override using the
#> `.groups` argument.
Var1 | Freq |
---|---|
boy | 2764 |
girl | 3345 |
mostly boy | 147 |
mostly girl | 224 |
unisex | 302 |
The above values for relative proportions were selected to give good agreement with the observed overall distribution of categories as determined by the internal library. These two more refined data sets can then be compared:
n2$gender <- get_gender (n2$name)$gender
n2$gender [n2$gender == "IS_FEMALE"] <- "girl"
n2$gender [n2$gender == "IS_MALE"] <- "boy"
n2$gender [n2$gender == "IS_MOSTLY_FEMALE"] <- "mostly girl"
n2$gender [n2$gender == "IS_MOSTLY_MALE"] <- "mostly boy"
n2$gender [n2$gender == "IS_UNISEX_NAME"] <- "unisex"
Some names are simply not found, so we’ll remove those from the comparison before calculating final statistics.
n2 <- n2 [which (!n2$gender == "NAME_NOT_FOUND"), ]
knitr::kable (with (n2, table (category, gender)))
boy | girl | mostly boy | mostly girl | unisex | |
---|---|---|---|---|---|
boy | 1643 | 19 | 92 | 15 | 61 |
girl | 19 | 2221 | 15 | 89 | 66 |
mostly boy | 90 | 3 | 36 | 4 | 8 |
mostly girl | 0 | 173 | 3 | 30 | 7 |
unisex | 27 | 44 | 65 | 79 | 66 |
The accuracy in that case is