Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
Code associated with the collection and organization of this name-gender association data can be found on [github](https://github.com/ianvanbuskirk/nbgc). The python package which makes use of this data to support name-based gender classification in scientific research can be found [online](https://github.com/ianvanbuskirk/nomquamgender) as well. In addition to raw and cleaned versions of individual data sources (source\_data.zip) we offer three data resources, each capturing a different level of granularity. The first resource (fine-grained\_name-gender_data.csv) contains all the complexity of naming practices enabling the study and use of how names change over time and across countries. In this resource each row represents a single datapoint in one of two possible formats. The first format is for *count data* and it reflects the empirical count of humans with a particular name and gendered label ("f" for female/girl/woman and "m" for male/boy/man) in a particular source dataset. The second format is for *estimate data* and it reflects labels or scores assigned directly to names rather than counts of individuals, taking on values ranging from zero (gendered male) to one-hundred (gendered female). Where necessary, we convert qualitative labels like "mostly-male" and "mostly-female" to numeric values. In both formats we record the name, source, country, year, data type (e.g. birth count, score), and the count or estimate itself. These two formats allow for the finest level of granularity by presenting empirical count data from a source when available and only in its absence returning a source's pre-processed estimates. Alongside the primary fields discussed above we include two additional pieces of information with each datapoint. A simplified, ASCII encoded version of each name is offered to accompany the original, UTF-8 encoded name. Romanization, the dropping of diacritics, and the clipping of multipart names makes possible the comparison and aggregation of data across a greater number of sources. Further, in downstream applications these simplified names enable smoother matching between query and reference data. It is important to be aware that the use of simplified names means that in certain cases gendering information is lost. Researchers with this concern can make use of the unprocessed UTF-8 encoded names provided. Second, an adjustment factor is computed to account for how different sources sample individuals from the general population. For example, Wikidata and the database of Olympic athletes both contain around 3 times more individuals labeled male (amodo males) than individuals labeled female (amodo females) in their samples. This would greatly bias estimates of how each name has been gendered unless the particular study population was sampled in a similar way. The adjustment made rests on two assumptions: (1) each source should contain an equal number of males and females in their samples to approximately reflect the composition of the general population and (2) the probability of an individual being sampled by a source is independent of their name conditioned on their gendered label (i.e. a source does not have a bias towards including people with particular names). The correction that follows from these assumptions is to weight each observation of either males or females in a source by the degree to which that group is over or underrepresented (a kind of post-stratification). We choose to weight the observations of females, as females are more frequently underrepresented. The computed weight for adjustment is provided alongside each datapoint. The second resource (source-aggregated\_name-gender_associations.json) is a reduction of the first and allows one to easily select a subset of sources to use for analysis and to build new classification models. It takes the form of a dictionary with three levels. The top level keys are names, giving each name its own dictionary of data, the keys of which are the individual sources that have data on the name in question. For each source there is a dictionary with the keys "m" and "f". For count data this captures the adjusted number of "male/boy/man" and "female/girl/woman" counts for a specific name, aggregated over the countries and years contained in that dataset. For estimate data this captures the total weight assigned to a gendered male and gendered female estimate. The third resource (averaged\_name-gender_estimates.json) is a reduction of the second and provides a summative account of how each name is gendered, taking the form of a dictionary with two levels. Here, each name is associated with three values: the number of sources that had data on this name, the adjusted counts aggregated across sources, and an estimate of how strongly the name is gendered female ranging from 0 (gendered male) to 1 (gendered female). The estimate is formed by computing for each source the fraction of the counts/weights that were labeled "f" and then averaging these source specific estimates. Some additional data necessary for replicating the results shown in the paper introducing these data resources is also provided. Finally, more information on the collection of source data is provided [elsewhere](https://www.notion.so/ianvb/Source-Data-Documentation-c4b87d97fe444ae9b87b2ca8344b96ac).
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.