Initial name set to be taken from "Das große Vornamenlexikon" (Duden, 2007) from table at pages 487-516.
This table contains about 7000 unique names. From these a preselection of about 3000 names will be compiled.
Names are checked against two collections of German words which can be accessed as part of the "Leipzig Wortschatz" project (Biemann, Heyer, Quasthoff, & Richter; 2007) to ensure correct spelling and selection of common names. For this the collections "deu\_newscrawl\_2011" and "deu-na\_newscrawl\_2012" of the Leipzig Wortschatz were used.
Furthermore names are converted using the "Kölner Phonetik" (Postel, 1969) to detect similar sounding names. For all names with the same encoding based on Kölner Phonetik only the most common one according to the collections from the Leipzig Wortschatz is taken.
From the initial list all names are removed, which match at least one of the following criteria:
1. The name is ambiguous in terms of gender, i.e. the name appears both in the table of male as well as in the table of female names.
2. The name differs from another name from the same table only in terms of diacritics. For all names that have the same form after removal of all diacritics, the one with the higher number of occurrences in the collection is taken.
3. The name does not occur in any of the two collections. All names that could not be found in any of the collections were removed.
4. There is a similar sounding and similar spelled name in the table for the same gender. Similar names are detected based on the "Kölner Phonetik" and names are clustered if they have the same encoding. For all names from the same cluster the spelling is checked by calculating the jaro-winkler similarity (Winkler; 1990) between each pair of names. For each name, all other names were tested if a name with a jaro-winkler similarity of more than 0.8 was found in the same cluster, which also had a higher number of occurrences in the collections. All names for which a similar name could be found in the same sound cluster were removed.
Python scripts for filtering can be found in the "code" subcomponent.
## List of Files
### Name Lists
#### Male Names
* **Names_male_Duden_2007.csv**: Original set of names scanned by Sandra Werner from "Das große Vornamenlexikon" (Duden, 2007), Table "Männliche Vornamen".
* **Names_male_Duden_2007.incorpus.csv**: Original list of male names filtered for names that appeared at least once in any of the two collections.
* **Names_male_Duden_2007.soundcode.csv**: Original list of male names filtered for names that were similar in sound and spelling..
* **Names_male_Duden_2007.spelling.csv**: Original list of male names filtered for names which only differed in diacritics.
* **Names_male_Duden_2007.unambiguous.csv**: Original list of male names filtered for names that were ambiguous (i.e. appeared in both tables).
* **Names_male_selected.csv**: List of male names after filtering. Names are sorted by the number of occurences in the collections.
#### Female Names
* **Names_female_Duden_2007.csv**: Original set of names scanned by Sandra Werner from "Das große Vornamenlexikon" (Duden, 2007), Table "Weibliche Vornamen".
* **Names_female_Duden_2007.incorpus.csv**: Original list of female names filtered for names that appeared at least once in any of the two collections.
* **Names_female_Duden_2007.soundcode.csv**: Original list of female names filtered for names that were similar in sound and spelling..
* **Names_female_Duden_2007.spelling.csv**: Original list of female names filtered for names which only differed in diacritics.
* **Names_female_Duden_2007.unambiguous.csv**: Original list of female names filtered for names that were ambiguous (i.e. appeared in both tables).
* **Names_female_selected.csv**: List of female names after filtering. Names are sorted by the number of occurences in the collections.
#### Lists used for filtering
* **Ambiguous_names.csv**: List of names, that appear both in the table for male as well as for female names.
* **Names_male_rejected.csv**: List of all rejected male names with the reason why the name was filtered.
* **Names_female_rejected.csv**: List of all rejected female names with the reason why the name was filtered.
#### Features Derived from Nameset and Collection
* **Names_male_features.csv**: Collection of counts gathered from the collection, base form (no diacritics) for each male name, and the encoding from the Kölner Phonetik.
* **Names_female_features.csv**: Collection of counts gathered from the collection, base form (no diacritics) for each female name, and the encoding from the Kölner Phonetik.
## Changelog
* 01-23-2017: Manual correction of initial nameset. All names with 0 occurences in the collection are checked manually and corrected.
* 01-24-2017: Duplicates removed.
* 01-24-2017: All steps of filtering added as indivdual files
## References
* Duden (2007). *Das große Vornamenlexikon*, Brockhaus AG, Mannheim.
* Hans Joachim Postel (1969): *Die Kölner Phonetik. Ein Verfahren zur Identifizierung von Personennamen auf der Grundlage der Gestaltanalyse.* In: IBM-Nachrichten, 19. Jahrgang, 1969, S. 925–931.
* Biemann, C., Heyer, G., Quasthoff, U., & Richter, M. (2007). *The Leipzig Corpora Collection-monolingual corpora of standard size.* Proceedings of Corpus Linguistic, 2007.
* Winkler, W. E. (1990). *String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.* Proceedings of the Section on Survey Research Methods, American Statistical Association, 354–359.