**EXAMPLE OF GENDER STUDIES USING CHINESE DYNASTIC HISTORIES**
The corpus of the 24 Chinese Dynastic Histories was used for conducting a gender semantics study. The results are described in the article: Zinin, S. and Xu, Y. (2020) Corpus of Chinese dynastic histories: Gender analysis over two millennia. In Proceedings of the 12th International Conference on Language Resources and Evaluation, Marseille, France. Below are main points of the article.
1) CDH project brings for the first time free 24 dynastic corpora, under CC linense. The body of the corpus is neither new, nor developed by the authors. The 24 histories has been the first online digital classic project, developed by Ancient Chinese Sinica corpus project. However, this corpus is only available online for searches, as a concordance, and cannot be used for experiments. Currently, there are many versions of 24 histories online, but neither of them offers ready free online corpus of 24 histories. The Wikisource provides its texts, including the 24 histories, covered by CC license. However, its version of the text, as it presented on the site, is not convenient to be used in experiments. Moreover, Wikisouce allows online editing of its texts. It means, that even there was a simple text form at the beginning of the project, upon many online changes, its online form exists as an artifact sui generis. It required a certain amount of work to clean up and combine Wikisource texts, to create stand-alone UTF-8 text version of the 24 histories, which can be used for experiments, as well as for further evolution. E.g., researchers could created POS-marked versions of these histories, without asking permissions (but it also should be covered by the CC license). Authors consider this a contribution to the field of Chinese corpus linguistics studies;
2) Along with creating an open-source corpus of the 24 histories, the CDH project conducts a semantic study of gender-specific terms in Classical (Literary) Chinese. Classical Chinese is genderless language; therefore, to study gender semantics, one needs a list of gender-specific terms, such as nan (man) and nv (woman), to explore the corpus. Farris was the pioneer in creating lists of gender specific terms for modern Chinese, but it was never done for Classical Chinese. The gender-specific list was tailored to the 24 histories corpus, so it cannot be considered as a full list for all Classical Chinese. However, the text of the 24 histories lack currently word segmentation. Therefore, the list was created by identifying all potential gender-specific terms in the list of unique bigrams and trigrams (provided translations with CC-CEDICT dictionary). The current list has been created on the basis of seven "core histories", representing the critical points in development of Chinese language (while Classical Chinese should not have been directly affected by this development, there could be some correlation). Only terms that were found in (almost) all of these histories, were entered on the list. This is an important moment. There are many other gender-specific terms, e.g., court titles, that are indisynchrastic to specific histories. Chinese court titles are very fluid and were invented ad hoc. So many of them did not enter this list, although they could be prominent in some histories (but disappear in others). When more histories were added, although there were some terms on this list that wer missing, but they were not excluded. (It should be kept in mind, that in this list, even though target terms could not be present in some intermediate history, they should always present in critical points over the whole historic period.) Therefore, this list is a compromise. Besides, it is built on a corpus of formal historical texts. Nonetheless, the authors consider this list as a contribution to the field, and could be a basis for further development.
3) The produced list contained 81 male terms and 31 female terms. The terms usually belong to area of kinship (family terms), social status (aristocratic titles) and officaldom (official titles and professional terms). In this study, it was discovered that there are much more male gender-specific terms in dynastic histories, than female. I.e., semantic field of male gender-specific terms in Classical Chinese is more developed and articulated, than fimale field. This might be not surprising, considering that the traditional society was male-dominated, and the dynastic histories were written from this point of view. The authors did not conduct comparative studies (i.e., is it a norm for other traditional societes). It should be also understood that in Chinese texts, even as the language is genderless, gender-specific terms could be avoided at all, when gender-specific characters are described. Often, methaphorical language would be employed to describe people, e.g., "green clothes" for maid servants.
4) Upon creation of the gender-specific term list, it is possible to study gender semantics in the 24 histories. This means, among other, extracting lists of terms that are associated with gender-specific terms. In a very "naive" stereotypical approach, considering that it is a traditional historical corpus, it could be expected, e.g., that "man" will be associated with such terms as "power", "money", "arms", "war", "ruling", etc., while "woman" would be associated with "marriage", "children", "family", "birth", etc.
5) The study of gender semantics has been conducted on so-called "focus corpora". That means that for study not all corpus has been used (directly), but a cut-out part of it, consisting of text fragments that includes the studied term or terms. Using focus corpora allows to increase semantic potential of text, or semantic saturation. The focus corpus is supposed to have higher semanticity levels for target word. The size of fragment is defined by study methodology. Most often, it is 3/3 or 5/5 windows (three words on the left and right of the target). Rarer, it is a sentence. In this study, two types of window have been introduced: sentence and paragraph. The average size of a sentence is about 19 characters in this corpus, which could be between 30-40 words in English translation, due to compactness of Classical Chinese. The authors presume that sentence is a semantically finished utterance, and paragraph contains even more valuable information for a distributional semantics study.
6) The first approach to the task was a simple character distribution study. The lists of context terms (i.e., terms, distributed together with the target terms in the bag-of-words of size of sentence or paragraph) were created. The initial approach was to look at distribution of context terms in relation to the group of all male and all female terms, over all times. I.e., the focus corpus was created for all terms on the gender-specific list. Special context-target table ("synoptic table") has been created (where only fact co-distribution was accounted, neither frequency nor temporal distribution). The table demonstrated that most context terms of female terms are at the same time context terms of male terms, but not vice-versa.
7) This established another fact of male gender-specific terms domination in the dynastic histories. Not only there are more male terms, but they are related to more context terms. In a way, this is simply result of co-distribution of male and female terms. For paragraphs, most paragraphs where a female term is present, there is a male term (the picture for sentences is similar). However, there are many paragraphs where there is a male term, but not female. Therefore, the context term distribution simply reflects the co-distribution of male and female terms, with domination of male terms. It could be interpreted as male actors having more activities that are not shared with female actors, which looks trivial for a dynastic history. Or, it is hard to find a considerable stretch of a dynastic history text that would be devoted only to female activities (female and male here are understood as presence of one of selected gender-specific terms).
8) The results of study demonstrate that the "naive" approach to semantic gender study does not work immediately for distributional study of Chinese dynastic histories. The paragirm of semantic gender study should be elaborated. One needs to resolve questions, like, what should be expected from distributional study of complementary semantic concepts, e.g., male/female?
9) One interesting outcome of creation of the synoptic table is discovery not all terms are equal in this table. It is easy to see that some context terms are found in windows of practically all gender terms. They could be called "start context terms", as they are connected to many target terms, like a busy node on a network diagram. Other (and most) context terms are only related to a few gender-specific terms, or even to one of them. The same is right for gender-specific terms. Some of them are connected to almost all context terms, while others - to just one or two. Such target terms also could be called "star target terms". So the study allows to discover the most significant, semantically, gender-specific terms and their significant context terms, i.e., establishing a hierarchy of terms.
10) Although it is hard to find single-gendered female section of text in dynastic histories, it is possible to study semantics of single gender-specific terms with other computational lingustics methods, e.g., topic analysis and keyword analysis. There again will be used focus corpora, this time for seingle targets. In case of topic analysis (e.g., with LDA), the focus corpus for single term (e.g., nan (man) or nv (woman)) could be used; while for keyword analysis, such focus corpus to be contrasted to the rest of the dynastic histories corpus, from which the focus corpus to be taken out.
11) Topic analysis of focus corpora often is used to extract word meanings of the target word. Although extracted words are not a direct definition of meaning or meanings of the target, these words could be considered semantically close, pertaining to the meaning. The topic analysis of focus corpora for nan and nv has been conducted, and it produced sets of words (characters) that demonstrate some semantic contrast. This could be a prospective way to study semantic of gender terms.
12) Keywords analysis is supposed to extract terms that describe "aboutness", or a content, of a text, that is contrasted with a larger corpus. In this study, a focus corpus for a target term is contracted with the rest of dynastic histories. The keywords extracted in this way, are supposed to express the aboutness of the focus corpus (which is similar, but not the same as topicality). The study discovered that keywords extarcted for e.g. nan contain more male gender-specific terms, than keywords for nv. In a way, this could be similar to the "naive" approach described above. It could be even possible to see the keywords for the target terms as semantic "signatures" of the terms.
13) The study also conducted elementary diachronic distributional analysis. For target terms, the (relative) frequency of context terms (for sentences and paragraphs) was measured, and it was tried to establish any trend over the time through building a linear regression line. If the line was definitely going up, it was considered to be growing (for context term and target), and vice versa. If there was no a definite slope, it was considered to be neutral. With the current arrangement (when each history was considered a point on time axis), most pairs came out neutral. Only for a small part of pairs set there was definite positive or negative growth. (But that could be result of framework arrangement; histories should hve been lumped together into larger periods.) On the other side, no news is a good news. The lack of distributional change for context terms for genders (at least how it was counted) is, may be, one of a few (if any) real longitudinal tests of Literary Chinese stability.