This dataset includes the raw frequency counts (classical_chinese_learners_vocabularies_raw_frequencies.zip) used in the article Thoughts on “Reliable” Learner’s Vocabularies for Classical and Literary Chinese. Corpus I – Micheal Loewe (1993)’s Early Chinese Texts
Corpus II – Official Histories (zhengshi 正史)
Corpus III Six Novels (xiaoshuo 小說), as defined in Hsia 1968 The download includes one folder per corpus, structured as follows: xx_corpus.csv > list of texts and sources / used versions, token and type counts xx_freq_1-1.csv > unigram / character frequencies and counts xx_freq_1-4.csv > 1 to 4 character word frequencies and counts, "words" according to Hanyu da cidian 漢語大詞典 (Luo 1986–1994)) xx_freq_2-4.csv > 2 to 4 character words Additionally, pca_zhengshi_vs_loewe_vs_xiaoshuo.html is an interactive version of the Principal Component Analysis (PCA) presented in the article, texts from the three corpora are represented using the 1.000 most frequent 1–4 character combinations from the dataset.
Corpus II – Official Histories (zhengshi 正史)
Corpus III Six Novels (xiaoshuo 小說), as defined in Hsia 1968 The download includes one folder per corpus, structured as follows: xx_corpus.csv > list of texts and sources / used versions, token and type counts xx_freq_1-1.csv > unigram / character frequencies and counts xx_freq_1-4.csv > 1 to 4 character word frequencies and counts, "words" according to Hanyu da cidian 漢語大詞典 (Luo 1986–1994)) xx_freq_2-4.csv > 2 to 4 character words Additionally, pca_zhengshi_vs_loewe_vs_xiaoshuo.html is an interactive version of the Principal Component Analysis (PCA) presented in the article, texts from the three corpora are represented using the 1.000 most frequent 1–4 character combinations from the dataset.