main.py: script to train both lda and word2vec models main.ipynb: Jupyter Notebook containing all the analyses reported in the paper pos.ipynb: clustering based on frequencies of parts-of-speech corpora: contains original data for Czech, English, and Dutch poetry in JSON (proprietary German and Russian not included)
{ <= Each item in the following lists corresponds to particular poem and holds: 'words': [] <= list of lemmata found in the poem 'pos_tags': [] <= their POS-tags (Positional Morphological Tags for Czech, MyStem for Russian, TreeTagger tagsets for other corpora) 'meters': [[]] <= list of meters found in poem 'years': [] <= year when poem published (year when author born in case of English) 'n_words': [] <= number of words 'n_lines': [] <= number of lines 'authors': [] <= author of the poem 'titles': [] <= title of the poem 'schemes': [] <= line-ending schemes }
dicts: contains Gensim dictionary files for all 5 corpora fig: contains all resulting figures json > metadata: contains all metadata on poems in particular corpora { <= Each item in the following lists corresponds to particular poem and holds: 'meters': [[]] <= list of meters found in poem 'years': [] <= year when poem published (year when author born in case of English) 'n_words': [] <= number of words 'n_lines': [] <= number of lines 'authors': [] <= author of the poem 'titles': [] <= title of the poem }
json > topics: contains topic probabilities in particular poems [ <= each item corresponds to particular poem and comprise 100-dimensional dict { 'topic title': its probability in poem } ]
json > pos: contains POS relative frequencies in particular poems [ <= each item corresponds to particular poem { 'POS': its frequency } ]
json > w2v: contains mapping of lemmata and their neighbours in word2vec models models: contains pretrained lda and word2vec models (Gensim) regression: contains data and code to produce S5_table and S6_fig