1.1. Working with a Corpus

Corpus. A selection of texts and attendant metadata.

Selection implies intention: the texts in a corpus are selected for some purpose, which informs our analytic approach.

It is crucial to have some way to associate individual text files with their corresponding metadata. This might be via URI, an ID, or some other mechanism. Sometimes you can even put metadata (E.g. publication date) into the filename of the text file.

Word count. The number of tokens of a word in a text.

Document count. The number of documents in which a word occurs.

Word frequency. Proportion of observed tokens that are a particular word. In other words, the word count divided by the number of tokens (e.g. in a single document).

Conditional frequency. The distribution of word frequencies conditional on some context. E.g. the frequency of words over time. We usually write this as f(w|c).

Word probability. The probability of a word, which we will usually write p(w), is the probability that we will encounter a token of word w when we retrieve a single token from the corpus. In most cases we will assume that word frequency is the best estimator of the probability of a word.

Conditional probability. Like the conditional frequency, this is the probability of encountering a particular word w given that we are sampling from a specific context. We write this p(w|c). For examples, we might want to know the probability of encountering the word "embryo" in texts published in 1997. So we would say, p(w="embryo"|c=1997). Most of the time, the best estimator of p(w|c) is f(w|c).