1.0 Reading Texts

You can view the IPython notebook for this module on GitHub.

String. A sequence of characters. A character can be a letter or a number, but also punctuation, whitespace characters (space, tab, newline), and other strange symbols.

Normalization. Transforming tokens into a standardized representation.

Stemming. Removing affixes from tokens so that only the "root" or "stem" word remains.

Lemmatization. Converting a token into its lexicon headword. In the IPython notebooks, a lemmatizer based on the WordNet lexicon is demonstrated.

Filtering. Removing unwanted tokens.

Punctuation;
Numbers;
Extremely (in)frequent words;
"Junk" words; e.g. pronouns.

Stoplist. A list of words that should be removed/ignored.

Methods in Digital and Computational Humanities

1.0 Reading Texts

Analytics

Related content