1.0 Reading Texts
You can view the IPython notebook for this module on GitHub.
String. A sequence of characters. A character can be a letter or a number, but also punctuation, whitespace characters (space, tab, newline), and other strange symbols.
Normalization. Transforming tokens into a standardized representation.
Stemming. Removing affixes from tokens so that only the "root" or "stem" word remains.
Lemmatization. Converting a token into its lexicon headword. In the IPython notebooks, a lemmatizer based on the WordNet lexicon is demonstrated.
Filtering. Removing unwanted tokens.
- Punctuation;
- Numbers;
- Extremely (in)frequent words;
- "Junk" words; e.g. pronouns.
Stoplist. A list of words that should be removed/ignored.