1.0 Reading Texts

You can view the IPython notebook for this module on GitHub.

 

String. A sequence of characters. A character can be a letter or a number, but also punctuation, whitespace characters (space, tab, newline), and other strange symbols.

Normalization. Transforming tokens into a standardized representation.

Stemming. Removing affixes from tokens so that only the "root" or "stem" word remains.

Lemmatization. Converting a token into its lexicon headword. In the IPython notebooks, a lemmatizer based on the WordNet lexicon is demonstrated.

Filtering. Removing unwanted tokens.

  • Punctuation;
  • Numbers;
  • Extremely (in)frequent words;
  • "Junk" words; e.g. pronouns.

Stoplist. A list of words that should be removed/ignored.