/
1.0 Reading Texts
1.0 Reading Texts
You can view the IPython notebook for this module on GitHub.
String. A sequence of characters. A character can be a letter or a number, but also punctuation, whitespace characters (space, tab, newline), and other strange symbols.
Normalization. Transforming tokens into a standardized representation.
Stemming. Removing affixes from tokens so that only the "root" or "stem" word remains.
Lemmatization. Converting a token into its lexicon headword. In the IPython notebooks, a lemmatizer based on the WordNet lexicon is demonstrated.
Filtering. Removing unwanted tokens.
- Punctuation;
- Numbers;
- Extremely (in)frequent words;
- "Junk" words; e.g. pronouns.
Stoplist. A list of words that should be removed/ignored.
Related content
Software Requirements (Module 2)
Software Requirements (Module 2)
More like this
1.1. Working with a Corpus
1.1. Working with a Corpus
More like this
Tutorial 1: Bibliographic Networks in Tethne
Tutorial 1: Bibliographic Networks in Tethne
More like this
docuManager
docuManager
More like this
Develop an application to recognize relevant entities in texts
Develop an application to recognize relevant entities in texts
More like this
Flexible and Transparent Data Processing Pipelines using Common Workflow Language
Flexible and Transparent Data Processing Pipelines using Common Workflow Language
More like this