Methods in Digital and Computational Humanities

May 11–18, 2016 | Leuphana University | Lüneburg, Germany

Erick Peirson — erick.peirson@asu.edu | https://asu.academia.edu/ErickPeirson | https://github.com/erickpeirson

IPython notebooks and sample datasets: https://github.com/diging/methods

Modules

0.0. Metadata
1.0. Reading Texts
1.1. Working with a Corpus
1.2. Change and difference
- 1.2.1. Linear model with OLS
- 1.2.2. Linear model - Bayesian approach
- 1.2.3 Comparing word use between years
- 1.2.4 Comparing word use between corpora
1.3. Feature selection
- 1.3.1. Keywords
- 1.3.2. N-grams
- 1.3.3. Entities
1.4. Colocates added colocation
1.6. Latent Semantic Analysis
1.7. Topic modeling
- 1.7.1. Latent Dirichlet Allocation
- 1.7.2. Hierarchical Dirichlet Process
- 1.7.3. LDA with Tethne and MALLET
1.8. Skipgram Model
2.0. Co-author graphs
2.1. Co-citation graphs

By and large, what has made the very notion of computational humanities thinkable is the digitization of texts. This includes the conversion of print texts into digital form as well as the digital production of texts—the vast majority of texts produced in the 21st century are “born digital,” even if they are subsequently circulated in print form.

This course is an entry-point into interrogating digital texts. By design, this course focuses mostly on method and goes light on theory. We make the minimal assumption that texts are emissions of historical/cultural processes. The ways in which we theorize the relationship between texts and those underlying processes depend very much on the discipline in which we work. My hope is that as we grapple with these analytic methods, we can reflect on how to incorporate them into our own respective disciplinary frameworks. Translating qualitative theories from disciplines in the humanities into quantitative theories and models is a major problem-area for digital humanities.

What we will focus on in this course are ways of identifying theoretically relevant features in (or around) texts using computational methods, and using those features as the basis for quantitative analysis. Given the limited time available, you will not be an expert in any of these methods at the end of the course. Instead, my goal is to give you a thorough enough introduction to a range of methods that, as you develop your research project, you will know roughly where to start.

Programming Required

It is extremely challenging to design a course that introduces truly useful computational methods, while at the same time avoiding advanced programming techniques. The fact of the matter is that a truly useful and rigorous computational analysis of texts requires some degree of computer programming. Luckily, some programming languages are abstract enough (i.e. close enough to the semantics of everyday language) that the determined scholar can get up and running fairly quickly. Interpreted languages like Python, Ruby, and R are fairly easy to learn, and each has a rich ecosystem of packages and extensions for quantitative analysis, including text-based analyses.

My strategy in designing this course was therefore to provide an introduction to quantitative text analysis in Python, without requiring you to know any Python ahead of time. To do this, I have created a series of “IPython Notebooks” — these are interactive notebooks that run in your web browser, pre-loaded with blocks of code surrounded by expository text. You can run the analyses in these notebooks without changing more than one or two lines of code, using text collections that I will provide. If you want to use your own text collections, or tweak the methods, you can alter the code to your heart’s content.

Here are a couple of good places to start learning Python:

Format

This course will mix short lectures with hands-on coding exercises. The course is divided into bite-size "modules". For each module, we will start with 15-20 minutes of lecture on the core concepts of the module. We will then (usually) work through some code samples together, with further exposition on Python coding techniques. You will then have time to play with the code, either by tinkering with the parameters or applying it to other datasets.

We will use the SageMath cloud platform for the computational exercises. This provides a standardized, stable environment for running code samples.

Cool Papers