Data History & Architecture

Accession

The original dataset was generated from digitized annual reports from the Marine Biological Laboratory (available here). Project researchers manually transcribed information from course attendee lists in those annual reports into loosely structured spreadsheets (available here). Erick then aggregated those spreadsheets and performed some rudimentary normalization and matching procedures against people, institutions, locations, and courses, and then ingested those semi-cleaned data into a relational database (schema below). That database now sits behind the current curation platform and API.

Schema

One of the main concerns in modeling this dataset is that the original data were only loosely structured, and no attempt had been made to disambiguate person or institution names against a stable authority system. Thus the primary objective of the data model was to adequately represent that ambiguity while also making it possible to progressively disambiguate records in the future.

Caution!

Because of the ambiguities in the original dataset, great care should be exercised in interpreting the data provided by this platform. Unless explicitly indicated, all records should be treated as non-validated and subject to change or deletion. Details about validation and disambiguation can be found below.

A second concern was how to model course series. The dataset primarily describes individual course events (e.g. who were the attendees, in what capacity). At the same time, many of the anticipated use-cases conceptualize courses as recurrent events or series. This series-instance duality is directly modeled in the data, and should be borne in mind when developing on top of the API.