/
Data History & Architecture

Data History & Architecture

Accession

The original dataset was generated from digitized annual reports from the Marine Biological Laboratory (available here). Project researchers manually transcribed information from course attendee lists in those annual reports into loosely structured spreadsheets (available here). Erick then aggregated those spreadsheets and performed some rudimentary normalization and matching procedures against people, institutions, locations, and courses, and then ingested those semi-cleaned data into a relational database (schema below). That database now sits behind the current curation platform and API.

Schema

One of the main concerns in modeling this dataset is that the original data were only loosely structured, and no attempt had been made to disambiguate person or institution names against a stable authority system. Thus the primary objective of the data model was to adequately represent that ambiguity while also making it possible to progressively disambiguate records in the future. 

Caution!

Because of the ambiguities in the original dataset, great care should be exercised in interpreting the data provided by this platform. Unless explicitly indicated, all records should be treated as non-validated and subject to change or deletion. Details about validation and disambiguation can be found below.

A second concern was how to model course series. The dataset primarily describes individual course events (e.g. who were the attendees, in what capacity). At the same time, many of the anticipated use-cases conceptualize courses as recurrent events or series. This series-instance duality is directly modeled in the data, and should be borne in mind when developing on top of the API.

Core Data

The following classes comprise the core of the MBL course and investigator dataset. This part of the schema captures the central concepts of the historical data.

Person

Person records represent people who attended courses at the MBL. Unless explicitly indicated via an authority relation to a KnownPerson record, such records are ambiguous: we are not aware of the person to which this record refers, and the record should be treated accordingly.

KnownPerson

KnownPerson record represents an entry in the Conceptpower authority service that can be used as a reliable identifier for the referent of a Person record. 

CourseGroup

A course group represents a series or set of related courses, often spanning long durations of time. For example, the Embryology course group has been held annually since the late 19th century. Course records (below) are associated with CourseGroup records by way of PartOf records.

Course

A course record represents a single course event that occurred in a particular year. These usually (but not necessarily) will have the same name as the course group to which they belong, plus the year in which it occurred. Persons are associated directly with these course records by way of Attendance records. Course records are associated with CourseGroup records (above) by way of PartOf records.

Location

A Location record can represent any geospatial concept, such as an address, city, region, or country. 

Localization

Localization record indicates that a Person was associated with a particular Location in a given year (e.g. they listed that location as their address).

KnownLocation

A KnownLocation record represents an entry in an authority service or controlled vocabulary (such as OpenStreetMaps or Geonames) that can be used as a reliable identifier for the referent of a Location record. This may include a Geo ID, which provides a specific geographic point. 

TODO

This model needs some work.

Institution

An Institution record represents a legal body with which a course attendee or investigator was affiliated. Unless explicitly indicated via an authority relation to a KnownInstitution record, such records are ambiguous: we are not aware of the actual institution to which this record refers, and the record should be treated accordingly.

KnownInstitution

KnownInstitution record represents an entry in the Conceptpower authority service that can be used as a reliable identifier for the referent of an Institution record. 

Affiliation

An Affiliation record indicates that a Person was associated with an Institution in a particular year. It also includes information about the nature of that affiliation.

Meaning of "validated" field

All of the core data models, including the relational fields, include a validated field. This field should be used to indicate that a curator has reviewed a record for accuracy and completeness. If the value of validated is null or False, then the record should be interpreted with caution.

Curatorial Data

Changes to the core data are commemorated explicitly in the database in two ways: the use of an Historical* class for each of the core models, and some specific event models that commemorate larger-scale changes.

Historical* Records

Each of the core data models has a corresponding Historical* shadow model, provided by the Django Simple History application. See the docs for that application for details. In short, every time fields are updated on a model instance, a new historical record is created that commemorates the change. This allows us to view the state of a model instance at any point in its change history.

Event Records

Given the ambiguity of the original data, curators may merge or split entity records (people, institutions, etc), and also reassign relations (attendances, localizations, affiliations, etc). These actions are commemorated by MergeEvent, SplitEvent, and AlterRelationEvent records.

Persistence

Despite some cautionary language in the user interface, data are never actually deleted from the underlying database; rather, they are hidden. However, that does not mean that the act of restoring accidentally deleted data is trivial. Caution should be exercised when curating data.


Related content

REST API
More like this
MBL History Data Platform Home
MBL History Data Platform Home
More like this
Adding an attendee to a course
Adding an attendee to a course
More like this
Getting Started
More like this