System Architecture and Technical Summary

Pipeline, architecture, and software


  • Django 2.2.20 (Python framework)
  • Python 3.10
  • PostgreSQL 10 (relational database)
  • Elasticsearch 6.6 (index)
  • Nginx, Gunicorn (web server)
  • Celery/Redis (task queueing)
  • Ubuntu 22.04 (operating system)
  • Front-end: Javascript, MapLibreGL, Leaflet, JQuery, Bootstrap 4, D3
  • The Following description is for the original WHG system. MEHDIE adds some functionality described at the end of this list.
  • WHG has two data stores: a relational database (db) and a high-speed index (idx).
  • Interfaces to this data include a graphical web application (GUI) and APIs.
  • Contributed data in Linked Places or LP-TSV format is uploaded by registered users to the database (-> db) using GUI screens.
  • Once uploaded, datasets are managed in a set of GUI screens, where they can be browsed and reconciled against Wikidata and Getty TGN.
  • Reconciliation entails initiating a task managed by Celery/Redis and reviewing prospective matches returned.
  • Confirming matches to Wikidata and/or TGN augments the contributed dataset by adding new place_link and, if desired, place_geom records. NOTE: The original contribution can always be retrieved in its original state; i.e. omitting records generated by the reconciliation review step.
  • Once an uploaded dataset is reconciled and as many place_link records are generated for it as possible, it can be accessioned to the WHG index (idx <- db ). That step will be initiated by WHG staff, but review of results will be by the dataset owner and designated collaborators.
  • Accessioning to the WHG index is another reconciliation process, so there are two steps: initiating the task and reviewing results – but in this case, only some. Incoming records that share a link to an external gazetteer (e.g. tgn, geonames, wikidata, etc.) with a record already in our index are queued separtely and can be added automatically, associating it with that match and any other similarly linked "siblings."
  • Incoming records that don't share one or more links to existing index items become new "seed" records in the index, referred to internally as "parents."
  • MEHDIE allows matching two datasets to each other. This is accomplished using a set of Google Cloud Run services.
  • The Graph visualization is accomplished using the D3RDF framework