Abstract

This presentation is a personal reflection on a US National Archives project, Founders Online, which enabled 180,000 documents from six early Presidential collections to be transcribed and annotated. Topics discussed will include: some history; a lingua franca of markup; how the markup was done; a system that understands markup; and some lessons learnt.

(I prefer markup for noun forms and mark up for verb forms.)

Some history

I’m just a suburban boy who was born and grew up here in Perth. I got started with document markup while doing research which eventually produced a doctoral dissertation named The Ancient Witnesses of the Epistle to the Hebrews.This involved transcribing and marking up about thirty New Testament manuscripts. The language was Greek; the transcriptions were made from photographs, microfilms, plates in books, and sometimes the manuscripts themselves; the markup system was one invented by Peter Robinson, another Australian, while he was at Oxford. Here’s an example of a manuscript image and the corresponding markup used for my dissertation:

New Testament Papyrus P46

New Testament Papyrus P46

<ch 1><v 1> {inscription: PROS EBRAIOUS} |f 21r|[c4]|p MA|[/c4]|l 2| POLUMERWS KAI POLUTROPWS
PALAI O [ns]QS[/ns] LALHSAS TOIS PATRASIN [c1][st]HMWN[/st][/c1] EN
TOIS PROFHTAISÉ <ch 1><v 2> EP ESCATOU TWN HME=
RWN TOUTWN ELALHSEN HMEIN EN
U[di]I[/di]WÉ ON EQHKEN KLHRONOMON PANT[fn]W[/fn]
DI OU EPOIHSEN TOUS AIWNASÉ <ch 1><v 3> OS WN
APAUGASMA THS DOXHS KAI CARA=
KTHR THS UPOSTASEWS AUTOU FERWN TE
TA PANTA TW RHMATI THS DUNAMEWS
DI AUTOU KAQARISMON TWN AMARTIWN
POIHSAMENOS EKAQISEN EN DEXIA THS
MEGALWSUNHS EN UYHLOIS <ch 1><v 4> TOSOUTWN
KRITTWN GENOMENOS AGGELWN OS=
W DIAFORWTERON PAR AUTOUS KEKLH=
RONOMHKEN ONOMA <ch 1><v 5> TINI GAR EIPEN
POTE TWN AGGELWNÉ [ns]UIS[/ns] MOU EI SU
EGW SHMERON GEGENNHKA SE KAI [ut]P[/ut]ALIN
[rt]E[/rt][ut]G[/ut]W ESOMAI AUTW EIS PATERA KAI AU=
[rt]TOS E[/rt]STAI MOI EIS [ns]UN[/ns]É <ch 1><v 6> OTA[ut]N[/ut] DE P[rt]A[/rt]LIN
[rt]AGAG[/rt][ut]H[/ut] TON PRW[ut]T[/ut]OTOKON EIS TH[ut]N[/ut] OIKOU=
[rt]MENHN[/rt] LEGEI K[rt]A[/rt]I PROSKUNHSATWSAN
[rt]AUTW PANTE[/rt][ut]S[/ut] A[ut]G[/ut][rt]G[/rt]ELOI [ns][ut]QU[/ut][/ns]É <ch 1><v 7> [ut]K[/ut]AI PR[ut]O[/ut]S MEN
[rt]TOUS AGGELOUS LEGEI O POIWN TOUS AGGE[/rt]=
[rt]LOUS AUTOU [ns]PNATA[/ns] KAI TOUS LEITOURGOUS AUTOU[/rt]

There’s nothing new about markup. People have been marking up for as long as they’ve been writing. If you look at the manuscript image you can see προσ εβραιουσ marked with pairs of lines to show it’s a title. Lines also mark sacred names and a suspended final nu. There is a correction (ημων) as well. The ancient scribes also did annotation: examples here include a page number (MA = 41) and stichometric count (possibly to calculate payment due to the copyist).

A lingua franca of markup

About the time I was marking up manuscripts using Peter Robinson’s system, others were busy developing the Text Encoding Initiative, or TEI for short. This system used something called the Standard Generalised Markup Language and came with an extensive set of guidelines that specified a set of tags and how they should be used. Here’s a quote from the introduction to the 1994 edition of the Guidelines for Electronic Text Encoding and Interchange, edited by Michael Sperberg-McQueen and Lou Burnard:

The Guidelines apply to texts in any natural language, of any date, in any literary genre or text type, without restriction on form or content. They treat both continuous materials (“running text”) and discontinuous materials such as dictionaries and linguistic corpora. Though principally directed to the needs of the scholarly research community, the Guidelines are not restricted to esoteric academic applications. They should also be useful for librarians who maintain and document electronic materials, as well as for publishers and others creating or distributing electronic texts. Although they focus on problems of representing in electronic form texts which already exist in traditional media, these Guidelines should also be useful for the creation of electronic texts.

While not as famous as another application of SGML that came out about the same time, namely HTML, the TEI (now defined in XML) has become the lingua franca of documentary markup. If you have ever wondered how to encode things so that others will be able to use them, the answer is simple: use the TEI, Luke.

How the markup was done

After finishing the PhD, I got the chance to work with an American friend on a project to digitize fifty years worth of fifty theological journals. The project used TEI (of course), sending the journals to India to be double-keyed (i.e. typed in twice by different individuals). As you know, markup is more than transcription: typing in the text is only the first step; after that, whatever you want to mark up (e.g. paragraphs, titles, names, dates, references) has to be tagged. E.g. (not proper TEI):

<para>Julius Henry "Groucho" Marx (<date type="from">October 2, 1890</date> – <date type="to">August 19, 1977</date>) was an American comedian, writer, stage, film, radio, and television star. He was known as a master of quick wit and is widely considered one of America's greatest and most gifted comedians.<note><title>Billboard Magazine</title> <date type="publication">May 4, 1974</date> <page>pg 35</page>: <quote>Groucho Marx was the best comedian this country ever produced</quote> – <attribution>Woody Allen</attribution>"</note></para>

I think that it was Lou Burnard who said that markup is interpretation. Often there is little scope for variance – two people skilled in the art will always produce the same markup of a source text. However, that is not always the case. When there is ambiguity, whether in the source (e.g. through wear and tear) or encoding instructions (e.g. the Guidelines suggest more than one way to do it), markup becomes subjective. The best you can do in such circumstances is to choose what you think is the best markup and then to be consistent when handling similar cases.

Starting in 2005, I was fortunate enough to work on digital publications at Rotunda, which is the digital imprint of the University of Virginia Press. The first job was to realise John Bryant’s vision for a digital edition of Melville’s Typee.

Melville’s Typee

Melville’s Typee

(I’ve got an alternative reading here. What do you think?)

Next was a digital Journal of Emily Shore.

Journal of Emily Shore, January 1839

Journal of Emily Shore, January 1839

Journal of Emily Shore, June 1839 (final entry, aged nineteen)

Journal of Emily Shore, June 1839 (final entry, aged nineteen)

What a prodigy. What a loss.

Then we started on a really big job: the digital edition of the Papers of George Washington. The print edition has 69 volumes – everything from or to him they’ve managed to find. Editorial notes on the characters and events mentioned in each piece of correspondence are an extremely valuable bonus.

The Papers of George Washington (printed edition)

The Papers of George Washington (printed edition)

A set of volumes was handed to a company that sent it to India to be double-keyed then marked up in TEI XML. I had nothing to do with the markup specification, which says exactly how to mark up every feature encountered in the printed volumes – titles, names, page numbers, dates, notes, cross-references, and a thousand other things. That hard work was done by my boss, David Sewell, who got it right.

A system that understands markup

One great thing about working at UVa Press was access to XML database software called Marklogic. Transcriptions marked up with TEI are fed into the database and an XML-aware language called XQuery is used to write web applications for serving up the content in whichever way you please.

declare function v:search(
  $map as map:map
) as item()* {
  <div id="main" role="main" class="clearfix">{
    v:search-widgets($map),
    v:search-content($map)
  }</div>
};

An advantage of using TEI is that tools built for one corpus will work just as well for another provided the markup is consistent across corpora. As a result, we were able to use the same system to publish digital editions of The Adams Papers (39 vols), The Papers of Alexander Hamilton (27 vols), The Papers of James Madison (39 vols), and The Papers of Thomas Jefferson (55 vols).

These editions are only available by subscription.

Founders Online

There was a desire to make the founding era correspondence freely available, so in 2010 the US National Archives entered into an agreement with the University of Virginia Press to produce Founders Online.

Founders Online

Founders Online

I was the main programmer though David Sewell made many major contributions. The look and feel was developed by the Ivy Group. The result is a web interface to a database that currently (2018) holds 180,000 documents from six founding era collections:

A particular strength is its faceted search, which lets users limit key word queries by author, recipient, date range, and other constraints. Facets are made possible by metadata attached to each document. If there are XML elements across all documents that specify authors, recipients, from dates, to dates, and so on, then faceted searches based on these fields are possible.

By mid-2012 we had a working system ready for release. Or so we thought.

The National Archives were wary because they had just been through a bad experience in which unanticipated interest overloaded their servers when the 1940 US census data was released. We were therefore given an extension of time to come up with a strategy that would enable the platform to cope with very large loads.

At the outset, queries were coming back in less than a second, which was our original design goal. However, this was when only one person was using the system. We needed to maintain that level of performance when thousands of people were hitting the site with different queries at the same time. David and I came up with a couple of strategies that worked very well in the end. Both involved caching, one in the Marklogic layer and another in the Nginx web server layer. The strategy goes like this:

We needed to test the system and hired IBM to do so. They spun up a lot of query generators at a lot of different locations to simulate the kinds and dynamics of loads that might be expected. Happily, the system managed to cope, continuing to deliver subsecond performance with thousands of concurrent, diverse queries.

The system finally went live in mid-2013. It survived the initial load and continues to work well, with the occasional peak happening when someone famous or a popular web site posts a link to something in Founders Online.

A load peak (July 2018)

A load peak (July 2018)

Lessons learnt

They say that the second system is the most dangerous system you will ever design. (This, and other gems of wisdom, are found in The Mythical Man-Month by Frederick P. Brooks Jr.) I’m glad that doesn’t apply to Founders Online, though we did have the luxury of working from the ground up on the new design. I was already a fan of TEI XML, XQuery, and Marklogic but became a fan of HTML5 and JQuery libraries while writing the viewing code. Caching is a great strategy for handling massive loads but can only be used in a system whose content does not change all the time. It’s great to have enlightened bosses, good tools, time, and creative freedom to make a thing of beauty.

A few examples