Archivists and record managers are gravitating to the Encoded Archival Description (EAD) for encoding finding aids. Having information encoded in a common format across organisations has many benefits of its own, but another significant advantage of storing information in an SGML(/XML) format is the relative ease with which it can be exchanged, extracted, searched and formatted for viewing.
Last year I was involved in the design and implementation of the AustLit project, working on site in the ADFA library, which has a significant special collection of material related to Australian Literature. One of the contributors to AustLit was the Lu Rees Archives at the University of Canberra, which maintains Australia's largest collection of material associated with children's literature.
Both ADFA Special Collections and Lu Rees were keen to convert their material to a sustainable long-term format. In ADFA Specials case, their finding aids were being maintained as Word documents. Lu Rees's material was stored in a FileMaker Pro database, with finding aids being produced on demand by cutting and pasting relevant information from this database in Word documents.
Whilst the Lu Rees material catalogued in UC's library was incorporated into AustLit, most of the archival material did not fit into the data structures planned for AustLit (as AustLit was not designed to be a manuscripts or special materials archive). So, it was incumbent on the AustLit project team to provide a migration path for the Lu Rees data and EAD seemed the obvious choice.
The Lu Rees Archives had previously maintained their collection data in a FileMaker Pro database running on a stand-alone Macintosh. Data was stored in 32 separate database tables containing a few specific fields, such as Short Stories, Slides, Transcripts of Talks, Artwork, Awards, Autobiographical Notes, Obituaries etc. A master Author table listed all authors with a unique identifier that was used to refer to a specific author in the other tables.
The first step in the conversion was to extract the data from the FileMaker Pro database as easily as possible, and the path of least resistance was to export the data to a tabbed-delimited format.
Because the volume of exported data was not large (about 2MB), the obvious approach was to write a Java program to read all 32 exported files into memory and for each author and use the unique author identifier to build data structures of all material related to them across the 32 tables.
Once this was built, the program performed a series of transformations on the data for each author:
Then, for each author, an EAD finding aid was generated by modification of a template EAD file which contained a standard but skeletal "Lu Rees" EAD finding aid with sections such as:
... <eadheader audience="internal" langencoding="iso 639-2" findaidstatus="unverified-full-draft"> <eadid type="SGML catalog">PUBLIC "-//Australian Defence Force Academy Library//TEXT (AU::ADFA::**ID**::**LONGTITLE**)//EN" "**FILESTUB**.sgml"</eadid> <filedesc> <titlestmt> <titleproper>Guide to the **LONGTITLE**</titleproper> <author>Prepared by Special Collections.</author> </titlestmt> .... <archdesc level="collection" langmaterial="eng"> <did> <head>Summary</head> <origination label="Creator"><persname>**TITLE**</persname></origination> <unittitle label="Title">**LONGTITLE**</unittitle> <unitdate label="Date Range">**DATERANGE**</unitdate> <unitid label="Reference Number">**ID**</unitid> <physdesc label="Extent">**EXTENT**</physdesc> <repository label="Repository"> <corpname>Australian Defence Force Academy Library</corpname> </repository> </did> <scopecontent> <head>Scope and Content</head> <p>**NOTE**</p> <organization> <head>Organization</head> <p>This collection has not yet been arranged into series.</p> </organization> </scopecontent>
The "**xx**" values where substituted with the relevant information for the author, and finally the series/item etc information generated above was appended to the template and the EAD document closed.
The ADFA Specials Collections group had started manually converting their finding aids from Word to EAD, but it quickly became apparent that this was a very large, labour intensive and tedious task.
A chance remark I made to Dr Marie-Louise Ayres (who was the manager of both the AustLit project and ADFA Special Collections) about automatic conversion tools from Microsoft Word format to XHTML format triggered a mini-project to convert approximately 300 finding aids to EAD format.
The first step was to get the documents from Word into a format more amenable to programmatic manipulation, and seeing as all the formatting in the AustLit project was based on XSLT operations on XML data, XML seemed like a good markup syntax.
There are many tools available for converting Word into XML (including recent versions of Word itself!). We looked at Logictran's "r2net" converter and the MajiX converter. Both worked well, and are extremely configurable, but "r2net" did a slightly better job "out of the box" and we had the 300 Word files converted into rtf and then into XHTML within an hour or two.
The ADFA Specials finding aids had been constructed as a set of tables each consisting of 5 columns optionally preceded by a heading describing the type of content in the tables. Here's an example:
BOX NO. | WALLET | CONS NO. | ACCESS | DESCRIPTION |
---|---|---|---|---|
1 | 1 | 1 | CLSD |
|
1 | 2 | 1 | CLSD | Blaiklock Lecture, 2/8/88 S.U. |
1 | 3 | 1 | CLSD |
|
1 | 4 | 1 | CLSD | Hottest Night of the Year manuscript. |
Logictran's r2net created table/row/cell HTML elements and even preserved the "numbered list" nature of the contents of the DESCRIPTION cell in the first and third rows in this example, which was a great help in subsequent processing. It is also worth noting that one of the titles referenced ("Hottest Night of the Year") was made identifiable by rendering into HTML italics, which another, "'The Day of the Mothers'" was entered into the Word document using Word's "smart quotes" (which I haven't shown here). Both representations facilitated the identification of titles in subsequent processing. However, use of italics and "smart quotes" was not completely consistent, meaning that some titles could not be automatically identified.
Having the files in XHTML allowed us to apply an XSLT stylesheet to "break apart" the tables into XML structures which started to creep towards EAD semantics. The content of each row was formatted into tag elements named box, wallet, cons, access and and descr.
The next two steps were done with simple Java programs.
First, the output from the XSLT transformation was processed to convert the contents Word "smart quotes" to "titleRef" elements, thus assuming that the text between the smart quotes was indeed a title (as they almost always were).
Then, the XML document output from the preceding step was read, and using Java's Document Object Model (DOM) interface, element contents were examined to determine (well, guess at) a date range for the objects in the collection (4 digit numbers in the content starting with 18 or 19 where assumed to be years), punctuation was regularised, certain phrases translated and stock phrases inserted and tags such as <br> were translated to <br>. Finally, following the same procedures as used in generating the Lu Rees finding aids, a skeletal EAD header was read and merged with appropriate content extracted and derived from the finding aid, before being written to the final output file followed by the body of the finding aid.
A major part of the AustLit project was data conversion: from spreadsheets, databases and local systems designed largely for printing and human processing. The enormity of this effort was greatly underestimated. On the surface, the data looks reasonable - it is certainly humanly readable, and "looks good" on the page. But each system contains subtle variations that detract not at all from their ability to be processed by humans but wreak havoc with algorithms attempting to find patterns and consistency.
The EAD conversion effort provided "more of the same", but on a smaller scale (with hundreds of finding aids rather than hundreds of thousands of records): material which looks great to the eye is often suprising intractable to the algorithm. Ambiguities that are ignored by humans create subtle problems for their automated programs.
A simple example is quoted strings referencing things. To a human, a line such as:D.H. Lawrence's "Kangaroo" was written in 6 weeks.clearly contains a reference to the title "Kangaroo", whereas with:
Angus sailed home on the "Orient" later that year."Orient" probably refers to the name of a boat.
Similarly, there is only one (vague) date in this line:
He lived at 1922 Bush Road until the late 1950's.
There are many ways to produce the indistinguishable output using Word. However, tools which convert Word documents to XML/XHTML will generate different markup for equivalent-looking output. Consequently, algorithms which attempt to infer semantics from the generated XML need to cope with many variations in the expected input. Because the viewable image of the document camouflages these differences it is very difficult to estimate the number of variations which will be encountered in advance.
Some tips for producing more readily convertible documents:
title place-of-publication: publisher, date orderthen a human will have little trouble if the colon is occasionally omitted or replaced with a semi-colon, or the order is slightly changed. But a program will!
This work would not have been possible without the expert EAD knowledge of Megan Williams (formerly of the ADFA Library, now with the National Library of Australia) and without the help of Marlene Meyers (formerly of the Lu Rees Archives, now with the National Archives of Australia).