by Lari Lampen

An old man looks back on his life, spent in the ultimately futile pursuit of knowledge. Born, lived and soon to die within an immense – perhaps infinite – library, his world is made up of hexagonal arrays of bookshelves separated by tiny corridors. However, the most significant items found in this universal library or library universe are of course books. This multitude of bookshelves is filled with an unfathomable number of books filled with mostly incomprehensible sequences of letters that occasionally manage to spell a few words, much like the output of a million monkeys with typewriters. Librarians travel the endless corridors looking for a book, the catalogue of catalogues, which would reveal the locations of meaningful books.

The setting is that of “The Library of Babel” (1941), arguably the most famous of the seminal short stories of Jorge Luis Borges, a parable on the difficulty of fishing for meaning from a virtually endless ocean of data. The library universe of the universal library is practically devoid of meaning: while every possible book in every (alphabetic) language is included in it, the entirety of the library contributes nothing to anyone seeking useful information, simply because it is impossible to find anything.

The books in Borges’s vision of the universal library are not stored in any particular order; and while there are letters on the spine of each book, “these letters do not indicate or prefigure what the pages will say”. The crucial thing missing from this picture is not data, of which there is an abundance, but signposts, shelf labels, meaningful book titles or anything else describing the data contained in the endless profusion of books: in short, metadata.

The Max Planck Institute for Psycholinguistics hosts a substantial archive of language data, but it has its own unexplored corners, records that have only ever been accessed a handful of times, if even that. As the archive grows, finding relevant information becomes harder. Moreover, ours is but one of a number of repositories one needs to dig through when looking for data on, say, speakers of a particular language in a given area. Trawling through the different archives can be time-consuming and awkward, so it has been necessary to develop a method of sharing metadata between archives.

The mechanism by which this is achieved is called the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Proprietors of corpora become providers, serving metadata records using the OAI-PMH protocol, which is then collected by harvesters to be processed further as required. The Language Archive here at the MPI makes its records available as an OAI provider. In turn we harvest metadata from around 60 other repositories of language data. There automated processes silently take place on servers, allowing end users to view information on harvested records alongside TLA-hosted records in a single tree structure.

At the moment, adaptation of OAI-PMH is still in its infancy, relative to the scale it is hoped to eventually attain, but the protocol is already helping to provide a uniform view into a number of language corpora, making it at least slightly easier to find what you want. It may not be the catalogue of catalogues, but it is a start.