I’m going to deviate slightly from my original intention, which was to blog about the plenary lectures only, to quickly write up a short synopsis of the lecture we had before yesterday afternoon’s workshop, which was “Markup and Why it Matters”, given by Lou Burnard. Lou Burnard pretty much wrote the book on TEI and his presence there summed up the attitude of the Digital Humanities @ Oxford Summer School to its speakers, which was: “Oh, you mean this world-renowned expert in his/her field? Oh yeah, we thought we’d ask them to pop by to talk to you”, which was a heady experience to say the least.
Mr Burnard (apologies, but I couldn’t find reference to an academic title, so henceforth I am going to refer to you as Lou) ran us through what markup is, and why is it important, by first asking the question: what is a text? And, what is a document? And, what is XML markup and why is everyone using it?
The humanities, he said, are all about text. Non digital books, manuscripts, archival papers, as well as other, increasingly digital items – cultural manifestations such as sounds, images, blogs and tweets. The digital humanities are all about digital technologies and techniques for manipulating such manifestations in an integrated way. Markup (encoding or tagging) is one of the key components enabling such integration.
Texts are three-dimensional: a text has a physical presence with visual aspects which may be transferred more or less automatically from one physical instance to another (as when one looks at their blog on a different computer, or take a picture of a book on their camera and take it home to read). A text also has linguistic and structural properties, which may be transcribed, translated, and transmitted with some human intervention. It conveys information about the real world, which may be understood (or not), annotated, and even used to generate new texts. Good markup, Lou said, thus has to operate in all three dimensions.
Moving from there, he discussed the nature of an e-book as a surrogate for the appearance of a pre-existing non digital document, a representation of that document’s linguistic and structural content, and annotations explaining the context in which it was originally produced or the ideas it contains. Managing large numbers of such resources requires good descriptions, or metadata, which makes possible intelligent, complex searching and analysis. This is obviously especially important now we’re in the realms of Big Data.
Increasingly we want to share and integrate (or mash up) these digital resources in new and unexpected ways at different levels of granularity, and for different applications.
When you digitise you have the thing itself, existing in the real world. You have the thing itself, in a digital form. What’s interesting is that once a digital document has been created you can do various analyses that can produce new information and enriches the knowledge you have about the original artefact. That, fundamentally, is what markup is for: enrich a document, complexify it; make it do more stuff. If this process is to be successful it needs human interaction, and we have to use the same conceptual modes to do our markup and analysis.
The most appropriate mode to use, says Lou, is TEI.
The TEI provides a well-established conceptual model of text to support conversion of existing data, creation of new datasets, integration of existing data sets derived from a variety of sources. It’s open source. There’s a lovely little PowerPoint presentation here which sums it all up very succinctly. The fact of the matter is, the digital humanities have focused people’s minds on the nature of textual objects. This isn’t just for computer scientists or printers; it’s a question at the heart of humanities: what are the important characteristics of a text?
Lou compared an original manuscript to a 19th century print version, and described how the use of font in each creates the illusion of importance. He warned us that if we rely solely on how print looks on the page we will be caught out.
All the images he showed us were the same text. They were about the same things, despite the fact that they don’t have the same words, spelling or layout: they make the same argument in the same language. But they are three different instantiations of the text. So, we need to ask ourselves – what is the essential part of a text? Is it the shape of letters and their layout, or the original form from which our digital copy derives? Or is the stories we read into it, and the author’s intentions?
A document is something that exists in the world, which we can digitise. A text is an abstraction, created by or for a community of readers, which we can markup – and by marking it up, we are carrying out an interpretative act.
Lou suggested that a text is more than a sequence of encoded glyphs or characters, or linguistic forms. It has a structure and communicative function, it has multiple possible readings and its meaning can be enriched by annotation. Markup makes these things explicit – and only that which is explicit can be reliably processed.
He went on to discuss the Babel effect – explaining that this meant there are many possible readings of most texts. He used Moby Dick as example. The chapter number and title, whilst being separate and in a different font, is not the same as the first word; it is not as important. Markup should represent that.
Essentially, we’re seeing in the markup a description of what the printer should do – it is procedural markup. It is telling us how to print our document. If you want to know whether Melville liked one-word titles, this will not help you. Markup is a way of making explicit what is important in a text, the distinctions identified. It is a way of identifying and naming the components. But you have to decide what’s important in the first place. This is where we have the (potentially quite controversial) interpretative act – for who is to say what the most important bit of a text is? And should we be separating form and content? Lou said this is a very old argument dating back to Aristotle, but in general descriptive encoding which focuses on the content rather than the form is preferred, if only because it makes it easier to reuse the same content for many purposes, and because content is somehow less disposable.
So what should we be tagging? The markup language should specify all the characters found and make explicit the structures.
XML is structured data represented as strings of text; it looks like HTML, except that it is extensible, it must be well-formed, and can be validated. XML is application, platform, and vendor-independent, and empowers the content provider and facilitates data integration.
Now for the real science. An XML doc actually represents a tree composed of nodes. There is a one root nodes which contains all the others. It is an international standard, and must respect ISO standard ISO 10646 (aka Unicode). A valid XML document is well-formed, and also conforms to some additional structural rules, which make up what we call a schema. TEI out of the box is designed to work with traditionally organised books and manuscripts.
So at this point in the proceedings we broke for lunch, and this was when the real panic began to set in, as I realised that I would actually be expected to convert a document into a searchable artefact. I wish I could say that my fears were expelled and that I came away from the session determined to markup all the digital images in the world: what actually happened was that I came away feeling a bit disillusioned. This was not the fault of the teachers themselves, who were informative and eager to help, but simply because I just couldn’t get my head around the actual coding itself. I didn’t understand what a <div> was, or why I was putting it in somewhere, nor indeed whether I should put my cursor inside a bracket or outside of it. I felt awkward, and because I felt awkward I didn’t just throw myself into the task at hand: I worried away and second-guessed myself. This is the problem with my academic career generally, and I must remind myself that I have nothing to fear but fear itself, etc etc.
Anyway, huge thanks to Lou for making it actually quite understandable to me in terms of theory – and I’ll keep beavering away at the practical!