Is this a 17th-century Twitter? Maybe. (Even before this scrap came to light, the promotional material for the play Brief Lives called Aubrey “the world’s oldest blogger.”) The scrap both does and doesn’t mirror a tweet — or a status update, or a Tumblr post, or anything on any social network. It has structural limits. It’s odd, jotted, and hasty. It brimming with scribbled social information, meaningful only to those steeped in its world.
I’ve been rather slack in updating this blog recently, not least because I took two weeks off after my Oxford trip – couldn’t believe how exhausted I was after my week there! However, I will be catching up with myself over the course of this weekend and finishing off my overview of the lectures I caught over the course of that week. I hope you’re all well, and enjoying the summer now it’s finally arrived?
How can you tell if an ancient story is completely fictional or based on reality? One method, says a team of physicists, is to map out the social network of the characters and test whether it looks like a real social network. When they used that method for three ancient myths, they found that the characters have surprisingly realistic relationships.
Ancient stories are called myths for a reason. No one believes that Beowulf, the hero of the Anglo-Saxon epic, slew a talking monster named Grendel. Or that the Greek gods described in The Iliad actually appeared on Earth to intervene in the Trojan War. But historians and archaeologists agree that much of those ancient narratives was based on real people and events. The supernatural features of the narrative were then layered onto reality.
Ralph Kenna and Pádraig Mac Carron, physicists at Coventry University in the United Kingdom, wondered if reality leaves its mark on mythological narratives through the relationships between characters. So they built social network maps for three ancient texts. Along with Beowulf and The Iliad , they included an Irish epic, Táin Bó Cúailnge. The Irish epic’s origins are murky. Most scholars assume that it is completely fictional, but recent archaeological evidence suggests that it could be based in part on a real conflict in Ireland 3200 years ago.
The full paper is available online free for 30 days, though the site requires registration.
I’m going to deviate slightly from my original intention, which was to blog about the plenary lectures only, to quickly write up a short synopsis of the lecture we had before yesterday afternoon’s workshop, which was “Markup and Why it Matters”, given by Lou Burnard. Lou Burnard pretty much wrote the book on TEI and his presence there summed up the attitude of the Digital Humanities @ Oxford Summer School to its speakers, which was: “Oh, you mean this world-renowned expert in his/her field? Oh yeah, we thought we’d ask them to pop by to talk to you”, which was a heady experience to say the least.
Mr Burnard (apologies, but I couldn’t find reference to an academic title, so henceforth I am going to refer to you as Lou) ran us through what markup is, and why is it important, by first asking the question: what is a text? And, what is a document? And, what is XML markup and why is everyone using it?
The humanities, he said, are all about text. Non digital books, manuscripts, archival papers, as well as other, increasingly digital items – cultural manifestations such as sounds, images, blogs and tweets. The digital humanities are all about digital technologies and techniques for manipulating such manifestations in an integrated way. Markup (encoding or tagging) is one of the key components enabling such integration.
Texts are three-dimensional: a text has a physical presence with visual aspects which may be transferred more or less automatically from one physical instance to another (as when one looks at their blog on a different computer, or take a picture of a book on their camera and take it home to read). A text also has linguistic and structural properties, which may be transcribed, translated, and transmitted with some human intervention. It conveys information about the real world, which may be understood (or not), annotated, and even used to generate new texts. Good markup, Lou said, thus has to operate in all three dimensions.
Moving from there, he discussed the nature of an e-book as a surrogate for the appearance of a pre-existing non digital document, a representation of that document’s linguistic and structural content, and annotations explaining the context in which it was originally produced or the ideas it contains. Managing large numbers of such resources requires good descriptions, or metadata, which makes possible intelligent, complex searching and analysis. This is obviously especially important now we’re in the realms of Big Data.
Increasingly we want to share and integrate (or mash up) these digital resources in new and unexpected ways at different levels of granularity, and for different applications.
When you digitise you have the thing itself, existing in the real world. You have the thing itself, in a digital form. What’s interesting is that once a digital document has been created you can do various analyses that can produce new information and enriches the knowledge you have about the original artefact. That, fundamentally, is what markup is for: enrich a document, complexify it; make it do more stuff. If this process is to be successful it needs human interaction, and we have to use the same conceptual modes to do our markup and analysis.
The most appropriate mode to use, says Lou, is TEI.
The TEI provides a well-established conceptual model of text to support conversion of existing data, creation of new datasets, integration of existing data sets derived from a variety of sources. It’s open source. There’s a lovely little PowerPoint presentation here which sums it all up very succinctly. The fact of the matter is, the digital humanities have focused people’s minds on the nature of textual objects. This isn’t just for computer scientists or printers; it’s a question at the heart of humanities: what are the important characteristics of a text?
Lou compared an original manuscript to a 19th century print version, and described how the use of font in each creates the illusion of importance. He warned us that if we rely solely on how print looks on the page we will be caught out.
All the images he showed us were the same text. They were about the same things, despite the fact that they don’t have the same words, spelling or layout: they make the same argument in the same language. But they are three different instantiations of the text. So, we need to ask ourselves – what is the essential part of a text? Is it the shape of letters and their layout, or the original form from which our digital copy derives? Or is the stories we read into it, and the author’s intentions?
A document is something that exists in the world, which we can digitise. A text is an abstraction, created by or for a community of readers, which we can markup – and by marking it up, we are carrying out an interpretative act.
Lou suggested that a text is more than a sequence of encoded glyphs or characters, or linguistic forms. It has a structure and communicative function, it has multiple possible readings and its meaning can be enriched by annotation. Markup makes these things explicit – and only that which is explicit can be reliably processed.
He went on to discuss the Babel effect – explaining that this meant there are many possible readings of most texts. He used Moby Dick as example. The chapter number and title, whilst being separate and in a different font, is not the same as the first word; it is not as important. Markup should represent that.
Essentially, we’re seeing in the markup a description of what the printer should do – it is procedural markup. It is telling us how to print our document. If you want to know whether Melville liked one-word titles, this will not help you. Markup is a way of making explicit what is important in a text, the distinctions identified. It is a way of identifying and naming the components. But you have to decide what’s important in the first place. This is where we have the (potentially quite controversial) interpretative act – for who is to say what the most important bit of a text is? And should we be separating form and content? Lou said this is a very old argument dating back to Aristotle, but in general descriptive encoding which focuses on the content rather than the form is preferred, if only because it makes it easier to reuse the same content for many purposes, and because content is somehow less disposable.
So what should we be tagging? The markup language should specify all the characters found and make explicit the structures.
XML is structured data represented as strings of text; it looks like HTML, except that it is extensible, it must be well-formed, and can be validated. XML is application, platform, and vendor-independent, and empowers the content provider and facilitates data integration.
Now for the real science. An XML doc actually represents a tree composed of nodes. There is a one root nodes which contains all the others. It is an international standard, and must respect ISO standard ISO 10646 (aka Unicode). A valid XML document is well-formed, and also conforms to some additional structural rules, which make up what we call a schema. TEI out of the box is designed to work with traditionally organised books and manuscripts.
So at this point in the proceedings we broke for lunch, and this was when the real panic began to set in, as I realised that I would actually be expected to convert a document into a searchable artefact. I wish I could say that my fears were expelled and that I came away from the session determined to markup all the digital images in the world: what actually happened was that I came away feeling a bit disillusioned. This was not the fault of the teachers themselves, who were informative and eager to help, but simply because I just couldn’t get my head around the actual coding itself. I didn’t understand what a <div> was, or why I was putting it in somewhere, nor indeed whether I should put my cursor inside a bracket or outside of it. I felt awkward, and because I felt awkward I didn’t just throw myself into the task at hand: I worried away and second-guessed myself. This is the problem with my academic career generally, and I must remind myself that I have nothing to fear but fear itself, etc etc.
Anyway, huge thanks to Lou for making it actually quite understandable to me in terms of theory – and I’ll keep beavering away at the practical!
Wednesday morning’s plenary lecture was given by Professor David DeRoure, a Fellow of the British Computer Society and interim Director and Professor of e-Research at the Oxford e-Research Centre. He is also the National Strategic Director for the Digital Social Research project. Professor DeRoure epitomises what I believe the digital humanities are all about, in that he is unafraid to collaborate with multiple disciplines to ask new questions, and seek new answers.
I have to confess that his illustrious position makes me feel rather better about the fact that I struggled with some of the concepts he was explaining, but actually this ongoing struggle is the basis for my attending this week’s conference. Coming from a literature background I sometimes find it difficult to engage with techniques which owe more to the Social Sciences, or to IT, but the application of these research techniques and the language they use is something I need to engage with, and to become comfortable with. Ho hum, I digress.
Professor DeRoure began by asking us to consider the Web in a variety of different ways: as an infrastructure of research, as a source of data, as a subject of research, and as a web of scholarly discourse. He commented that the data deluge has moved away from being an issue only for social scientists and scientists generally, but it is the science community who have reacted to this emphasis on data-led research by announcing a paradigm shift from hypothesis-driven research (the Fourth Paradigm I mentioned in my previous blog post). In fact, Science magazine went as far as to announce the end of theory (which rather brings to mind the great Mark Twain quote “Rumours of my death are an exaggeration”).
Supporting this Big Data, as I understood it, are computers with sophisticated-enough technology to sort through the masses of data – or Big Compute, as it’s known. And as the Science magazine article explains, we need to have this kind of technology – we’re children of the Petabyte Age, and we need to adapt accordingly. The Web should be about co-evolution – society and technology working together.
The problem with Big Data is that the temptation is to work within a sub-set that concentrates on proving your own personal theories. But we simply can’t work in that way anymore. We are, in the words of Nick Cave, merely “a microscopic cog”. We need to realise we can’t work in isolation and we can’t ignore other data simply because it doesn’t say what we want it to. One of the ways which DeRoure suggests this can be avoided is with the use of linked data. Linked data enables us to discover more things; we need to realise that our questions are often similar to those being asked within other disciplines, and that linked data can broaden our areas of understanding.
“Wait a second, back up!”, I hear you wail. What’s linked data? There’s a possibility the librarians amongst you will have started to sit up and take notice, as the idea of linked data is closely related to concepts like controlled headings in library catalogues. Essentially, the idea of linked data is that information can become linked, and therefore more useful. It needs a standard format, which is “reachable and manageable by Semantic Web tools”, and tools to make either conversion or as-necessary conversion achievable (for further clarification, the Semantic Web is simply “a collaborative movement led by the World Wide Web Consortium (W3C) that promotes common formats for data on the World Wide Web”)
An example of a project which will be published as linked data is the Digital Music Collections (SALAMI). SALAMI will analyse 23,000 hours of digitised music in order to build a resource for musicologists, drawing on a range of music from the Internet Archive. Students will annotate the structure of songs based on what Professor DeRoure termed their “ground truth” – meaning, what those annotators is say the structure of the song at the time in which they’re annotating it.
In addition to the sheer scale of the information we’re receiving, the nature of that data is changing. Twitter has generated a lot of energy within the Social Sciences community as to how useful the data they collect from Twitter can be. Some areas of the field have rejected the data on the basis that it was not collected correctly, or collated properly, but other areas of the field have embraced this new rich area of data, and are establishing new methods to deal with it. Professor DeRoure suggested that it may very well be a case of whether you consider your data cup half full, or half empty.
Whichever way you look at it, the loop is closing – social theory is being used to describe the data. But the need to create intermediaries for this form of data remains. The Web Science Trust proposes the creation of a global community which looks not at how you represent your data, but how you describe it.
So, let’s take a breath. I need one, frankly, and I’m guessing you do too. I’m omitting great swathes of Professor DeRoure’s lecture and endeavouring to stick with the nuts and bolts of what I think he was saying, so you will have to bear with me whilst I process my thoughts. I think, at this juncture, what he is suggesting is that because of Big Data and the need to process large amounts and different kinds of data (such as the information one could glean from collating tweets, for example), we are more in need than ever of a linked, coherent system that communicates with itself and doesn’t come up against any barriers in the learning process. I’m thinking now of the Web as a maze, in which one suddenly finds oneself at a dead end simply because a program is hosted on a different system, for example. Obviously the concept of the Semantic Web and linked data are steps to enabling this open process.
We are links in that chain too – it’s not just the machinery of the computer. We are as much a part of it as the hard drive is. Our interaction with computers is changing. SOCIAM (the Theory & Practice of Social Machines) is a project proposed by the University of Southampton which will attempt to research the best means of supporting “purposeful” human interaction on the Web. This interaction, they claim, is:
“…characterised by a new kind of emergent, collective problem solving, in which we see (i) problems solved by a very large scale human participation via the Web (ii) access to, or the ability to generate, large amounts of relevant data using open data standards (iii) confidence in the quality of data and (iv) intuitive interfaces.
The boundary between programmers and users has been dissolved by the Web, and our participation with it. This is mainly typified by social websites such as Facebook and Twitter. We are now merely a component of the Social Machine.
The picture here is of Ory Okolloh, the founder and executive director of Ushahidi: an example, cited by Professor DeRoure, of the social machine in action. Developed shortly after the Kenyan elections on the 27th December 2007, Ushahidi was created to map incidents of violence and peacekeeping in the country after the elections, based on reports submitted by mobile phone and via the Web. The incident was a catalyst for the website team to realise that there was a need for a platform which could be used worldwide in the same way. Ushahidi (Swahili for “testimony”), the social machine, was born.
An example of the way in which the website is used was given in The Spectator magazine in 2011:
“At 6:54 pm the first bomb went off at Zaveri Bazaar, a crowded marketplace in South Mumbai. In the next 12 minutes two more followed in different locations in the city…The attacks added to the confusion just as millions of people were returning home from work. With telephone lines jammed, many Mumbaikars turned to a familiar alternative: they posted their whereabouts, and sought those of their close ones, on social networks.
Facebook doubled up as a discussion forum…users on Twitter, meanwhile, exchanged important real-time updates. Moments after the explosions, a link to an editable Google Docs spreadsheet was circulated frantically on the microblogging site. It carried names, addresses and phone numbers of people offering their houses as a refuge to those left stranded. The document was created by Nitin Sagar, an IT engineer in Delhi, 1,200km (720 miles) away.”
Problems (of any description, be they the classification of galaxies or a bomb going off in a city centre) are solved by the scale of human participation on the Web and the timely mobilisation of people, technology and information resources. And those websites which refute the traditional idea of the “layperson [as] irrational, ignorant…even intellectually vacuous” are the ones which are the most successful: the ones who tell people what they’re about, and treat participants as collaborators, not as subjects. We are even coming to a stage where we consider human interaction with the machine as a sub-routine: a human-based computation, outsourcing certain steps to humans. Professor DeRoure cited Wikipedia as a good case in point – an interesting combination of automation and assistance rather than the replacement of the human.
And there are many dimensions to our social machines: the number of people, and of machines; the scale and variety of data – and how does one measure the success of a social machine? By the way it empowers groups, individuals, crowds? We are moving away from the idea of the Turing machine to one in which humans and machines are brought together seamlessly.
We are at Big Data/Big Compute right now. In fact, if I understand Professor DeRoure correctly, WE are Big Compute: “The users of a website, the website and the interactions between them, together form our fundamental notion of a ‘machine’”. Thus, we find ourselves on the edge of a new frontier. Technology isn’t transforming society alone, but people will, and the behaviour of machines over time will evolve because of its involvement with humans. In order to facilitate those changes we need to understand how to design social computations, provide seamless access to a web of data and to consider how accountable and trusted the components should be. Ultimately, we are citizen-scientists and human-computer integrations.
I hope this has made some sense to you – please feel free to comment via my Twitter profile and let me know whether you think I’ve accurately assessed the tone of Professor DeRoure’s lecture, or whether I’m barking up entirely the wrong digital facsimile of a tree.
In his data mining practical session this afternoon, Glenn Roe identified a tool called Voyant, which is a blend of analysis and visualisation tools. It apparently works better on a smaller dataset, and I really want to try it out – but it keeps telling me I’m doing something wrong, and I don’t have the brainpower left tonight to work out what it is. Tomorrow I shall read Seth’s Voyant crib sheet, and hopefully enlightenment will follow.