This morning, I was privileged to attend the plenary lecture given by Dr Wolfram Horstmann, Associate Director of Digital Library Programmes and Information Technologies at the Bodleian Library. His talk was on research data, and more specifically where the humanities are within that discussion.
Within the sciences, data-driven research is referred to as the “fourth paradigm”. Before this, we had three main approaches to scientific research: experimental and theoretical research being the first two, with computer simulations of natural phenomena making up the third. The fourth paradigm is that of data-intensive science – the ability of modern instrumentation to generate data at rates 100 times, 1000 times, that of the devices they are replacing. The sciences have to deal with Big Data too – and they need Big Science to enable them to do so.
It is much more difficult, said Horstmann, for the humanities to be involved in the discussion, particularly as the debate is stymied by the idea that humanities research is threatened by demands for economic impact. But, should we be anxious? Really, suggests Horstmann, all we need to do is ensure we use the correct phraseology, and develop how we communicate, and we can then participate in the discussion. In fact, the current awareness of the importance of research data provides opportunities for the humanities to show their value. The challenge is to communicate what research data means for the humanities.
For instance, we need to state the obvious more clearly, and include text and images as research data, with our libraries identified as our research centres. The sciences look at matter: wetware, hardware, numbers. These things produce the data in their area. Humanities look at text and images. Horstmann asks, are we presenting these as data in the ongoing discussion? And if we aren’t, what needs to be done to put text and images as data within the debate?
When you think about it, we institutionalised our research facilities centuries ago. Other subject areas did it much later, with labs like CERN. However, the transformation of those physical research facilities into digital is laborious and expensive, and whilst over the last two decades the texts available have been digitised, and put into a more structured format, the real potential has not yet been explored. The humanities have an advantage in this next phase because we have infrastructure already in place.
The Bodleian Library in Oxford is approaching a petabyte of texts and images (in comparison, CERN produces 12 petabyte a year). This is somewhere in the region of 2 million digitised images, with another million to come in the next few years, and not counting the information which will be digitised under the Oxford Google Books Project, or the projects which are carried out by other bodies and then maintained by the Bodleian in perpetuity. Horstmann acknowledged that the Bodleian is a special case, but showed that it can be compared with the technical infrastructure provided within some science departments.
Horstmann gave us several examples of what he called “highly structured, intellectually curated data”, one of which was Networking Early Modern Correspondence, an interdisciplinary research project which uses “a variety of methods to reconstruct and interpret the correspondence networks central to the revolutionary intellectual developments of the early modern period”. The other was What’s The Score?, a crowdsourcing exercise in which members of the public are asked to help describe the Bodleian’s digitised collection of scores from the 19th Century. In two months, the online project (which was soft-launched), had completed 6% of its records.
As mentioned above, the Bodleian is also involved in a project called Google Books at the Bodleian. This project has evolved mainly because Google doesn’t allow access to the digitised material from Europe; they can only provide access to the US. The Bodleian had to provide access itself to the European data. Released as a soft launch available only through the catalogue, this is a great chance for the humanities to produce numbers comparable to the usage numbers in the sciences.
And this is important because, whichever way we look at it, size matters. Even though the humanities often uses qualitative and hermeneutic methodologies, rather than quantitative, the fact is data is important, and it needs structure to provide us with a thorough description. Collaboration matters too: involvement of colleagues and crowdsourcing makes a significant difference to research.
Horstmann maintained that our first challenge is that of diversity. The humanities have a varied typology of research data, often requiring idiographic approaches. Thus, standardisation is difficult, and so is finding computational skills. The second challenge is openness. Competition, privacy and exploitation, he said, are impediments to data sharing. He postulated whether the humanities, more than others, help to maintain the “ivory tower” attitude, and gave an extreme example of a group of archaeologists who didn’t want to reveal their data in case tomb raiders stole from the site. There are, of course, many more obstacles for putting research data into the public domain, not merely being chased by a gigantic rolling boulder.
What we have in our favour is that humanities research data is often easier understood by the public than science data. In fact, they are more likely to be accessed and preserved than research data in other subject areas. http://www.europeana.eu/portal/ does a good job of exploiting that aspect.
Humanities research data also has a web impact advantage, and high societal interest can manifest itself in higher webometric and usage statistic ratings, if we are able to collect that data. One way of doing this is to create effective links between websites, which can actualise effective data readings to be used by the humanities in the future. Horstmann said we should pay particular attention to webometric analysis and usage statistics, although there is still standardisation work to be done on this aspect of new metrics.
Horstmann believes that if we simply alter our mindset, we can play a role in the research data discussion, should we want to do so. But firstly, we need to conceptualise text and images as a type of data, and to do so clearly. If we do that, we can claim that the humanities are a data intensive subject area. From there, we can see a chance for the better accessibility of humanities projects on the web. He recommended exploiting the good accessibility of humanities research themes (through newspapers, exhibitions, crowdsourcing, and citizen science) and to make as many research outputs web accessible as possible. We need to invest in and support new metrics, and strengthen the partnership between us and our libraries.
A wonderful observation was made, albeit a little off the beaten track of the lecture, during the questions at the end of Dr Horstmann’s session. There is, one conference attendee said, an increasing desire for us to “acknowledge the telescope through which we see the star”, meaning that we are being pressed to name the tools with which we do our research, and surely that doesn’t matter if we’re getting the same end result as we would if we were using the original artefact. “I don’t write ‘I went into my library and I got this book, so why should I do that if I’m using a digital facsimile?”. Dr Horstmann suggested that this was only necessary to provide reference to the point at which we accessed a specific resource, and how we accessed it – something we would have to do anyway with a book.
I confess that today has been harder for me. The plenary lecture given by Dr Horstmann was very interesting but the afternoon, which was a practical introduction to data mining given by Glenn Roe, the Mellon Fellow in Digital Humanities at the Oxford e-Research Centre, I compared to hanging on to a tiger by its tail. Glenn himself acknowledged that the workshop was challenging and highly-engaging, and his rather plaintive tweet a day earlier told me all I needed to know about preparing for it. Coming from a literature background, it seems an alien concept to me to strip the text down to single words, but I can acknowledge that doing so allows for new ways of thinking, and can create interesting research questions – which is what it’s all about, after all.
Back at Merton, I am lying in bed with a cup of tea and contemplating (with no small amount of trepidation) what tomorrow will bring.