Research in a Connected World by Alex Voss, Elizabeth Vander Meer, David Fergusson - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

Chapter 1Examples of e-Research

1.1Archaeology*

Key Concepts

  • The early emergence of computing in archaeology

  • Data storage and integration – the Archaeology Data Service and Archaeotools

  • The role of ICT in excavation – Virtual Research Environment for Archaeology and geospatial archaeology

  • The critical importance of integrating any new ICT-based tool or procedure with existing research practices

Introduction

“[A]rchaeologists should look beyond the short term when planning how to use a computer. The world of archaeology is likely to be considerably different in twenty years from now (2009), so archaeologists need to plan with future change in mind.”

J. Moffett, in Computers for Archaeologists, Ross et. al. (eds.), 1991.

The growth in the last ten years or so of e-Research methods across the academic spectrum, and their impact on archaeology, could not have proved Moffett more correct in his prediction. In many ways, archaeology differs significantly from other arts and humanities disciplines in its uptake, theory and application of computational methods. For one thing, computing has played a central role in the development of archaeology’s intellectual traditions for decades; and a coherent community of archaeological computing professionals is now well established. The thirty six year-old Computer Applications and Quantitative Methods in Archaeology conference provides a trusted international forum for this community to network and disseminate its outcomes at the cutting edge of technology, and for those experts to undertake critical assessment of that technology (see, e.g., Clark 2007: 11, and http://www.leidenuniv.nl/caa/). More broadly, the history of archaeology as a discipline can be broadly characterized by a progression from ‘antiquarian’ interest in aesthetically pleasing artefacts to the development of the principles of typology and the evolution of material culture in the late nineteenth century, to the present-day emphasis on systematic and consistent record-keeping (for an overview, see Lock 2003: 1-13). This progression may be seen as an ongoing, iterative transformation in archaeologists’ approach to, and relationship with, information.

In the past, many archaeological approaches have assumed that information handling, processing and visualization is seen as at best ancillary to, and at worst disconnected from, from the interpretive process of understanding the past. This is reflected in early treatments of the subject: in 1985 for example, Martin Carver wrote that ‘[i]n spite of, or perhaps because of, a great deal of breathless proselytizing, it is the computer’s relevance to creative archaeology that is still doubted, and it is the wisdom of investing precious thinking-time in such a potential wild-goose chase that must be weighed’ (Carver 1985: 47). The misperceptions Richards and Ryan identified in the same year, that computers are ‘black boxes producing magic answers’, operated by ‘practitioners of some mystical black art’ (Richards and Ryan 1985: 11) are likely to be less widespread today, due not least to the exposure of archaeologists to ubiquitous internet, email, and other digital technologies, both at home and at work. However, such critical caution is no more misplaced now than it was back then. Indeed the very existenceof the archaeological community’s self-awareness (or perhaps outright scepticism) of the approach to computing and Information and Communications Technology provides an excellent background in which to consider the exponential growth and potential of e-Research in recent years. The following examples demonstrate how, in 2009, e-Research tools and methods, can contribute to the critical interpretive process of archaeology. What 2029 will bring is, of course, anyone’s guess.

Data: integration and understanding

Archaeology produces a vast amount of data of a massive range of types. These include artefact descriptions, measurements, site plans, context plans, photographs, and cartographic and spatial data. Every excavation has particular challenges for data gathering and recording, and the possible responses of excavators to these challenges are constrained by scale, resources, the type of material, and so on. But there is no question that e-Research methods offer enormous potential for supporting such processes; and few archaeologists would doubt the desirability of integrating data from different sites. Eiteljorg (2004) writes of ‘the hope that data storehouses could be used by scholars to retrieve and analyze information from related excavations, thus permitting broader syntheses’ (Eiteljorg 2004: 22): broader synthesis is at the core of academic archaeology, and is vital for any interpretation that seeks to embrace any combination of site, inter-site, or regional scale. However, there is an obvious tension between the structures and standards any database must impose in order to be useful, and the unordered (and incomplete) nature of the archaeological record (see Lock 2003: 85-98). e-Research technologies can support researchers faced with such problems in a number of ways. One approach is the construction of domain-specific ontologies and controlled vocabularies, which can describe and link concepts, and map between different groups of concepts. Thus if artefact of type A is found at site X, then a linked ontological system should be able to identify further examples of type A at site Y, even if the artefacts have been otherwise recorded or described differently. This approach has limitations – those producing data still have to describe and/or annotate the information in a way that conforms with, or can be adapted to, the ontology. This will impose extra costs on already-overburdened resources. On the other hand, standardized metadata and data storage systems can be immensely useful and easy to implement, if supported by centralized support services and repositories like the Archaeology Data Service (http://ads.ahds.ac.uk).

Other approaches seek to apply Natural Language Processing (NLP) technologies to primary archaeological material and secondary archives. One such example is the Archaeotoolsproject, conducted as part of the AHRC-JISC-EPSRC Arts and Humanities e-Science Initiative by the universities of York and Sheffield (http://ads.ahds.ac.uk/project/archaeotools/). Archaeotools identifies and extracts references to ‘what’, ‘when’ and ‘where’ entities in so-called ‘grey literature’. Grey literature refers to reports of (usually small-scale) archaeological investigations that have been produced and archived, often never to be seen again. The NLP process allows information to be tagged in a systematic way according to ‘what’, ‘where’ and ‘when’ and structured into facets for facetted browsing. It should therefore be possible, for example, to search across a range of disparate archaeological reports for references to data concerning Early Medieval coins from North Eastern England [when, what, where], even if the information has not been tagged or described in such terms and the point of being recorded. In another important development, Archaeotools uses NLP-generated entities to search for the information according to the terms in existing controlled vocabularies such as Sites and Monuments Records (SMRs): as will be seen below, integrating e-Research methods within existing practices is essential for archaeology, so allowing researchers to search using the terms and conventions they are already familiar with is critical.

Computers and excavation

As indicated above, the data recovered from excavations are often hugely complex, but excavation itself is also a very complex task. When situated within the nexus of data gathering and the realities of excavation practice, e-Research presents both significant challenges and great opportunities. Digital methods have been integrated with excavation practice at a low level for many years. For example, it is common practice for excavators to take the points of particular positions from a Total Station Theodolite (TST), place these in a local data store, and download them to a computer for processing back at base. However, the ubiquity of networked systems, along with the availability of (often proprietary) software such as ArchaeoData, has meant that e-Research technologies are now being more widely applied in field archaeology. In many cases, this has ‘only’ meant speeding up and/or facilitating existing work; allowing for the documentation of objects and their contexts and transferral of this information to the excavation’s database faster and more efficiently. In essence, many of the software packages are database-oriented, aiming to support excavation directors and post-excavation researchers in organizing and structuring the site’s data according to existing organizing principles and structures.

Some projects however have considered in greater depth the intellectual and interpretive implications of using such technology, thereby addressing Carver’s ‘relevance to creative archaeology’ critique. Ian Hodder for example has reflected on the implications of separating observation from interpretation, and noted that ‘[i]nterpretation occurs at the trowel’s edge. And yet, perhaps because of the technologies available to deal with very large sets of data, we have as archaeologists separated excavation methods out and seen them as prior to interpretation. Modern data-management systems perhaps allow some resolution of the contradiction. At any rate, it is time it was faced and dealt with’ (Hodder 1997: 693). Hodder’s own response to this problem, the online site database of the Çatalhöyük project (http://www.catalhoyuk.com/database/catal/), seeks to present fully and simply all the data about the site, including the free text interpretations of the recorders.

An archaeological project frequently referenced in the literature is the Roman urban excavation of Silchester in Hampshire, which has trialled the use of e-Research technologies in the Virtual Research Environment for Archaeology (VERA: see http://vera.rdg.ac.uk, also case study at http://engage.ac.uk) project in conjunction with its existing Integrated Archaeological Database (http://www.iadb.co.uk/). VERA, funded under JISC’s VRE programme, has tested use of a broadband network at the site and various onsite digital capture methods. Those used earlier in the project, such as PDAs and tablet PCs for recording information about artefacts and plans of trenches and features, proved less successful for a variety of reasons (a major one being that liquid screens perform badly in bright sunlight). Currently however, the project is trialling the use of digital pens for recording information. This follows exactly the procedure for recording information using ‘normal’ pens, with the exception that users can ‘dock’ the digital variety at the end of the working day, downloading handwriting and converting it to ASCII text using automated handwriting recognition. The VERA project has noted that integrating such technologies with existing onsite workflows is critical if they are to stand any chance of wider adoption (see Warwick at. al. 2009 for full discussion). This greatly speeds up and facilitates the process of entering the data into the IADB; and it may well be that, as the method is further refined and deployed in the field, it will provide some hitherto unforeseen contribution to understanding the data as well.

e-Research methods and technologies have also played a significant role in the development of geospatial archaeology. Geographic Information Systems (GIS) have long been at the forefront of computational archaeology: the large quantities of data from large-scale surveys and site-wide analysis, and the need to reference it within a broader spatial framework such as a global coordinate system, has ensured this. However, the emergence of the so-called ‘Geospatial Web’ in recent years (for a recent review see Scharl and Tochtermann 2007) has led to new ways of linking, sharing and understanding geospatial information online. The availability of high quality satellite imagery from services such as Google Earth (GE) has generated a good deal of recent interest in the archaeological community (see Ullman and Gorokhovich 2006), as have the means of marking up and describing data in such environments. In GE’s case this is Keyhole Markup Language (KML), which allows a dataset to be created in a GE view and then shared, updated and added to by another user. Although its impact on field archaeology is not likely to be great in the near future, GE and other ‘virtual earth’ platforms are undoubtedly of interest to scholars wishing to link and contextualize archaeological data online (e.g. Elliott and Gillies 2009).

Summary: Improving Archaeological Research

Archaeology has always thrived on technological innovation. The increasingly information-rich ways of working into which the UK’s academic milieu is moving forms a backdrop for the ever-convoluted relationships between archaeologists and their data. Current e-Research technologies will not provide any panaceas: these equate with what Hodder describes (above) as ‘modern data-management systems’. They may have yet to prove that they can transfer very ‘fuzzy’ data from the ground into the highly structured and quality assured forms that appear in archaeological publications; but there seems little doubt that tools and methods such as relational databases, Natural Language Processing, cultural heritage ontologies, quantitative profiling, geospatial computing, and field-based digital data capture, form a ‘methodological commons’. Whether taken together for the discipline as a whole, or separately in individual projects or research exercises, this collective set of e-Research tools and methods can provide a type of ‘enabling support’ that is simply unprecedented for archaeologists so that they may undertake the research process in better, faster and – possibly – completely new ways.

References / Further Reading

Carver, M. 1985: The friendly user. In Cooper, M. A. and Richards, J. D. (eds.), Current issues in archaeological computing. British Archaeological Reports International Series 271: 47-61.

Clark, J. T. 2007: An introduction to digital discovery: Exploring new frontiers in human heritage. In Clark, J. T. and Hagemesiter, E. M. (eds.), Digital Discovery: Exploring new frontiers in human heritage. Computer Applications and Quantitative Methods in Archaeology: Proceedings of the 34th conference, Fargo, ND, April 2006: 11-14.

Eiteljorg, H. 2004: Computing for Archaeologists. In Schreibman, S., Siemens, R. and Unsworth, J. 2004: A Companion to Digital Humanities. Blackwell, London: 20-30.

Elliott, T. and Gilles, S. 2009: Digital geography and classics. In Changing the Center of Gravity: Transforming Classical Studies Through Cyberinfrastructure. Special issue of Digital Humanities Quarterly (Winter 2009: v3 n1), Gregory Crane and Melissa Terras (eds.): http://digitalhumanities.org/dhq/vol/3/1/000031.html.

Hodder, I. 1997: 'Always momentary, fluid and flexible': towards a reflexive excavation methodology. Antiquity 71: 691-700.

Lock, G. 2003: Using computers in archaeology: towards virtual pasts. Routledge, Taylor and Francis, London.

Moffett, J. 1991: Computers in archaeology: approaches and applications past and present. in Ross, S., Moffett, J. and Henderson, J. (eds.), Computing for archaeologists. Oxford University Committee for Archaeology Monograph No. 18. Oxford: 13-39.

Richards, J. D. and Ryan, N. 1985: Data processing in archaeology. Cambridge Manuals in Archaeology.

Scharl, A. and Tochtermann, K. (eds.), The Geospatial Web (Springer 2007).

Ullmann, L. and Gorokhovich, Y., 2006: ‘Google Earth and some practical applications for the field of archaeology’, CSA Newsletter Vol. XVIII, No. 3 (2006), published online: http://csanet.org/newsletter/winter06/nlw0604.html

Warwick, C., Baker, M., Clarke, A., Fulford, M., Grove, M., O'Riordan, E. and Rains, M. 2009: iTrench: A study of user reactions to the use of information technology in field archaeology. Literary and Linguistic Computing 24 (2).

1.2Text Analysis in the Arts and Humanities*

Key Concepts
  • Digital scholarship

  • Data-driven research

  • TextGrid and collaborative working

  • HiTHeR (High ThroughPut Computing in Humanities e-Research) and use of e-Infrastructure

Introduction

According to UNESCO reports, Britain tops the European lists of research publications per year in philology, literature and other text-based studies such as philosophy. Worldwide, Britain overtook the U.S. in 2006 in terms of book publications per year. In the list of countries for number of book publications in the latest available year, Britain is no. 1. These figures emphasize the urgent need for the British textual studies communities to explore new ways of dealing with this deluge of research data.

Based on these quoted figures, collaboration becomes fundamental to digital scholarship in textual studies. No researcher alone will be able to cope with the plethora of new daily published material. Furthermore, text analysis in the humanities can be a tedious and time-consuming task. But advanced computer-enabled methods make the process easier for digital or digitised works. Researchers can search large texts rapidly, conduct complex searches and have the results presented in context. The ease brought to the analysis process allows the researcher to engage with texts more thoroughly and can then lead to the development of insightful, well-crafted interpretations of texts.

Various projects have emerged internationally in recent years that allow for a new scale of textual studies research, in keeping with the idea of new data-driven research. Software developed by the US MONK (Metadata Offer New Knowledge) project helps humanities scholars discover and analyze patterns in texts,[1] while its sister project SEASR (Software Environment for the Advancement of Scholarly Research) enables digital humanities developers to design, build, and share software applications that support research and collaboration in textual studies.[2] Aus-e-Lit is an Australian project aimed at Australian literary scholars to allow them to seamlessly search across relevant databases and archives to retrieve reliable information on a particular author, topic or publication.[3] These are just three projects quite closely linked to e-Research initiatives, but there are many more. For over fifty years, there has been a worldwide academic movement to work on Digital Humanities, resulting in many achievements, especially in the field of textual studies. It is impossible in the space of this chapter to list all of these projects (for a history of Digital Humanities and some of the involved textual scholarship see Schreibman, Siemens et al. 2004). Instead, we shall concentrate on two projects linked to both Digital Humanities and e-Research, which exemplify in very particular ways two major new developments in textual studies research that are directly linked to the shift in methodologies based on data-driven research: the German TextGrid project illustrates the value of new collaborative research in textual studies, while the UK project HiTHeR (High ThroughPut Computing in Humanities e-Research) demonstrates the effective use of e-Infrastructure to support everyday research in the Digital Humanities.

Collaboration in textual studies - TextGrid

TextGrid[4] is primarily concerned with historical-critical editions for modern cross-language researchers. Such historical-critical editions often form the basis for more light-weight editions for study and reading. Such editions can be very large and very detailed. They cannot be the result of the work of one individual researcher alone, but have to be the result of a collaborative effort. It is TextGrid’s key innovation to facilitate such (virtual) collaboration across language and national barriers.

In its first phase of funding, TextGrid delivered a modular platform for collaborative textual editing, mainly based on the community standard of the Text Encoding Initiative (TEI).[5] As a community grid for textual studies, TextGrid forms a cornerstone in the emerging German e-Humanities agenda. Its success has also been noted in the UK, where the arts and humanities e-Science initiative allowed researchers to experiment with new technologies to cope with the research data deluge in textual studies. The UK e-Science Scoping Study for textual studies, written by Professor Peter Robinson from Birmingham University, quotes TextGrid as a prime example of how to advance literary and textual studies with new digital services, because it addresses the need for collaborative resource creation, comparison (that is, collation and alignment), analysis and annotation.

TextGrid focuses on advancing digital scholarship for a particular community: TEI-based textual studies research. At the centre of its technology innovation is the deployment of an integrated development environment for the creation of critical editions called TextGridLab. Based on the Eclipse platform, TextGridLab uses Grid technologies for storage and retrieval of textual studies resources. It supports all activities, stakeholders and challenges in the textual studies research lifecycle. Resource discovery, via the web interface or TextGridLab modules, is aided by searching across the entire TextGrid data pool – either full text or metadata-restricted.

Decentralized and collaborative work is always sensible when primary sources grow very large and need to be made available and linked to each other in complex metadata schemes. This is due to the quality of these resources, which demand an integration of different viewpoints. Additionally, new mass quantities of resources need the support of high-performance technology in new investigations of ways that advanced text mining solutions can add to the linking and discovery of textual studies resources. The UK JISC Engage funded HiTHeR project has taken on this challenge.

Use of e-Infrastucture in textual studies – HiTHeR

In the Digital Humanities, many text-based collections are exposed via searchable websites. One of these resources is the Nineteenth Century Serials Edition (NCSE) in the UK.[6] The NCSE, a free online scholarly edition of nineteenth-century periodicals and newspapers, has been created as a collaborative project between Birkbeck, University of London, King's College London, the British Library, and Olive Software. The UK Arts and Humanities Research Council funded the project from January 2005 to December 2007. The NCSE corpus contains circa 430,000 articles that originally appeared in roughly 3,500 issues of six 19th Century periodicals. Published over a span of 84 years, materials within the corpus exist in numbered editions and include supplements, wrapper materials and visual elements. A key challenge in creating a digital system for managing such a corpus is to develop appropriate and innovative tools that will assist scholars in finding materials that support their research, while at the same time stimulating and enabling innovative approaches to the material. One goal would be to create a 'semantic view' that would allow users of the resource to find information more intuitively. Such a semantic view can be created by offering users articles with common content through a browsing interface. This is a typical classification task known from many information retrieval and text mining applications. (Nentwich 2003)

According to Toms and O’Brien (2008), the work of humanities researchers using digital resources is concerned with access to sources, the presentation of texts and the ability to analyse texts using a well-defined set of analysis tools. HiTHeR promises direct retrieval of relevant primary sources for research on the NCSE collections. It provides an automatically generated browsing interface, which allows for the crucial Humanities 'chain of readings' activities that define most Humanities researchers' work. In Humanities research processes, new relevant resources are based on the initial discovery of other relevant resources. HiTHeR offers an interface to primary resources by automatically generating a chain of related documents for reading.

However, the advanced automated methods that could help to create such a browsing view using text mining to aid the information retrieval task by users require greater processing power than is available in standard desktop environments. Prior to the current case study, we experimented with a simple document similarity index to allow journals of similar contents to be represented next to each other. Initial benchmarks on a stand-alone server allowed us to conclude that (assuming the test set was representative) a complete set of comparisons for the corpus would take more than 1,000 years!

Governments, private enterprise and funding bodies are investing heavily in digitization of cultural heritage and humanities research resources. With advances in the availability of parallel computing resources and the simultaneous need to process large and complicated historical collections, it seems logical to turn attention towards the best parallel computing infrastructures to support work as envisioned in the HiTHeR project. In HiTHeR we set up an infrastructure based on High Throughput Computing (HTC), which uses many computational resources to accomplish a single computational task.

The HiTHeR project created a prototype infrastructure to demonstrate to textual scholars, and indeed to humanities researchers in general, the utility of HTC methods using Condor. It uses Condor to set up a Campus Grid. In our case, we have built a Campus Grid using underutilized computers from two institutions, which share a building at King’s College London: the Centre for Computing in the Humanities (CCH) and the Centre for e-Research (CeRch). We use two types of computer systems: underutilized normal desktops and dedicated servers. Both, CCH and CeRch, have a large number of desktop machines and servers, used to present their vast archives and online publications. While the servers contain several Terabytes of data, they have underused processing capabilities which can be made available for advanced processing. Additionally, the Condor Toolkit can use the national research infrastructure in the UK, the National Grid Service (NGS), which is a free service to UK researchers and provides dedicated advanced computing facilities.

The evaluation showed that the time used for calculating document similarity could b