Research in a Connected World by Alex Voss, Elizabeth Vander Meer, David Fergusson - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

Chapter 3Managing Complex Data

3.1Scholarly Communication and the Web*

Introduction

The social process of sharing research results underpins the progress of research. For many decades our research has been published in journal articles, conference proceedings, books, theses and professional magazines. With increasing availability of tools to disseminate knowledge digitally, and with increasing participation in the digital world through widespread access to the Web, we are seeing this scholarly knowledge lifecycle become digital too. Although we have seen some welcome changes, including open access publishing which makes material free for all to read, the shared artefact in this lifecycle is predominantly still the academic paper. We might call this “Science 1.0”.

e-Science is taking us into the “Science 2.0” world where we have new mechanisms for sharing (Schneiderman 2008) but also new artefacts to share. The tooling of e-Science produces and consumes data, together with metadata to aid interpretation and reuse, and also the scripts and experiment plans that support automation and the records that make the results interpretable and reusable – our new forms of artefact include data, metadata, scripts, scientific workflows, provenance records and ontologies. Our tools for sharing include the array of collaboration tools from repositories, blogs and wikis to social networking, instant messaging and tweeting that are available on the Web today, though these are not always designed around the new artefacts nor do they always have the particular needs of the researcher in mind.

These are already the familiar tools of the next generation of researchers and their uptake may seem inevitable, though it may take time for them to be appropriated and embedded in research practice. But crucially the other driver for change is the evolution of research practice as more work is conducted in silico and as we pursue multidisciplinary endeavours in data-intensive science to tackle some of the biggest problems facing society, from climate change to energy.

In this chapter we look at emerging practice in collaboration and scholarly communication by focusing on a case study which exemplifies a number of the principles in the paradigm shift to Science 2.0 and gives us a glimpse into the future needs of researchers.

myExperiment

myExperiment is an open source repository solution for the born-digital items arising in contemporary research practice, in particular in silico workflows (see the contribution by Fisher et al.) and experiment plans (DeRoure et al. 2009). Launched in November 2007, the public repository (myexperiment.org) has established a unique collection of workflows and a diverse international user community. The collection serves both researchers and learners: ranging from self-contained, high value research analysis methods referenced by the journal publications that discuss the results of their use, to training workflows that encode routine best practice scientific analyses or illustrate new techniques for new kinds of research data.

myExperiment has focused on support for sharing pieces of research method, such as scientific workflows and experimental plans, in order to address a specific need in the research community in both conducting research and training researchers. Experimental plans, standard operating procedures and laboratory protocols are descriptions of the steps of a research process, commonly undertaken manually. Scientific workflows are one of the most recent forms of scientific digital methods, and one that has gained popularity and adoption in a short time – they represent the methods component of modern in silico science and are valuable and important scholarly assets in their own right. Repositories often emphasise curation of data, but in digital research the curation of the process around that data is equally important – methods are crucial intellectual assets of the research life cycle whose stewardship is often neglected (Goble and De Roure 2008), and by focusing on methods, myExperiment provides a mechanism for expert and community curation of process in a rapidly changing landscape.

While it shares many characteristics with other Web 2.0 sites, myExperiment’s distinctive features to meet the needs of its research user base include support for credit, attributions and licencing, fine control over privacy, a federation model and the ability to execute workflows. Hence myExperiment has demonstrated the success of blending modern social curation methods (social tagging, crowd sourcing) with the demands of researchers sharing hard-won intellectual assets and research works within the scholarly communication lifecycle.

Research Objects

The Web 2 design patterns (O’Reilly 2005) tell us “Data is the next Intel Inside. Applications are increasingly data-driven. Therefore for competitive advantage, seek to own a unique, hard-to-recreate source of data.” Significantly, myExperiment also recognises that a workflow can be enriched as a sharable item by bundling it with some other pieces which make up the “experiment”. Hence myExperiment supports aggregations of items stored in the myExperiment repository as well as elsewhere. These are called “packs”, and while a pack might aggregate external content stored in multiple specialised repositories for particular content types, the pack itself is a single entity which can be tagged, reviewed, published, shared etc. For example, a pack might correspond to an experiment, containing input and output data, the experimental plan, associated publications and presentations, enabling that experiment to be shared. Another example is a pack containing all the evidence corresponding to a particular decision as part of the record of the research process. Packs are described using the Open Archives Initiative’s Object Reuse and Exchange representation which is based on RDF graphs and was specifically designed with this form of aggregation in mind (Van de Sompel 2009).

While some publishers are looking at how to augment papers with supplemental materials, raising concerns about peer-review and about decay, myExperiment is tackling this from first principles by starting with the digital artefacts and asking “what is the research object that researchers will share in the future?” These Research Objects have important properties:

  • Replayable – go back and see what happened. Experiments are automated and may occur in milliseconds or in months. Either way, the ability to replay the experiment, and to study parts of it, is essential for human understanding of what happened.

  • Repeatable – run the experiment again. There's enough in a Research Object for the original researcher or others to be able to repeat the experiment, perhaps years later, in order to verify the results or validate the experimental environment. This also helps scale to the repetition of processing needed for the scale of data intensive science.

  • Reproducible – run a new experiment to reproduce the results. To reproduce (or replicate) a result is for a third party to start with the same materials and methods and see if a prior result can be confirmed.

  • Reusable – use as part of new experiments or Research Objects. One experiment may call upon another, and by assembling methods in this way we can conduct research, and ask research questions, at a higher level.

  • Repurposeable – reuse the pieces in a new experiment. An experiment which is a black box is only reuseable as a black box. By opening the lid we find parts, and combinations of parts, available for reuse, and the way they are assembled is a clue to how they can be re-used.

  • Reliable – robust under automation, which brings systematic and unbiased processing, and also “unattended experiments” without a human in the loop. In data-intensive science, Research Objects promote reliable experiments, but also they must be reliable for automated running.

To achieve these behaviours it is crucial to store provenance records and full contextual metadata in the Research Object, so that results can be properly interpreted and replicated. This complete digital chain from laboratory bench to scholarly output is exemplified by the work on repositories and blogs in laboratories (Coles and Carr 2008), and also in the use of electronic laboratory notebooks.

We believe that in the fullness of time, objects such as these will replace academic papers as the entities that researchers share, because they plug straight into the tooling of e-Research. This means it is Research Objects rather than papers that will be collected in our repositories, and as well as a workflow repository, myExperiment has become a prototypical Research Object repository.

Linked Data

To achieve these properties, a Research Object must be self-contained and self-describing – containing enough metadata to have all the above characteristics and have maximal potential for re-use, whether anticipated or unanticipated. To support this, myExperiment provides a SPARQL endpoint (rdf.myexperiment.org) that makes myExperiment content available according to the myExperiment data model – a modularised ontology drawing on a set of emerging ontologies and standards in open repositories, scientific discourse, provenance and social networking.

myExperiment also aims to be a source of Linked Data so that myExperiment content can be readily integrated with other scientific data. The Linked Data initiative (linkeddata.org) enables people to share structured data on the Web as easily as they can share documents – as with documents, the value and usefulness of data increases the more it is interlinked with other data. To be part of the Linked Data web, data has to be accessible as RDF over the HTTP protocol in line with guidelines. At the time of writing there are 8 billion triples in Linked Data datasets.

With linked data a user can assemble a workflow in minutes to integrate data and call upon a variety of services from search and computation to visualisation. While the linked data movement has persuaded public data providers to deliver RDF, we are now beginning to see assembly of scripts and workflows that consume it – and the sharing of these on myExperiment. We believe this is an important glimpse of future research practice: the ability to assemble with ease experiments that are producing and consuming this form of rich scientific content.

Discussion

There is an open debate about the extent to which open publication should be mandated through the project lifecycle. A common pattern is to share artefacts with friends and colleagues and then make them available more broadly at time of publication. myExperiment supports this model, providing privacy and facilitating openness. In contrast, some sites like openwetware.org oblige their members to make everything public and still enjoy considerable adoption, exemplifying the open science approach.

Scholarly communication is evolving (Hey and Hey 2006) but the traditional academic publishing system has reinforced silos and made communication between disciplines more difficult. In contrast, important challenges like climate change research, which cut across different research communities, demand a social infrastructure to support resource sharing in large teams, and new shared artefacts. With this there also needs to be a culture of sharing and of making shared artefacts re-usable. It is clear that the behaviour of researchers is closely related to incentive models, and these are currently set up around traditional publications. The creation of data and digital methods needs also to be rewarded if they are to flourish as a powerful enabler of new research.

myExperiment shares many characteristics with social networking sites for scientists and also with open repositories and contemporary content management systems, but it also exemplifies some important principles in developing Science 2.0 solutions (DeRoure and Goble 2009). One is the focus on providing a specific solution to meet the immediate requirements of its users and to make it highly configurable to the immediate needs of new communities. Another is that the user can come to myExperiment and find it familiar to use, but equally the myExperiment functionality can be appropriated and integrated in the familiar working environment of the user, be it loosely coupled or tightly integrated. Through linked data, myExperiment realises the network effects of scientific data as well as the network effects of the scientific community. It is an example of the kinds of systems that enable Science 2.0.

References

Coles, S. and Carr, L. (2008). Experiences with Repositories & Blogs in Laboratories. Third International Conference on Open Repositories, 1-4 April, Southampton, UK.

De Roure, D., Goble, C., Aleksejevs, S., Bechhofer, S., Bhagat, J., Cruickshank, D., Fisher, P., Hull, D., Michaelides, D., Newman, D., Procter, R., Lin, Y. and Poschen, M. (2009) Towards Open Science: The myExperiment approach. Concurrency and Computation: Practice and Experience. (In Press)

De Roure, D. and Goble, C. (2009). Software Design for Empowering Scientists. IEEE Software, 26(1). January/February 2009. pp. 88-95.

Goble, C. and De Roure, D. (2008). Curating Scientific Web Services and Workflows. Educause Review, 43(5), September/October.

Hey, T. and Hey, J. (2006). e-Science and its implications for the library community. Library Hi Tech, 24(4). pp. 515-528

O'Reilly, T. (2005) What Is Web 2.0? Design Patterns and Business Models for the Next Generation of Software, September.

Shneiderman, B. (2008). Science 2.0. Science, 319. pp. 1349-1350.

Van de Sompel, H., Lagoze, C., Nelson, ML., Warner, S., Sanderson, R. and Johnston, P. (2009). Adding eScience Assets to the Data Web. CoRR abs/0906.2135

3.2Scientific Workflows*

Key Concepts:

  • Scientific workflows

  • Data-intensive research

Introduction

The use of data processing workflows within the business sector has been commonplace for many years. Their use within the scientific community, however, has only just begun. With the uptake of workflows within scientific research, an unprecedented level of data analyses is now at the fingertips of individual researchers, leading to a change in the way research is carried out. This chapter describes the advantages of using workflows in modern biological research; demonstrating research from the field where the application of workflow technologies was vital for understanding the processes involved in resistance and susceptibility of infection by a parasite. Specific attention is drawn to the Taverna Workflow Workbench (Hull et al. 2006), a workflow management system that provides a suite of tools to support the design, execution, and management of complex analyses in the data intensive research, for example, in the Life Sciences.

Data-Intensive Research in the Life Sciences

In the last decade the field of informatics has moved from the fringes of biological and biomedical sciences to being an essential part of research. From the early days of gene and protein sequence analysis, to the high-throughput sequencing of whole genomes, informatics is integral in the analysis, interpretation, and understanding of biological data. The post-genomic era has been witness to an exponential rise in the generation of biological data; the majority of which is freely available in the public domain, and accessible over the Internet.

New techniques and technologies are continuously emerging to increase the speed of data production. As a result, the generation of novel biological hypotheses has shifted from the task of data generation to that of data analysis. The results of such high-throughput investigations, and the way it is published and shared, is initially for the benefit of the research groups generating the data; yet it is fundamental to many other investigations and research institutes. The public availability means that it can then be reused in the day to day work of many other scientists. This is true for most bioinformatics resources. The overall effect, however, is the accumulation of useful biological resources over time.

In the 2009 Databases special issue of Nucleic Acids Research, over 1000 different biological databases were available to the scientific community. Many of these data resources have associated analysis tools and search algorithms, increasing the number of possible tools and resources to several thousand. These resources have been developed over time by different institutions. Consequently, they are distributed and highly heterogeneous with few standards for data representation or data access. Therefore, despite the availability of these resources, integration and interoperability present significant challenges to researchers.

In bioinformatics, many of the major service providers are providing Web Service interfaces to their resources, including the NCBI, EBI, and DDBJ; many more are embracing this technology each year. This widespread adoption of Web Services has enabled workflows to be more commonly used within scientific research. Data held in the NCBI can now be analysed with tools available at the EBI, within analysis pipeline.

In Silico Workflows

One possible solution to the problem of integrating heterogeneous resources is the use of in silico workflows. The use of workflows in science has only emerged over the last few years and addresses different concerns to workflows used within the business sector. Rather than co-ordinating the management and transactions between corporate resources, scientific workflows are used to automate the analysis of data through multiple, distributed data resources in order to execute complex in silico experiments.

Workflows provide a mechanism for accessing remote third-party services and components. This in turn reduces the overheads of downloading, installing, and maintaining resources locally whilst ensuring access to the latest versions of data and tools. Additionally, much of the computation happens remotely (on dedicated servers). This allows complex and computationally intensive workflows to be executed from basic desktop or laptop computers. As a result, the researchers are not held back by a lack of computational resources or access to data.

A workflow provides an abstracted view over the experiment being performed. It describes what analyses will be executed, not the low-level details of how they will be executed; the user does not need to understand the underlying code, but only the scientific protocol. This protocol can be easily understood by others, so can be reused or even altered and repurposed. Workflows are a suitable technology in any case where scientists need to automate data processing through a series of analysis steps. Such mechanisms have the potential to increase the rate of data analysis, from a cottage-scale to industrial scale operation.

There are many workflow management systems available in the scientific domain, including: Taverna (Hull et al. 2006), Kepler (Altintas et al. 2004) and Triana (Taylor et al. 2003). Taverna, developed by the the myGrid consortium (http://www.mygrid.org.uk/), is a workflow system that was built with the Life Sciences in mind but it has since been used in other fields as well, including Physics, Astronomy and Chemistry. Like many others, the Taverna Workbench provides:

  • an environment for designing workflows;

  • an enactment engine to execute workflow locally or remotely;

  • support for workflow design in the form of service and workflow discovery;

  • and provenance services to manage the results and events of workflow invocations.

Understanding Disease Resistance in Model Organisms

Taverna workflows are used in many areas of Life Science research, notably for research into genotype-phenotype correlations, proteomics, genome annotation, and Systems Biology. The following case study demonstrates the use of Taverna workflows in the Life Sciences domain for genotype-phenotype studies (Stevens et al. 2008).

Figure (graphics1.png)
Figure 3.1
This figure shows the conversion of a microarray CEL image file to a list of candidate genes, pathways, and pathway publications. The workflow makes use of a local statistical processor, services from the National Centre for Biotechnological Innovation (NCBI) and the Kyoto Ecyclopedia of Genes and Genomes (KEGG)

Sleeping sickness (or African trypanosomiasis) is an endemic disease throughout the sub-Saharan region of Africa. It is the result of infection from the trypanosome parasite, affecting a host of organisms. The inability of the agriculturally productive Boran cattle species to resist trypanosome infection is a major restriction within this region. The N’Dama species of cattle, however, has shown tolerance to infection and its subsequent disease. The low milk yields and lack of physical strength of this breed, unfortunately, limit their use in farming or meat production. A better understanding of the processes that govern the characteristics of resistance or susceptibility in different breeds of cattle will potentially lead to the development of novel therapeutic drugs or the construction of informed selective breeding programs for enhancing agricultural production.

Research conducted by the Wellcome Trust Host-Pathogen project is currently investigating the mechanisms of resistance to this parasitic infection, utilising Taverna workflows for a large-scale analysis of complex biological data (Fisher et al. 2007). The workflows in this study combine two approaches to identify candidate genes and their subsequent biological pathways: classic genetic mapping can identify chromosomal regions that contain genes involved in the expression of a trait (Quantitative Trait Loci or QTL) while transcriptomics can reveal differential gene expression levels in susceptible and resistant species.

Previous studies using the mouse as model organism identified 3 chromosomal regions statistically linked to resistance to trypanosome infection. One of these regions, the Tir1 QTL, showed the largest effect on survival. Previous investigations using this QTL identified a region shared between the mouse and cow genomes. As the scale of the data analysis task is large, researchers performing such an analysis manually would often triage their data and in this case have tended to focus on this shared region in their search for candidate genes contributing to the susceptibility to trypanosome infection. While this approach may be scientifically valid, there is a danger that candidate genes may be missed where additional biological factors may contribute to the expression of the phenotype. With a workflow, this triage of data is no longer necessary. All data can be analysed systematically, reducing the risk of missing vital information.

Researchers on the Wellcome Trust Pathogen-Host project conducted a wider analysis of the entire QTL region using a set of workflows to identify pathways whose genes lie within the chosen QTL region, and contain genes whose expression level changes. As a result of this research, a key pathway was identified whose component genes showed differential expression following infection from the trypanosome parasite. Further analysis showed that, within this pathway, the Daxx gene is located within the Tir1 QTL region and showed the strongest change in expression level. Subsequent investigations using the scientific literature highlighted the potential role of Daxx in contributing to the susceptibility to trypanosome infection. This prompted the re-sequencing of Daxx within the laboratory, leading to the identification of mutations of the gene within the susceptible mouse strains. Previous studies had failed to identify this as a candidate gene due to the premature triage of the QTL down to the syntenous region.

This example shows that conducting this kind of data-driven approach to analysing complex biological data at the level of biological pathways can provide detailed information of the molecular processes contributing to the expression of these traits. The success of this work was primarily in data integration and the ability of the workflow to process large amounts of data in a consistent and automated fashion.

Workflow Reuse

Workflows not only provide a description of the analysis being performed, but also serve as a permanent record of the experiment when coupled with the results and provenance of workflow runs. Researchers can verify past results by re-running the workflow or by exploring the intermediate results from past invocations. The same workflow can also be used with new data or modified and reused for further investigations.

The ability to reuse workflows and to automatically record provenance o