Skip to content Skip to navigation Skip to collection information

OpenStax-CNX

You are here: Home » Content » Research in a Connected World » Scientific Workflows

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

In these lenses

  • eScience, eResearch and Computational Problem Solving

    This collection is included inLens: eScience, eResearch and Computational Problem Solving
    By: Jan E. Odegard

    Click the "eScience, eResearch and Computational Problem Solving" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.
 

Scientific Workflows

Module by: Katy Wolstencroft, Paul Fisher, David De Roure, Carole Goble. E-mail the authorsEdited By: Alex Voss

Summary: This chapter describes the advantages of using scientific workflows in data-intensive research.

Key Concepts:

  • Scientific workflows
  • Data-intensive research

Introduction

The use of data processing workflows within the business sector has been commonplace for many years. Their use within the scientific community, however, has only just begun. With the uptake of workflows within scientific research, an unprecedented level of data analyses is now at the fingertips of individual researchers, leading to a change in the way research is carried out. This chapter describes the advantages of using workflows in modern biological research; demonstrating research from the field where the application of workflow technologies was vital for understanding the processes involved in resistance and susceptibility of infection by a parasite. Specific attention is drawn to the Taverna Workflow Workbench (Hull et al. 2006), a workflow management system that provides a suite of tools to support the design, execution, and management of complex analyses in the data intensive research, for example, in the Life Sciences.

Data-Intensive Research in the Life Sciences

In the last decade the field of informatics has moved from the fringes of biological and biomedical sciences to being an essential part of research. From the early days of gene and protein sequence analysis, to the high-throughput sequencing of whole genomes, informatics is integral in the analysis, interpretation, and understanding of biological data. The post-genomic era has been witness to an exponential rise in the generation of biological data; the majority of which is freely available in the public domain, and accessible over the Internet.

New techniques and technologies are continuously emerging to increase the speed of data production. As a result, the generation of novel biological hypotheses has shifted from the task of data generation to that of data analysis. The results of such high-throughput investigations, and the way it is published and shared, is initially for the benefit of the research groups generating the data; yet it is fundamental to many other investigations and research institutes. The public availability means that it can then be reused in the day to day work of many other scientists. This is true for most bioinformatics resources. The overall effect, however, is the accumulation of useful biological resources over time.

In the 2009 Databases special issue of Nucleic Acids Research, over 1000 different biological databases were available to the scientific community. Many of these data resources have associated analysis tools and search algorithms, increasing the number of possible tools and resources to several thousand. These resources have been developed over time by different institutions. Consequently, they are distributed and highly heterogeneous with few standards for data representation or data access. Therefore, despite the availability of these resources, integration and interoperability present significant challenges to researchers.

In bioinformatics, many of the major service providers are providing Web Service interfaces to their resources, including the NCBI, EBI, and DDBJ; many more are embracing this technology each year. This widespread adoption of Web Services has enabled workflows to be more commonly used within scientific research. Data held in the NCBI can now be analysed with tools available at the EBI, within analysis pipeline.

In Silico Workflows

One possible solution to the problem of integrating heterogeneous resources is the use of in silico workflows. The use of workflows in science has only emerged over the last few years and addresses different concerns to workflows used within the business sector. Rather than co-ordinating the management and transactions between corporate resources, scientific workflows are used to automate the analysis of data through multiple, distributed data resources in order to execute complex in silico experiments.

Workflows provide a mechanism for accessing remote third-party services and components. This in turn reduces the overheads of downloading, installing, and maintaining resources locally whilst ensuring access to the latest versions of data and tools. Additionally, much of the computation happens remotely (on dedicated servers). This allows complex and computationally intensive workflows to be executed from basic desktop or laptop computers. As a result, the researchers are not held back by a lack of computational resources or access to data.

A workflow provides an abstracted view over the experiment being performed. It describes what analyses will be executed, not the low-level details of how they will be executed; the user does not need to understand the underlying code, but only the scientific protocol. This protocol can be easily understood by others, so can be reused or even altered and repurposed. Workflows are a suitable technology in any case where scientists need to automate data processing through a series of analysis steps. Such mechanisms have the potential to increase the rate of data analysis, from a cottage-scale to industrial scale operation.

There are many workflow management systems available in the scientific domain, including: Taverna (Hull et al. 2006), Kepler (Altintas et al. 2004) and Triana (Taylor et al. 2003). Taverna, developed by the the myGrid consortium (http://www.mygrid.org.uk/), is a workflow system that was built with the Life Sciences in mind but it has since been used in other fields as well, including Physics, Astronomy and Chemistry. Like many others, the Taverna Workbench provides:

  • an environment for designing workflows;
  • an enactment engine to execute workflow locally or remotely;
  • support for workflow design in the form of service and workflow discovery;
  • and provenance services to manage the results and events of workflow invocations.

Understanding Disease Resistance in Model Organisms

Taverna workflows are used in many areas of Life Science research, notably for research into genotype-phenotype correlations, proteomics, genome annotation, and Systems Biology. The following case study demonstrates the use of Taverna workflows in the Life Sciences domain for genotype-phenotype studies (Stevens et al. 2008).

Figure 1: This figure shows the conversion of a microarray CEL image file to a list of candidate genes, pathways, and pathway publications. The workflow makes use of a local statistical processor, services from the National Centre for Biotechnological Innovation (NCBI) and the Kyoto Ecyclopedia of Genes and Genomes (KEGG)
Figure 1 (graphics1.png)

Sleeping sickness (or African trypanosomiasis) is an endemic disease throughout the sub-Saharan region of Africa. It is the result of infection from the trypanosome parasite, affecting a host of organisms. The inability of the agriculturally productive Boran cattle species to resist trypanosome infection is a major restriction within this region. The N’Dama species of cattle, however, has shown tolerance to infection and its subsequent disease. The low milk yields and lack of physical strength of this breed, unfortunately, limit their use in farming or meat production. A better understanding of the processes that govern the characteristics of resistance or susceptibility in different breeds of cattle will potentially lead to the development of novel therapeutic drugs or the construction of informed selective breeding programs for enhancing agricultural production.

Research conducted by the Wellcome Trust Host-Pathogen project is currently investigating the mechanisms of resistance to this parasitic infection, utilising Taverna workflows for a large-scale analysis of complex biological data (Fisher et al. 2007). The workflows in this study combine two approaches to identify candidate genes and their subsequent biological pathways: classic genetic mapping can identify chromosomal regions that contain genes involved in the expression of a trait (Quantitative Trait Loci or QTL) while transcriptomics can reveal differential gene expression levels in susceptible and resistant species.

Previous studies using the mouse as model organism identified 3 chromosomal regions statistically linked to resistance to trypanosome infection. One of these regions, the Tir1 QTL, showed the largest effect on survival. Previous investigations using this QTL identified a region shared between the mouse and cow genomes. As the scale of the data analysis task is large, researchers performing such an analysis manually would often triage their data and in this case have tended to focus on this shared region in their search for candidate genes contributing to the susceptibility to trypanosome infection. While this approach may be scientifically valid, there is a danger that candidate genes may be missed where additional biological factors may contribute to the expression of the phenotype. With a workflow, this triage of data is no longer necessary. All data can be analysed systematically, reducing the risk of missing vital information.

Researchers on the Wellcome Trust Pathogen-Host project conducted a wider analysis of the entire QTL region using a set of workflows to identify pathways whose genes lie within the chosen QTL region, and contain genes whose expression level changes. As a result of this research, a key pathway was identified whose component genes showed differential expression following infection from the trypanosome parasite. Further analysis showed that, within this pathway, the Daxx gene is located within the Tir1 QTL region and showed the strongest change in expression level. Subsequent investigations using the scientific literature highlighted the potential role of Daxx in contributing to the susceptibility to trypanosome infection. This prompted the re-sequencing of Daxx within the laboratory, leading to the identification of mutations of the gene within the susceptible mouse strains. Previous studies had failed to identify this as a candidate gene due to the premature triage of the QTL down to the syntenous region.

This example shows that conducting this kind of data-driven approach to analysing complex biological data at the level of biological pathways can provide detailed information of the molecular processes contributing to the expression of these traits. The success of this work was primarily in data integration and the ability of the workflow to process large amounts of data in a consistent and automated fashion.

Workflow Reuse

Workflows not only provide a description of the analysis being performed, but also serve as a permanent record of the experiment when coupled with the results and provenance of workflow runs. Researchers can verify past results by re-running the workflow or by exploring the intermediate results from past invocations. The same workflow can also be used with new data or modified and reused for further investigations.

The ability to reuse workflows and to automatically record provenance of workflow runs gives workflow management systems a large advantage over manual analysis methods and scripting. Manual analysis techniques are inherently difficult to replicate and are compounded by poor documentation. An example is the wide-spread use of ‘link integration’ in bioinformatics (Stein 2003). This process, of hyper-linking through any number of data resources, further exacerbates the problem of capturing the methods used for obtaining in silico results where it is often difficult to identify the essential data in the chain of hyper-linked resources.

Workflow reuse is also an important area within the sciences, and provides a mechanism for sharing methodologies and analysis protocols. As a result, repositories for finding and sharing workflows are emerging. One such resource, myExperiment, developed in collaboration between the Universities of Manchester and Southampton, provides a workflow repository and a collaborative social networking environment to support the in silico experimental process, and to enable scientists to connect with those with similar interests. The workflows dicussed in the trypanosomiasis use-case study are available on myExperiment, as part of a workflow pack. Many of these have already been reused in other studies. One such example includes the re-purposing of the microarray gene expression workflow to analyse gene expression data from E. Coli. This workflow appends a further workflow to include a means of information retrieval for future text mining applications (shown in Figure 1).

Discussion

Manually processing and examining results in biology is no longer feasible for many scientists. Data is dynamic, distributed, and often very large. This will not change in the near future.

The integration and interoperation of data between different and distributed resources is a vital part of almost all experiments. With the exception of a few supercomputing centres, most institutions do not have the storage, computational, or curation facilities to consider integrating resources locally. The ability to access and utilise many different resources from all over the world is consequently a large advantage of workflow technologies. It allows scientists to access computing resources far beyond the power available through their own desktop machines.

Building workflows is a practical solution to problems involving access to data and applications, but care still needs to be taken to exploit these advantages. Interoperation without integration may lead to unmanageable results which are difficult to analyse. In this event, the problem has not been solved, but simply transferred further downstream. Considering how results will be used and who will be analysing them is important. For example, designing workflows to populate a data model, or to feed into external visualization software, could reduce these problems. The provenance traces of the workflow runs can also help scientists to explore their results.

Designing these ‘advanced’ workflows requires a significant amount of informatics knowledge that many laboratory researchers cannot be expected to have. They do, however, need to use tools and software to analyse their data. The introduction of workflow repositories, like myExperiment, provides the wider research communities with access to pre-configured, complex workflows. Researchers can re-use established analysis protocols by downloading and running them with their own data. In some circumstances, they can even run Taverna workflows through the myExperiment interface.

Increasingly, workflows are becoming applications that are hidden behind web pages like myExperiment, or other domain specific portals. Instead of stand-alone tools, they are becoming integral parts of virtual research environments, or e-Laboratories. Users may not necessarily know they are invoking workflows.

The use of workflows in research can reduce many problems associated with data distribution and size. In the post-genomic era of biology, for example, this is extremely important. Biomedical science is a multidisciplinary activity which can benefit from advances e-Science in equal measure to advances in laboratory techniques. Sharing workflows and in silico analysis methods, with tools like Taverna and myExperiment, can lead to significant contributions to research in this and other disciplines.

References

Altintas, I. et al. (2004). Kepler: an extensible system for design and execution of scientific workflows. Proceedings of the 16th International Conference on Scientific and Statistical Database Management

Fisher, P., Hedeler, C., Wolstencroft, K., Hulme, H., Noyes, H., Kemp, S., Stevens, R. and Brass, A. (2007). A systematic strategy for large-scale analysis of genotype phenotype correlations: identification of candidate genes involved in African trypanosomiasis. Nucleic Acids Resesearch, 35(16). pp. 5625-5633.

Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M., Li, P. and Oinn, T. (2006). Taverna: a tool for building and running workflows of services. Nucleic Acids Research, vol. 34, Web Server issue, W729-W732.

Stein, L. (2003). Integrating biological databases. Nat Rev Genet, 4(5). pp. 337-345.

Stevens, R. et al. (2004). Exploring Williams-Beuren syndrome using myGrid. Bioinformatics, 20 Suppl 1

Stevens, R. et al. (2008). Traversing the bioinformatics landscape. W. Dubitzky (ed.) Data Mining Techniques in Grid Computing Environments. John Wiley and Sons. pp. 141-164.

Taylor, I. et al. (2003). Triana Applications within Grid Computing and Peer to Peer Environments. Journal of Grid Computing, 1(2). pp. 199-217.

Collection Navigation

Content actions

Download:

Collection as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Module as:

PDF | More downloads ...

Add:

Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks