O'Really?

June 17, 2009

Nettab 2009 Day Two: Wikis ‘n’ Workflows

Alex Bateman on the RNA WikiprojectThis is a  brief report and some links from the second day of Network Applications and Tools in Biology (NETTAB 2009) in Catania, Sicily. There were two keynotes on the RNA WikiProject [1] by Alex Bateman and myExperiment [2] (by me) as as well as presentations by (I think but I wasn’t concentrating enough) Dietlind Gerloff, Guiliano Armano, Frédéric Cadier and Leandro Ciuffo.

Alex Bateman (wikipedia user:Alexbateman) did an entertaining talk on the RNA wikiproject: Community annotation of RNA families where they have taken data from the Rfam database [3], and put it all into regular wikipedia. This project got quite a lot of media attention back in February. In this case, the primary advantages of “letting go of data” by giving it to wikipedia are that it is read by everyone who uses Google (where pages are frequently the top search result) and wikipedia gets lots more traffic than biological databases like rfam.sanger.ac.uk do. Thanks to wikirank which tells you what is popular on wikipedia, it is also possible to quickly compare the popularity of pages, see RNA vs. Ribosomal RNA vs Micro RNA vs SnoRNA for an example. The Rfam project have some interesting stats on who makes the most edits to the Rfam pages, it isn’t always the scientists who make important contributions, but anonymous users and machines (e.g. like Rfambot, Smackbot and Citation bot) who are often doing most of the hard work. There is a very long tail of contributors who make small contributions – which supports the 90% of users in on-line communities are lurkers who never contribute rule and is reminiscent of Citizen Science and Muggles. I wanted to put the slides from this talk on slideshare, but they contain some unpublished data. You can, however, subscribe to the feed of the Rfam and Pfam blog at xfam.wordpress.com, if you’d like to keep up to date on developments in this area.

After the keynote there were presentations by Dietlind Gerloff on Open Knowledge (a new agent-based infrastructure for bioinformatics experimentation – nice pictorial intro using lego here) and Guiliano Armano? on ProDaMa-C – a collaborative web application to generate specialised protein structure datasets.

The next keynote was on myexperiment.org, “Where Experimental Work Flows” – my slides on Who are you, Managing collaborative digital identities in bioinformatics with myexperiment are embedded below.

I followed this presentation with a live 30 minute demonstration and discussion of myexperiment. The most interesting question people asked was Why use OpenID instead of full blown Public Key Infrastructure? (answer: OpenID is currently a lot easier and provides good-enough security). The rest of the day is a bit of a blur, I’m with Tim Bray in enjoying the monster adrenaline high of public speaking, but with all that ChEBI:28918 coursing through my veins it can be difficult to think straight (immediately before, during or after a talk)… so you’ll have to take a look at the proceedings for the full details of what happened in the afternoon – but they included Make Histri (great name!), SBMM: Systems Biology Metabolic Modeling Assistant [4] by Ismael Navas-Delgado and Biomedical Applications of the EELA-2 project.

By the evening time, there was some Opera dei Pupi (traditional sicilian puppet theatre), a trip to Acireale and a delicious italian feast in a ristorante (the name of which I can’t remember) to round off an enjoyable day.

References

  1. Daub, J., Gardner, P., Tate, J., Ramskold, D., Manske, M., Scott, W., Weinberg, Z., Griffiths-Jones, S., & Bateman, A. (2008). The RNA WikiProject: Community annotation of RNA families RNA, 14 (12), 2462-2464 DOI: 10.1261/rna.1200508
  2. De Roure, D., & Goble, C. (2009). Software Design for Empowering Scientists IEEE Software, 26 (1), 88-95 DOI: 10.1109/MS.2009.22
  3. Gardner, P., Daub, J., Tate, J., Nawrocki, E., Kolbe, D., Lindgreen, S., Wilkinson, A., Finn, R., Griffiths-Jones, S., Eddy, S., & Bateman, A. (2009). Rfam: updates to the RNA families database Nucleic Acids Research, 37 (Database) DOI: 10.1093/nar/gkn766
  4. Reyes-Palomares, A., Montanez, R., Real-Chicharro, A., Chniber, O., Kerzazi, A., Navas-Delgado, I., Medina, M., Aldana-Montes, J., & Sanchez-Jimenez, F. (2009). Systems biology metabolic modeling assistant: an ontology-based tool for the integration of metabolic data in kinetic modeling Bioinformatics, 25 (6), 834-835 DOI: 10.1093/bioinformatics/btp061

June 10, 2009

Kenjiro Taura on Parallel Workflows

Kenjiro TauraKenjiro Taura is visting Manchester next week from the Department of Information and Communication Engineering at the University of Tokyo. He will be doing a seminar, the details of which are below:

Title: Large scale text processing made simple by GXP make: A Unixish way to parallel workflow processing

Date-time: Monday, 15 June 2009 at 11:00 AM

Location: Room MLG.001, mib.ac.uk

In the first part of this talk, I will introduce a simple tool called GXP make. GXP is a general purpose parallel shell (a process launcher) for multicore machines, unmanaged clusters accessed via SSH, clusters or supercomputers managed by batch scheduler, distributed machines, or any mixture thereof. GXP make is a ‘make‘ execution engine that executes regular UNIX makefiles in parallel. Make, though typically used for software builds, is in fact a general framework to concisely describe workflows constituting sequential commands. Installation of GXP requires no root privileges and needs to be done only on the user’s home machine. GXP make easily scales to more than 1,000 CPU cores. The net result is that GXP make allows an easy migration of workflows from serial environments to clusters and to distributed environments. In the second part, I will talk about our experiences on running a complex text processing workflow developed by Natural Language Processing (NLP) experts. It is an entire workflow that processes MEDLINE abstracts with deep NLP tools (e.g., Enju parser [1]) to generate search indices of MEDIE, a semantic retrieval engine for MEDLINE. It was originally described in Makefile without a particular provision to parallel processing, yet GXP make was able to run it on clusters with almost no changes to the original Makefile. Time for processing abstracts published in a single day was reduced from approximately eight hours (with a single machine) to twenty minutes with a trivial amount of efforts. A larger scale experiment of processing all abstracts published so far and remaining challenges will also be presented.

References

  1. Miyao, Y., Sagae, K., Saetre, R., Matsuzaki, T., & Tsujii, J. (2008). Evaluating contributions of natural language parsers to protein-protein interaction extraction Bioinformatics, 25 (3), 394-400 DOI: 10.1093/bioinformatics/btn631

December 10, 2008

Congratulations Carole Goble, e-Scientist

Carole Goble wins first Jim Gray e-Science awardAt the Microsoft e-Science workshop in Indianapolis, earlier this week Carole Goble was awarded with the first Jim Gray 2008 e-Science award, pictured here collecting the prize from Tony Hey of Microsoft Research. You can read all about it in the Seattle Tech Report which says:

“As director of the U.K.’s myGrid project, Goble helped create Taverna, open source software that allows scientists to analyse complex data sets with a standard computer.”

It is very inspiring when colleagues win prizes and awards. Personally, I would not be here doing what I’m doing if it wasn’t for Carole and myGrid, and neither would many other people who work on (or have worked on) myGrid and related projects.

Carole, you are an inspiration to us all, congratulations! To celebrate your success, I’m off to commit some more of the seven deadly sins of bioinformatics [1]…

References

  1. Carole Goble The Seven Deadly Sins of Bioinformatics
  2. e-Science in Indianapolis: Carole Goble wins the 1st Jim Gray eScience Award
  3. Joseph Tartakoff British professor given first Jim Gray Award, Seattle Post-Intelligencer, Tech Report
  4. Todd Bishop UK prof receives Jim Gray award Tech Flash
  5. Savas Parastatidis Carole Goble as the first recipient of the “Jim Gray eScience Award”
  6. Microsoft Recognise Manchester e-Science Contribution
  7. Deborah Gage Microsoft creates award in the name of Jim Gray San Francisco Chronicle, The Tech Chronicles
  8. Microsoft New tools for Discovery on Display at e-Science workshop

September 5, 2007

WWW2007: Workflows on the Web

Don't PanicThe Hitch-hiking novelist Douglas Noel Adams (DNA) once remarked that the World Wide Web (WWW) is the only thing whose shortened form – ‘double-you double-you double-you-dot’ – takes three times longer to say than what it’s “short” for [1]. If he were still with us today, there is plenty of stuff at the 16th International World Wide Web conference (WWW2007), currently underway in Banff, that would interest him. Here are some short, abbreviated notes on a couple of interesting papers at this years conference. They are relevant to bioinformatics and worth reading, whichever type of DNA you’re most interested in.

One full paper [2] by Daniel Goodman describes a scientific workflow language called Martlet. The motivating example is taken from climateprediction.net but I suspect some of the points they make about scientific workflows are relevant to bioinformatics too. Just like the recent post by Boscoh about functional programming, the paper discusses an inspired-by-Haskell functional approach to building and running workflows. Comparisons with other workflow systems like Taverna / SCUFL are drawn. Despite what they say, Taverna already uses a functional model (not an imperative one), it just hasn’t been published yet. The paper also draws comparisons between Martlet and other functional systems, like Google’s Map-Reduce. It concludes that the (allegedly) new Martlet programming model “raises the interesting possibility of a whole set of new algorithms just waiting to be discovered once people start to think about programming in this new way”. Which is an exciting possibility.

Another position paper [3] (warning: position paper = arm waving) by Anupriya Ankolekar et al argues that the Semantic Web and Web-Two-Point-Oh are complementary, rather than competing. Their motivating examples are a bit lame (Blogging a movie? Can’t they think of something more original?) …but they make some interesting (and obvious) points. The authors think that aggregators like Yahoo! Pipes! will play an important role in the emerging Semantic Web. Currently, there don’t seem to be too many bioinformaticians using Yahoo! pipes, perhaps they just don’t share their pipes / workflows yet?

Running in parallel to all of the above is the Health Care and Life Sciences Data Integration for the Semantic Web workshop, where more detailed discussion on the bio semweb is underway. As its a workshop, there are no full or position papers, but take a look at The State of the Nation in Life Science Data integration to get a flavour of what is going on.

Wether functional, semantic, Web-enabled or just buzzword-friendly, there is plenty of action in the scientific workflow field right now. If you’re interested in the webby stuff, next years conference, WWW2008, is in Beijing, China. I wonder if they will mark the 10th anniversary of the publication of that Google paper at WWW7 back in 1998? The deadline for papers at WWW2008 will probably be sometime in November 2007, but around 90% of submitted papers will be rejected if previous years are anything to go by. If you’re thinking of doing a paper, DON’T PANIC about those intimidating statistics, because bioinformatics is bursting full of interesting and hard problems that challenge the state-of-the-art. The kind of stuff that will go down well at Dubya Dubya Dubya.

(Photo credit: Fire Monkey Fish)

References

  1. Douglas Adams (1999) Beyond the Brochure: Build it and we will come
  2. Daniel Goodman (2007) Introduction and Evaluation of Marlet, a Scientific Workflow Language for Abstracted Parallelisation doi:10.1145/1242572.1242705
  3. Anupriya Ankolekar, Markus Krotzsch, Thanh Tran and Denny Vrandecic (2007) The Two Cultures: Mashing up Web 2.0 and the Semantic Web doi:10.1145/1242572.1242684


Creative Commons License

This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.


December 19, 2006

Taverna 1.5.0

Filed under: Uncategorized — Duncan Hull @ 8:26 pm
Tags: , , , , , ,

Happy Christmas from the myGrid team, who are pleased to announce the release of version 1.5.0 of the Open Source Taverna bioinformatics workflow toolkit [1]. This is now available for download on the Sourceforge site and includes some substantial changes to version 1.4.

IMGP4570Taverna 1.5.0 is a small download, but when first run it will then download and install the required packages which can take some time on slow networks. In the near future there will be a mechanism for downloading a bundle of core packages. There are some significant changes in the underlying architecture of Taverna and how it handles core packages and optional plugins, using a system called Raven, see release notes below.

The documentation is currently being updated and the user documentation should be complete very soon, with the technical documentation following shortly afterwards. The reason for this is to allow the software to be released with some time to spare before the Christmas holidays.

Release notes:

There have been a number of substantial changes in the underlying architecture of Taverna since the previous release. These include:

  • An overhaul of the User Interface (UI), replacing the unpopular Multiple Document Interface with a cleaner and simpler single document UI which can be customised using Perspectives. There are built in perspectives to allow the design and enactment of workflows, and plugins can integrate with the UI by providing perspectives of their own. Together with this, users are able to create their own layouts built from individual components.
  • Taverna now allows for multiple workflows to be open and enacted at the same time.
  • Support for the new BioMart data management system version 0.5, together with backward compatibility for old workflows that used Biomart 0.4.
  • Better provenance generation and browsing support, through a plugin now known as LogBook.
  • Better support for semantic service discovery through the Feta plugin [2].
  • Modulularisation of the Taverna code base.
  • Development and integration of an underlying architecture know as Raven. This allows for Apache Maven like declaration of dependencies which are discovered and incorporated into the Taverna system at runtime. Together with the modularisation of the Taverna code base, Raven gives the benefit that updates can be provided dynamically and incrementally, without the need for monolithic releases as in the past. This allows the provision of updates to bugs, and new features, within a very short timescale if necessary. It also provides plugin developers with a greater degree of autonomy and independance from the core Taverna code base.
  • Improved and more advanced plugin management with the ability to provide immediate updates, and for plugin providers to publish their plugins via xml descriptions.
  • Numerous bug-fixes including the removal of a number of memory leaks.

JIRA generated release notes and bug status reports can be found here and here

References

  1. Peer-reviewed publications about the Taverna workbench in PubMed
  2. Feta: A Light-Weight Architecture for User Oriented Semantic Service Discovery
  3. BioMoby extensions to the Taverna workflow management and enactment software

June 2, 2006

Debugging Web Services

Filed under: biotech,informatics — Duncan Hull @ 11:19 pm
Tags: , , , , , , , , ,

IMGP4014There are a growing number of biomedical services out there on Wild Wild Web for performing various computations on DNA, RNA and proteins as well as the associated scientific literature. Currently, using and debugging these services can be hard work. SOAP UI (SOAP User Interface) is newish and handy free tool to help debug services and get your in silico experiments and analyses done, hopefully more easily.

So why should bioinformaticans care about Web Services? Three of the most important advantages are:

  1. They can reduce the need to install and maintain hundreds of tools and databases locally on desktop(s) or laboratory server(s) as these resources are programmatically accessible over the web.
  2. They can remove the need for tedious and error-prone screen-scraping, or worse, “cut-and-paste” of data between web applications that don’t have fully programmatic interfaces.
  3. It is possible to compose and orchestrate services into workflows or pipelines, which are repeatable and verifiable descriptions of your experiments that you can share. Needless to say, sharing repeatable experiments has always been important part of science, its shouldn’t be any different on the Web of Science.

All this distributed computing goodness comes at a price though and there are several disadvantages of using web services. We will focus on one here: Debugging services, which can be problematic. In order to do this, bioinformaticians need to understand a little bit about how web services work and how to debug them.

Death by specification

Debugging services sounds straightforward, but many publicly available biomedical services, are not the simpler RESTian type, but the more complex SOAP-and-WSDL type of web service. Consequently, debugging usually requires a basic understanding these protocols and interfaces, the so-called Simple” Object Access Protocol (SOAP) and Web Services Description Language (WSDL). However these specifications are both big, complicated and being superceded by newer versions so you might lose the will-to-live while reading them. Also, individual services described in WSDL are easier for machines to read, than for humans, and therefore give humble bioinformaticians a big headache. As an example, have a look at the WSDL for BLAST at the DNA Databank of Japan (DDBJ).

So, if you’re not intimately familiar with the WSDL 1.1 specification (frankly, life is too short and they keep moving the goal-posts anyway), it is not very clear what is going on here. WSDL describes messages, port types, end points, part-names, bindings, bla-bla-bla, and lots of other seemingly unnecessary abstractions. To add insult to injury WSDL is used in several different styles and is expressed in verbose XML. Down with the unnecessary abstractions! But the problems don’t stop there. From looking at this WSDL, you have to make several leaps of imagination to understand what the corresponding SOAP messages this BLAST service accepts and responds with will look like. So when you are analysing your favourite protein sequence(s) with BLAST or perhaps InterProScan it can be difficult or impossible to work out what went wrong.

Using SOAPUI

This is where SOAPUI, can make life easier. You can launch SOAPUI using the Java Web Start, load a WSDL in and you can begin to see what is going on. One of the nice features, is it will show you what the SOAP messages look like, which saves you having to work it out in your head. So, going back to our BLAST example…

  1. Launch the SOAPUI tool and select File then New WSDL Project (Give project a name and save it when prompted).
  2. Right click on the Project folder and select add WSDL from URL
  3. Type in http://xml.nig.ac.jp/wsdl/Blast.wsdl or your own favourite from this list of molecular biology wsdl.
  4. When asked: Create default requests for all operations select Yes
  5. The progress bar will whizz away while it imports the file, once its done, you can see a list of operations
  6. If you click on one of them e.g. searchParam then Request1, then select Open Request Editor it spawns two new windows…
  7. The first (left-hand) window shows the SOAP request that is sent to the BLAST service:
    <soapenv:Envelope
    	... boring namespace declarations ... >
    	 <soapenv:Body>
    
    		<blas:searchParam soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
    			<!-- use BLASTp -->
    			<program xsi:type="xsd:string">blastp</program>
    
    			<!-- Use SWISSPROT data  -->
    			<database xsi:type="xsd:string">SWISS</database>
    
    			<!-- protein sequence -->
    			<query xsi:type="xsd:string">MHLEGRDGRR YPGAPAVELL QTSVPSGLAE LVAGKRRLPR GAGGADPSHS</query>
    
    			<!-- no parameters -->
    			<param xsi:type="xsd:string"></param>
    		</blas:searchParam>
    
    	</soapenv:Body>
    </soapenv:Envelope>
  8. When you click on the green request button, this message is sent to the service. Note: you have to fill in the parameters values as they default to: “?”.
  9. After submitting the request above, the SOAP response appears in the second (right-hand) window:
    <soap:Envelope
    ... namespace declarations... >
       <soap:Body>
    
          <n:searchParamResponse xmlns:n="http://tempuri.org/Blast">
             <Result xsi:type="xsd:string">BLASTP 2.2.12 [Aug-07-2005] ...
    		 Sequences producing significant alignments:                      (bits) Value
    		 sp|Q04671|P_HUMAN P protein (Melanocyte-specific transporter pro...   104   8e-23 ...
    		 </Result>
          </n:searchParamResponse>
       </soap:Body>
    </soap:Envelope>

Not all users of web services will want the gory details of SOAP, but for serious users, its a handy tool for understanding how any given web service works. This can be invaluable in working out what happened if, or more probably when, an individual service behaves unexpectedly. If you know of any other tools that make web services easier to use and debug, I’d be interested to hear about them.

Conclusions: It’s not rocket science

In my experience, small tools (like SOAPUI) can make a BIG difference. I’ve used a deliberately simple (and relatively reliable) BLAST service for demonstration purposes, but the interested reader / hacker might want to use this tool to play with more complex programs like the NCBI Web Services or InterProScan at the EBI. Using such services often requires good testing and debugging support, for example, when you compose (or “mashup”) services into complex workflows, using a client such as the Taverna workbench. This is where SOAP UI might just help you test and debug web services provided by other laboratories and data centres around the world, so you can use them reliably in your in silico experiments.

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,512 other followers