O'Really?

June 2, 2006

Debugging Web Services

Filed under: biotech,informatics — Duncan Hull @ 11:19 pm
Tags: bloatware, debug, mashup, soap, soapui, taverna, web services, workflow, WSDL, xml

There are a growing number of biomedical services out there on Wild Wild Web for performing various computations on DNA, RNA and proteins as well as the associated scientific literature. Currently, using and debugging these services can be hard work. SOAP UI (SOAP User Interface) is newish and handy free tool to help debug services and get your in silico experiments and analyses done, hopefully more easily.

So why should bioinformaticans care about Web Services? Three of the most important advantages are:

They can reduce the need to install and maintain hundreds of tools and databases locally on desktop(s) or laboratory server(s) as these resources are programmatically accessible over the web.
They can remove the need for tedious and error-prone screen-scraping, or worse, “cut-and-paste” of data between web applications that don’t have fully programmatic interfaces.
It is possible to compose and orchestrate services into workflows or pipelines, which are repeatable and verifiable descriptions of your experiments that you can share. Needless to say, sharing repeatable experiments has always been important part of science, its shouldn’t be any different on the Web of Science.

All this distributed computing goodness comes at a price though and there are several disadvantages of using web services. We will focus on one here: Debugging services, which can be problematic. In order to do this, bioinformaticians need to understand a little bit about how web services work and how to debug them.

Death by specification

Debugging services sounds straightforward, but many publicly available biomedical services, are not the simpler RESTian type, but the more complex SOAP-and-WSDL type of web service. Consequently, debugging usually requires a basic understanding these protocols and interfaces, the so-called “Simple” Object Access Protocol (SOAP) and Web Services Description Language (WSDL). However these specifications are both big, complicated and being superceded by newer versions so you might lose the will-to-live while reading them. Also, individual services described in WSDL are easier for machines to read, than for humans, and therefore give humble bioinformaticians a big headache. As an example, have a look at the WSDL for BLAST at the DNA Databank of Japan (DDBJ).

So, if you’re not intimately familiar with the WSDL 1.1 specification (frankly, life is too short and they keep moving the goal-posts anyway), it is not very clear what is going on here. WSDL describes messages, port types, end points, part-names, bindings, bla-bla-bla, and lots of other seemingly unnecessary abstractions. To add insult to injury WSDL is used in several different styles and is expressed in verbose XML. Down with the unnecessary abstractions! But the problems don’t stop there. From looking at this WSDL, you have to make several leaps of imagination to understand what the corresponding SOAP messages this BLAST service accepts and responds with will look like. So when you are analysing your favourite protein sequence(s) with BLAST or perhaps InterProScan it can be difficult or impossible to work out what went wrong.

Using SOAPUI

This is where SOAPUI, can make life easier. You can launch SOAPUI using the Java Web Start, load a WSDL in and you can begin to see what is going on. One of the nice features, is it will show you what the SOAP messages look like, which saves you having to work it out in your head. So, going back to our BLAST example…

Launch the SOAPUI tool and select File then New WSDL Project (Give project a name and save it when prompted).
Right click on the Project folder and select add WSDL from URL
Type in http://xml.nig.ac.jp/wsdl/Blast.wsdl or your own favourite from this list of molecular biology wsdl.
When asked: Create default requests for all operations select Yes
The progress bar will whizz away while it imports the file, once its done, you can see a list of operations
If you click on one of them e.g. searchParam then Request1, then select Open Request Editor it spawns two new windows…

The first (left-hand) window shows the SOAP request that is sent to the BLAST service:

<soapenv:Envelope
	... boring namespace declarations ... >
	 <soapenv:Body>

		<blas:searchParam soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
			<!-- use BLASTp -->
			<program xsi:type="xsd:string">blastp</program>

			<!-- Use SWISSPROT data  -->
			<database xsi:type="xsd:string">SWISS</database>

			<!-- protein sequence -->
			<query xsi:type="xsd:string">MHLEGRDGRR YPGAPAVELL QTSVPSGLAE LVAGKRRLPR GAGGADPSHS</query>

			<!-- no parameters -->
			<param xsi:type="xsd:string"></param>
		</blas:searchParam>

	</soapenv:Body>
</soapenv:Envelope>

When you click on the green request button, this message is sent to the service. Note: you have to fill in the parameters values as they default to: “?”.

After submitting the request above, the SOAP response appears in the second (right-hand) window:

<soap:Envelope
... namespace declarations... >
   <soap:Body>

      <n:searchParamResponse xmlns:n="http://tempuri.org/Blast">
         <Result xsi:type="xsd:string">BLASTP 2.2.12 [Aug-07-2005] ...
		 Sequences producing significant alignments:                      (bits) Value
		 sp|Q04671|P_HUMAN P protein (Melanocyte-specific transporter pro...   104   8e-23 ...
		 </Result>
      </n:searchParamResponse>
   </soap:Body>
</soap:Envelope>

Not all users of web services will want the gory details of SOAP, but for serious users, its a handy tool for understanding how any given web service works. This can be invaluable in working out what happened if, or more probably when, an individual service behaves unexpectedly. If you know of any other tools that make web services easier to use and debug, I’d be interested to hear about them.

Conclusions: It’s not rocket science

In my experience, small tools (like SOAPUI) can make a BIG difference. I’ve used a deliberately simple (and relatively reliable) BLAST service for demonstration purposes, but the interested reader / hacker might want to use this tool to play with more complex programs like the NCBI Web Services or InterProScan at the EBI. Using such services often requires good testing and debugging support, for example, when you compose (or “mashup”) services into complex workflows, using a client such as the Taverna workbench. This is where SOAP UI might just help you test and debug web services provided by other laboratories and data centres around the world, so you can use them reliably in your in silico experiments.

May 26, 2006

BioGrids: From Tim Bray to Jim Gray (via Seymour Cray)

Filed under: biotech — Duncan Hull @ 11:30 pm
Tags: BLAST, FASTA, Globus, Globus Toolkit, Grid, HPC, Jim Gray, myGrid, nodalpoint, sequence jockey, Seymour Cray, Tim Bray

Grid Computing already plays an important role in the life sciences, and will probably continue doing so for the forseeable future. BioGrid (Japan), ^myGrid (UK) and CoreGrid (Europe) are just three current examples, there are many more Grid and Super Duper Computer projects in the life sciences. So, is there an accessible Hitch Hikers Guide to the Grid for newbies, especially bioinformaticians?

Unfortunately much of the literature of Grid Computing is esoteric and inaccessible, liberally sprinkled with abstract and wooly concepts like “Virtual Organisations” with a large side-order of acronym soup. This makes it difficult or impossible for the everyday bioinformatican to understand or care about. Thankfully, Tim Bray from Sun Microsystems has a written an accessible review of the area, “Grids for dummies”, if you like. Its worth a read if you’re a bioinformatician with a need for more heavyweight distributed computing than the web currently provides, but you find Grid-speak is usually impenetrable nonsense.

One of the things Tims discusses in his review is Microsoftie Jim Gray, who is partly responsible for the 2020 computing initiative mentioned on nodalpoint earlier. Tim describes Jim’s article Distributed Computing Economics. In this, Jim uses wide variety of examples to illustrate the current economics of grids, from “Megaservices” like Google, Yahoo! and Hotmail to the bioinformaticians favourites, BLAST and FASTA. So how might Grids affect the average bioinformatician? There are many different applications of Grid computing, but two areas spring to mind:

Running your in silico experiments (genome annotation, sequence analysis, protein interactions etc), using someone elses memory, disk space, processors on the Grid. This could mean you will be able to do your experiments more quickly and reliably than you can using the plain ol’ Web.
Executing high-throughput and long-running experiments, e.g. you’ve got a ton of microarray data and it takes hours or possibly days to analyse computationally.

So if you deal with microarray data daily, you probably know all this stuff already, but Tims overview and Jims commentary are both accessible pieces to pass on to your colleagues in the lab. If this kind of stuff pushes your button, you might also be interested in the eProtein Scientific Meeting and Workshop Proceedings.

[This post was originally published on nodalpoint with comments.]

May 24, 2006

Dub Dub Dub 2006

Filed under: web,web of science — Duncan Hull @ 5:41 pm
Tags: BioMOBY, conferences, functional programming, Harry Halpin, semantic web, w3c, web services, www, www2006, xml, XML Schema

The 15th International World Wide Web conference is currently underway in Edinburgh, Bonny Scotland. As usual, this popular conference has some good papers, only 11%* of submissions are accepted. One particular paper caught my eye: One Document to Bind Them: Combining XML, Web Services, and the Semantic Web. This paper has probably been selected because it will wind people up (sorry I mean “spark a debate”) so its an entertaining and sometimes enlightening read.

In this paper, Harry Halpin and Henry Thompson make some observations about the state of the web in 2006:

The Semantic Web stack and Web Service Stack, are a long long way from the web of everyday users, or to put it another way, there is too much theory and not enough practice.
The web is in danger of becoming fragmented between XML, Web Services, Semantic Web, Second generation web, Asynchronous JavaScript and XML (AJAX) and microformats like Really Simple Syndication (RSS) etc

But, according to the authors, it doesn’t have to be this way…

Many (but not all) web services are functions that are available on the web,
The semantic web gives us an elaborate type system, using ontologies, which can extend what we already have with XML Schema
The combination of the first two, gives us Semantic Web Services which are typed functions. This allows us to invoke web services not just by their URI (e.g. http://xml.nig.ac.jp/xddbj/Blast for a Blast service), but by the type of information they have. E.g. you have an output of type BLAST_report or perhaps InterProScan_report, what services will take this as input? What operations can be performed on this data? This sounds a lot like BioMOBY, with bells on.

What Harry and Henry propose is tying all this together using a single XML vocabulary, called Semantic fXML, to put “a unified abstraction of data, types and functions” so that the web can compute. This is all a bit pie-in-the-sky vision of the future stuff, but what might it mean for your average bioinformatican? It would be seriously useful if we could make the current molecular biology web services easier to use, but agreeing on and using an ontology for annotating the types of the inputs and outputs of all the services is non-trivial task. Bioinformaticians already have a (somewhat limited) universal type system for describing all data in bioinformatics, its called string. Persuading them to use something more powerful is not easy unless the benefits are immediately obvious.

At the moment, it is difficult to tell if sfXML will ever have any impact on bioinformatics but who cares? Despite this, the paper is enjoyable reminder of what is interesting about services on the Web. They transform the web from a place where we can merely search and browse for data (sequences, genes, proteins, metabolic pathways, systems etc), into “one vast de-centralised computer” a bit like the one described in can computers explain biology? This, in my humble opinion, is what makes the web and bioinformatics an exciting place to work in 2006.

* Footnote: Of nearly 700 papers submitted: only 81 research papers were accepted (11%). This is a 25% increase on the number of submissions last year to www2005 in Chiba, Japan.

References

Harry Halpin and Henry S. Thompson (2006) One Document to Bind Them: Combining XML, Web Services, and the Semantic Web in Proceedings of the 15th international conference on World Wide Web, Edinburgh Scotland DOI:10.1145/1135777.1135877
This post originally published on nodalpoint with comments

May 5, 2006

Bioinformatics at the BBC: Demonstrating the power of metadata

Filed under: informatics — Duncan Hull @ 10:02 pm
Tags: Alan Turing, BBC, bioinformatics, BioRDF, bioruby, FOAF, hclsig, Matt Biddulph, metadata, Paul Nurse, rdf, Ruby, semantic web

The BBC programme catalogue has recently gone online, and provides a demonstration of the applications that can be built using web technology such as Ruby on Rails, RDF, FOAF, web feeds, tag clouds and sparklines. This impressive online catalogue has:

Details of nearly a million BBC radio and TV programmes, dating back 75 years
Over 500,000 subject categories, from DNA and H5N1 avian bird flu to Genomes and Genetics Research
Over a million contributors and appearances, from Nobel Prize winner Paul Nurse and Alan Turing to Craig Venter, Francis Crick and Albert Einstein

Unfortunately this catalogue currently includes no data, only metadata at the moment, so there are no audio or video streams yet, as this is an experimental prototype. As mentioned earlier the catalogue is based on RDF which will no doubt please Semantic Webhead Tim Berners-Lee and allows the database to be queried with SPARQL. One of the brains behind this is Matt Biddulph.

I wonder if a similar application could be built using the UniProt protein sequence and annotation data in RDF or the data currently being produced by the W3C BioRDF subgroup? Compared to biological databases the BBC catalogue is relatively small, although there are no figures on the size of the catalogue, which has been extensively hand-curated by experts over the years. The ratio of metadata to data is probably different too, where a typical biological database might have lots of data (e.g. raw protein sequence data) but poor quality and a low quantity of metadata (interactions, structures, functions etc).

However, this catalogue is an interesting prototype, which is addictively fun to play with and might spark a few imaginations in the bioinformatics community.

[update: seeAlso Alf Eaton’s visual TouchGraph of BBC TV/Radio Collaborators which allows you to browse this data more graphically. Unfortunately, this fantastic BBC Database is not always online. This post was originally posted on nodalpoint with comments].

This work is licensed under a

Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.