O'Really?

February 21, 2008

Biological Complexity

Filed under: sysbio — Duncan Hull @ 11:00 pm
Tags: , , ,

From Molecules to Systems @ UCLDetails of two-day conference titled “Biological Complexity: From Molecules to Systems” at University College London (UCL) in June 2008 have recently been announced. Speakers and topics are described in the link above and also by Martyn Amos on his blog.

Speakers from the UK include: Martyn Amos, Cyrus Chothia, Jasmin Fisher, Mike Hoffman / Ewan Birney, Jaroslav Stark, Michael Sternberg and Perdita Stevens.

Speakers from the Weizmann UK include Nir Friedman, David Harel, Shmuel Pietrokovski, Gideon Schreiber, Eran Segal, Ehud Shapiro and Yoav Soen

January 18, 2008

One Thousand Databases High (and rising)

StampsWell it’s that time of year again. The 15th annual stamp collecting edition of the journal Nucleic Acids Research (NAR), also known as the 2008 Database issue [1], was published earlier this week. This year there are 1078 databases listed in the collection, 110 more than the previous one (see Figure 1). As we pass the one thousand databases mark (1kDB) I wonder, what proportion of the data in these databases will never be used?

R.I.P. Biological Data?

It seems highly likely that lots of this data is stored in what Usama Fayyad at Yahoo! Research! Laboratories! calls data tombs [2], because as he puts it:

“Our ability to capture and store data has far outpaced our ability to process and utilise it. This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again.”

Like last year, lets illustrate the growth with an obligatory graph, see Figure 1.

Figure 1: Data growth: the ability to capture and store biological data has far outpaced our ability to understand it. Vertical axis is number of databases listed in Nucleic Acids Research [1], Horizontal axis is the year. (Picture drawn with Google Charts API which is OK but as Alf points out, doesn’t do error bars yet).

Another day, another dollar database

Does it matter that large quantities of this data will probably never be used? How could you find out, how much and which data was “write-only”? Will Biologists ever catch up with the physicists when it comes to Very Large stamp collections Databases? Biological databases are pretty big, but can you imagine handling up to 1,500 megabytes of data per second for ten years as the Physicists will soon be doing? You can already hear the (arrogant?) Physicists taunting the Biologists, “my database is bigger than yours”. So there.

Whichever of these databases you are using, happy data mining in 2008. If you are lucky, the data tombs you are working will contain hidden treasure that will make you famous and/or rich. Maybe. Any stamp collector will tell you, some stamps can become very valuable. There’s Gold in them there hills databases you know…

  1. Galperin, M. Y. (2007). The molecular biology database collection: 2008 update. Nucleic Acids Research, Vol. 36, Database issue, pages D2-D4. DOI:10.1093/nar/gkm1037
  2. Fayyad, U. and Uthurusamy, R. (2002). Evolving data into mining solutions for insights. Communications of the ACM, 45(8):28-31. DOI:10.1145/545151.545174
  3. This post originally published on nodalpoint (with comments)
  4. Stamp collectors picture, top right, thanks to daxiang stef / stef yau

September 5, 2007

WWW2007: Workflows on the Web

Don't PanicThe Hitch-hiking novelist Douglas Noel Adams (DNA) once remarked that the World Wide Web (WWW) is the only thing whose shortened form – ‘double-you double-you double-you-dot’ – takes three times longer to say than what it’s “short” for [1]. If he were still with us today, there is plenty of stuff at the 16th International World Wide Web conference (WWW2007), currently underway in Banff, that would interest him. Here are some short, abbreviated notes on a couple of interesting papers at this years conference. They are relevant to bioinformatics and worth reading, whichever type of DNA you’re most interested in.

One full paper [2] by Daniel Goodman describes a scientific workflow language called Martlet. The motivating example is taken from climateprediction.net but I suspect some of the points they make about scientific workflows are relevant to bioinformatics too. Just like the recent post by Boscoh about functional programming, the paper discusses an inspired-by-Haskell functional approach to building and running workflows. Comparisons with other workflow systems like Taverna / SCUFL are drawn. Despite what they say, Taverna already uses a functional model (not an imperative one), it just hasn’t been published yet. The paper also draws comparisons between Martlet and other functional systems, like Google’s Map-Reduce. It concludes that the (allegedly) new Martlet programming model “raises the interesting possibility of a whole set of new algorithms just waiting to be discovered once people start to think about programming in this new way”. Which is an exciting possibility.

Another position paper [3] (warning: position paper = arm waving) by Anupriya Ankolekar et al argues that the Semantic Web and Web-Two-Point-Oh are complementary, rather than competing. Their motivating examples are a bit lame (Blogging a movie? Can’t they think of something more original?) …but they make some interesting (and obvious) points. The authors think that aggregators like Yahoo! Pipes! will play an important role in the emerging Semantic Web. Currently, there don’t seem to be too many bioinformaticians using Yahoo! pipes, perhaps they just don’t share their pipes / workflows yet?

Running in parallel to all of the above is the Health Care and Life Sciences Data Integration for the Semantic Web workshop, where more detailed discussion on the bio semweb is underway. As its a workshop, there are no full or position papers, but take a look at The State of the Nation in Life Science Data integration to get a flavour of what is going on.

Wether functional, semantic, Web-enabled or just buzzword-friendly, there is plenty of action in the scientific workflow field right now. If you’re interested in the webby stuff, next years conference, WWW2008, is in Beijing, China. I wonder if they will mark the 10th anniversary of the publication of that Google paper at WWW7 back in 1998? The deadline for papers at WWW2008 will probably be sometime in November 2007, but around 90% of submitted papers will be rejected if previous years are anything to go by. If you’re thinking of doing a paper, DON’T PANIC about those intimidating statistics, because bioinformatics is bursting full of interesting and hard problems that challenge the state-of-the-art. The kind of stuff that will go down well at Dubya Dubya Dubya.

(Photo credit: Fire Monkey Fish)

References

  1. Douglas Adams (1999) Beyond the Brochure: Build it and we will come
  2. Daniel Goodman (2007) Introduction and Evaluation of Marlet, a Scientific Workflow Language for Abstracted Parallelisation doi:10.1145/1242572.1242705
  3. Anupriya Ankolekar, Markus Krotzsch, Thanh Tran and Denny Vrandecic (2007) The Two Cultures: Mashing up Web 2.0 and the Semantic Web doi:10.1145/1242572.1242684


Creative Commons License

This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.


Semantic Biomedical Mashups with Connotea


Mashup or Shutup

The Journal of Biomedical Informatics (JBI), will soon be publishing their special issue on Semantic Biomedical Mashups (can you fit any more buzzwords into a Call For Papers?!). Ben Good and friends have submitted a paper on their Entity Describer which extends connotea using some Semantic Web goodness. They’d appreciate your comments on their submitted manuscript over at i9606. As Ben says, their pre-publication turns out to be an interesting experiment “figuring out how blogging might fit into the academic publishing landscape”. If this interests you, get commenting now!

Update: Just spotted this interesting graphic of the Elsevier / Evilsevier logo (snigger), who are the publishers of JBI…

May 31, 2007

Google Metabolic Maps

Google in the Palm of my HandThese days, new Google products and code seem to appear on a weekly basis. Take, for example, Google Gears which takes advantage of SQLite, mentioned on nodalpoint recently. They certainly don’t hang about at the Googleplex in Mountain View, California. Wouldn’t it be great if Google applied some of that engineering expertise and agility to science and bioinformatics? Just imagine: we could have Google Metabolic Maps, a virtual globe of the cell for scientists everywhere…

Scientists have been drawing metabolic maps for a very long time, but unfortunately when it comes to charting and understanding metabolic pathways, we’re still at the “here be dragons” stage of bio-cartography. I’m obviously not the first person to dream of this, but imagine maps of metabolic pathways looked more like Google Earth or Google Maps, than the old fashioned style maps, many life scientists will be familiar with. Now imagine just a little more, that these maps weren’t just available on conventional screens, but we’re given the Minority Report treatment, courtesy of Mr Bill Gates and his wizzy surface magic at Microsoft. Wouldn’t that be great? Metabolic maps on an interactive tabletop computer. Just like Tom Cruise in the movies, we’d be able to effortlessly swish around metabolism (or the metabolome / proteome / genome / [insert-your-favourite]ome). Imagine if it was all open-source too, no boundaries, no passports…

Now, you may say that I’m a dreamer, but I’m not the only one [1,2,3].

References

  1. Zhenjun Hu, Joe Mellor, Jie Wu, Minoru Kanehisa, Joshua M. Stuart and Charles DeLisi (2007) Towards zoomable multidimensional maps of the cell Nature biotechnology 25 (5), 547-54. DOI:10.1038/nbt1304
  2. Hiroaki Kitano, Akira Funahashi, Yukiko Matuoka and Kanae Oda (2005) Using process diagrams for the graphical representation of biological networks Nature biotechnology 23 (8), 961-6. DOI:10.1038/nbt1111
  3. John Lennon and Yoko Ono (1971) Imagine
  4. this post originally published on nodalpoint with comments

Creative Commons License

This work is licensed under a

Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.


April 13, 2007

Collaboration, collaboration, collaboration!

Geldof Blair collaborationWhat should your three main priorities be as a Scientist? Collaboration, collaboration, collaboration. Quentin Vicens and Phil Bourne have just published Ten Simple Rules for a Successful Collaboration [1] to help you do just that, as part of a continuing series [2,3,4,5].

Tony Bliar once said “Ask me my three main priorities for government, and I tell you: education, education, education.” In Science, its not so much about education as collaboration, collaboration, collaboration. The advice in Ten Simple Rules is all useful stuff, but what caught my eye is the fact that collaboration is on the rise, at least according to the number of co-authors on papers published in PNAS. The average number of co-authors has risen from 3.9 in 1981 to 8.4 in 2001. So before you publish or perish, it seems likely that you’ll also need to collaborate or commiserate… less laboratory, more collaboratory!

Photo credit Garret Keogh

References

  1. Quentin Vicens and Phillip Bourne (2007) Ten Simple Rules for a Successful Collaboration PLOS Computational Biology
  2. Phillip Bourne (2006) Ten Simple Rules for Getting Published PLOS Computational Biology
  3. Philip Bourne and Iddo Friedberg (2006) Ten Simple Rules for Selecting a Postdoctoral Position PLOS Computational Biology
  4. Phillip Bourne and Leo Chalupa (2006) Ten Simple Rules for Getting Grants PLOS Computational Biology
  5. Phillip Bourne and Alon Korngreen (2006) Ten Simple Rules for Reviewers PLOS Computational Biology
  6. This post originally published on nodalpoint with comments

Creative Commons License

This work is licensed under a

Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

January 22, 2007

DNA mania

Filed under: bio — Duncan Hull @ 10:29 pm
Tags: , , , ,

What does DNA do when it’s not being transcribed into RNA? It causes DNA mania…

Quote of the Day

“DNA, you know, is Midas’ gold. Everyone who touches it goes mad.”

Maurice Wilkins

Read the rest in [1,2]

Do you or your colleagues ever suffer from DNA mania [3,4]? A biochemist friend of mine once semi-jokingly remarked that people’s manic obsession with DNA is a bit like buying some food and being more interested in the bar-code on the packaging, than the food inside. In his particular area of research, DNA is about as exciting as bar-codes, because it doesn’t even leave the nucleus of the cell, at least in Eukaryotes. I wonder what readers of nodalpoint think of this analogy? Anyway, as a result of this philosophy, most of his community have developed an unhealthy and manic interest in proteins rather than DNA. You could call this particular obsessive-compulsive disorder “protein mania”.

Depending on the scientific obsession(s) of your particular community, you might need to substitute Protein or RNA for DNA in the above quote, as appropriate. And if that is all too molecular for you, substitute any other of your favourite bioinformatics buzzwords.

References

  1. Horace Freeland Judson (1996) The Eighth Day of Creation: Makers of the Revolution in Biology
  2. John Sulston (2006) Won for All: How the Drosophila Genome was sequenced: a book by Michael Ashburner
  3. André Pichot (1999) Histoire de la notion de gène (one of the first documented uses of the phrase “DNA mania”)
  4. Denis Noble (2006) The Music of Life: Biology Beyond the Genome (an antidote to DNA mania and the Dawkinian gene-centric view of Life)
  5. DNA Photograph taken by Unapersona in Ciutat de les Arts i les Ciències, Calatrava building, Valencia, Spain.

January 5, 2007

NAR Database Issue 2007: Not Waving But Drowning?

The 14th annual Nucleic Acids Research (NAR) database issue 2007 has just been published, open-access. This year is the largest yet (again) with 968 molecular biology databases listed, 110 more than the previous one (see figure below). In the world of biological databases, are we waving or drowning?

NAR Database Growth 2007

Nine hundred and sixty eight is a lot of databases, and even that mind-boggling number is not an exhaustive or comprehensive tally. But is counting all these databases waving or drowning [1]? Will we ever stop stamp-collecting the databases and tools we have in molecular biology? What prompted this is, an employee of the The Boeing Company once told me they have given up counting their databases because there were just too many. Just think of all the databases of design and technical documentation that accompanies the myriad of different aircraft that Boeing manufacture, like the iconic 747 jumbo jet. Now, combine that with all the supply chain, customer and employee information and you can begin to imagine the data deluge that a large multi-national corporation has to handle.

Like Boeing, in Biology we’ve clearly got more data than we know what to do with [2,3]. It won’t be news to bioinformaticians and its been said many times before but its worth repeating again here:

  • We know how many databases we have but we don’t know what a lot of the data in these databases means, think of all those mystery proteins of unknown function. It will obviously take time until we understand it all…
  • Most of the data only begins to make sense when it is integrated or mashed-up with other data. However, we still don’t know how to integrate all these databases, or as Lincoln Stein puts it “so far their integration has proved problematic” [4], a bit of an understatement. Many grandiose schemes for the “integration” of biological databases have been proposed over the years, but unfortunately none have been practical to the point of implementation [5]


IMGP4592
Despite this, it is still useful to know how many molecular biology databases there are. At least we know how many databases we are drowning in. Thankfully, unlike Boeing, most biological data, algorithms and tools are open-source and more literature is becoming open access which will hopefully make progress more rapid. But biology is more complicated than a Boeing 747, so we’ve got a long-haul flight ahead of us. OK, I’ve managed to completely overstretch that aerospace analogy now so I’ll stop there.

Whatever databases you’ll be using in 2007, have a Happy New Year mining, exploring and understanding the data they contain, not drowning in it.

References

  1. Stevie Smith (1957) Not waving but drowning
  2. Michael Galperin (2007) The Molecular Biology Database Collection: 2007 update Nucleic Acids Research, Vol. 35, Database issue. DOI:10.1093/nar/gkl1008
  3. Alex Bateman (2007) Editorial: What makes a good database? Nucleic Acids Research, Vol. 35, Database issue. DOI:10.1093/nar/gkl1051
  4. Lincoln Stein (2003) Biological Database Integration Nature Reviews Genetics. 4 (5), 337-45. DOI:10.1038/nrg1065
  5. Michael Ashburner (2006) Keynote at the Pacific Symposium on Biocomputing (PSB2006) in Hawaii seeAlso Aloha: Biocomputing in Hawaii
  6. This post originally published on nodalpoint with comments

Creative Commons License
This work is licensed under a

Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.


December 19, 2006

Taverna 1.5.0

Filed under: Uncategorized — Duncan Hull @ 8:26 pm
Tags: , , , , , ,

Happy Christmas from the myGrid team, who are pleased to announce the release of version 1.5.0 of the Open Source Taverna bioinformatics workflow toolkit [1]. This is now available for download on the Sourceforge site and includes some substantial changes to version 1.4.

IMGP4570Taverna 1.5.0 is a small download, but when first run it will then download and install the required packages which can take some time on slow networks. In the near future there will be a mechanism for downloading a bundle of core packages. There are some significant changes in the underlying architecture of Taverna and how it handles core packages and optional plugins, using a system called Raven, see release notes below.

The documentation is currently being updated and the user documentation should be complete very soon, with the technical documentation following shortly afterwards. The reason for this is to allow the software to be released with some time to spare before the Christmas holidays.

Release notes:

There have been a number of substantial changes in the underlying architecture of Taverna since the previous release. These include:

  • An overhaul of the User Interface (UI), replacing the unpopular Multiple Document Interface with a cleaner and simpler single document UI which can be customised using Perspectives. There are built in perspectives to allow the design and enactment of workflows, and plugins can integrate with the UI by providing perspectives of their own. Together with this, users are able to create their own layouts built from individual components.
  • Taverna now allows for multiple workflows to be open and enacted at the same time.
  • Support for the new BioMart data management system version 0.5, together with backward compatibility for old workflows that used Biomart 0.4.
  • Better provenance generation and browsing support, through a plugin now known as LogBook.
  • Better support for semantic service discovery through the Feta plugin [2].
  • Modulularisation of the Taverna code base.
  • Development and integration of an underlying architecture know as Raven. This allows for Apache Maven like declaration of dependencies which are discovered and incorporated into the Taverna system at runtime. Together with the modularisation of the Taverna code base, Raven gives the benefit that updates can be provided dynamically and incrementally, without the need for monolithic releases as in the past. This allows the provision of updates to bugs, and new features, within a very short timescale if necessary. It also provides plugin developers with a greater degree of autonomy and independance from the core Taverna code base.
  • Improved and more advanced plugin management with the ability to provide immediate updates, and for plugin providers to publish their plugins via xml descriptions.
  • Numerous bug-fixes including the removal of a number of memory leaks.

JIRA generated release notes and bug status reports can be found here and here

References

  1. Peer-reviewed publications about the Taverna workbench in PubMed
  2. Feta: A Light-Weight Architecture for User Oriented Semantic Service Discovery
  3. BioMoby extensions to the Taverna workflow management and enactment software

December 12, 2006

Buggotea: Redundant Links in Connotea

IMGP4570Dear Santa, all I want for Christmas* is a better version of Connotea, please can you sort out it’s duplicated redundant links? In my book this particular bug is “buggotea” number one. Here is the problem… [update: buggotea is partially fixed, see comments from Ian Mulvany at the nodalpoint link in the references below]

There is this handy bioinformatics web application called Connotea which I like to use, built by those nice people in the web team at Nature Publishing Group. Most readers of nodalpoint probably already know about it, but because you’re Santa and you’ve been busy lately, let me explain. Connotea can help scientists (not just bioinformaticians) to organise and share their bibliographic references, whilst discovering what other people with similar interests are reading. It’s good, but it has some bugs in it. Since it’s open-source software, anyone with the time, inclination and skills can get hold of the connotea source code and improve it. There is, however, one particularly nasty redundancy bug in Connotea that is bugging me [1]. I think it should be fixable, and that doing so would make Connotea a significantly better application than it already is. Let’s illustrate this bug with a little story…

(more…)

« Previous PageNext Page »

Blog at WordPress.com.