July 27, 2010

Twenty million papers in PubMed: a triumph or a tragedy?

pubmed.govA quick search on pubmed.gov today reveals that the freely available American database of biomedical literature has just passed the 20 million citations mark*. Should we celebrate or commiserate passing this landmark figure? Is it a triumph or a tragedy that PubMed® is the size it is? (more…)

February 12, 2010

The 3rd OBO Foundry Workshop 2010, Cambridge, UK

Ultrawide Wellcome Trust Genome Campus, Cambridge by Tim NugentThe Open Biomedical Ontologies (OBO) [1] are a set of reference ontologies for describing all kinds of biomedical data shared in a centralised repository called The OBO Foundry. Every year, users and developers of these ontologies gather from around the globe for a workshop at the EBI near Cambridge, UK. Following on from the first workshop two years ago, the second workshop last year it’s already time for the third workshop on February 15th-16th. All the details and agenda are here if you’re interested. This workshop is possible thanks to sponsorship from the BBSRC funds for Workshop on Data Standards and the EU ELIXIR ‘Data Integration & Interoperability’ Package 7.

[Update: outcomes from the workshop are available here, along with a summary of discussion from Monday and a summary of discussion from Tuesday.]


  1. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L., Eilbeck, K., Ireland, A., Mungall, C., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S., Scheuermann, R., Shah, N., Whetzel, P., & Lewis, S. (2007). The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration Nature Biotechnology, 25 (11), 1251-1255 DOI: 10.1038/nbt1346

[Ultrawide panoramic picture of the Wellcome Trust Genome Campus by Tim Nugent, as featured on the cover of the EMBL-EBI Annual Scientific Report 2009. Making those pictures looks like a lot of fun.]

February 5, 2010

Classic paper: Montagues and Capulets in Science

Romeo and Juliet by HappyHippoSnacksIn preparation for a joint seminar I’ll be doing with Midori Harris here at the EBI, here’s a classic paper [1,2] on the social problems of building biomedical ontologies. This paper is worth reading (or re-reading) because it makes lots of relevant points about the use and abuse of research and how people misunderstand each other [3]. It’s funny (and available Open Access too) plus how many papers do you read with an abstract written in the style of Big Bard Bill Shakespeare?

ABSTRACT: Two households, both alike in dignity, In fair Genomics, where we lay our scene, (One, comforted by its logic’s rigour, Claims ontology for the realm of pure, The other, with blessed scientist’s vigour, Acts hastily on models that endure), From ancient grudge break to new mutiny, When ‘being’ drives a fly-man to blaspheme. From forth the fatal loins of these two foes, Researchers to unlock the book of life; Whole misadventured piteous overthrows, Can with their work bury their clans’ strife. The fruitful passage of their GO-mark’d love, And the continuance of their studies sage, Which, united, yield ontologies undreamed-of, Is now the hour’s traffic of our stage; The which if you with patient ears attend, What here shall miss, our toil shall strive to mend.

So if you read the paper, you have to ask yourself, are you a Montague or a Capulet?


  1. Carole Goble and Chris Wroe (2004). The Montagues and the Capulets Comparative and Functional Genomics, 5 (8), 623-632 DOI: 10.1002/cfg.442
  2. Carole Goble (2004) The Capulets and Montagues: A plague on both your houses?, SOFG: Standards and Ontologies for Functional Genomics
  3. William Shakespeare (1596) Romeo and Juliet

[Romeo and Juliet picture via Happy Hippo Snacks]

January 15, 2010

Bio2RDF: Large Scale, Distributed Biological Knowledge Discovery

Filed under: ChEBI — Duncan Hull @ 2:11 pm
Tags: , , , , , , ,

Bio2RDFMichel Dumontier was visiting the EBI this week, here’s the details of his seminar Bio2RDF and Beyond! Large Scale, Distributed Biological Knowledge Discovery (slides embedded below) for anyone interested who missed it:

Abstract: The Bio2RDF.org [1] project aims to transform silos of bioinformatics data into a distributed platform for biological knowledge discovery. Initial work focused on building a public database of open-linked data with web-resolvable identifiers that provides information about named entities. This involved a syntactic normalization to convert open data represented in a variety of formats (flatfile, tab, xml, web services) to RDF-based linked data with normalized names (HTTP URIs) and basic typing from source databases. Bio2RDF entities also make reference to other open linked data networks (e.g. dbPedia) thus facilitating traversal across information spaces. However, a significant problem arises when attempting to undertake more sophisticated knowledge discovery approaches such as question answering or symbolic data mining. This is because knowledge is represented in a fundamentally different manner, requiring one to know the underlying data model and reconcile the artefactual differences when they arise. In this talk, we describe our data integration strategy that makes use of both syntactic and semantic normalization to consistently marshal knowledge to a common data model while leveraging explicit logic-based mappings with community ontologies to further enhance the biological knowledgescope. Coupled with the web-service based Semantic Automated Discovery and Integration (SADI) framework, Bio2RDF is well placed to serve up biological data for prediction and analysis.

Some quick notes: Bio2RDF is currently indexing around 5 billion triples, and is built with the open source Virtuoso database. There are some scalability issues in making the system cope with up to a total of 15+ billion triples currently required. There is nothing in Bio2RDF yet that deals with the redundancy problem, e.g. “buggotea” and its friends.


  1. Belleau, F., Nolin, M., Tourigny, N., Rigault, P., & Morissette, J. (2008). Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Journal of Biomedical Informatics, 41 (5), 706-716 DOI: 10.1016/j.jbi.2008.03.004

December 11, 2009

The Semantic Biochemical Journal experiment

utopian documentsThere is an interesting review [1] (and special issue) in the Biochemical Journal today, published by Portland Press Ltd. It provides (quote) “a whirlwind tour of recent projects to transform scholarly publishing paradigms, culminating in Utopia and the Semantic Biochemical Journal experiment”. Here is a quick outline of the publishing projects the review describes and discusses:

  • Blogs for biomedical science
  • Biomedical Ontologies – OBO etc
  • Project Prospect and the Royal Society of Chemistry
  • The Chemspider Journal of Chemistry
  • The FEBS Letters experiment
  • PubMedCentral and BioLit [2]
  • Public Library of Science (PLoS) Neglected Tropical Diseases (NTD) [3]
  • The Elsevier Grand Challenge [4]
  • Liquid Publications
  • The PDF debate: Is PDF a hamburger? Or can we build more useful applications on top of it?
  • The Semantic Biochemical Journal project with Utopia Documents [5]

The review asks what advances these projects have made  and what obstacles to progress still exist. It’s an entertaining tour, dotted with enlightening observations on what is broken in scientific publishing and some of the solutions involving various kinds of semantics.

One conclusion made is that many of the experiments described above are expensive and difficult, but that the costs of not improving scientific publishing with various kinds of semantic markup is high, or as the authors put it:

“If the cost of semantic publishing seems high, then we also need to ask, what is the price of not doing it? From the results of the experiments we have seen to date, there is clearly a need to move forward and still a great deal of scope to innovate. If we fail to move forward in a collaborative way, if we fail to engage the key players, the price will be high. We will continue to bury scientific knowledge, as we routinely do now, in static, unconnected journal articles; to sequester fragments of that knowledge in disparate databases that are largely inaccessible from journal pages; to further waste countless hours of scientists’ time either repeating experiments they didn’t know had been performed before, or worse, trying to verify facts they didn’t know had been shown to be false. In short, we will continue to fail to get the most from our literature, we will continue to fail to know what we know, and will continue to do science a considerable disservice.”

It’s well worth reading the review, and downloading the Utopia software to experience all of the interactive features demonstrated in this special issue, especially the animated molecular viewers and sequence alignments.

Enjoy… the Utopia team would be interested to know what people think, see commentary on friendfeed,  the digital curation blog and youtube video below for more information.


  1. Attwood, T., Kell, D., McDermott, P., Marsh, J., Pettifer, S., & Thorne, D. (2009). Calling International Rescue: knowledge lost in literature and data landslide! Biochemical Journal, 424 (3), 317-333 DOI: 10.1042/BJ20091474
  2. Fink, J., Kushch, S., Williams, P., & Bourne, P. (2008). BioLit: integrating biological literature with databases Nucleic Acids Research, 36 (Web Server) DOI: 10.1093/nar/gkn317
  3. Shotton, D., Portwin, K., Klyne, G., & Miles, A. (2009). Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article PLoS Computational Biology, 5 (4) DOI: 10.1371/journal.pcbi.1000361
  4. Pafilis, E., O’Donoghue, S., Jensen, L., Horn, H., Kuhn, M., Brown, N., & Schneider, R. (2009). Reflect: augmented browsing for the life scientist Nature Biotechnology, 27 (6), 508-510 DOI: 10.1038/nbt0609-508
  5. Pettifer, S., Thorne, D., McDermott, P., Marsh, J., Villéger, A., Kell, D., & Attwood, T. (2009). Visualising biological data: a semantic approach to tool and database integration BMC Bioinformatics, 10 (Suppl 6) DOI: 10.1186/1471-2105-10-S6-S19

November 24, 2009

Semantic Web Applications and Tools for the Life Sciences (SWAT4LS) 2009, Amsterdam

Snow in Amsterdam by Bas van GaalenLast Friday, the Centrum Wiskunde & Informatica (CWI) in Amsterdam hosted a workshop called Semantic Web Applications and Tools for the Life Sciences (SWAT4LS) 2009.

Following on from last year [1], the workshop proceedings will be published at ceur-ws.org and in a special issue of the Journal of Biomedical Semantics, but if you want to find out what happened in the meantime, take a look at the #swat4ls2009 hashtag on twitter. Twitter makes bloggers lazy (they blog less but tweet more), but thankfully Nico Adams has studiously blogged the workshop very extensively.

Disruptive Technologies Director (cool job title!) Anita de Waard from Elsevier was asking what were the conclusions of the workshop. So here is an incomplete summary: Roughly speaking, people agreed to disagree (again). Keynote speaker Barend Mons argued that redundant data should be eliminated through the use of “nano-publications” and micro-attribution in his entertaining but controversial keynote. Some people in the audience disagreed with this. Greg Tyrelle thinks that redundancy is a feature, not a bug, in the Web and we have to deal with it. Alan Ruttenberg argued that semantic web reasoners  are required to clean up and sanity check all the messy and noisy biological data but emphasised the importance of Computer Scientists learning to speak Biologists language.

The good thing about this workshop is its size: small, friendly but internationally attended. Thanks to M. Scott Marshall, Albert Burger, Adrian Paschke, Paolo Romano and Andrea Splendiani for organising another good workshop, hope to see you again next year (if not before).


  1. Burger, A., Romano, P., Paschke, A., & Splendiani, A. (2009). Semantic Web Applications and Tools for Life Sciences, 2008 – Introduction BMC Bioinformatics, 10 (Suppl 10) DOI: 10.1186/1471-2105-10-S10-S1 part of the special issue on SWAT4LS 2008

[CC-licensed picture of Amsterdam in the snow by Bas van Gaalen]

September 4, 2009

XML training in Oxford

XML Summer School 2009The XML Summer School returns this year at St. Edmund Hall, Oxford from 20th-25th September 2009. As always, it’s packed with high quality technical training for every level of expertise, from the Hands-on Introduction for beginners through to special classes devoted to XQuery and XSLT, Semantic Technologies, Open Source Applications, Web 2.0, Web Services and Identity. The Summer School is also a rare opportunity to experience what life is like as a student in one of the world’s oldest university cities while enjoying a range of social events that are a part of the unique summer school experience.

This year, classes and sessions are taught and chaired by:

W3C XML 10th anniversaryThe Extensible Markup Language (XML) has been around for just over ten years, quickly and quietly finding its niche in many different areas of science and technology. It has been used in everything from modelling biochemical networks in systems biology [1], to electronic health records [2], scientific publishing, the provision of the PubMed service (which talks XML) [3] and many other areas. As a crude measure of its importance in biomedical science, PubMed currently has no fewer than 800 peer-reviewed publications on XML. It’s hard to imagine life without it. So whether you’re a complete novice looking to learn more about XML or a seasoned veteran wanting to improve your knowledge, register your place and find out more by visiting xmlsummerschool.com. I hope to see you there…


  1. Hucka, M. (2003). The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models Bioinformatics, 19 (4), 524-531 DOI: 10.1093/bioinformatics/btg015
  2. Bunduchi R, Williams R, Graham I, & Smart A (2006). XML-based clinical data standardisation in the National Health Service Scotland. Informatics in primary care, 14 (4) PMID: 17504574
  3. Sayers, E., Barrett, T., Benson, D., Bryant, S., Canese, K., Chetvernin, V., Church, D., DiCuccio, M., Edgar, R., Federhen, S., Feolo, M., Geer, L., Helmberg, W., Kapustin, Y., Landsman, D., Lipman, D., Madden, T., Maglott, D., Miller, V., Mizrachi, I., Ostell, J., Pruitt, K., Schuler, G., Sequeira, E., Sherry, S., Shumway, M., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusova, T., Wagner, L., Yaschenko, E., & Ye, J. (2009). Database resources of the National Center for Biotechnology Information Nucleic Acids Research, 37 (Database) DOI: 10.1093/nar/gkn741

June 4, 2009

Improving the OBO Foundry Principles

The Old Smithy Pub by loop ohThe Open Biomedical Ontologies (OBO) are a set of reference ontologies for describing all kinds of biomedical data, see [1-5] for examples. Every year, users and developers of these ontologies gather from around the globe for a workshop at the EBI near Cambridge, UK. Following on from the first workshop last year, the 2nd OBO workshop 2009 is fast approaching.

In preparation, I’ve been revisiting the OBO Foundry documentation, part of which establishes a set of principles for ontology development. I’m wondering how they could be improved because these principles are fundamental to the whole effort. We’ve been using one of the OBO ontologies (called Chemical Entities of Biological Interest (ChEBI)) in the REFINE project to mine data from the PubMed database. OBO Ontologies like ChEBI and the Gene Ontology are really crucial to making sense of the massive data which are now common in biology and medicine – so this is stuff that matters.

The OBO Foundry Principles, a sort of Ten Commandments of Ontology (or Obology if you prefer) currently look something like this (copied directly from obofoundry.org/crit.shtml):

  1. The ontology must be open and available to be used by all without any constraint other than (a) its origin must be acknowledged and (b) it is not to be altered and subsequently redistributed under the original name or with the same identifiers.The OBO ontologies are for sharing and are resources for the entire community. For this reason, they must be available to all without any constraint or license on their use or redistribution. However, it is proper that their original source is always credited and that after any external alterations, they must never be redistributed under the same name or with the same identifiers.
  2. The ontology is in, or can be expressed in, a common shared syntax. This may be either the OBO syntax, extensions of this syntax, or OWL. The reason for this is that the same tools can then be usefully applied. This facilitates shared software implementations. This criterion is not met in all of the ontologies currently listed, but we are working with the ontology developers to have them available in a common OBO syntax.
  3. The ontologies possesses a unique identifier space within the OBO Foundry. The source of a term (i.e. class) from any ontology can be immediately identified by the prefix of the identifier of each term. It is, therefore, important that this prefix be unique.
  4. The ontology provider has procedures for identifying distinct successive versions.
  5. The ontology has a clearly specified and clearly delineated content. The ontology must be orthogonal to other ontologies already lodged within OBO. The major reason for this principle is to allow two different ontologies, for example anatomy and process, to be combined through additional relationships. These relationships could then be used to constrain when terms could be jointly applied to describe complementary (but distinguishable) perspectives on the same biological or medical entity. As a corollary to this, we would strive for community acceptance of a single ontology for one domain, rather than encouraging rivalry between ontologies.
  6. The ontologies include textual definitions for all terms. Many biological and medical terms may be ambiguous, so terms should be defined so that their precise meaning within the context of a particular ontology is clear to a human reader.
  7. The ontology uses relations which are unambiguously defined following the pattern of definitions laid down in the OBO Relation Ontology.
  8. The ontology is well documented.
  9. The ontology has a plurality of independent users.
  10. The ontology will be developed collaboratively with other OBO Foundry members.

ResearchBlogging.orgI’ve been asking all my frolleagues what they think of these principles and have got some lively responses, including some here from Allyson Lister, Mélanie Courtot, Michel Dumontier and Frank Gibson. So what do you think? How could these guidelines be improved? Do you have any specific (and preferably constructive) criticisms of these ambitious (and worthy) goals? Be bold, be brave and be polite. Anything controversial or “off the record” you can email it to me… I’m all ears.

CC-licensed picture above of the Old Smithy (pub) by Loop Oh. Inspired by Michael Ashburner‘s standing OBO joke (Ontolojoke) which goes something like this: Because Barry Smith is one of the leaders of OBO, should the project be called the OBO Smithy or the OBO Foundry? :-)


  1. Noy, N., Shah, N., Whetzel, P., Dai, B., Dorf, M., Griffith, N., Jonquet, C., Rubin, D., Storey, M., Chute, C., & Musen, M. (2009). BioPortal: ontologies and integrated data resources at the click of a mouse Nucleic Acids Research DOI: 10.1093/nar/gkp440
  2. Côté, R., Jones, P., Apweiler, R., & Hermjakob, H. (2006). The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries BMC Bioinformatics, 7 (1) DOI: 10.1186/1471-2105-7-97
  3. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L., Eilbeck, K., Ireland, A., Mungall, C., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S., Scheuermann, R., Shah, N., Whetzel, P., & Lewis, S. (2007). The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration Nature Biotechnology, 25 (11), 1251-1255 DOI: 10.1038/nbt1346
  4. Smith, B., Ceusters, W., Klagges, B., Köhler, J., Kumar, A., Lomax, J., Mungall, C., Neuhaus, F., Rector, A., & Rosse, C. (2005). Relations in biomedical ontologies Genome Biology, 6 (5) DOI: 10.1186/gb-2005-6-5-r46
  5. Bada, M., & Hunter, L. (2008). Identification of OBO nonalignments and its implications for OBO enrichment Bioinformatics, 24 (12), 1448-1455 DOI: 10.1093/bioinformatics/btn194

June 1, 2009

Scott Marshall on Interoperability

M. Scott MarshallScott Marshall is visiting Manchester this week, he will be doing a seminar on Friday 5th June, here are some details for anyone who is interested in attending:

Speaker: Dr. M. Scott Marshall, The University of Amsterdam

Date/Time: 5th June 2009, 11:00

Location: Room MLG.001 (Lecture Theatre), MIB building, (number 16 on campus map)

Title: Standards Enabled Interoperability: W3C Semantic Web for Health Care and Life Sciences Interest Group

Abstract: The W3C Semantic Web for Health Care and Life Sciences Interest Group (HCLS) has the mission of developing, advocating for, and supporting the use of Semantic Web technologies for biological science, translational medicine and health care. HCLS covers hot topics including data integration and federation, bridging commonly used domain standards such as CDISC and HL7, and the applications of medical terminologies. This talk will introduce the HCLS, as well as provide an overview of the activities that are currently ongoing within the task forces, as well as new developments and the recent Face2Face meeting. The role of information extraction and the current interest in Shared Identifiers will also be discussed.


  1. Ruttenberg, A., Rees, J., Samwald, M., & Marshall, M. (2009). Life sciences on the Semantic Web: the Neurocommons and beyond Briefings in Bioinformatics, 10 (2), 193-204 DOI: 10.1093/bib/bbp004

May 13, 2009

XML Summer School, Oxford

XML Summer School, Oxford, U.K.After a brief absence, it is good to see the XML Summer School is back again this September (20th-25th) at St. Edmund Hall, Oxford. This is  “a unique event for everyone using, designing or implementing solutions using XML and related technologies.” I’ve been both a delegate and a speaker here over the years; back in 2005, with Nick Drummond we presented the Protégé and OWL tutorial which was good fun.  So here is what I.M.H.O. makes the XML summer school worth a look: (more…)

Next Page »

Customized Rubric Theme Blog at WordPress.com.


Get every new post delivered to your Inbox.

Join 1,485 other followers