June 16, 2009

OBO Foundry workshop outcomes 2009

Filed under: conferences — Duncan Hull @ 4:28 pm
Tags: , , , ,

Haystack OWL by dullhunkWell I was going to blog about last weeks Open Biomedical Ontologies workshop, but Susanna-Assunta Sansone at the EBI has already done it via some very detailed minutes. See her notes for the:

  1. Overview
  2. Outcomes from day one
  3. Outcomes from day two

Thanks to the organisers of this workshop for hosting another well run event, I’m only sorry I had to miss the delicious looking dinner at Cotto in Cambridge (and entertaining company) on the last day…  Hope to see you again next year.


  1. Schober, D., Smith, B., Lewis, S., Kusnierczyk, W., Lomax, J., Mungall, C., Taylor, C., Rocca-Serra, P., & Sansone, S. (2009). Survey-based naming conventions for use in OBO Foundry ontology development BMC Bioinformatics, 10 (1) DOI: 10.1186/1471-2105-10-125

[CC-licensed Picture of Haystack OWL by dullhunk].

June 15, 2009

Nettab 2009 Day One: Bio-wikis (and football)

Drogba, Eto'o, Ronalda, Beckham, Messi, Ibrahimovic, Del Piero and KakaA brief wiki-report and some wiki-links from the first short and introductory day of Network Applications and Tools in Biology (NETTAB 2009) in Sicily where there was a tutorial on Technologies of wiki resources and bio-wikis delivered by Paolo Romano and Elda Rossi. This covered Gene Wiki, Wikiproteins, Wikigenes and Wikipathways [1-4].

There is already a bewildering array of different wikitechnology, thankfully wikimatrix (“compare them all”) gives wikicomparisons on some of the wikisolutions are already out there (open vs. closed – more on this later).

The theme of the workshop this year has been Technologies, Tools and Applications for Collaborative and Social Bioinformatics Research and Development. So wikis seems like an obvious place to start.

Since user-driven social software is becoming increasingly important, here is a list of of few of the people involved in this years workshop,

  1. Giampaolo Bella
  2. Luca Bortolussi
  3. Leandro Ciuffo
  4. Alfredo Ferro
  5. Rosalba Giugno
  6. Alessandro Lagana
  7. Stefania Parodi
  8. Alfredo Pulvirenti
  9. Paolo Romano
  10. Elda Rossi
  11. Andrea Splendiani

I don’t know about you, but those names sound deliciously exotic to my non-italian speaking Inglese ears. When I read the list of names above, it sounds like an elite squad of the Azzurri (football team). You would have Romano as capitano in the middle of the park, joined by Ferro, Ciuffo and Rossi. Then at the back you’ve got the famous italian Catenaccio (locking defence: Paolo Maldini style), the kind that wins world cups (remember 2006?) – there’s nothing getting past Parodi, Giugno, Pulvirenti and Bortolussi in defence. Last but not least, I’d put Splendiani and Bella up front, they sound like strikers to me, mostly because of their surnames.

What all this footballing nonsense has to do with NETTAB and wikis I don’t know. There’s probably some obvious-but-cliched link between Football and Science (by virtue of them both being collaborative and competitive team sports). But, really I just couldn’t resist a little Italian-inspired post about football, I hope to post some more notes on days two and three of the NETTAB workshop later… where most of the action took place.


  1. Mons, B., Ashburner, M., Chichester, C., van Mulligen, E., Weeber, M., den Dunnen, J., van Ommen, G., Musen, M., Cockerill, M., Hermjakob, H., Mons, A., Packer, A., Pacheco, R., Lewis, S., Berkeley, A., Melton, W., Barris, N., Wales, J., Meijssen, G., Moeller, E., Roes, P., Borner, K., & Bairoch, A. (2008). Calling on a million minds for community annotation in WikiProteins Genome Biology, 9 (5) DOI: 10.1186/gb-2008-9-5-r89
  2. Hoffmann, R. (2008). A wiki for the life sciences where authorship matters Nature Genetics, 40 (9), 1047-1051 DOI: 10.1038/ng.f.217
  3. Huss, J., Orozco, C., Goodale, J., Wu, C., Batalov, S., Vickers, T., Valafar, F., & Su, A. (2008). A Gene Wiki for Community Annotation of Gene Function PLoS Biology, 6 (7) DOI: 10.1371/journal.pbio.0060175
  4. Pico, A., Kelder, T., van Iersel, M., Hanspers, K., Conklin, B., & Evelo, C. (2008). WikiPathways: Pathway Editing for the People PLoS Biology, 6 (7) DOI: 10.1371/journal.pbio.0060184

Andrea Wiggins on little e-Science

Andrea WigginsAndrea Wiggins [1,2] from Syracuse University, New York is visiting Manchester this week and will be doing a seminar on “Little e-Science“, the details of which are below.

Date, time: 12 – 2pm on Thursday 18th June

Location: Atlas 1&2, Kilburn building

Title: Little eScience

Abstract: An interdisciplinary community of researchers has started to coalesce around the study of free/libre open source software (FLOSS) development. The research community is in many ways a reflection of the phenomenon of FLOSS practices in both social and technological respects, as many share the open source community’s values that support transparency and democratic participation. As community ties develop, new collaborations have spurred the creation of shared research resources: several repositories provide access to curated research-ready data, working paper repositories provide a means for disseminating early results, and a variety of analysis scripts and workflows connecting the data sets and literature are freely available. Despite these apparently favorable conditions for research collaboration, adoption of the tools and practices associated with eResearch has been slow as yet.

The key issues observed to date seem to stem from the challenges of pre-paradigmatic little science research. Researchers from software engineering, information systems, and even anthropology may examine the same construct, such as FLOSS project success, but will likely proceed from different epistemologies, utilize different data sources, identify different independent variables with varying operationalizations, and employ different research methodologies. In the decentralized and phenomenologically-driven FLOSS research community, creating and maintaining cyberinfrastructure [3] is a substantial effort for a small number of participants. In the little sciences, achieving critical mass of participation may be the most significant factor in creating a viable community of practice around eScience methods.

Update Slides are embedded below:


  1. Andrea Wiggins (2009) Social Life of Information: We Are Who We Link Andrea’s blog
  2. Andrea Wiggins, James Howison, & Kevin Crowston (2008). Social dynamics of FLOSS team communication across channels Open Source Development, Communities and Quality
  3. Lincoln Stein (2008). Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges Nature Reviews Genetics, 9 (9), 678-688 DOI: 10.1038/nrg2414

June 10, 2009

Kenjiro Taura on Parallel Workflows

Kenjiro TauraKenjiro Taura is visting Manchester next week from the Department of Information and Communication Engineering at the University of Tokyo. He will be doing a seminar, the details of which are below:

Title: Large scale text processing made simple by GXP make: A Unixish way to parallel workflow processing

Date-time: Monday, 15 June 2009 at 11:00 AM

Location: Room MLG.001, mib.ac.uk

In the first part of this talk, I will introduce a simple tool called GXP make. GXP is a general purpose parallel shell (a process launcher) for multicore machines, unmanaged clusters accessed via SSH, clusters or supercomputers managed by batch scheduler, distributed machines, or any mixture thereof. GXP make is a ‘make‘ execution engine that executes regular UNIX makefiles in parallel. Make, though typically used for software builds, is in fact a general framework to concisely describe workflows constituting sequential commands. Installation of GXP requires no root privileges and needs to be done only on the user’s home machine. GXP make easily scales to more than 1,000 CPU cores. The net result is that GXP make allows an easy migration of workflows from serial environments to clusters and to distributed environments. In the second part, I will talk about our experiences on running a complex text processing workflow developed by Natural Language Processing (NLP) experts. It is an entire workflow that processes MEDLINE abstracts with deep NLP tools (e.g., Enju parser [1]) to generate search indices of MEDIE, a semantic retrieval engine for MEDLINE. It was originally described in Makefile without a particular provision to parallel processing, yet GXP make was able to run it on clusters with almost no changes to the original Makefile. Time for processing abstracts published in a single day was reduced from approximately eight hours (with a single machine) to twenty minutes with a trivial amount of efforts. A larger scale experiment of processing all abstracts published so far and remaining challenges will also be presented.


  1. Miyao, Y., Sagae, K., Saetre, R., Matsuzaki, T., & Tsujii, J. (2008). Evaluating contributions of natural language parsers to protein-protein interaction extraction Bioinformatics, 25 (3), 394-400 DOI: 10.1093/bioinformatics/btn631

June 4, 2009

Improving the OBO Foundry Principles

The Old Smithy Pub by loop ohThe Open Biomedical Ontologies (OBO) are a set of reference ontologies for describing all kinds of biomedical data, see [1-5] for examples. Every year, users and developers of these ontologies gather from around the globe for a workshop at the EBI near Cambridge, UK. Following on from the first workshop last year, the 2nd OBO workshop 2009 is fast approaching.

In preparation, I’ve been revisiting the OBO Foundry documentation, part of which establishes a set of principles for ontology development. I’m wondering how they could be improved because these principles are fundamental to the whole effort. We’ve been using one of the OBO ontologies (called Chemical Entities of Biological Interest (ChEBI)) in the REFINE project to mine data from the PubMed database. OBO Ontologies like ChEBI and the Gene Ontology are really crucial to making sense of the massive data which are now common in biology and medicine – so this is stuff that matters.

The OBO Foundry Principles, a sort of Ten Commandments of Ontology (or Obology if you prefer) currently look something like this (copied directly from obofoundry.org/crit.shtml):

  1. The ontology must be open and available to be used by all without any constraint other than (a) its origin must be acknowledged and (b) it is not to be altered and subsequently redistributed under the original name or with the same identifiers.The OBO ontologies are for sharing and are resources for the entire community. For this reason, they must be available to all without any constraint or license on their use or redistribution. However, it is proper that their original source is always credited and that after any external alterations, they must never be redistributed under the same name or with the same identifiers.
  2. The ontology is in, or can be expressed in, a common shared syntax. This may be either the OBO syntax, extensions of this syntax, or OWL. The reason for this is that the same tools can then be usefully applied. This facilitates shared software implementations. This criterion is not met in all of the ontologies currently listed, but we are working with the ontology developers to have them available in a common OBO syntax.
  3. The ontologies possesses a unique identifier space within the OBO Foundry. The source of a term (i.e. class) from any ontology can be immediately identified by the prefix of the identifier of each term. It is, therefore, important that this prefix be unique.
  4. The ontology provider has procedures for identifying distinct successive versions.
  5. The ontology has a clearly specified and clearly delineated content. The ontology must be orthogonal to other ontologies already lodged within OBO. The major reason for this principle is to allow two different ontologies, for example anatomy and process, to be combined through additional relationships. These relationships could then be used to constrain when terms could be jointly applied to describe complementary (but distinguishable) perspectives on the same biological or medical entity. As a corollary to this, we would strive for community acceptance of a single ontology for one domain, rather than encouraging rivalry between ontologies.
  6. The ontologies include textual definitions for all terms. Many biological and medical terms may be ambiguous, so terms should be defined so that their precise meaning within the context of a particular ontology is clear to a human reader.
  7. The ontology uses relations which are unambiguously defined following the pattern of definitions laid down in the OBO Relation Ontology.
  8. The ontology is well documented.
  9. The ontology has a plurality of independent users.
  10. The ontology will be developed collaboratively with other OBO Foundry members.

ResearchBlogging.orgI’ve been asking all my frolleagues what they think of these principles and have got some lively responses, including some here from Allyson Lister, Mélanie Courtot, Michel Dumontier and Frank Gibson. So what do you think? How could these guidelines be improved? Do you have any specific (and preferably constructive) criticisms of these ambitious (and worthy) goals? Be bold, be brave and be polite. Anything controversial or “off the record” you can email it to me… I’m all ears.

CC-licensed picture above of the Old Smithy (pub) by Loop Oh. Inspired by Michael Ashburner‘s standing OBO joke (Ontolojoke) which goes something like this: Because Barry Smith is one of the leaders of OBO, should the project be called the OBO Smithy or the OBO Foundry? 🙂


  1. Noy, N., Shah, N., Whetzel, P., Dai, B., Dorf, M., Griffith, N., Jonquet, C., Rubin, D., Storey, M., Chute, C., & Musen, M. (2009). BioPortal: ontologies and integrated data resources at the click of a mouse Nucleic Acids Research DOI: 10.1093/nar/gkp440
  2. Côté, R., Jones, P., Apweiler, R., & Hermjakob, H. (2006). The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries BMC Bioinformatics, 7 (1) DOI: 10.1186/1471-2105-7-97
  3. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L., Eilbeck, K., Ireland, A., Mungall, C., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S., Scheuermann, R., Shah, N., Whetzel, P., & Lewis, S. (2007). The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration Nature Biotechnology, 25 (11), 1251-1255 DOI: 10.1038/nbt1346
  4. Smith, B., Ceusters, W., Klagges, B., Köhler, J., Kumar, A., Lomax, J., Mungall, C., Neuhaus, F., Rector, A., & Rosse, C. (2005). Relations in biomedical ontologies Genome Biology, 6 (5) DOI: 10.1186/gb-2005-6-5-r46
  5. Bada, M., & Hunter, L. (2008). Identification of OBO nonalignments and its implications for OBO enrichment Bioinformatics, 24 (12), 1448-1455 DOI: 10.1093/bioinformatics/btn194

June 2, 2009

Who Are You? Digital Identity in Science

The Who by The WhoThe organisers of the Science Online London 2009 conference are asking people to propose their own session ideas (see some examples here), so here is a proposal:

Title: Who Are You? Digital Identity in Science

Many important decisions in Science are based on identifying scientists and their contributions. From selecting reviewers for grants and publications, to attributing published data and deciding who is funded, hired or promoted, digital identity is at the heart of Science on the Web.

Despite the importance of digital identity, identifying scientists online is an unsolved problem [1]. Consequently, a significant amount of scientific and scholarly work is not easily cited or credited, especially digital contributions: from blogs and wikis, to source code, databases and traditional peer-reviewed publications on the Web. This (proposed) session will look at current mechanisms for identifying scientists digitally including contributor-id (CrossRef), researcher-id (Thomson), Scopus Author ID (Elsevier), OpenID, Google Scholar [2], Single Sign On, PubMed, Google Scholar [2], FOAF+SSL, LinkedIn, Shared Identifiers (URIs) and the rest. We will introduce and discuss each via a SWOT analysis (Strengths, Weaknesses, Opportunities and Threats). Is digital identity even possible and ethical? Beside the obvious benefits of persistent, reliable and unique identifiers, what are the privacy and security issues with personal digital identity?

If this is a successful proposal, I’ll need some help. Any offers? If you are interested in joining in the fun, more details are at scienceonlinelondon.org


  1. Bourne, P., & Fink, J. (2008). I Am Not a Scientist, I Am a Number PLoS Computational Biology, 4 (12) DOI: 10.1371/journal.pcbi.1000247
  2. Various Publications about unique author identifiers bookmarked in citeulike
  3. Yours Truly (2009) Google thinks I’m Maurice Wilkins
  4. The Who (1978) Who Are You? Who, who, who, who? (Thanks to Jan Aerts for the reference!)

Michael Ley on Digital Bibliographies

Michael Ley

Michael Ley is visiting Manchester this week, he will be doing a seminar on Wednesday 3rd June, here are some details for anyone who is interested in attending:

Date: 3rd Jun 2009

Title: DBLP: How the data get in

Speaker: Dr Michael Ley. University of Trier, Germany

Time & Location: 14:15, Lecture Theatre 1.4, Kilburn Building

Abstract: The DBLP (Digital Bibliography & Library Project) Computer Science Bibliography now includes more than 1.2 million bibliographic records. For Computer Science researchers the DBLP web site now is a popular tool to trace the work of colleagues and to retrieve bibliographic details when composing the lists of references for new papers. Ranking and profiling of persons, institutions, journals, or conferences is another usage of DBLP. Many scientists are aware of this and want their publications being listed as complete as possible.

The talk focuses on the data acquisition workflow for DBLP. To get ‘clean’ basic bibliographic information for scientific publications remains a chaotic puzzle.

Large publishers are either not interested to cooperate with open services like DBLP, or their policy is very inconsistent. In most cases they are not able or not willing to deliver basic data required for DBLP in a direct way, but they encourage us to crawl their Web sites. This indirection has two main problems:

  1. The organisation and appearance of Web sites changes from time to time, this forces a reimplementation of information extraction scripts. [1]
  2. In many cases manual steps are necessary to get ‘complete’ bibliographic information.

For many small information sources it is not worthwhile to develop information extraction scripts. Data acquisition is done manually. There is an amazing variety of small but interesting journals, conferences and workshops in Computer Science which are not under the umbrella of ACM, IEEE, Springer, Elsevier etc. How they get it often is decided very pragmatically.

The goal of the talk and my visit to Manchester is to start a discussion process: The EasyChair conference management system developed by Andrei Voronkov and DBLP are parts of scientific publication workflow. They should be connected for mutual benefit?


  1. Lincoln Stein (2002). Creating a bioinformatics nation: screen scraping is torture Nature, 417 (6885), 119-120 DOI: 10.1038/417119a

Blogging For Profit: Costs and Benefits

Business Graph by nDevilTV
The organisers of the Science Online London 2009 conference are asking people to propose their own session ideas (see some examples here), so here is proposal:

Title: Blogging For Profit: Costs and Benefits

What are the costs and benefits of blogging and how can you make sure the latter justifies the former?

This (proposed) session will look at two kinds of profit, and the costs associated with each.

  1. Research profit (in science and academia), building digital reputations on the Web. Can blogging help your next grant proposal for research funding and if so, how? How can blogging be used to increase the visibility and impact of published research via the likes of ResearchBlogging.org, blogs.nature.com and other aggregators?
  2. Financial profit (in business), making blogging pay the bills. What business models (and infrastructure) exist to support blogging? Including, but not limited to: Nature Network, ScienceBlogs, Google AdSense, “20% time“, “free” tools (WordPress, Blogger, OpenWetWare etc). Going solo vs. joining a club – which business models and tools are right for you?

This could be followed by a general discussion on these benefits. When do they justify their costs (and risks) and make for profitable blogging?

If this is a successful proposal, I’ll need some help. Any offers? If you are interested in joining in the fun, details are at scienceonlinelondon.org

[CC-licensed Business Graph picture by nDevilTV]

June 1, 2009

Scott Marshall on Interoperability

M. Scott MarshallScott Marshall is visiting Manchester this week, he will be doing a seminar on Friday 5th June, here are some details for anyone who is interested in attending:

Speaker: Dr. M. Scott Marshall, The University of Amsterdam

Date/Time: 5th June 2009, 11:00

Location: Room MLG.001 (Lecture Theatre), MIB building, (number 16 on campus map)

Title: Standards Enabled Interoperability: W3C Semantic Web for Health Care and Life Sciences Interest Group

Abstract: The W3C Semantic Web for Health Care and Life Sciences Interest Group (HCLS) has the mission of developing, advocating for, and supporting the use of Semantic Web technologies for biological science, translational medicine and health care. HCLS covers hot topics including data integration and federation, bridging commonly used domain standards such as CDISC and HL7, and the applications of medical terminologies. This talk will introduce the HCLS, as well as provide an overview of the activities that are currently ongoing within the task forces, as well as new developments and the recent Face2Face meeting. The role of information extraction and the current interest in Shared Identifiers will also be discussed.


  1. Ruttenberg, A., Rees, J., Samwald, M., & Marshall, M. (2009). Life sciences on the Semantic Web: the Neurocommons and beyond Briefings in Bioinformatics, 10 (2), 193-204 DOI: 10.1093/bib/bbp004

May 26, 2009

Subscribing to O’Really?

Filed under: technology — Duncan Hull @ 4:53 pm
Tags: , , , ,

Feed the WorldJust a quick note about subscribing: if you are a regular reader of this O’Really blog and you don’t already subscribe, there are two ways you can receive automatic notifications when new posts are published here:

  1. Point your feed reader at http://feeds2.feedburner.com/oreally, (the preferred method) or …
  2. Point your feed reader at https://duncan.hull.name/feed/ (the WordPress method) which unfortunately gives unreliable subscriber stats. This  feed is linked to by the blue or orange feed icon (pictured right) you should be able to see in your web browsers address bar.

The first feed, is just the second feed re-routed through the magic of FeedBurner, which gives more useful viewing statistics.

« Previous PageNext Page »

Blog at WordPress.com.