O'Really?

March 16, 2010

DNA, Diversity and You at Cambridge Science Festival

Sequence BraceletsAs part of Cambridge Science festival last weekend, I joined a group of about 40 volunteers from The Sanger and EBI at an event “DNA, diversity and you”. This was a series of education and outreach events designed to explore how differences in your genetic code make you different from other individuals, and what makes the humans different from other living things –  with a bit of computational biology thrown in for good measure.  Here are some notes on a selection of the activities, in case you ever find yourself trying to explain biology, computer science or bioinformatics to anyone aged 4-18 and beyond. These resources are all tried, tested and fun to work with, for students and teachers alike:

  1. DNA origami create your own origami DNA molecule, and hands on way of learning abou tthe double helix structure of DNA
  2. DNA sequence bracelets (see picture right). Thread coloured beads according to sequence sections from a range of organisms including trout, chimpanzee, butterfly, a flesh-eating microbe and rotting corpse flower.
  3. Yummy gummy DNA (under 5’s) build your own DNA helix out of sweets and cocktail sticks. Then scoff it all afterwards.
  4. What’s my name in DNA? find out what your name is in DNA, and what the corresponding (hypothetical) protein is using software from deCODE.
  5. Function Finders translate DNA into a sequence of amino acids using wooden translator blocks, then find out which organism the amino acid sequence is from.
  6. Genome sizes (with seatbelts) Rank organisms (inc. human, zebrafish, mosquito, sugar cane and yeast) and find out if they are in the right order. Results are often not what you would expect.
  7. Play your genes right. A card-based guessing game which compares the number of genes in the human genome with the number of genes from a range of different organisms include the flu virus, E. coli bacteria, armadillo, rice plant and others.
  8. Genome Jigsaws for illustrating the process of finishing supposedly “finished” genomes, by putting together a square sequence jigsaw following base pairing rules to end up with a complete finished square.
  9. DNA Time Team examines of aspects ancestry and evolution. The activity encourages people to work out the sequence of a common ancestor by filling in the gaps on a simple evolutionary tree.
  10. Spot the difference with proteins. Comparing Heat Shock Protein (HSP) in human and other organisms to illustrate how different regions of the protein vary between different organisms and how this affects function.
  11. Ready, steady sort: a sorting network that demonstrates one technique that computers use to sort through large amounts of information like sequence data. This comes straight from Computer Science Unplugged by Tim Bell, Mike Fellows and Ian Witten. This activity can be done either as a smaller board game, or as a larger floor game. Either way, it’s a lot of fun, especially if you time people for an added competitive element (see video below)

There were a whole bunch of new activities at the festival this year, maybe these will appear on the your genome website in the future. Anyway, it was great fun to get involved, there is nothing quite like the challenge of explaining parallel computing to young kids, teenagers and their parents – actually much easier than you’d think if you’ve got access to great teaching materials.

Thanks to Francesca Gale and Louisa Wright for all the hard work that went into organising this fun and successful event.

June 23, 2009

Impact Factor Boxing 2009

Fight Night Punch Test by djclear904[This post is part of an ongoing series about impact factors]

The latest results from the annual impact factor boxing world championship contest are out. This is a combat sport where scientific journals are scored according to their supposed influence and impact in Science. This years competition rankings include the first-ever update to the newly introduced Five Year Impact Factor and Eigenfactor™ Metrics [1,2] in Journal Citation Reports (JCR) on the Web (see www.isiknowledge.com/JCR warning: clunky website requires subscription*), presumably in response to widespread criticism of impact factors. The Eigenfactor™ seems to correlate quite closely with the impact factor scores, both of which work at the level of the journal, although they use different methods for measuring a given journals impact. However, what many authors are often more interested in is the impact of an individual article, not the journal where it was published. So it would be interesting to see how the figures below tally with Google Scholar, see also comments by Abhishek Tiwari. I’ve included a table below of bioinformatics impact factors, updated for June 2009. Of course, when I say 2009 (today), I mean 2008 (these are the latest figures available based on data from 2007) – so this shiny new information published this week is already out of date [3] and flawed [4,5] but here is a selection of the data anyway: [update: see figures published in June 2010.]

Journal Title 2008 data from isiknowledge.com/JCR Eigenfactor™ Metrics
Total Cites Impact Factor 5-Year Impact Factor Immediacy Index Articles Cited Half-life Eigenfactor™ Score Article Influence™ Score
BMC Bionformatics 8141 3.781 4.246 0.664 607 2.8 0.06649 1.730
OUP Bioinformatics 30344 4.328 6.481 0.566 643 4.8 0.18204 2.593
Briefings in Bioinformatics 2908 4.627 1.273 44 4.5 0.02188
PLoS Computational Biology 2730 5.895 6.144 0.826 253 2.1 0.03063 3.370
Genome Biology 9875 6.153 7.812 0.961 229 4.4 0.07930 3.858
Nucleic Acids Research 86787 6.878 6.968 1.635 1070 6.5 0.37108 2.963
PNAS 416018 9.380 10.228 1.635 3508 7.4 1.69893 4.847
Science 409290 28.103 30.268 6.261 862 8.4 1.58344 16.283
Nature 443967 31.434 31.210 8.194 899 8.5 1.76407 17.278

The internet is radically changing the way we communicate and this includes scientific publishing, as media mogul Rupert Murdoch once pointed out big will not beat small any more – it will be the fast beating the slow.  An interesting question for publishers and scientists is, how can the Web help the faster flyweight and featherweight boxers (smaller journals) compete and punch-above-their-weight with the reigning world champion heavyweights (Nature, Science and PNAS)? Will the heavyweight publishers always have the killer knockout punches? If you’ve got access to the internet, then you already have a ringside seat from which to watch all the action. This fight should be entertaining viewing and there is an awful lot of money riding on the outcome [6-11].

Seconds away, round two…

References

  1. Fersht, A. (2009). The most influential journals: Impact Factor and Eigenfactor Proceedings of the National Academy of Sciences, 106 (17), 6883-6884 DOI: 10.1073/pnas.0903307106
  2. Bergstrom, C., & West, J. (2008). Assessing citations with the Eigenfactor Metrics Neurology, 71 (23), 1850-1851 DOI: 10.1212/01.wnl.0000338904.37585.66
  3. Cockerill, M. (2004). Delayed impact: ISI’s citation tracking choices are keeping scientists in the dark. BMC Bioinformatics, 5 (1) DOI: 10.1186/1471-2105-5-93
  4. Allen, L., Jones, C., Dolby, K., Lynn, D., & Walport, M. (2009). Looking for Landmarks: The Role of Expert Review and Bibliometric Analysis in Evaluating Scientific Publication Outputs PLoS ONE, 4 (6) DOI: 10.1371/journal.pone.0005910
  5. Grant, R.P. (2009) On article-level metrics and other animals Nature Network
  6. Corbyn, Z. (2009) Do academic journals pose a threat to the advancement of Science? Times Higher Education
  7. Fenner, M. (2009) PLoS ONE: Interview with Peter Binfield Gobbledygook blog at Nature Network
  8. Hoyt, J. (2009) Who is killing science on the Web? Publishers or Scientists? Mendeley Blog
  9. Hull, D. (2009) Escape from the Impact Factor: The Great Escape? O’Really? blog
  10. Murray-Rust, P. (2009) THE article: Do academic journals pose a threat to the advancement of science? Peter Murray-Rust’s blog: A Scientist and the Web
  11. Wu, S. (2009) The evolution of Scientific Impact shirleywho.wordpress.com

* This important data should be freely available (e.g. no subscription), since crucial decisions about the allocation of public money depend on it, but that’s another story.

[More commentary on this post over at friendfeed. CC-licensed Fight Night Punch Test by djclear904]

June 4, 2009

Improving the OBO Foundry Principles

The Old Smithy Pub by loop ohThe Open Biomedical Ontologies (OBO) are a set of reference ontologies for describing all kinds of biomedical data, see [1-5] for examples. Every year, users and developers of these ontologies gather from around the globe for a workshop at the EBI near Cambridge, UK. Following on from the first workshop last year, the 2nd OBO workshop 2009 is fast approaching.

In preparation, I’ve been revisiting the OBO Foundry documentation, part of which establishes a set of principles for ontology development. I’m wondering how they could be improved because these principles are fundamental to the whole effort. We’ve been using one of the OBO ontologies (called Chemical Entities of Biological Interest (ChEBI)) in the REFINE project to mine data from the PubMed database. OBO Ontologies like ChEBI and the Gene Ontology are really crucial to making sense of the massive data which are now common in biology and medicine – so this is stuff that matters.

The OBO Foundry Principles, a sort of Ten Commandments of Ontology (or Obology if you prefer) currently look something like this (copied directly from obofoundry.org/crit.shtml):

  1. The ontology must be open and available to be used by all without any constraint other than (a) its origin must be acknowledged and (b) it is not to be altered and subsequently redistributed under the original name or with the same identifiers.The OBO ontologies are for sharing and are resources for the entire community. For this reason, they must be available to all without any constraint or license on their use or redistribution. However, it is proper that their original source is always credited and that after any external alterations, they must never be redistributed under the same name or with the same identifiers.
  2. The ontology is in, or can be expressed in, a common shared syntax. This may be either the OBO syntax, extensions of this syntax, or OWL. The reason for this is that the same tools can then be usefully applied. This facilitates shared software implementations. This criterion is not met in all of the ontologies currently listed, but we are working with the ontology developers to have them available in a common OBO syntax.
  3. The ontologies possesses a unique identifier space within the OBO Foundry. The source of a term (i.e. class) from any ontology can be immediately identified by the prefix of the identifier of each term. It is, therefore, important that this prefix be unique.
  4. The ontology provider has procedures for identifying distinct successive versions.
  5. The ontology has a clearly specified and clearly delineated content. The ontology must be orthogonal to other ontologies already lodged within OBO. The major reason for this principle is to allow two different ontologies, for example anatomy and process, to be combined through additional relationships. These relationships could then be used to constrain when terms could be jointly applied to describe complementary (but distinguishable) perspectives on the same biological or medical entity. As a corollary to this, we would strive for community acceptance of a single ontology for one domain, rather than encouraging rivalry between ontologies.
  6. The ontologies include textual definitions for all terms. Many biological and medical terms may be ambiguous, so terms should be defined so that their precise meaning within the context of a particular ontology is clear to a human reader.
  7. The ontology uses relations which are unambiguously defined following the pattern of definitions laid down in the OBO Relation Ontology.
  8. The ontology is well documented.
  9. The ontology has a plurality of independent users.
  10. The ontology will be developed collaboratively with other OBO Foundry members.

ResearchBlogging.orgI’ve been asking all my frolleagues what they think of these principles and have got some lively responses, including some here from Allyson Lister, Mélanie Courtot, Michel Dumontier and Frank Gibson. So what do you think? How could these guidelines be improved? Do you have any specific (and preferably constructive) criticisms of these ambitious (and worthy) goals? Be bold, be brave and be polite. Anything controversial or “off the record” you can email it to me… I’m all ears.

CC-licensed picture above of the Old Smithy (pub) by Loop Oh. Inspired by Michael Ashburner‘s standing OBO joke (Ontolojoke) which goes something like this: Because Barry Smith is one of the leaders of OBO, should the project be called the OBO Smithy or the OBO Foundry? 🙂

References

  1. Noy, N., Shah, N., Whetzel, P., Dai, B., Dorf, M., Griffith, N., Jonquet, C., Rubin, D., Storey, M., Chute, C., & Musen, M. (2009). BioPortal: ontologies and integrated data resources at the click of a mouse Nucleic Acids Research DOI: 10.1093/nar/gkp440
  2. Côté, R., Jones, P., Apweiler, R., & Hermjakob, H. (2006). The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries BMC Bioinformatics, 7 (1) DOI: 10.1186/1471-2105-7-97
  3. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L., Eilbeck, K., Ireland, A., Mungall, C., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S., Scheuermann, R., Shah, N., Whetzel, P., & Lewis, S. (2007). The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration Nature Biotechnology, 25 (11), 1251-1255 DOI: 10.1038/nbt1346
  4. Smith, B., Ceusters, W., Klagges, B., Köhler, J., Kumar, A., Lomax, J., Mungall, C., Neuhaus, F., Rector, A., & Rosse, C. (2005). Relations in biomedical ontologies Genome Biology, 6 (5) DOI: 10.1186/gb-2005-6-5-r46
  5. Bada, M., & Hunter, L. (2008). Identification of OBO nonalignments and its implications for OBO enrichment Bioinformatics, 24 (12), 1448-1455 DOI: 10.1093/bioinformatics/btn194

June 2, 2009

Who Are You? Digital Identity in Science

The Who by The WhoThe organisers of the Science Online London 2009 conference are asking people to propose their own session ideas (see some examples here), so here is a proposal:

Title: Who Are You? Digital Identity in Science

Many important decisions in Science are based on identifying scientists and their contributions. From selecting reviewers for grants and publications, to attributing published data and deciding who is funded, hired or promoted, digital identity is at the heart of Science on the Web.

Despite the importance of digital identity, identifying scientists online is an unsolved problem [1]. Consequently, a significant amount of scientific and scholarly work is not easily cited or credited, especially digital contributions: from blogs and wikis, to source code, databases and traditional peer-reviewed publications on the Web. This (proposed) session will look at current mechanisms for identifying scientists digitally including contributor-id (CrossRef), researcher-id (Thomson), Scopus Author ID (Elsevier), OpenID, Google Scholar [2], Single Sign On, PubMed, Google Scholar [2], FOAF+SSL, LinkedIn, Shared Identifiers (URIs) and the rest. We will introduce and discuss each via a SWOT analysis (Strengths, Weaknesses, Opportunities and Threats). Is digital identity even possible and ethical? Beside the obvious benefits of persistent, reliable and unique identifiers, what are the privacy and security issues with personal digital identity?

If this is a successful proposal, I’ll need some help. Any offers? If you are interested in joining in the fun, more details are at scienceonlinelondon.org

References

  1. Bourne, P., & Fink, J. (2008). I Am Not a Scientist, I Am a Number PLoS Computational Biology, 4 (12) DOI: 10.1371/journal.pcbi.1000247
  2. Various Publications about unique author identifiers bookmarked in citeulike
  3. Yours Truly (2009) Google thinks I’m Maurice Wilkins
  4. The Who (1978) Who Are You? Who, who, who, who? (Thanks to Jan Aerts for the reference!)

Michael Ley on Digital Bibliographies

Michael Ley

Michael Ley is visiting Manchester this week, he will be doing a seminar on Wednesday 3rd June, here are some details for anyone who is interested in attending:

Date: 3rd Jun 2009

Title: DBLP: How the data get in

Speaker: Dr Michael Ley. University of Trier, Germany

Time & Location: 14:15, Lecture Theatre 1.4, Kilburn Building

Abstract: The DBLP (Digital Bibliography & Library Project) Computer Science Bibliography now includes more than 1.2 million bibliographic records. For Computer Science researchers the DBLP web site now is a popular tool to trace the work of colleagues and to retrieve bibliographic details when composing the lists of references for new papers. Ranking and profiling of persons, institutions, journals, or conferences is another usage of DBLP. Many scientists are aware of this and want their publications being listed as complete as possible.

The talk focuses on the data acquisition workflow for DBLP. To get ‘clean’ basic bibliographic information for scientific publications remains a chaotic puzzle.

Large publishers are either not interested to cooperate with open services like DBLP, or their policy is very inconsistent. In most cases they are not able or not willing to deliver basic data required for DBLP in a direct way, but they encourage us to crawl their Web sites. This indirection has two main problems:

  1. The organisation and appearance of Web sites changes from time to time, this forces a reimplementation of information extraction scripts. [1]
  2. In many cases manual steps are necessary to get ‘complete’ bibliographic information.

For many small information sources it is not worthwhile to develop information extraction scripts. Data acquisition is done manually. There is an amazing variety of small but interesting journals, conferences and workshops in Computer Science which are not under the umbrella of ACM, IEEE, Springer, Elsevier etc. How they get it often is decided very pragmatically.

The goal of the talk and my visit to Manchester is to start a discussion process: The EasyChair conference management system developed by Andrei Voronkov and DBLP are parts of scientific publication workflow. They should be connected for mutual benefit?

References

  1. Lincoln Stein (2002). Creating a bioinformatics nation: screen scraping is torture Nature, 417 (6885), 119-120 DOI: 10.1038/417119a
« Previous Page

Blog at WordPress.com.