Uncategorized | O'Really?

July 6, 2009

Fabio Rinaldi on OntoGene

Filed under: Uncategorized — Duncan Hull @ 8:26 am
Tags: biocreative, Fabio Rinaldi, nactem, OntoGene

Fabio Rinaldi is currently visiting Manchester from the University of Zurich, he will be doing a seminar on Monday 6th July, the details of which are below.

Title : OntoGene in the BioNLP shared task and in BioCreative II.5

Speaker: Dr Fabio Rinaldi, University of Zurich

Date: Monday 6th July 2009

Time: 14:00

Location: Lecture Theatre – MLG.001, MIB building

Abstract In this talk I will describe our participation to the BioNLP shared task and the BioCreative II.5 competitions [1]. Our approach is based on a common core: a pipeline of NLP tools and a dependency parser. The adaptation for the BioNLP shared task consisted of suitable input filters and a transformation-based approach which maps syntactic dependencies to event structures. Despite the very simple approach, results were satisfactory (34.78 F-score). The adaptation for BioCreative requires the detection and disambiguation of domain entities, while candidate interactions are proposed on the basis of a simple learning approach.

If time allows I will then describe our approach to finding the ‘focus organisms’ i.e. the organisms in which the experiments have been conducted or which are the source of the interacting proteins. This information is of crucial importance for the correct disambiguation of other entities mentioned in the article.

References

Rinaldi, F., Kappeler, T., Kaljurand, K., Schneider, G., Klenner, M., Clematide, S., Hess, M., von Allmen, J., Parisot, P., Romacker, M., & Vachon, T. (2008). OntoGene in BioCreative II Genome Biology, 9 (Suppl 2) DOI: 10.1186/gb-2008-9-s2-s13

Leave a Comment

June 19, 2008

Sixteen (Yes 16!) PhD studentships available in Computer Science

Filed under: Uncategorized — Duncan Hull @ 3:45 pm
Tags: epsrc, mrc, phd, research, studentship

The School of Computer Science of the University of Manchester has up to 16 studentships to offer to highly motivated research students who wish to start a PhD in September 2008 (in exceptional circumstances the start date can be deferred until April 2009). The studentships pay tuition fees and a stipend to cover living expenses for 3 years.

In 2008/09, the stipend will be £12940 per year for students who were UK residents in the 3 years before the start of the PhD, or between £10352 and £12940 per year for students who were not UK residents in the same period and cannot demonstrate a relevant connection to the UK. The stipend is expected to rise in subsequent years. Because of conditions associated with this funding, these studentships are open to students eligible for home fees only; this includes UK and EU nationals. (more…)

Comments (6)

January 18, 2008

One Thousand Databases High (and rising)

Filed under: informatics,Uncategorized — Duncan Hull @ 12:36 pm
Tags: bioinformatics, data tombs, database, DNA mania, Michael Galperin, NAR, Usama Fayyad

Well it’s that time of year again. The 15th annual stamp collecting edition of the journal Nucleic Acids Research (NAR), also known as the 2008 Database issue [1], was published earlier this week. This year there are 1078 databases listed in the collection, 110 more than the previous one (see Figure 1). As we pass the one thousand databases mark (1kDB) I wonder, what proportion of the data in these databases will never be used?

R.I.P. Biological Data?

It seems highly likely that lots of this data is stored in what Usama Fayyad at Yahoo! Research! Laboratories! calls data tombs [2], because as he puts it:

“Our ability to capture and store data has far outpaced our ability to process and utilise it. This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again.”

Like last year, lets illustrate the growth with an obligatory graph, see Figure 1.

Figure 1: Data growth: the ability to capture and store biological data has far outpaced our ability to understand it. Vertical axis is number of databases listed in Nucleic Acids Research [1], Horizontal axis is the year. (Picture drawn with Google Charts API which is OK but as Alf points out, doesn’t do error bars yet).

Another day, another dollar database

Does it matter that large quantities of this data will probably never be used? How could you find out, how much and which data was “write-only”? Will Biologists ever catch up with the physicists when it comes to Very Large stamp collections Databases? Biological databases are pretty big, but can you imagine handling up to 1,500 megabytes of data per second for ten years as the Physicists will soon be doing? You can already hear the (arrogant?) Physicists taunting the Biologists, “my database is bigger than yours”. So there.

Whichever of these databases you are using, happy data mining in 2008. If you are lucky, the data tombs you are working will contain hidden treasure that will make you famous and/or rich. Maybe. Any stamp collector will tell you, some stamps can become very valuable. There’s Gold in them there hills databases you know…

Galperin, M. Y. (2007). The molecular biology database collection: 2008 update. Nucleic Acids Research, Vol. 36, Database issue, pages D2-D4. DOI:10.1093/nar/gkm1037
Fayyad, U. and Uthurusamy, R. (2002). Evolving data into mining solutions for insights. Communications of the ACM, 45(8):28-31. DOI:10.1145/545151.545174
This post originally published on nodalpoint (with comments)
Stamp collectors picture, top right, thanks to daxiang stef / stef yau

Leave a Comment

October 17, 2007

The Luxuriant Flowing Hair Club for Scientists (LFHCfS)

Filed under: Uncategorized — Duncan Hull @ 9:18 pm
Tags: Albert Einstein, funny, ignobel, improbable, LFHCfS, Marc Abrahams, Mendeleev

Calling all Scientists, is your hair luxuriant and flowing? Perhaps you’re a bouffant bioinformatician, a hairy hacker or share a lab with somebody who is? If this is you, its high-time you joined the Luxuriant Flowing Hair Club for Scientists.

To propose somebody for membership, send email to Marc Abrahams at Harvard University marca /ate/ chem2.harvard.edu. Your email needs to include evidence of your luxuriant, flowing hair (a photo) and your credentials as a scientist. Some current members have impressive hair, see Simon Gregory, Carlisle Landel and Sterling Paramore for examples. Honorary and historical members include Dr. Brian May (Queen guitarist / astrophysicist), Dimitry Mendleyev and Albert Einstein, “Physicist. Bon vivant. A bold experimentalist with hair”.

So, if you are a scientist with a copius coiffure, ask yourself, will you ever get another chance to be in such distinguished company?

Leave a Comment

September 5, 2007

Semantic Biomedical Mashups with Connotea

Filed under: Uncategorized — Duncan Hull @ 9:26 pm
Tags: Ben Good, bioinformatics, connotea, Elsevier, Entity Describer, Evilsevier, informatics, JBI, mashup, medical informatics, semantic web

The Journal of Biomedical Informatics (JBI), will soon be publishing their special issue on Semantic Biomedical Mashups (can you fit any more buzzwords into a Call For Papers?!). Ben Good and friends have submitted a paper on their Entity Describer which extends connotea using some Semantic Web goodness. They’d appreciate your comments on their submitted manuscript over at i9606. As Ben says, their pre-publication turns out to be an interesting experiment “figuring out how blogging might fit into the academic publishing landscape”. If this interests you, get commenting now!

Update: Just spotted this interesting graphic of the Elsevier / Evilsevier logo (snigger), who are the publishers of JBI…

Leave a Comment

April 13, 2007

Collaboration, collaboration, collaboration!

Filed under: Uncategorized — Duncan Hull @ 10:05 pm
Tags: bioinformatics, Bob Geldof, careers, collaboration, collaboratory, Phil Bourne, PNAS, publish or perish, Ten Simple Rules, tony blair

What should your three main priorities be as a Scientist? Collaboration, collaboration, collaboration. Quentin Vicens and Phil Bourne have just published Ten Simple Rules for a Successful Collaboration [1] to help you do just that, as part of a continuing series [2,3,4,5].

Tony Bliar once said “Ask me my three main priorities for government, and I tell you: education, education, education.” In Science, its not so much about education as collaboration, collaboration, collaboration. The advice in Ten Simple Rules is all useful stuff, but what caught my eye is the fact that collaboration is on the rise, at least according to the number of co-authors on papers published in PNAS. The average number of co-authors has risen from 3.9 in 1981 to 8.4 in 2001. So before you publish or perish, it seems likely that you’ll also need to collaborate or commiserate… less laboratory, more collaboratory!

Photo credit Garret Keogh

References

Quentin Vicens and Phillip Bourne (2007) Ten Simple Rules for a Successful Collaboration PLOS Computational Biology
Phillip Bourne (2006) Ten Simple Rules for Getting Published PLOS Computational Biology
Philip Bourne and Iddo Friedberg (2006) Ten Simple Rules for Selecting a Postdoctoral Position PLOS Computational Biology
Phillip Bourne and Leo Chalupa (2006) Ten Simple Rules for Getting Grants PLOS Computational Biology
Phillip Bourne and Alon Korngreen (2006) Ten Simple Rules for Reviewers PLOS Computational Biology
This post originally published on nodalpoint with comments

This work is licensed under a

Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

Comments (1)

February 22, 2007

NSPNAS: Nature, Science or PNAS?

Filed under: publishing,Uncategorized — Duncan Hull @ 10:19 pm
Tags: H-index, NSPNAS, PNAS

A crude score for benchmarking scientists

Have you ever wanted to compare different scientists by their publication record? It’s not always an easy task, but here is a crude and handy way to benchmark people by their journal publications in Nature, Science or PNAS using PubMed. Let’s call it the NSPNAS score, it’s not the h-index and it’s far from perfect, but it can be useful.

Imagine these scenarios:

You’re a young scientist comtemplating who to do an undergraduate project, Masters degree or PhD with.
You’ve finished your PhD and are wondering which lab could be your Stairway to PostDoc Heaven [1].
You’re lucky enough to have landed a faculty position and you want to check the credibility of your new colleagues.
You want to do some industrial espionage on your competitors in different labs around the world.
You’re a Scientist dammit, and naturally you’re a curious person who just likes to measure things.

In any of these situations, you’ll probably want to look up the people concerned using Google Scholar which will give you a good idea of their research history. But you’re not interested in publications in the Journal of Few Subscribers or the Proceedings of the Boring Incomprehensible Nonsense Society (BINS), even if Google Scholar lists hundreds of their citations. Instead, you care about counting the Big Bang impact publications they have in the über-journals: Nature, Science and PNAS. You can find these publications in PubMed with this simple query:

Surname +Initials[au]+(nature[journal] or science[journal] or Proc Natl Acad Sci U S A[journal])

…and you can obviously modify this query to include popular journals from your own field as appropriate.

Where NSPNAS works

Note, NSPNAS scores were correct at the time of writring in 2007, but will change over time.

When you substitute an authors name and initials into the beginning of that query, you get your NSPNAS score. So Systems Biologist Douglas Kell for example, surname and initials “Kell+D[au]”, has an NSPNAS score of 6.

If the person in question has a unique or unusual surname and initials, its fairly easy to find their score: Nodalpointer Chris Mungall has an NSPNAS score of two while nodalpointer Jason Stajich has an NSPNAS score of three. These results suggest a positive correlation between Californian sunshine and NSPNAS. Meanwhile, back in rainy old Britain, Ensemblian Ewan Birney scores a formidable sixteen, which is just scary for a bloke in his thirties.

Where NSPNAS doesn’t work

Unfortunately, authors with common names like John Smith (who has more than 340 hits) can’t be easily benchmarked with this type of query, without trawling through hundreds of false positives. More importantly, some influential scientists score very low or zero, despite the fact that their work has been important in the world of biomedical science an beyond. This is especially true for Computer Scientists, Mathematicians and Informaticians, for example:

Some bloke called Tim (see picture, top right) scores a measly two and neither of these papers are particularly inspiring or highly cited. Contrary to popular belief, Tim didn’t invent the internet, but did play a leading rôle in the creation of the web. Can you imagine a world without it?
Googler Sergey Brin scores scores zero (once you exclude the false positive). But bioinformatics, and life generally, without search engines like Google is almost unimaginable. Sergey’s most heavily cited paper, co-authored with Larry Page, describes a prototype search engine called “Google”. This paper was first published at the seventh World Wide Web conference (WWW7) way back in 1998.
Googler Vint Cerf scores a pathetic one despite winning a Turing award (the Nobel Prize for Computer Science) for his co-invention of TCP/IP
Stanford’s Mark Musen scores zero, but his Protégé Ontology Editor and its derivatives have been influential in biomedical informatics, and will probably play an important rôle in the creation of next generation of biomedical web applications.
Leading mathematicians, such as Fields Medallists (the Nobel Prize for Mathematics) and winners of the Clay Millenium Prizes [2], typically score zero despite making fundamental, and indirect, contributions to biomedical science.
Desperate PERL hacker Larry Wall scores zero, but bioinformatics without PERL would be quite different.
This list is endless, so we’ll move on…

Many important members of the Dead Scientists Society also have low NSPNAS scores…

Edgar Codd scores zero, but can you imagine biomedical science without his relational database?
Edsger Dijkstra scores zero, but without him we’d probably still be taking the longest path to wherever we’re going.
Charles Darwin scores zero, PNAS didn’t even exist in his lifetime and both Nature and Science were in their infancy when he died.
Albert Einstein scores only one (and has even made it into PubMedCentral)
Alan Turing also scores zero, because none of his biomedical publications were in NSPNAS. Try to imagine science without Computers and Artificial Intelligence, because without Turing, bioinformatics and computational biology might not even exist at all…

Conclusions

All these statistics remind us that many important ideas, techniques and results are not published in Nature, Science or PNAS and others are excluded from the PubMed index completely. It also confirms what we already know about peer-reviewed Journal publications not being the be-all and end-all of Engineering, Science or Medicine [3]. But NSPNAS still has its uses, provided the people you’re benchmarking have a rare name and didn’t snuff it before the PubMed index starts.

What is your NSPNAS score? If like me, you score a spectacular “nul points”, console yourself with the fact that you’re in good company with that score and given time, maybe you can change it.

References

Jimmy Page and Robert Plant (1971) Stairway to Heaven
Most of the Clay Mathematics Institute Millenium Prizes are still up for grabs if you get disillusioned with bioinformatics, fancy some fame and winning a million dollar fortune!
Michael Seringhaus and Mark Gerstein (2007) Publishing perishing? Towards tomorrow’s information architecture BMC Bioinformatics 2007, 8:17 DOI:10.1186/1471-2105-8-17
This post originally on nodalpoint, with comments

This work is licensed under a

Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

Leave a Comment

January 5, 2007

NAR Database Issue 2007: Not Waving But Drowning?

Filed under: Uncategorized — Duncan Hull @ 10:43 pm
Tags: bioinformatics, data tombs, database, Lincoln Stein, Michael Galperin, NAR, Not waving but drowning, Open Access, OUP, Stevie Smith

The 14th annual Nucleic Acids Research (NAR) database issue 2007 has just been published, open-access. This year is the largest yet (again) with 968 molecular biology databases listed, 110 more than the previous one (see figure below). In the world of biological databases, are we waving or drowning?

Nine hundred and sixty eight is a lot of databases, and even that mind-boggling number is not an exhaustive or comprehensive tally. But is counting all these databases waving or drowning [1]? Will we ever stop stamp-collecting the databases and tools we have in molecular biology? What prompted this is, an employee of the The Boeing Company once told me they have given up counting their databases because there were just too many. Just think of all the databases of design and technical documentation that accompanies the myriad of different aircraft that Boeing manufacture, like the iconic 747 jumbo jet. Now, combine that with all the supply chain, customer and employee information and you can begin to imagine the data deluge that a large multi-national corporation has to handle.

Like Boeing, in Biology we’ve clearly got more data than we know what to do with [2,3]. It won’t be news to bioinformaticians and its been said many times before but its worth repeating again here:

We know how many databases we have but we don’t know what a lot of the data in these databases means, think of all those mystery proteins of unknown function. It will obviously take time until we understand it all…
Most of the data only begins to make sense when it is integrated or mashed-up with other data. However, we still don’t know how to integrate all these databases, or as Lincoln Stein puts it “so far their integration has proved problematic” [4], a bit of an understatement. Many grandiose schemes for the “integration” of biological databases have been proposed over the years, but unfortunately none have been practical to the point of implementation [5]

Despite this, it is still useful to know how many molecular biology databases there are. At least we know how many databases we are drowning in. Thankfully, unlike Boeing, most biological data, algorithms and tools are open-source and more literature is becoming open access which will hopefully make progress more rapid. But biology is more complicated than a Boeing 747, so we’ve got a long-haul flight ahead of us. OK, I’ve managed to completely overstretch that aerospace analogy now so I’ll stop there.

Whatever databases you’ll be using in 2007, have a Happy New Year mining, exploring and understanding the data they contain, not drowning in it.

References

Stevie Smith (1957) Not waving but drowning
Michael Galperin (2007) The Molecular Biology Database Collection: 2007 update Nucleic Acids Research, Vol. 35, Database issue. DOI:10.1093/nar/gkl1008
Alex Bateman (2007) Editorial: What makes a good database? Nucleic Acids Research, Vol. 35, Database issue. DOI:10.1093/nar/gkl1051
Lincoln Stein (2003) Biological Database Integration Nature Reviews Genetics. 4 (5), 337-45. DOI:10.1038/nrg1065
Michael Ashburner (2006) Keynote at the Pacific Symposium on Biocomputing (PSB2006) in Hawaii seeAlso Aloha: Biocomputing in Hawaii
This post originally published on nodalpoint with comments

This work is licensed under a

Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

Comments (1)

December 19, 2006

Taverna 1.5.0

Filed under: Uncategorized — Duncan Hull @ 8:26 pm
Tags: bioinformatics, biomart, feta, myGrid, semantic web, taverna, workflow

Happy Christmas from the myGrid team, who are pleased to announce the release of version 1.5.0 of the Open Source Taverna bioinformatics workflow toolkit [1]. This is now available for download on the Sourceforge site and includes some substantial changes to version 1.4.

Taverna 1.5.0 is a small download, but when first run it will then download and install the required packages which can take some time on slow networks. In the near future there will be a mechanism for downloading a bundle of core packages. There are some significant changes in the underlying architecture of Taverna and how it handles core packages and optional plugins, using a system called Raven, see release notes below.

The documentation is currently being updated and the user documentation should be complete very soon, with the technical documentation following shortly afterwards. The reason for this is to allow the software to be released with some time to spare before the Christmas holidays.

Release notes:

There have been a number of substantial changes in the underlying architecture of Taverna since the previous release. These include:

An overhaul of the User Interface (UI), replacing the unpopular Multiple Document Interface with a cleaner and simpler single document UI which can be customised using Perspectives. There are built in perspectives to allow the design and enactment of workflows, and plugins can integrate with the UI by providing perspectives of their own. Together with this, users are able to create their own layouts built from individual components.
Taverna now allows for multiple workflows to be open and enacted at the same time.
Support for the new BioMart data management system version 0.5, together with backward compatibility for old workflows that used Biomart 0.4.
Better provenance generation and browsing support, through a plugin now known as LogBook.
Better support for semantic service discovery through the Feta plugin [2].
Modulularisation of the Taverna code base.
Development and integration of an underlying architecture know as Raven. This allows for Apache Maven like declaration of dependencies which are discovered and incorporated into the Taverna system at runtime. Together with the modularisation of the Taverna code base, Raven gives the benefit that updates can be provided dynamically and incrementally, without the need for monolithic releases as in the past. This allows the provision of updates to bugs, and new features, within a very short timescale if necessary. It also provides plugin developers with a greater degree of autonomy and independance from the core Taverna code base.
Improved and more advanced plugin management with the ability to provide immediate updates, and for plugin providers to publish their plugins via xml descriptions.
Numerous bug-fixes including the removal of a number of memory leaks.

JIRA generated release notes and bug status reports can be found here and here

References

Leave a Comment

December 1, 2006

NAR Web Server Issue: Walking in a Webby Wonderland

Filed under: Uncategorized — Duncan Hull @ 3:18 pm
Tags: bioinformatics, data tombs, Gary Benson, NAR, OUP, publish or perish, web, Wonderland

Have you recently built a bioinformatics web application useful to the wider community that you’d like to tell the world about? Are you also looking to score brownie points for a rigourously peer-reviewed publication that stands a reasonable chance of being well cited? If that’s you, then you have one month from today (December 1st) to sort your code out, and get your abstract in, for the fifth annual Nucleic Acids Research (NAR) Web Server issue published by Oxford University Press (OUP) in 2007. All articles in this issue are published under an open access model.

As regular visitors to nodalpoint will already know, every year NAR publishes two special issues: one on databases (annually in January since 1993) and the other on web servers (annually in July since 2003). Authors interested in pre-submitting abstracts for the 2007 Web Server Issue should read the Instructions to Authors for Web Server papers in NAR and send an abstract to Gary Benson at Boston University before December 31st 2006. The deadline for final submission of full articles is January 31st 2007. Gary Benson has taken over this year from previous web server issue editor, Nobel laureate and Ignobel participant, Richard Roberts [1].

One advantage of publishing your application paper in NAR, instead of alternative open access journals like Source Code for Biology and Medicine (SCFBM), is a listing in the bioinformatics links directory [2] and a bigger impact factor [3] of 7.6, if you care about these things. There are of course, disadvantages of publishing with OUP in NAR, like the expensive open access publishing fees of $1185 to $2370 per article which are debateable value-for-money. If you’re living in a ‘List A’ developing country these charges are waived, which makes it tempting to set up a laboratory in Malawi to evade payment…

Anyway, does anyone out there know how OUP prices compare with the complicated Biomed Central membership fees which are presumably required for publication in SCFBM? Another leading open access publisher, the Public Library of Science (PLOS) currently charges from $2000 to $2500 for open access publication. Maybe I’m missing something, but aren’t these charges a lot of money to pay an administrator to shuffle a few bits of paper around and run a web server? Don’t let that put you off submitting your paper though, because in Science and academia you will either publish or perish. This is where the web is your friend because free online web availability substantially increases a paper’s impact.

On a lighter note, and now that the festive season is upon us, I’ll hand over to the Christmas crooner Perry Como to sign off:

♫ Sleigh bells ring, are you listening? In the lane, snow is glistening. A beautiful sight, We’re happy tonight, Walking in a webby wonderland. ♫

References

This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

Leave a Comment

« Previous Page — Next Page »

July 6, 2009

References

June 19, 2008

January 18, 2008

R.I.P. Biological Data?

October 17, 2007

September 5, 2007

April 13, 2007

References

February 22, 2007

A crude score for benchmarking scientists

Where NSPNAS works

Where NSPNAS doesn’t work

Conclusions

References

January 5, 2007

References

December 19, 2006

Release notes:

References

December 1, 2006

References

Meta / μετά