October 27, 2017

Mirror, mirror on the wall, who is the most viewed of them all?

Wikipedia is mirror that reflects the world around it. Sometimes the reflections are accurate, other times they get distorted. [1] Either way, we can look at the data in Wikipedia to see which reflections are being looked at the most using powerful analytics tools that are part of the platform.

Two weeks ago, as part of Physiology Friday, I gave a talk examining how biographies of scientists are viewed in Wikipedia, using the crude measure of PageViews.

Melissa Highton from the University of Edinburgh also gave a talk about the Edinburgh Seven, changing the way stories are told and their Wikipedian in Residence scheme.

Our convenor, Andy Mabbett (normally found on a Brompton) gave a talk introducing Wikimedia since our reason for being there was to recruit and train new editors of Wikipedia.

Thanks to the Physiological Society for having us and Anisha Tailor for putting the program together.


  1. Samoilenko, Anna; Yasseri, Taha (2014). “The distorted mirror of Wikipedia: a quantitative analysis of Wikipedia coverage of academics”. EPJ Data Science. Springer Publishing. 3 (1). arXiv:1310.8508 doi:10.1140/epjds20


December 16, 2015

Review of 2015 @csmcr, anticipating 2016

23159833313_4f787129da_o2015 has been a busy year in the School of Computer Science at the University of Manchester (@csmcr). Here is a brief summary of some key activities during 2015 that will be of interest to employers and alumni, with a quick look ahead at what is coming in 2016 including:

  1. Industrial mentoring in software engineering
  2. Industrial experience
  3. Competitions & hackathons
  4. Guest lectures
  5. Careers fairs
  6. Research
  7. Alumni
  8. Keeping in touch

Find out more in our December 2015 newsletter.

Hackathons have continued to gain popularity during 2015 here in the UK, with lots of help from Major League Hacking (mlh.io). There are so many events to choose from, it isn’t always obvious which are the best (and why). The Economist published an interesting piece arguing that Hackathons have entered the corporate mainstream and are no longer just for techies. (Hat tip, thanks Antonio Marino) Which sounds about right.

IMHO, generally hackathons are a good thing, especially for students, but there are some thorny issues around Intellectual Property and unpaid labour that many people brush under the carpet. So if you’re organising or attending a hackathon in 2016, make sure you are clear about who owns the IP.


July 6, 2009

Fabio Rinaldi on OntoGene

Fabio RinaldiFabio Rinaldi is currently visiting Manchester from the University of Zurich, he will be doing a seminar on Monday 6th July, the details of which are below.

Title : OntoGene in the BioNLP shared task and in BioCreative II.5

Speaker: Dr Fabio Rinaldi, University of Zurich

Date: Monday 6th July 2009

Time: 14:00

Location: Lecture Theatre – MLG.001, MIB building

Abstract In this talk I will describe our participation to the BioNLP shared task and the BioCreative II.5 competitions [1]. Our approach is based on a common core: a pipeline of NLP tools and a dependency parser. The adaptation for the BioNLP shared task consisted of suitable input filters and a transformation-based approach which maps syntactic dependencies to event structures. Despite the very simple approach, results were satisfactory (34.78 F-score). The adaptation for BioCreative requires the detection and disambiguation of domain entities, while candidate interactions are proposed on the basis of a simple learning approach.

If time allows I will then describe our approach to finding the ‘focus organisms’ i.e. the organisms in which the experiments have been conducted or which are the source of the interacting proteins. This information is of crucial importance for the correct disambiguation of other entities mentioned in the article.


  1. Rinaldi, F., Kappeler, T., Kaljurand, K., Schneider, G., Klenner, M., Clematide, S., Hess, M., von Allmen, J., Parisot, P., Romacker, M., & Vachon, T. (2008). OntoGene in BioCreative II Genome Biology, 9 (Suppl 2) DOI: 10.1186/gb-2008-9-s2-s13

June 19, 2008

Sixteen (Yes 16!) PhD studentships available in Computer Science

EinstongueThe School of Computer Science of the University of Manchester has up to 16 studentships to offer to highly motivated research students who wish to start a PhD in September 2008 (in exceptional circumstances the start date can be deferred until April 2009). The studentships pay tuition fees and a stipend to cover living expenses for 3 years.

In 2008/09, the stipend will be £12940 per year for students who were UK residents in the 3 years before the start of the PhD, or between £10352 and £12940 per year for students who were not UK residents in the same period and cannot demonstrate a relevant connection to the UK. The stipend is expected to rise in subsequent years. Because of conditions associated with this funding, these studentships are open to students eligible for home fees only; this includes UK and EU nationals. (more…)

January 18, 2008

One Thousand Databases High (and rising)

StampsWell it’s that time of year again. The 15th annual stamp collecting edition of the journal Nucleic Acids Research (NAR), also known as the 2008 Database issue [1], was published earlier this week. This year there are 1078 databases listed in the collection, 110 more than the previous one (see Figure 1). As we pass the one thousand databases mark (1kDB) I wonder, what proportion of the data in these databases will never be used?

R.I.P. Biological Data?

It seems highly likely that lots of this data is stored in what Usama Fayyad at Yahoo! Research! Laboratories! calls data tombs [2], because as he puts it:

“Our ability to capture and store data has far outpaced our ability to process and utilise it. This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again.”

Like last year, lets illustrate the growth with an obligatory graph, see Figure 1.

Figure 1: Data growth: the ability to capture and store biological data has far outpaced our ability to understand it. Vertical axis is number of databases listed in Nucleic Acids Research [1], Horizontal axis is the year. (Picture drawn with Google Charts API which is OK but as Alf points out, doesn’t do error bars yet).

Another day, another dollar database

Does it matter that large quantities of this data will probably never be used? How could you find out, how much and which data was “write-only”? Will Biologists ever catch up with the physicists when it comes to Very Large stamp collections Databases? Biological databases are pretty big, but can you imagine handling up to 1,500 megabytes of data per second for ten years as the Physicists will soon be doing? You can already hear the (arrogant?) Physicists taunting the Biologists, “my database is bigger than yours”. So there.

Whichever of these databases you are using, happy data mining in 2008. If you are lucky, the data tombs you are working will contain hidden treasure that will make you famous and/or rich. Maybe. Any stamp collector will tell you, some stamps can become very valuable. There’s Gold in them there hills databases you know…

  1. Galperin, M. Y. (2007). The molecular biology database collection: 2008 update. Nucleic Acids Research, Vol. 36, Database issue, pages D2-D4. DOI:10.1093/nar/gkm1037
  2. Fayyad, U. and Uthurusamy, R. (2002). Evolving data into mining solutions for insights. Communications of the ACM, 45(8):28-31. DOI:10.1145/545151.545174
  3. This post originally published on nodalpoint (with comments)
  4. Stamp collectors picture, top right, thanks to daxiang stef / stef yau

October 17, 2007

The Luxuriant Flowing Hair Club for Scientists (LFHCfS)

Falk Schuch, Andreas Linsner and Kai Jung
Calling all Scientists, is your hair luxuriant and flowing? Perhaps you’re a bouffant bioinformatician, a hairy hacker or share a lab with somebody who is? If this is you, its high-time you joined the Luxuriant Flowing Hair Club for Scientists.

To propose somebody for membership, send email to Marc Abrahams at Harvard University marca /ate/ chem2.harvard.edu. Your email needs to include evidence of your luxuriant, flowing hair (a photo) and your credentials as a scientist. Some current members have impressive hair, see Simon Gregory, Carlisle Landel and Sterling Paramore for examples. Honorary and historical members include Dr. Brian May (Queen guitarist / astrophysicist), Dimitry Mendleyev and Albert Einstein, “Physicist. Bon vivant. A bold experimentalist with hair”.

So, if you are a scientist with a copius coiffure, ask yourself, will you ever get another chance to be in such distinguished company?

September 5, 2007

Semantic Biomedical Mashups with Connotea

Mashup or Shutup

The Journal of Biomedical Informatics (JBI), will soon be publishing their special issue on Semantic Biomedical Mashups (can you fit any more buzzwords into a Call For Papers?!). Ben Good and friends have submitted a paper on their Entity Describer which extends connotea using some Semantic Web goodness. They’d appreciate your comments on their submitted manuscript over at i9606. As Ben says, their pre-publication turns out to be an interesting experiment “figuring out how blogging might fit into the academic publishing landscape”. If this interests you, get commenting now!

Update: Just spotted this interesting graphic of the Elsevier / Evilsevier logo (snigger), who are the publishers of JBI…

April 13, 2007

Collaboration, collaboration, collaboration!

Geldof Blair collaborationWhat should your three main priorities be as a Scientist? Collaboration, collaboration, collaboration. Quentin Vicens and Phil Bourne have just published Ten Simple Rules for a Successful Collaboration [1] to help you do just that, as part of a continuing series [2,3,4,5].

Tony Bliar once said “Ask me my three main priorities for government, and I tell you: education, education, education.” In Science, its not so much about education as collaboration, collaboration, collaboration. The advice in Ten Simple Rules is all useful stuff, but what caught my eye is the fact that collaboration is on the rise, at least according to the number of co-authors on papers published in PNAS. The average number of co-authors has risen from 3.9 in 1981 to 8.4 in 2001. So before you publish or perish, it seems likely that you’ll also need to collaborate or commiserate… less laboratory, more collaboratory!

Photo credit Garret Keogh


  1. Quentin Vicens and Phillip Bourne (2007) Ten Simple Rules for a Successful Collaboration PLOS Computational Biology
  2. Phillip Bourne (2006) Ten Simple Rules for Getting Published PLOS Computational Biology
  3. Philip Bourne and Iddo Friedberg (2006) Ten Simple Rules for Selecting a Postdoctoral Position PLOS Computational Biology
  4. Phillip Bourne and Leo Chalupa (2006) Ten Simple Rules for Getting Grants PLOS Computational Biology
  5. Phillip Bourne and Alon Korngreen (2006) Ten Simple Rules for Reviewers PLOS Computational Biology
  6. This post originally published on nodalpoint with comments

February 22, 2007

NSPNAS: Nature, Science or PNAS?

A crude score for benchmarking scientists

TIM Have you ever wanted to compare different scientists by their publication record? It’s not always an easy task, but here is a crude and handy way to benchmark people by their journal publications in Nature, Science or PNAS using PubMed. Let’s call it the NSPNAS score, it’s not the h-index and it’s far from perfect, but it can be useful.

Imagine these scenarios:

  1. You’re a young scientist comtemplating who to do an undergraduate project, Masters degree or PhD with.
  2. You’ve finished your PhD and are wondering which lab could be your Stairway to PostDoc Heaven [1].
  3. You’re lucky enough to have landed a faculty position and you want to check the credibility of your new colleagues.
  4. You want to do some industrial espionage on your competitors in different labs around the world.
  5. You’re a Scientist dammit, and naturally you’re a curious person who just likes to measure things.

In any of these situations, you’ll probably want to look up the people concerned using Google Scholar which will give you a good idea of their research history. But you’re not interested in publications in the Journal of Few Subscribers or the Proceedings of the Boring Incomprehensible Nonsense Society (BINS), even if Google Scholar lists hundreds of their citations. Instead, you care about counting the Big Bang impact publications they have in the über-journals: Nature, Science and PNAS. You can find these publications in PubMed with this simple query:

Surname +Initials[au]+(nature[journal] or science[journal] or Proc Natl Acad Sci U S A[journal])

…and you can obviously modify this query to include popular journals from your own field as appropriate.

Where NSPNAS works

Note, NSPNAS scores were correct at the time of writring in 2007, but will change over time.

When you substitute an authors name and initials into the beginning of that query, you get your NSPNAS score. So Systems Biologist Douglas Kell for example, surname and initials “Kell+D[au]”, has an NSPNAS score of 6.

If the person in question has a unique or unusual surname and initials, its fairly easy to find their score: Nodalpointer Chris Mungall has an NSPNAS score of two while nodalpointer Jason Stajich has an NSPNAS score of three. These results suggest a positive correlation between Californian sunshine and NSPNAS. Meanwhile, back in rainy old Britain, Ensemblian Ewan Birney scores a formidable sixteen, which is just scary for a bloke in his thirties.

Where NSPNAS doesn’t work

Unfortunately, authors with common names like John Smith (who has more than 340 hits) can’t be easily benchmarked with this type of query, without trawling through hundreds of false positives. More importantly, some influential scientists score very low or zero, despite the fact that their work has been important in the world of biomedical science an beyond. This is especially true for Computer Scientists, Mathematicians and Informaticians, for example:

Many important members of the Dead Scientists Society also have low NSPNAS scores…


All these statistics remind us that many important ideas, techniques and results are not published in Nature, Science or PNAS and others are excluded from the PubMed index completely. It also confirms what we already know about peer-reviewed Journal publications not being the be-all and end-all of Engineering, Science or Medicine [3]. But NSPNAS still has its uses, provided the people you’re benchmarking have a rare name and didn’t snuff it before the PubMed index starts.

What is your NSPNAS score? If like me, you score a spectacular “nul points”, console yourself with the fact that you’re in good company with that score and given time, maybe you can change it.


  1. Jimmy Page and Robert Plant (1971) Stairway to Heaven
  2. Most of the Clay Mathematics Institute Millenium Prizes are still up for grabs if you get disillusioned with bioinformatics, fancy some fame and winning a million dollar fortune!
  3. Michael Seringhaus and Mark Gerstein (2007) Publishing perishing? Towards tomorrow’s information architecture BMC Bioinformatics 2007, 8:17 DOI:10.1186/1471-2105-8-17
  4. This post originally on nodalpoint, with comments

January 5, 2007

NAR Database Issue 2007: Not Waving But Drowning?

The 14th annual Nucleic Acids Research (NAR) database issue 2007 has just been published, open-access. This year is the largest yet (again) with 968 molecular biology databases listed, 110 more than the previous one (see figure below). In the world of biological databases, are we waving or drowning?

NAR Database Growth 2007

Nine hundred and sixty eight is a lot of databases, and even that mind-boggling number is not an exhaustive or comprehensive tally. But is counting all these databases waving or drowning [1]? Will we ever stop stamp-collecting the databases and tools we have in molecular biology? What prompted this is, an employee of the The Boeing Company once told me they have given up counting their databases because there were just too many. Just think of all the databases of design and technical documentation that accompanies the myriad of different aircraft that Boeing manufacture, like the iconic 747 jumbo jet. Now, combine that with all the supply chain, customer and employee information and you can begin to imagine the data deluge that a large multi-national corporation has to handle.

Like Boeing, in Biology we’ve clearly got more data than we know what to do with [2,3]. It won’t be news to bioinformaticians and its been said many times before but its worth repeating again here:

  • We know how many databases we have but we don’t know what a lot of the data in these databases means, think of all those mystery proteins of unknown function. It will obviously take time until we understand it all…
  • Most of the data only begins to make sense when it is integrated or mashed-up with other data. However, we still don’t know how to integrate all these databases, or as Lincoln Stein puts it “so far their integration has proved problematic” [4], a bit of an understatement. Many grandiose schemes for the “integration” of biological databases have been proposed over the years, but unfortunately none have been practical to the point of implementation [5]

Despite this, it is still useful to know how many molecular biology databases there are. At least we know how many databases we are drowning in. Thankfully, unlike Boeing, most biological data, algorithms and tools are open-source and more literature is becoming open access which will hopefully make progress more rapid. But biology is more complicated than a Boeing 747, so we’ve got a long-haul flight ahead of us. OK, I’ve managed to completely overstretch that aerospace analogy now so I’ll stop there.

Whatever databases you’ll be using in 2007, have a Happy New Year mining, exploring and understanding the data they contain, not drowning in it.


  1. Stevie Smith (1957) Not waving but drowning
  2. Michael Galperin (2007) The Molecular Biology Database Collection: 2007 update Nucleic Acids Research, Vol. 35, Database issue. DOI:10.1093/nar/gkl1008
  3. Alex Bateman (2007) Editorial: What makes a good database? Nucleic Acids Research, Vol. 35, Database issue. DOI:10.1093/nar/gkl1051
  4. Lincoln Stein (2003) Biological Database Integration Nature Reviews Genetics. 4 (5), 337-45. DOI:10.1038/nrg1065
  5. Michael Ashburner (2006) Keynote at the Pacific Symposium on Biocomputing (PSB2006) in Hawaii seeAlso Aloha: Biocomputing in Hawaii
  6. This post originally published on nodalpoint with comments

