A quick search on pubmed.gov today reveals that the freely available American database of biomedical literature has just passed the 20 million citations mark*. Should we celebrate or commiserate passing this landmark figure? Is it a triumph or a tragedy that PubMed® is the size it is?
PubMed triumphs
Let’s start with the reasons to celebrate the triumphantly relentless growth of PubMed. Crack open the champagne!
- A central index freely available globally: Many biomedical scientists probably take PubMed for granted, but try to imagine biology and medicine without it – we would struggle to find anything. Unlike other bibliographic indexes (Scopus and ISI WOK etc), PubMed is freely available to anyone, anywhere with an internet connection – it is an essential scientific service that many depend on every day to do their research.
- Twenty million citations: That’s a lot of data and it’s growing at a rate of about one paper per minute (on average). This kind of big data can lead to big discoveries, bigger data could mean bigger discoveries – hopefully. Data can be unreasonably effective [1] so the more the merrier.
- More than a billion searches in 2009: That’s an average of 3.5 million searches per day or 40 searches per second and around 767 million of these queries were done interactively by users on the web …
- Entrez Utilities: … the other 514 million of those 1.3 billion searches were executed programmatically by machines rather than people. None of this web-enabled goodness would be possible without the reliable and successful Entrez Utilities services – which allow the data to be easily re-used by other software. Lots of useful applications have been built this way.
- A treasure trove of weird and wonderful things: A quick browse through NCBI Rolling On the Floor Laughing (ROFL) reveals all manner of strange reports indexed by PubMed (alongside the regular serious stuff).
PubMed tragedies
But there are also some reasons to commiserate the tragically relentless growth of PubMed. Pass round the anti-depressants…
- PubMed is too big and full of noise: Theodore Sturgeon’s law states that 90% of everything is rubbish. If correct, this means around 18 million records in PubMed are worthless junk. But that won’t stop them cluttering up the database and your search results making it harder to find what you want when you need it. Many of the papers indexed by PubMed are “salami-sliced” by publication-hungry scientists into the least publishable unit and are of little or no actual scientific value. It can be difficult (or impossible) to find what you need in PubMed. Cameron Neylon calls this discovery deficit, but however you describe it, finding the information you need in PubMed can be frustratingly difficult – despite the redesigns. There is so much in PubMed it is impossible to keep up.
- PubMed is too small: Some people argue that an overly conservative indexing and editorial policy prevents PubMed from including lots of biomedically relevant literature that is published in physics, chemistry, mathematics, engineering and computer science journals. Currently much of this data is excluded from the database. Actually, what we really need is PubSCIENCE (covering non-medical sciences) but that idea got tragically axed back in 2002.
- Identity crisis, ambiguous authors: one of the most useful ways to navigate the mountain of information that is PubMed is not to search by journal(s) or by keyword(s) but to search by author(s). Authors like Barack Obama are easy to find (because of their unique name) but poor John Smith (and many others like him) are much harder to find. A recent study has estimated that almost two thirds of authors in PubMed have ambiguous names [2] – where their last name and first initial is shared with one or more separate authors. Another recent study has shown that search by author is one of the three most frequent types of searches on PubMed [3] but unfortunately the precision and recall of these searches is typically poor due to ambiguous authors. This isn’t just a problem for PubMed, but scientific publishing generally. Hopefully ORCID (or something like it) will solve that problem one day…
- Identity crisis, missing document identifiers: There are over forty million unique document ID’s in the form of DOI’s. They are a useful way to uniquely identify papers on the Web and link directly to their full content wherever they were originally published. But you might have trouble using DOIs in PubMed. Sometimes DOI’s get left out of records (see some random examples here) altogether. When they are included, they can get buried and are not very accessible. For example this record has a DOI but you won’t find it anywhere in the default page served by PubMed, which means you can’t easily click through to the full text of the article which the DOI would take you to. What this means is, PubMed is not as well integrated with other databases as it could and should be.
- Mostly abstracts only: PubMed has 20 million freely available abstracts rather than 20 million full text papers. Imagine how the rate of scientific discovery and invention might increase (and the cost might decrease) if it was PubMed Central that had 20 million citations instead of just PubMed. Alas, PubMed Central is currently closer to the 2 million mark than the 20 million mark, but it is growing rapidly thanks to deposition mandates and open access publishing.
- Ranking results: by default PubMed ranks search results by date – but if Google did the same, very few people would bother use it. Ranking results by relevance, by using an algorithm more like PageRank, would be much more useful to many users as demonstrated by Pierre Lindenbaum.
- Text mining and ontologies: We’ve still a long way to go before fully exploiting the possibilities offered by text-mining and ontologies to allow PubMed users to semantically search and browse the data. MeSH is just the beginning but that’s another story…
So should we celebrate or commiserate passing the 20 million mark in PubMed®? The triumphs far outweigh the tragedies, many of which are either beyond the control of PubMed or can be sorted out (hint hint: anyone from NCBI reading this?). PubMed is a substantial fourteen years of work which continues to have significant benefits for many scientists around the world. There is plenty of room for improvement, but it’s hard to imagine Life® without PubMed®.
The complete catalogue of PubMed triumphs and tragedies is much longer than the above list, so if you think I missed any important ones, please leave a comment below.
References
- Alon Halevy, Peter Norvig, & Fernando Pereira (2009). The Unreasonable Effectiveness of Data IEEE Intelligent Systems, 24 (2), 8-12 DOI: 10.1109/MIS.2009.36
- Vetle Torvik & Neil Smalheiser (2009). Author Name Disambiguation in MEDLINE. ACM transactions on knowledge discovery from data, 3 (3) PMID: 20072710
- Rezarta Islamaj Dogan, G. Craig Murray, Aurelie Neveol and Zhiyong Lu (2009). Understanding PubMed user search behavior through log analysis. Database : the journal of biological databases and curation, 2009 PMID: 20157491
* These statistics were correct at the time of writing in July 2010 but will rapidly change over time. See more commentary on this piece over at friendfeed.
I think it’s a tragedy for not exploiting all the information already in there. As you say, text mining has not really been used to its fullest. Simple applications using these methods can spark ideas and experiments that can result in big discoveries. All we have to do is look at what we already have.
Comment by Paul Fisher — July 27, 2010 @ 5:29 pm |
Duncan, thanks a lot for the nice summary. I completely agree with your analysis, especially the part about ranking results and the DOI. Ranking results by date is almost useless, services like Scopus at least allow ranking by citation count.
Comment by Martin Fenner — July 27, 2010 @ 7:30 pm |
Martin, I’m glad somebody shares my PubMed pain! I feel a lot better now I’ve aired my grievances in public…
Comment by Duncan — July 27, 2010 @ 10:27 pm |
One problem with PubMed seems to be the “Not invented Here Syndrome”. That sometimes makes it difficult for them to integrate with others, e.g. The DOI.
Something else I like about PubMed, or rather PubMed Central, is the NLM-DTD. An XML standard for storing and displaying papers.
Comment by Martin Fenner — July 27, 2010 @ 11:17 pm
I bet all 514M e-utils searches were Fisher with his workflows.
Comment by Paul Dobson — July 27, 2010 @ 8:59 pm |
Paul, it can’t be Fishers workflows, he’s probably been blacklisted by the NCBI by now for hammering their servers
Comment by Duncan — July 27, 2010 @ 10:26 pm |
Not just yet
Comment by Paul Fisher — July 28, 2010 @ 2:06 pm
I’ve been using Medline for almost twenty years as a Health Research Analyst. HOW a Health Research Analyst could even HOPE to exist WITHOUT the access to these far too few medical studies I cannot even imagine. ANYONE who argues there to be TOO MANY medical studies is OBVIOUSLY **incapable** OF health research analysis. Thusly the opinions expressed are of no consequence since Medline exists SOLEY FOR health research analysis. Imho.
Comment by Tom Hennessy — July 28, 2010 @ 12:39 pm |
I completely disagree – being a bioinformatician that has no interest in health research analysis at all.
Comment by Paul Fisher — July 28, 2010 @ 1:01 pm |
…..or the herbivorous status of Human beings: http://network.nature.com/profile/ironjustice .
Comment by Paul Fisher — July 28, 2010 @ 1:17 pm
YOU would make the PERFECT ‘Health Research Analyst’ in that you will go in with NO ‘preconcieved notions’. The PROBLEM though for YOU is the FACT what **information** AVAILABLE to you is the ONLY way you can do the job. IF the information fed to you is INCORRECT then the ONLY conclusion you can come to will be false or “inconclusive” or or or .. ?
IF you could be guaranteed the information given to you is not false then and only then can you be confident IN your area of expertise.
Linus Pauling discovered it seems to be increased oxidation in man which leads to most if not all disease.
The theory of anti-oxidants and oxidation is shared by many.
I simply say I have found iron to be the defining factor in oxidation.
Iron rusts.
Simple.
Prove me wrong.
Comment by Tom Hennessy — July 29, 2010 @ 1:28 am
Duncan’s tongue-in-cheek comment about size is clearly about quality and the consequences for searching of “paper machine” researchers cluttering up good work with low quality nonsense. This he makes explicit later on. Therefore the opinions expressed are of tremendous relevance to health research analysts, who I’m sure dislike nonsensical hits as much as the rest of the very many other flavours of people who use Medline (it **really** doesn’t just exist for health research analysis).
Comment by Paul Dobson — July 28, 2010 @ 1:13 pm |
Quote: researchers cluttering up good work with low quality nonsense
Answer: One could argue a researcher eating fly spit and earning himself a Nobel Prize might be one of the articles YOU ‘may’ have considered to BE of “low quality” seeing everyone else did.
Comment by Tom Hennessy — July 28, 2010 @ 10:51 pm |
When I say ‘low quality’ I’m talking about poor experimental design, obviously manipulated stats, ludicrous conclusions that aren’t supported by evidence – not off-the-wall, unexpected research, which is fine in my book so long as it is well executed.
It seems to me that Paul Fisher doesn’t necessarily disagree with your iron findings but the extrapolation from that to humans being herbivores, which most people with incisors will find a bit of a stretch. You might like this paper, also about the major role of iron in disease…
http://www.biomedcentral.com/1755-8794/2/2/
Comment by Paul Dobson — July 29, 2010 @ 10:23 am |
I find your hypothesis, about iron being involved in disease perfectly plausable – but, I have to add that you have taken this FAR beyond the realm OF scientific discourse by stating that you ALONE have: “….found iron to be the defining factor in oxidation”. I have absolutely no idea what you wish me to say about Iron rusting, and how this credibly backs up your hypothesis – which it does not. Taking this further to say that iron rusts; well, I have to say that I don’t think the iron in my body rusts – for one thing, I’m a slightly pinkish and not a red-brown colour. I have a sore throat at the moment, and am clearly in a diseased state (being INFECTED by some respiratory pathogen). My body IS obviously mounting an immune response using oxidative stress as a key component of this process – hence the sore feeling in my throat as a result of cellular and tissue damage. Yet I am still pink. Iron may BE involved at some level IN the process of oxidative stress, by DONATING and accepting electrons, but I very much doubt that the iron in my blood will undergo the same process of oxidation as you say in your above comment. I think you may have misinterpreted the process of oxidation and oxidative stress.
I whole heartedly agree with Dr. Dobson and say you have not provided any clear evidence that the process of oxidative stress makes us all herbivores. I would also hasten to add that humans evolved: forward facing eyes as a means of judging distances to prey (not vegetables); incisors as a means of consuming such prey; and a complex digestive system that actvely breaks down animal tissue into the necessary parts our bodies need. If indeed your conclusions are correct, and we are infact herbivores, I have to say Darwin’s daft idea on so-called ‘Evolution’ was wrong. I am inclined to think, however, that your hypothesis is based on a biased view, since your are yourself (please correct me if I’m wrong) a vegetarian.
Can I ALSO add THAT I find your USE of UPPERCASE text extremely annoying and patronising. It does not make a point any more than lowercase text, and it shows how you mean to undermine what I have written. I have added such text into my own response to show how useless it can be.
Comment by Paul Fisher — July 29, 2010 @ 5:27 pm
Quote: extremely annoying and patronising
Answer: You ASSUME the “case of the letters” mean something untoward you ?
You assume too much . It is called **emphasis** and has been used that way for quite some time. IF you are LOOKING for trouble pal I will OBLIGE .. pal.
NOW do you think you UNDERSTOOD that CORRECTLY ? I bet you did.
Comment by Tom Hennessy — July 29, 2010 @ 9:01 pm |
Er… that’s the first flame war we’ve had at O’Really? Thanks guys! Seems to have gone a little off topic though?
Comment by Duncan — July 30, 2010 @ 11:03 am |
Indeed.
Comment by Paul Fisher — July 30, 2010 @ 12:38 pm
Please don’t call me pal. I’m neither your friend nor associate. I’d hate to be associated with someone like you.
Comment by Paul Fisher — August 11, 2010 @ 1:37 pm |