quick search on pubmed.gov today reveals that the freely available American database of biomedical literature has just passed the 20 million citations mark*. Should we celebrate or commiserate passing this landmark figure? Is it a triumph or a tragedy that PubMed® is the size it is?A
Let’s start with the reasons to celebrate the triumphantly relentless growth of PubMed. Crack open the champagne!
- A central index freely available globally: Many biomedical scientists probably take PubMed for granted, but try to imagine biology and medicine without it – we would struggle to find anything. Unlike other bibliographic indexes (Scopus and ISI WOK etc), PubMed is freely available to anyone, anywhere with an internet connection – it is an essential scientific service that many depend on every day to do their research.
- Twenty million citations: That’s a lot of data and it’s growing at a rate of about one paper per minute (on average). This kind of big data can lead to big discoveries, bigger data could mean bigger discoveries – hopefully. Data can be unreasonably effective  so the more the merrier.
- More than a billion searches in 2009: That’s an average of 3.5 million searches per day or 40 searches per second and around 767 million of these queries were done interactively by users on the web …
- Entrez Utilities: … the other 514 million of those 1.3 billion searches were executed programmatically by machines rather than people. None of this web-enabled goodness would be possible without the reliable and successful Entrez Utilities services – which allow the data to be easily re-used by other software. Lots of useful applications have been built this way.
- A treasure trove of weird and wonderful things: A quick browse through NCBI Rolling On the Floor Laughing (ROFL) reveals all manner of strange reports indexed by PubMed (alongside the regular serious stuff).
But there are also some reasons to commiserate the tragically relentless growth of PubMed. Pass round the anti-depressants…
- PubMed is too big and full of noise: Theodore Sturgeon’s law states that 90% of everything is rubbish. If correct, this means around 18 million records in PubMed are worthless junk. But that won’t stop them cluttering up the database and your search results making it harder to find what you want when you need it. Many of the papers indexed by PubMed are “salami-sliced” by publication-hungry scientists into the least publishable unit and are of little or no actual scientific value. It can be difficult (or impossible) to find what you need in PubMed. Cameron Neylon calls this discovery deficit, but however you describe it, finding the information you need in PubMed can be frustratingly difficult – despite the redesigns. There is so much in PubMed it is impossible to keep up.
- PubMed is too small: Some people argue that an overly conservative indexing and editorial policy prevents PubMed from including lots of biomedically relevant literature that is published in physics, chemistry, mathematics, engineering and computer science journals. Currently much of this data is excluded from the database. Actually, what we really need is PubSCIENCE (covering non-medical sciences) but that idea got tragically axed back in 2002.
- Identity crisis, ambiguous authors: one of the most useful ways to navigate the mountain of information that is PubMed is not to search by journal(s) or by keyword(s) but to search by author(s). Authors like Barack Obama are easy to find (because of their unique name) but poor John Smith (and many others like him) are much harder to find. A recent study has estimated that almost two thirds of authors in PubMed have ambiguous names  – where their last name and first initial is shared with one or more separate authors. Another recent study has shown that search by author is one of the three most frequent types of searches on PubMed  but unfortunately the precision and recall of these searches is typically poor due to ambiguous authors. This isn’t just a problem for PubMed, but scientific publishing generally. Hopefully ORCID (or something like it) will solve that problem one day…
- Identity crisis, missing document identifiers: There are over forty million unique document ID’s in the form of DOI’s. They are a useful way to uniquely identify papers on the Web and link directly to their full content wherever they were originally published. But you might have trouble using DOIs in PubMed. Sometimes DOI’s get left out of records (see some random examples here) altogether. When they are included, they can get buried and are not very accessible. For example this record has a DOI but you won’t find it anywhere in the default page served by PubMed, which means you can’t easily click through to the full text of the article which the DOI would take you to. Even simple URL identifiers get broken in PubMed (though it’s not always their fault). What all this means is, PubMed is not as well integrated with other databases as it could and should be.
- Mostly abstracts only: PubMed has 20 million freely available abstracts rather than 20 million full text papers. Imagine how the rate of scientific discovery and invention might increase (and the cost might decrease) if it was PubMed Central that had 20 million citations instead of just PubMed. Alas, PubMed Central is currently closer to the 2 million mark than the 20 million mark, but it is growing rapidly thanks to deposition mandates and open access publishing.
- Ranking results: by default PubMed ranks search results by date – but if Google did the same, very few people would bother use it. Ranking results by relevance, by using an algorithm more like PageRank, would be much more useful to many users as demonstrated by Pierre Lindenbaum.
- Text mining and ontologies: We’ve still a long way to go before fully exploiting the possibilities offered by text-mining and ontologies to allow PubMed users to semantically search and browse the data. MeSH is just the beginning but that’s another story…
fourteen years of work which continues to have significant benefits for many scientists around the world. There is plenty of room for improvement, but it’s hard to imagine Life® without PubMed®.So should we celebrate or commiserate passing the 20 million mark in PubMed®? The triumphs far outweigh the tragedies, many of which are either beyond the control of PubMed or can be sorted out (hint hint: anyone from NCBI reading this?). PubMed is a substantial
The complete catalogue of PubMed triumphs and tragedies is much longer than the above list, so if you think I missed any important ones, please leave a comment below.
- Alon Halevy, Peter Norvig, & Fernando Pereira (2009). The Unreasonable Effectiveness of Data IEEE Intelligent Systems, 24 (2), 8-12 DOI: 10.1109/MIS.2009.36
- Vetle Torvik & Neil Smalheiser (2009). Author Name Disambiguation in MEDLINE. ACM transactions on knowledge discovery from data, 3 (3) PMID: 20072710
- Rezarta Islamaj Dogan, G. Craig Murray, Aurelie Neveol and Zhiyong Lu (2009). Understanding PubMed user search behavior through log analysis. Database : the journal of biological databases and curation, 2009 PMID: 20157491
* These statistics were correct at the time of writing in July 2010 but will rapidly change over time.