Twenty million papers in PubMed: a triumph or a tragedy?

July 27, 2010

Twenty million papers in PubMed: a triumph or a tragedy?

Filed under: data mining,publishing,web — Duncan Hull @ 3:37 pm
Tags: 20 million, Alon Halevy, Anna Kushnir, Barack Obama, bibliography, cameron neylon, database, discovery deficit, Entrez, Fernando Pereira, filter failure, information overload, ISI WOK, Least Publishable Unit, Medline, MESH, NCBI, Neil Smalheiser, ontology, Open Researcher & Contributor ID, ORCID, PageRank, Peter Norvig, prozac, publish or perish, pubmed, PubMed Central, pubmed tragedies, PubMed triumphs, PubSCIENCE, Rezarta Islamaj, ROFL, scopus, tragedy, triumph, Vetle Torvik

A quick search on pubmed.gov today reveals that the freely available American database of biomedical literature has just passed the 20 million citations mark*. Should we celebrate or commiserate passing this landmark figure? Is it a triumph or a tragedy that PubMed® is the size it is?

PubMed triumphs

Let’s start with the reasons to celebrate the triumphantly relentless growth of PubMed. Crack open the champagne!

A central index freely available globally: Many biomedical scientists probably take PubMed for granted, but try to imagine biology and medicine without it – we would struggle to find anything. Unlike other bibliographic indexes (Scopus and ISI WOK etc), PubMed is freely available to anyone, anywhere with an internet connection – it is an essential scientific service that many depend on every day to do their research.
Twenty million citations: That’s a lot of data and it’s growing at a rate of about one paper per minute (on average). This kind of big data can lead to big discoveries, bigger data could mean bigger discoveries – hopefully. Data can be unreasonably effective [1] so the more the merrier.
More than a billion searches in 2009: That’s an average of 3.5 million searches per day or 40 searches per second and around 767 million of these queries were done interactively by users on the web …
Entrez Utilities: … the other 514 million of those 1.3 billion searches were executed programmatically by machines rather than people. None of this web-enabled goodness would be possible without the reliable and successful Entrez Utilities services – which allow the data to be easily re-used by other software. Lots of useful applications have been built this way.
A treasure trove of weird and wonderful things: A quick browse through NCBI Rolling On the Floor Laughing (ROFL) reveals all manner of strange reports indexed by PubMed (alongside the regular serious stuff).

PubMed tragedies

But there are also some reasons to commiserate the tragically relentless growth of PubMed. Pass round the anti-depressants…

PubMed is too big and full of noise: Theodore Sturgeon’s law states that 90% of everything is rubbish. If correct, this means around 18 million records in PubMed are worthless junk. But that won’t stop them cluttering up the database and your search results making it harder to find what you want when you need it. Many of the papers indexed by PubMed are “salami-sliced” by publication-hungry scientists into the least publishable unit and are of little or no actual scientific value. It can be difficult (or impossible) to find what you need in PubMed. Cameron Neylon calls this discovery deficit, but however you describe it, finding the information you need in PubMed can be frustratingly difficult – despite the redesigns. There is so much in PubMed it is impossible to keep up.
PubMed is too small: Some people argue that an overly conservative indexing and editorial policy prevents PubMed from including lots of biomedically relevant literature that is published in physics, chemistry, mathematics, engineering and computer science journals. Currently much of this data is excluded from the database. Actually, what we really need is PubSCIENCE (covering non-medical sciences) but that idea got tragically axed back in 2002.
Identity crisis, ambiguous authors: one of the most useful ways to navigate the mountain of information that is PubMed is not to search by journal(s) or by keyword(s) but to search by author(s). Authors like Barack Obama are easy to find (because of their unique name) but poor John Smith (and many others like him) are much harder to find. A recent study has estimated that almost two thirds of authors in PubMed have ambiguous names [2] – where their last name and first initial is shared with one or more separate authors. Another recent study has shown that search by author is one of the three most frequent types of searches on PubMed [3] but unfortunately the precision and recall of these searches is typically poor due to ambiguous authors. This isn’t just a problem for PubMed, but scientific publishing generally. Hopefully ORCID (or something like it) will solve that problem one day…
Identity crisis, missing document identifiers: There are over forty million unique document ID’s in the form of DOI’s. They are a useful way to uniquely identify papers on the Web and link directly to their full content wherever they were originally published. But you might have trouble using DOIs in PubMed. Sometimes DOI’s get left out of records (see some random examples here) altogether. When they are included, they can get buried and are not very accessible. For example this record has a DOI but you won’t find it anywhere in the default page served by PubMed, which means you can’t easily click through to the full text of the article which the DOI would take you to. Even simple URL identifiers get broken in PubMed (though it’s not always their fault). What all this means is, PubMed is not as well integrated with other databases as it could and should be.
Mostly abstracts only: PubMed has 20 million freely available abstracts rather than 20 million full text papers. Imagine how the rate of scientific discovery and invention might increase (and the cost might decrease) if it was PubMed Central that had 20 million citations instead of just PubMed. Alas, PubMed Central is currently closer to the 2 million mark than the 20 million mark, but it is growing rapidly thanks to deposition mandates and open access publishing.
Ranking results: by default PubMed ranks search results by date – but if Google did the same, very few people would bother use it. Ranking results by relevance, by using an algorithm more like PageRank, would be much more useful to many users as demonstrated by Pierre Lindenbaum.
Text mining and ontologies: We’ve still a long way to go before fully exploiting the possibilities offered by text-mining and ontologies to allow PubMed users to semantically search and browse the data. MeSH is just the beginning but that’s another story…

So should we celebrate or commiserate passing the 20 million mark in PubMed®? The triumphs far outweigh the tragedies, many of which are either beyond the control of PubMed or can be sorted out (hint hint: anyone from NCBI reading this?). PubMed is a substantial fourteen years of work which continues to have significant benefits for many scientists around the world. There is plenty of room for improvement, but it’s hard to imagine Life® without PubMed®.

The complete catalogue of PubMed triumphs and tragedies is much longer than the above list, so if you think I missed any important ones, please leave a comment below.

References

Alon Halevy, Peter Norvig, & Fernando Pereira (2009). The Unreasonable Effectiveness of Data IEEE Intelligent Systems, 24 (2), 8-12 DOI: 10.1109/MIS.2009.36
Vetle Torvik & Neil Smalheiser (2009). Author Name Disambiguation in MEDLINE. ACM transactions on knowledge discovery from data, 3 (3) PMID: 20072710
Rezarta Islamaj Dogan, G. Craig Murray, Aurelie Neveol and Zhiyong Lu (2009). Understanding PubMed user search behavior through log analysis. Database : the journal of biological databases and curation, 2009 PMID: 20157491

* These statistics were correct at the time of writing in July 2010 but will rapidly change over time.

Comments (29)

29 Comments »

I think it’s a tragedy for not exploiting all the information already in there. As you say, text mining has not really been used to its fullest. Simple applications using these methods can spark ideas and experiments that can result in big discoveries. All we have to do is look at what we already have.

Comment by Paul Fisher — July 27, 2010 @ 5:29 pm | Reply
- Ranking results by relevance, by using an algorithm more like [Google’s] PageRank, would be much more useful to many users…”
  
  Indeed, but the meantime, one can use the Google algorithm by using Google itself. Try some site-restricted searches in this form:
  
  site:ncbi.nlm.nih.gov/pubmed
  
  I usually use Google Scholar (scholar.google.com) instead, however: Scholar searches not only PubMed, but the broader literature of science as a whole, and it links to full-text articles found anywhere on the Web, including those posted only on private and university websites.
  
  Highly recommended.
  
  Comment by Eric Drexler — January 23, 2011 @ 11:19 pm | Reply
  - Paul — Sorry, I misplaced the above reply, which was written in response to the article itself.
    
    Comment by Eric Drexler — January 23, 2011 @ 11:25 pm
Duncan, thanks a lot for the nice summary. I completely agree with your analysis, especially the part about ranking results and the DOI. Ranking results by date is almost useless, services like Scopus at least allow ranking by citation count.

Comment by Martin Fenner — July 27, 2010 @ 7:30 pm | Reply
- Martin, I’m glad somebody shares my PubMed pain! I feel a lot better now I’ve aired my grievances in public…
  
  Comment by Duncan — July 27, 2010 @ 10:27 pm | Reply
  - One problem with PubMed seems to be the “Not invented Here Syndrome”. That sometimes makes it difficult for them to integrate with others, e.g. The DOI.
    
    Something else I like about PubMed, or rather PubMed Central, is the NLM-DTD. An XML standard for storing and displaying papers.
    
    Comment by Martin Fenner — July 27, 2010 @ 11:17 pm
I bet all 514M e-utils searches were Fisher with his workflows.

Comment by Paul Dobson — July 27, 2010 @ 8:59 pm | Reply
- Paul, it can’t be Fishers workflows, he’s probably been blacklisted by the NCBI by now for hammering their servers 🙂
  
  Comment by Duncan — July 27, 2010 @ 10:26 pm | Reply
  - Not just yet 🙂
    
    Comment by Paul Fisher — July 28, 2010 @ 2:06 pm
I’ve been using Medline for almost twenty years as a Health Research Analyst. HOW a Health Research Analyst could even HOPE to exist WITHOUT the access to these far too few medical studies I cannot even imagine. ANYONE who argues there to be TOO MANY medical studies is OBVIOUSLY **incapable** OF health research analysis. Thusly the opinions expressed are of no consequence since Medline exists SOLEY FOR health research analysis. Imho.

Comment by Tom Hennessy — July 28, 2010 @ 12:39 pm | Reply
- I completely disagree – being a bioinformatician that has no interest in health research analysis at all.
  
  Comment by Paul Fisher — July 28, 2010 @ 1:01 pm | Reply
  - …..or the herbivorous status of Human beings: http://network.nature.com/profile/ironjustice .
    
    Comment by Paul Fisher — July 28, 2010 @ 1:17 pm
  - YOU would make the PERFECT ‘Health Research Analyst’ in that you will go in with NO ‘preconcieved notions’. The PROBLEM though for YOU is the FACT what **information** AVAILABLE to you is the ONLY way you can do the job. IF the information fed to you is INCORRECT then the ONLY conclusion you can come to will be false or “inconclusive” or or or .. ?
    IF you could be guaranteed the information given to you is not false then and only then can you be confident IN your area of expertise.
    Linus Pauling discovered it seems to be increased oxidation in man which leads to most if not all disease.
    The theory of anti-oxidants and oxidation is shared by many.
    I simply say I have found iron to be the defining factor in oxidation.
    Iron rusts.
    Simple.
    Prove me wrong.
    
    Comment by Tom Hennessy — July 29, 2010 @ 1:28 am
Duncan’s tongue-in-cheek comment about size is clearly about quality and the consequences for searching of “paper machine” researchers cluttering up good work with low quality nonsense. This he makes explicit later on. Therefore the opinions expressed are of tremendous relevance to health research analysts, who I’m sure dislike nonsensical hits as much as the rest of the very many other flavours of people who use Medline (it **really** doesn’t just exist for health research analysis).

Comment by Paul Dobson — July 28, 2010 @ 1:13 pm | Reply
Quote: researchers cluttering up good work with low quality nonsense
Answer: One could argue a researcher eating fly spit and earning himself a Nobel Prize might be one of the articles YOU ‘may’ have considered to BE of “low quality” seeing everyone else did.

Comment by Tom Hennessy — July 28, 2010 @ 10:51 pm | Reply
- When I say ‘low quality’ I’m talking about poor experimental design, obviously manipulated stats, ludicrous conclusions that aren’t supported by evidence – not off-the-wall, unexpected research, which is fine in my book so long as it is well executed.
  
  It seems to me that Paul Fisher doesn’t necessarily disagree with your iron findings but the extrapolation from that to humans being herbivores, which most people with incisors will find a bit of a stretch. You might like this paper, also about the major role of iron in disease…
  http://www.biomedcentral.com/1755-8794/2/2/
  
  Comment by Paul Dobson — July 29, 2010 @ 10:23 am | Reply
  - I find your hypothesis, about iron being involved in disease perfectly plausable – but, I have to add that you have taken this FAR beyond the realm OF scientific discourse by stating that you ALONE have: “….found iron to be the defining factor in oxidation”. I have absolutely no idea what you wish me to say about Iron rusting, and how this credibly backs up your hypothesis – which it does not. Taking this further to say that iron rusts; well, I have to say that I don’t think the iron in my body rusts – for one thing, I’m a slightly pinkish and not a red-brown colour. I have a sore throat at the moment, and am clearly in a diseased state (being INFECTED by some respiratory pathogen). My body IS obviously mounting an immune response using oxidative stress as a key component of this process – hence the sore feeling in my throat as a result of cellular and tissue damage. Yet I am still pink. Iron may BE involved at some level IN the process of oxidative stress, by DONATING and accepting electrons, but I very much doubt that the iron in my blood will undergo the same process of oxidation as you say in your above comment. I think you may have misinterpreted the process of oxidation and oxidative stress.
    
    I whole heartedly agree with Dr. Dobson and say you have not provided any clear evidence that the process of oxidative stress makes us all herbivores. I would also hasten to add that humans evolved: forward facing eyes as a means of judging distances to prey (not vegetables); incisors as a means of consuming such prey; and a complex digestive system that actvely breaks down animal tissue into the necessary parts our bodies need. If indeed your conclusions are correct, and we are infact herbivores, I have to say Darwin’s daft idea on so-called ‘Evolution’ was wrong. I am inclined to think, however, that your hypothesis is based on a biased view, since your are yourself (please correct me if I’m wrong) a vegetarian.
    
    Can I ALSO add THAT I find your USE of UPPERCASE text extremely annoying and patronising. It does not make a point any more than lowercase text, and it shows how you mean to undermine what I have written. I have added such text into my own response to show how useless it can be.
    
    Comment by Paul Fisher — July 29, 2010 @ 5:27 pm
Quote: extremely annoying and patronising
Answer: You ASSUME the “case of the letters” mean something untoward you ?
You assume too much . It is called **emphasis** and has been used that way for quite some time. IF you are LOOKING for trouble pal I will OBLIGE .. pal.
NOW do you think you UNDERSTOOD that CORRECTLY ? I bet you did.

Comment by Tom Hennessy — July 29, 2010 @ 9:01 pm | Reply
- Er… that’s the first flame war we’ve had at O’Really? Thanks guys! Seems to have gone a little off topic though?
  
  Comment by Duncan — July 30, 2010 @ 11:03 am | Reply
  - Indeed.
    
    Comment by Paul Fisher — July 30, 2010 @ 12:38 pm
- Please don’t call me pal. I’m neither your friend nor associate. I’d hate to be associated with someone like you.
  
  Comment by Paul Fisher — August 11, 2010 @ 1:37 pm | Reply
[…] Duncan Hull. Twenty million papers in PubMed: atriomph or a tragedy?. O’Really, Online, posted on July 27, 2010: https://duncan.hull.name/2010/07/27/pubmed-20-million/ […]

Pingback by PubMed: a success or a tragedy? « Science Intelligence and InfoPros — December 17, 2010 @ 10:37 pm | Reply
A little belated rant to see in the New Year

Duncan,

Your blog item ‘PubMed: Triumph or Tragedy’ provides some thoughtful questions but misses several points:

PubMed’s real tragedy (which may eventually kill it) is that among the big professionally-curated databases (WoS, Scopus, Chemical Abstracts, Engineering Index, Cinahl, Embase) it is the only one which is accessible freely AND makes its data available on licence, but again for free, to any institution which wishes to provide an ‘alternative’ interface and access route. As some of these ‘alternative’ PubMed providers are profit-driven they have an interest in proclaiming that they are easier to use and provide access to e.g. full-text articles faster than ‘PubMed original’. These alternative PubMed resources suggest, quite wrongly, that PubMed is difficult to use. They attract large chunks of the PubMed user-base and may eventually lead to funding reductions of PubMed via the NCBI. What then?

PubMed is easy to use; it accepts Google-like, focused questions with ease. No knowledge of MeSH terms, subheadings, Boolean operators, or Venn diagrams, nesting of search elements is required. Search results ca be easily manipulated for sensitivity and specificity, or if required, arranged by relevance (in an easy, very sophisticated fashion). PubMed’s search formulation translation algorithm, working quietly in the background is highly developed and used by most ‘alternative’ interfaces, usually without attribution.

The perception that PubMed is difficult to use is also encouraged by medical librarians, who, to show their mettle, point at manuals, help-sheets, explanatory videos: none are necessary for a casual (or regular?) user.

Some subjects may be difficult to search: this is not PubMed’s fault but a reflection of biomedicine’s complexity, which doesn’t let itself to simple yes-no answers. A PubMed search can be compared to a screening test which always (and by definition) produces a false-positive and, sometimes, false-negative results. A screening test with 100% sensitivity AND 100% precision does not exist; why expect it of a PubMed search?

The limitations of PubMed you mention, e.g. no full-text for all articles, author ambiguity, salami-publication tactic, is inherent to all databases (incl. Google) and can’t be put a PubMed’s feet. Your suggestion that PubMed is both too big and too small indicates that it gets it ‘about right’. If you say that it is impossible to ‘find what you need’ you should give examples of focused, biomedical question which do not provide some answers on PubMed.

Your quote of Sturgeon’s law that 90% of everything is rubbish may be correct, but is not helpful. Apart from the fact that a noise-to-signal ratio of 9/1 is quite good, I am reminded of the suggestion ’50% of what we learn at medical school is obsolete by the time we start practicing: unfortunately we don’t know which 50%’ which is more indicative of the problem. PubMed, used intelligently and with realistic expectations (not something Anna Kushnir seems able to do) can help to manage this problem.

Long may it prosper! Thank you, American taxpayers, for making this excellent, unique, powerful resource available for free!

Reinhard

Comment by Reinhard — January 1, 2011 @ 2:26 pm | Reply
- Hello Reinhard,
  
  Thanks for your feedback on this post…
  
  My point about difficulty of use was really that PubMed could be much easier to use than it currently is. How easy/difficult pubmed is to use depends on who you ask of course, but there is always room for improvement. Some (probably not all) users *do* find it difficult to use, not just the conventional interface at http://pubmed.gov but the http://eutils.ncbi.nlm.nih.gov could be improved as well.
  
  I can’t see the NCBI would be likely to pull all the funding for PubMed in the near future, there would be an international uproar if they did. As for the fulltext issue, thankfully this is being addressed by the work of PubMedCentral and its mirrors like http://ukpmc.ac.uk here in the UK
  
  Finally, I see your point about the signal to noise ratio being “quite good” but do we currently accept similar noisy results from the likes of Google?
  
  Comment by Duncan — January 12, 2011 @ 4:00 pm | Reply
Ahem – I would have thought my comments of Jan 1 2011 would deserve a few further comments or at least some appreciative noises from O’Really or contributors to this blog?! Reinhard

Comment by Reinhard — January 11, 2011 @ 3:24 pm | Reply
Hello Duncan,

Thank you for your comments.

PubMed provides the best map / representation of a complex reality (here: the literature of biomedicine) we have and, just like other curated databases (WoS, Embase, Scopus, Chemical Abstracts, Cinahl etc.) requires some effort to use properly for complex investigations. That *some* people find this difficult is irrelevant: map reading requires some skills* if you want to use them for more than a Sunday afternoon stroll. Natural curiosity and intuition to find out what a map symbol means would help, but these two qualities are perhaps in short supply in some digital natives who want everything effortlessly and *now*.

I do agree of course with your original suggestion (I paraphrase) that “PubMed’s strengths far outweigh its weaknesses”.

Reinhard

*As does playing chess, reading music, sailing, differential calculus, or determining the ppv of an ELISA test.

Comment by Reinhard — January 14, 2011 @ 11:51 am | Reply
- Hi Duncan,
  
  upon reflection I have changed my approach to PubMed ‘Alternatives’: as the NLM make their PubMed data available for free, it created a free playground for all kind of PubMed search options, 3rd-party add-ons, semantic searching etc. There are e.g. Medsum, Anne O’Tate’ and others which help to identify search terms [MeSH) for further searching. Medsum can even distinguish between major (starred) MeSH terms with subheadings and other MeSH terms allocated to a set of PubMed records on a given subject. Excellent!
  
  Reinhard
  
  Comment by Reinhard — February 8, 2011 @ 3:39 pm | Reply
  - Hello Reinhard, yes, as far as third parties are concerned, it’s good and healthy to let “a thousand flowers bloom” on PubMed data. There is a nice review of many of these applications just published: PubMed and beyond: a survey of web tools for searching biomedical literature http://pubmed.gov/21245076, its impressive how much value has been added to the original data by lots of different projects around the world.
    
    Comment by Duncan — February 8, 2011 @ 8:40 pm
[…] the “triumphs and tragedies” of PubMed’s 20 million records (https://duncan.hull.name/2010/07/27/pubmed-20-million/). Hull suggests that more is not always best citing Theodore Sturgeon’s law that “90% […]

Pingback by Training: Making PubMed Work for You « — January 28, 2011 @ 9:19 pm | Reply