O'Really?

September 1, 2010

How many unique papers are there in Mendeley?

Lex Macho Inc. by Dan DeChiaro on Flickr, How many people in this picture?Mendeley is a handy piece of desktop and web software for managing and sharing research papers [1]. This popular tool has been getting a lot of attention lately, and with some impressive statistics it’s not difficult to see why. At the time of writing Mendeley claims to have over 36 million papers, added by just under half a million users working at more than 10,000 research institutions around the world. That’s impressive considering the startup company behind it have only been going for a few years. The major established commercial players in the field of bibliographic databases (WoK and Scopus) currently have around 40 million documents, so if Mendeley continues to grow at this rate, they’ll be more popular than Jesus (and Elsevier and Thomson) before you can say “bibliography”. But to get a real handle on how big Mendeley is we need to know how many of those 36 million documents are unique because if there are lots of duplicated documents then it will affect the overall head count.

An obvious place to start looking for duplicates is a personal bibliography, but  I’m not a regular user of Mendeley. However, I do have a collection of stuff on citeulike. Thankfully users of citeulike can synchronise data with their mendeley accounts, to save re-entering publications (again).  So I pulled my data from citeulike, entered my citeulike username into the importer and Bingo! I’m a Mendeley user – nice and easy. Looking closely at some of the papers, its easy to spot quite a few duplicates that are not unique, and this problem hinges on the thorny issue of identity. Any given paper be can be identified in different ways and these need to be resolved. For example, all the identifiers below use different ways to identify the same paper:

  1. http://pubmed.gov/18974831 (PubMed)
  2. http://dx.doi.org/10.1371/journal.pcbi.1000204 (DOI)
  3. http://ukpmc.ac.uk/articlerender.cgi?tool=EBI&pubmedid=18974831 (UK PMC)
  4. http://www.ploscompbiol.org/article/info:doi:10.1371/journal.pcbi.1000204 (PLoS)
  5. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2568856 (US PMC)

Part of what all reference management software does is resolve multiple identities to the same thing. Database people call this normalisation – and it can be tricky to do with web data. In citeulike for example, all of these different id’s are ultimately recognised and normalised to the same unique thing: citeulike.org/article/3467077 which has been saved by 297 users. Citeulike, which currently stands at just over 4 million articles is doing a reasonable job of detecting and merging duplicates although it’s not perfect – it’s a hard problem. How does Mendeley compare on the same problem?

If you search Mendeley for the same example paper, you currently get at least seven different results that are not recognised as the same thing:

  1. defrosting-digital-library-bibliographic-tools-next-generation-web-240/ (saved by 99 users)
  2. defrosting-digital-library-bibliographic-tools-next-generation-web-252/ (saved by 123 users)
  3. defrosting-the-digital-library-bibliographic-tools-for-the-next-generation-web/ (saved by 399 users)
  4. defrosting-the-digital-library-a-survey-of-bibliographic-tools-for-the-next-generation-web/ (saved by 1 user)
  5. sr-kell-db-defrosting-the-digital-library-bibliographic-tools-for-the-next-generation-web/ (saved by 93 users)
  6. a-framework-for-scientific-knowledge-generation/ (ignore misleading title, saved by 93 users. Same as above? Might be)
  7. tropical-forest-fragmentation-and-the-local-extinction/ (ignore misleading title, saved by 7 users )

This particular paper may be an extreme case, but this kind of redundant duplication is certainly not uncommon. A quick search for some other papers reveals that many have at least one duplicate in Mendeley – which means there is room for improvement. But popular papers like Error bars in experimental biology (saved by 894 users), don’t seem to have any duplicates at all – maybe that’s why they appear to be so popular?

So how many unique papers are there in Mendeley? It depends on how many duplicates there are, and that’s quite difficult to calculate accurately. Some papers have zero duplicates, others have as many as seven. So Mendeley might have as little as ~20 million unique documents or it have as many as ~30 million, who knows? But it’s probably not as much as 36 million. Well, at least not just yet anyway…

References

  1. Victor Henning, & Jan Reichelt (2008). Mendeley – A Last.fm For Research? IEEE Fourth International Conference on eScience, 327-328 DOI: 10.1109/eScience.2008.128

[The duplicitous Lex Macho Inc. by Dan DeChiaro on Flickr, see extra commentary on this post over at friendfeed and Data duplication in Mendeley from Egon Willighagen.]

23 Comments »

  1. Mendeley have been called out on this problem numerous times over the past several years and always dodge it. One must wonder about their other figures as well. Company claim they have 500,000 users, but the API seems to be revealing some problems with those numbers. http://pipes.yahoo.com/pipes/pipe.info?_id=6a5ba92b83b777964009f8eaf341ad9f only shows a handful of people from institutions like Harvard (19), Princeton (21), and Yale (2). Something doesn’t smell right. Reminds one of Soviet industrial and agricultural output figures.

    Comment by BWG — September 1, 2010 @ 11:54 am | Reply

    • That’s interesting, so the most reliable data is “how many users”. The number of institutions and papers are both pretty questionable…

      Comment by Duncan — September 1, 2010 @ 1:13 pm | Reply

      • Your recent blog post “How many journal articles have been published (ever)?” also makes me uneasy about M’s claimed article figures. Do they (or we) really think they have even 10% of the cumulative scholarly output, let alone 70%, which is what 35M of 50M would of course be?

        Comment by BWG — September 1, 2010 @ 2:07 pm

    • Another way to look at this is with (my possibly faulty) math, and even using M’s figures it doesn’t add up. 36,109,930 papers / 487,953 users = 74. That means, on average, every single M user has 74 entirely unique papers, held by no one else. It would be surprising if each user had on average 74 papers at all in their libraries, but 74 unique ones seems highly implausible.

      Comment by BWG — September 1, 2010 @ 2:17 pm | Reply

      • Yup, 70% of 50 million papers ever published does seem pretty unlikely, as does an average of 74 unique papers per user. But Mendeley also includes books and book chapters too, not just papers.

        Comment by Duncan — September 1, 2010 @ 2:35 pm

  2. Every database has to deal with issues of duplicate content, but as far as duplicate papers go, we’re currently collapsing the duplicates into canonical papers and have this issue mostly solved, as you’ll see over the next few weeks. The stats we currently deliver on the web page are as complete as we can technically make them right now, and shouldn’t be that far off from the “true” number. Because the documents continue to come in at a exponential rate, time will substantiate the picture we’re painting. Likewise, when I search for Harvard at Mendeley, I get 90 results. Some of these will be mentions of Harvard in other places on their profile, but it’s quite easy to see that it’s higher than the 19 returned by the pipe, and remember that most users don’t yet list their institution on their profile, so it’s some multiple of that number.

    The issues raised by BWG are important ones, but they’ve been asked and answered before (by the same people, even!) For those who may not be familiar with the conversation, just google “BWG Mendeley.

    Duncan, if anything is unsatisfactory to you about the answers I’ve given, please let me know. As far as I know, there’s no hard data on how many unique papers a researcher has in their reference manager (or filing cabinet), but it doesn’t seem unreasonable to me that it would be 74 or 100. We should do a better job making usage stats available, but I’d especially caution against trying to extrapolate in too global a fashion from the small but rapidly growing sample of users that Mendeley represents.

    Every database has some level of duplicates. Facebook, for example, has inconsistencies in the number of users they report, but no one really complains about this, because the story is that they’re far and away the largest of a service whose most valuable attribute is the number of people who use it. Likewise, Mendeley has built an research catalog that will soon surpass the Web of Knowledge and we’re letting anyone, man or machine, query it for free.

    Comment by Mr. Gunn — September 1, 2010 @ 3:50 pm | Reply

    • Hi William, thanks for your comments. First up, I’ve nothing against Mendeley, I think the move to make this data more open and queryable is a good and exciting one.

      I’m disputing the 36 million nearly-as-big-as-Scopus-and-WoK claim because I don’t think the data currently supports this based on that example paper. If we assume the average number of duplicates is 1, this halves the size of Mendeley in terms of unique publications.

      I look forward to seeing the new de-duplicated Mendeley, and I’d expect the figure of 36 million to go down a bit – by how much will be interesting to see. As for duplication being common, yes it’s a problem for everyone (citeulike suffers too, but it currently does a much better job than Mendeley IMHO).

      Databases like Scopus and WoK presumably spend a lot of time and money manually and automatically recognising and removing duplicates, and I’ve haven’t found any duplicates in their database for individual *papers* yet. It’s different for individual *author* duplicates of course, but that’s a different story…

      Comment by Duncan — September 1, 2010 @ 5:09 pm | Reply

    • Mr. Gunn: Looks like in the past someone from Mendeley has claimed these press release figures were unique or a true number, so either that person was wrong or the story has changed or something else:

      http://eu.techcrunch.com/2009/11/18/mendeley-the-last-fm-of-research-could-be-world%E2%80%99s-largest-online-research-paper-database-by-early-2010/#comment-282335

      Comment by BWG — September 1, 2010 @ 5:52 pm | Reply

  3. Duncan – The number on the site is correct. The average number of duplicates is not 1, rather some small fraction of that, as duplicates aren’t that common to begin with and there’s been some deduplication already applied to the results. Duplicates are understandably enriched among the popular papers, such as yours, and it’s harder to go from 6 duplicates to 1 canonical document than from 2 to one, because the variability is higher. However, even if the number were larger than 1, that still wouldn’t make news, IMO, because there’s no other crowd-sourced open research catalog even close to to compiling what we’ve now released for free.

    That said, it’s understandable that people will try to come up with their own back-of-the-envelope calculations where our data is sparse, and we should do a better job of defining what the numbers mean.

    Comment by Mr. Gunn — September 1, 2010 @ 6:38 pm | Reply

    • Hi William, I’m not disputing that there are 36 million documents, just that there are 36 million *unique* documents meaning Mendeley will overtake the 40m papers at Thomson Reuters Web of Knowledge by the end of the year.

      I’ve not done a comprehensive survey, but based on a few sample papers from my own library it seems duplication is fairly common, here are two more examples: paper a and
      paper b.

      I can’t see anything unusual about these papers, they are all in pubmed and have DOI’s (which many other scientific papers do) – so it seems quite likely that this is a widespread problem in Mendeley, not just a quirk of my own library.

      Comment by Duncan — September 1, 2010 @ 11:58 pm | Reply

      • Duncan – Google’s index lags our catalog by some degree, so the extra link was probably indexed before our internal de-duplication became active. There are also many pages in our catalog that will not have yet appeared in Google’s index, so I would recommend against using this means of searching Mendeley Web, at least for now. As you can see from the same search at Mendeley.com, there are only two results reported and not the three you see in Google’s index. Two isn’t one, or course, but from what we can tell, the average number of duplicates is much less than one. We should know more about this in a few weeks.

        Having made this observation about duplicates, how would you recommend it guide our development efforts? Should we stop accepting new documents until we have this sorted? Should we halt development on the citation style editor or the API and focus solely on deduplication? Should we not brag about ourselves to the media?

        Here’s the thing: Mendeley is about 30 people, all gathered together in one room, trying to disrupt the stagnant, old publishing infrastructure which we all complain about. At the end of a long day, we need to believe that what we’re doing really can change the world, even if it’s just to explain to our spouses why we’re putting in another late night. Does that make sense?

        Comment by Mr. Gunn — September 2, 2010 @ 1:57 am

    • Mr. Gunn… you write that “Duplicates are understandably enriched among the popular papers, such as yours, and it’s harder to go from 6 duplicates to 1 canonical document than from 2 to one, because the variability is higher.” … but I am pretty sure my papers are not that popular is the given example… still, I see quite a few duplications.

      More importantly, please let us know how we can manually report duplication… I could not find on the website how to do this. Why not crowd-source that part of the database too? You’ll probably find out that people are eager to report such issues, as merging duplicates in ones own set of publications will raise the readermeter.org statistics…

      Comment by Egon Willighagen — September 2, 2010 @ 11:10 pm | Reply

      • Egon,

        Yes, duplicates affect all papers, popular or not. I think the idea of crowd-sourcing the merging is a great one. It would be great to see Mendeley and Citeulike allow their users to do this.

        Comment by Duncan — September 3, 2010 @ 11:10 am

      • Just had a discussion with the developers here, and they agree crowdsourcing would be a great way to handle this problem. In fact, they already had a prototype of this under development. Does anyone have clever ideas for how to prevent abusive uses of a crowdsourced duplicate detection approach?

        Comment by Mr. Gunn — September 6, 2010 @ 2:51 pm

      • @Mr Gunn… the crowd-sourcing could be restricted to identification of duplicates, that could be curated by people at Mendeley… (e.g. with priority to paying customers)… I guess many incorrect mergers can be easily detected… many of the fields must have a high similarity, and fields like DOI must be identical…

        Or are you thinking of a different kind of abuse?

        Comment by Egon Willighagen — September 6, 2010 @ 2:56 pm

      • DOI, actually, isn’t the infalliable indicator I used to think, because many people mis-enter the DOI manually to look up the metadata, so while the metadata and the DOI are consistent, the attached PDF is a different paper. If we could only use the DOIs extracted from papers, that would go some way toward handling the issue, but there remain so many things that don’t have DOIs that we really need a better solution.

        The abuse potential via disgruntled academics trying to bury competitors is distant, but we still want to architect the system such that we can more easily handle problems when they arise. Probably a good start would be to allow Mendeley users to flag items that can’t be automatically identified, and do the cleanup manually on the back-end for those.

        Comment by Mr. Gunn — September 6, 2010 @ 3:45 pm

      • Excellent point of the DOI as metadata! Didn’t think of that.

        Comment by Egon Willighagen — September 6, 2010 @ 3:51 pm

  4. Hi William,

    Mendeley (and other groups of people) are doing a great job to “disrupt the stagnant, old publishing infrastructure”… many people want this (including me), in fact it’s surprising it’s taking such a long time to happen. So here’s my wish list for Mendeley:

    1. Sort out duplication, to my mind, this is a serious problem that needs sorting out – it will never be 100% perfect, but it could be much much better. If Mendeley was as good as citeulike in doing merges, then that would be a start.

    2. Better statistics which means that size isn’t everything, quality matters too. I’d like to see more realistic statistics on how many unique documents there are in Mendeley. It would also be useful to know how much overlap there is between Mendeley/Scopus/WoK/PubMed etc. When Mendeley hits 40 million documents, will it really completely subsume them all?

    Comment by Duncan — September 2, 2010 @ 11:18 am | Reply

  5. Agreed on all points, Duncan. Deduplication is being sorted now, at least for documents which can be matched via title or that have an identifier, and further work is prioritized.

    The stats is something where we have to balance usefulness vs. completeness. We’d like to get out the message that we’re growing, but releasing numbers often brings about a sort of horse-race mentality which I think is really distracting, would take a large amount of time and effort to maintain, and may invite comparisons between sets of numbers that aren’t really equivalent, such as between Mendeley and other sites that don’t really address the same populations, have the same business model, etc. The numbers are for press releases, but the value of Mendeley is how much it improves your personal workflow.

    Comment by Mr. Gunn — September 2, 2010 @ 1:13 pm | Reply

  6. […] bibliographic databases, is an interesting issue that sparked a heated debate (see this post by Duncan Hull and the ensuing […]

    Pingback by Academic Productivity » ReaderMeter: Crowdsourcing research impact — September 22, 2010 @ 6:05 pm | Reply

  7. […] bibliographic databases, is an interesting issue that sparked a heated debate (see this post by Duncan Hull and the ensuing […]

    Pingback by Another researcher index? ReaderMeter looks to answer with Mendeley | Mendeley Blog — September 22, 2010 @ 7:06 pm | Reply

  8. […] many unique papers are there in Mendeley? http://duncan.hull.name/2010/09/01/mendeley/#more-3374 Possibly related posts: (automatically generated)Google Scholar Vs Web of Science Vs ScopusScopus […]

    Pingback by Mendeley: soon more popular than WoS and Scopus? « Science Intelligence and InfoPros — December 17, 2010 @ 10:58 pm | Reply

  9. […] details. These are my qualitative impressions — I have not backed them up with analysis as an old chum did a while back — but they are enough to make me wary of relying on Mendeley […]

    Pingback by Mendeley vs Bookends: No Contest — January 4, 2012 @ 4:05 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,595 other followers

%d bloggers like this: