Mendeley is a handy piece of desktop and web software for managing and sharing research papers . This popular tool has been getting a lot of attention lately, and with some impressive statistics it’s not difficult to see why. At the time of writing Mendeley claims to have over 36 million papers, added by just under half a million users working at more than 10,000 research institutions around the world. That’s impressive considering the startup company behind it have only been going for a few years. The major established commercial players in the field of bibliographic databases (WoK and Scopus) currently have around 40 million documents, so if Mendeley continues to grow at this rate, they’ll be more popular than Jesus (and Elsevier and Thomson) before you can say “bibliography”. But to get a real handle on how big Mendeley is we need to know how many of those 36 million documents are unique because if there are lots of duplicated documents then it will affect the overall head count.
An obvious place to start looking for duplicates is a personal bibliography, but I’m not a regular user of Mendeley. However, I do have a collection of stuff on citeulike. Thankfully users of citeulike can synchronise data with their mendeley accounts, to save re-entering publications (again). So I pulled my data from citeulike, entered my citeulike username into the importer and Bingo! I’m a Mendeley user – nice and easy. Looking closely at some of the papers, its easy to spot quite a few duplicates that are not unique, and this problem hinges on the thorny issue of identity. Any given paper be can be identified in different ways and these need to be resolved. For example, all the identifiers below use different ways to identify the same paper:
- http://pubmed.gov/18974831 (PubMed)
- http://dx.doi.org/10.1371/journal.pcbi.1000204 (DOI)
- http://ukpmc.ac.uk/articlerender.cgi?tool=EBI&pubmedid=18974831 (UK PMC)
- http://www.ploscompbiol.org/article/info:doi:10.1371/journal.pcbi.1000204 (PLoS)
- http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2568856 (US PMC)
Part of what all reference management software does is resolve multiple identities to the same thing. Database people call this normalisation – and it can be tricky to do with web data. In citeulike for example, all of these different id’s are ultimately recognised and normalised to the same unique thing: citeulike.org/article/3467077 which has been saved by 297 users. Citeulike, which currently stands at just over 4 million articles is doing a reasonable job of detecting and merging duplicates although it’s not perfect – it’s a hard problem. How does Mendeley compare on the same problem?
If you search Mendeley for the same example paper, you currently get at least seven different results that are not recognised as the same thing:
- defrosting-digital-library-bibliographic-tools-next-generation-web-240/ (saved by 99 users)
- defrosting-digital-library-bibliographic-tools-next-generation-web-252/ (saved by 123 users)
- defrosting-the-digital-library-bibliographic-tools-for-the-next-generation-web/ (saved by 399 users)
- defrosting-the-digital-library-a-survey-of-bibliographic-tools-for-the-next-generation-web/ (saved by 1 user)
- sr-kell-db-defrosting-the-digital-library-bibliographic-tools-for-the-next-generation-web/ (saved by 93 users)
- a-framework-for-scientific-knowledge-generation/ (ignore misleading title, saved by 93 users. Same as above? Might be)
- tropical-forest-fragmentation-and-the-local-extinction/ (ignore misleading title, saved by 7 users )
This particular paper may be an extreme case, but this kind of redundant duplication is certainly not uncommon. A quick search for some other papers reveals that many have at least one duplicate in Mendeley – which means there is room for improvement. But popular papers like Error bars in experimental biology (saved by 894 users), don’t seem to have any duplicates at all – maybe that’s why they appear to be so popular?
So how many unique papers are there in Mendeley? It depends on how many duplicates there are, and that’s quite difficult to calculate accurately. Some papers have zero duplicates, others have as many as seven. So Mendeley might have as little as ~20 million unique documents or it have as many as ~30 million, who knows? But it’s probably not as much as 36 million. Well, at least not just yet anyway…
- Victor Henning, & Jan Reichelt (2008). Mendeley – A Last.fm For Research? IEEE Fourth International Conference on eScience, 327-328 DOI: 10.1109/eScience.2008.128
[The duplicitous Lex Macho Inc. by Dan DeChiaro on Flickr, see extra commentary on this post over at friendfeed and Data duplication in Mendeley from Egon Willighagen.]