Mendeley is a handy piece of desktop and web software for managing and sharing research papers [1]. This popular tool has been getting a lot of attention lately, and with some impressive statistics it’s not difficult to see why. At the time of writing Mendeley claims to have over 36 million papers, added by just under half a million users working at more than 10,000 research institutions around the world. That’s impressive considering the startup company behind it have only been going for a few years. The major established commercial players in the field of bibliographic databases (WoK and Scopus) currently have around 40 million documents, so if Mendeley continues to grow at this rate, they’ll be more popular than Jesus (and Elsevier and Thomson) before you can say “bibliography”. But to get a real handle on how big Mendeley is we need to know how many of those 36 million documents are unique because if there are lots of duplicated documents then it will affect the overall head count. (more…)
September 1, 2010
How many unique papers are there in Mendeley?
July 27, 2010
July 15, 2010
How many journal articles have been published (ever)?

According to some estimates, there are fifty million articles in existence as of 2010. Picture of a fifty million dollar note by ZeroOne on Flickr.
Earlier this year, the scientific journal PLoS ONE published their 10,000th article. Ten thousand articles is a lot of papers especially when you consider that PLoS ONE only started publishing four short years ago in 2006. But scientists have been publishing in journals for at least 350 years [1] so it might make you wonder, how many articles have been published in scientific and learned journals since time began?
If we look at PubMed Central, a full-text archive of journals freely available to all – PubMedCentral currently holds over 1.7 million articles. But these articles are only a tiny fraction of the total literature – since a lot of the rest is locked up behind publishers paywalls and is inaccessible to many people. (more…)
September 18, 2009
Popular, personal and public data: Article-level metrics at PLoS
The Public Library of Science (PLoS) is a non-profit organisation committed to making the world’s scientific and medical literature freely accessible to everyone via open access publishing. As recently announced they have just published the first article-level metrics (e.g. web server logs and related information) for all articles in their library. This is novel, interesting and potentially useful data, not currently made publicly available by other publishers. Here is a selection of some of the data, taken from the full dataset here (large file), which includes the “top ten” papers by viewing statistics.
Article level metrics for some papers published in PLoS (August 2009)
*The rank is based on the 12,549 papers for which viewing data (combined usage of HTML + PDF + XML) are available.
**Citation counts are via PubMedCentral (data from CrossRef and Scopus is also provided, see Bora’s comments and commentary at Blue Lab Coats.)
Science is not a popularity contest but…
Analysing this data is not straightforward. Some highly-viewed articles are never cited (reviews, editorial, essays, opinion, etc). Likewise, popularity and importance are not the same thing. Some articles get lots of citations but few views, which suggests that people are not actually reading the papers them before citing them. As described on the PLoS website article-level-metrics.plos.org:
“When looking at Article-Level Metrics for the first time bear the following points in mind:
- Online usage is dependent on the article type, the age of the article, and the subject area(s) it is in. Therefore you should be aware of these effects when considering the performance of any given article.
- Older articles normally have higher usage than younger ones simply because the usage has had longer to accumulate. Articles typically have a peak in their usage in the first 3 months and usage then levels off after that.
- Spikes of usage can be caused by media coverage, usage by large numbers of people, out of control download scripts or any number of other reasons. Without a detailed look at the raw usage logs it is often impossible to tell what the reason is and so we encourage you to regard usage data as indicative of trends, rather than as an absolute measure for any given article.
- We currently have missing usage data for some of our articles, but we are working to fill the gaps. Primarily this affects those articles published before June 17th, 2005.
- Newly published articles do not accumulate usage data instantaneously but require a day or two before data are shown.
- Article citations as recorded by the Scopus database are sometimes undercounted because there are two records in the database for the same article. We’re working with Scopus to correct this issue.
- All metrics will accrue over time (and some, such as citations, will take several years to accrue). Therefore, recent articles may not show many metrics (other than online usage, which accrues from day one). ”
So all the usual caveats apply when using this bibliometric data. Despite the limitations, it is more revealing than the useful (but simplistic) “highly accesssed” papers at BioMedCentral, which doesn’t always give full information on what “highly” actually means next to each published article. It will be interesting to see if other publishers now follow the lead of PLoS and BioMed Central and also publish their usage data combined with other bibliometric indicators such as blog coverage. For authors publishing with PLoS, this data has an added personal dimension too, it is handy to see how many views your paper has.
As paying customers of the services that commercial publishers provide, should scientists and their funders be demanding more of this kind of information in the future? I reckon they should. You have to wonder, why these kind of innovations have taken so long to happen, but they are a welcome addition.
[More commentary on this post over at friendfeed.]
References
- Ioannidis, J. (2005). Why Most Published Research Findings Are False PLoS Medicine, 2 (8) DOI: 10.1371/journal.pmed.0020124
- Kirsch, I., Deacon, B., Huedo-Medina, T., Scoboria, A., Moore, T., & Johnson, B. (2008). Initial Severity and Antidepressant Benefits: A Meta-Analysis of Data Submitted to the Food and Drug Administration PLoS Medicine, 5 (2) DOI: 10.1371/journal.pmed.0050045
- Lacasse, J., & Leo, J. (2005). Serotonin and Depression: A Disconnect between the Advertisements and the Scientific Literature PLoS Medicine, 2 (12) DOI: 10.1371/journal.pmed.0020392
- Levy, S., Sutton, G., Ng, P., Feuk, L., Halpern, A., Walenz, B., Axelrod, N., Huang, J., Kirkness, E., Denisov, G., Lin, Y., MacDonald, J., Pang, A., Shago, M., Stockwell, T., Tsiamouri, A., Bafna, V., Bansal, V., Kravitz, S., Busam, D., Beeson, K., McIntosh, T., Remington, K., Abril, J., Gill, J., Borman, J., Rogers, Y., Frazier, M., Scherer, S., Strausberg, R., & Venter, J. (2007). The Diploid Genome Sequence of an Individual Human PLoS Biology, 5 (10) DOI: 10.1371/journal.pbio.0050254
- Holy, T., & Guo, Z. (2005). Ultrasonic Songs of Male Mice PLoS Biology, 3 (12) DOI: 10.1371/journal.pbio.0030386
- Franzen, J., Gingerich, P., Habersetzer, J., Hurum, J., von Koenigswald, W., & Smith, B. (2009). Complete Primate Skeleton from the Middle Eocene of Messel in Germany: Morphology and Paleobiology PLoS ONE, 4 (5) DOI: 10.1371/journal.pone.0005723
- The PLoS Medicine Editors (2006). The Impact Factor Game PLoS Medicine, 3 (6) DOI: 10.1371/journal.pmed.0030291
- Voight, B., Kudaravalli, S., Wen, X., & Pritchard, J. (2006). A Map of Recent Positive Selection in the Human Genome PLoS Biology, 4 (3) DOI: 10.1371/journal.pbio.0040072
- Hagmann, P., Cammoun, L., Gigandet, X., Meuli, R., Honey, C., Wedeen, V., & Sporns, O. (2008). Mapping the Structural Core of Human Cerebral Cortex PLoS Biology, 6 (7) DOI: 10.1371/journal.pbio.0060159
- Bourne, P. (2005). Ten Simple Rules for Getting Published PLoS Computational Biology, 1 (5) DOI: 10.1371/journal.pcbi.0010057
- Lawrence, P. (2006). Men, Women, and Ghosts in Science PLoS Biology, 4 (1) DOI: 10.1371/journal.pbio.0040019
- Hull, D., Pettifer, S., & Kell, D. (2008). Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web PLoS Computational Biology, 4 (10) DOI: 10.1371/journal.pcbi.1000204
- Beltrao, P., & Serrano, L. (2007). Specificity and Evolvability in Eukaryotic Protein Interaction Networks PLoS Computational Biology, 3 (2) DOI: 10.1371/journal.pcbi.0030025
- Beltrao, P., & Serrano, L. (2005). Comparative Genomics and Disorder Prediction Identify Biologically Relevant SH3 Protein Interactions PLoS Computational Biology, 1 (3) DOI: 10.1371/journal.pcbi.0010026
- Ho, B., & Dill, K. (2006). Folding Very Short Peptides Using Molecular Dynamics PLoS Computational Biology, 2 (4) DOI: 10.1371/journal.pcbi.0020027
- Saunders, N., Beltrão, P., Jensen, L., Jurczak, D., Krause, R., Kuhn, M., & Wu, S. (2009). Microblogging the ISMB: A New Approach to Conference Reporting PLoS Computational Biology, 5 (1) DOI: 10.1371/journal.pcbi.1000263
- Ho, B., & Agard, D. (2009). Probing the Flexibility of Large Conformational Changes in Protein Structures through Local Perturbations PLoS Computational Biology, 5 (4) DOI: 10.1371/journal.pcbi.1000343
- Wolschin, F., & Gadau, J. (2009). Deciphering Proteomic Signatures of Early Diapause in Nasonia PLoS ONE, 4 (7) DOI: 10.1371/journal.pone.0006394
June 2, 2009
Michael Ley on Digital Bibliographies
Michael Ley is visiting Manchester this week, he will be doing a seminar on Wednesday 3rd June, here are some details for anyone who is interested in attending:
Date: 3rd Jun 2009
Title: DBLP: How the data get in
Speaker: Dr Michael Ley. University of Trier, Germany
Time & Location: 14:15, Lecture Theatre 1.4, Kilburn Building
Abstract: The DBLP (Digital Bibliography & Library Project) Computer Science Bibliography now includes more than 1.2 million bibliographic records. For Computer Science researchers the DBLP web site now is a popular tool to trace the work of colleagues and to retrieve bibliographic details when composing the lists of references for new papers. Ranking and profiling of persons, institutions, journals, or conferences is another usage of DBLP. Many scientists are aware of this and want their publications being listed as complete as possible.
The talk focuses on the data acquisition workflow for DBLP. To get ‘clean’ basic bibliographic information for scientific publications remains a chaotic puzzle.
Large publishers are either not interested to cooperate with open services like DBLP, or their policy is very inconsistent. In most cases they are not able or not willing to deliver basic data required for DBLP in a direct way, but they encourage us to crawl their Web sites. This indirection has two main problems:
- The organisation and appearance of Web sites changes from time to time, this forces a reimplementation of information extraction scripts. [1]
- In many cases manual steps are necessary to get ‘complete’ bibliographic information.
For many small information sources it is not worthwhile to develop information extraction scripts. Data acquisition is done manually. There is an amazing variety of small but interesting journals, conferences and workshops in Computer Science which are not under the umbrella of ACM, IEEE, Springer, Elsevier etc. How they get it often is decided very pragmatically.
The goal of the talk and my visit to Manchester is to start a discussion process: The EasyChair conference management system developed by Andrei Voronkov and DBLP are parts of scientific publication workflow. They should be connected for mutual benefit?
References
- Lincoln Stein (2002). Creating a bioinformatics nation: screen scraping is torture Nature, 417 (6885), 119-120 DOI: 10.1038/417119a
May 19, 2009
Defrosting the John Rylands University Library
For anyone who missed the original bioinformatics seminar I’ll be doing a repeat of the “Defrosting the Digital Library” talk, this time for the staff in the John Rylands University Library (JRUL) . This is the main academic library in Manchester with (quote) “more than 4 million printed books and manuscripts, over 41,000 electronic journals and 500,000 electronic books, as well as several hundred databases, the John Rylands University Library is one of the best-resourced academic libraries in the country.” The journal subscription budget of the library is currently around £4 million per year, that’s before they’ve even bought any books! Here is the abstract for the talk:
After centuries with little change, scientific libraries have recently experienced massive upheaval. From being almost entirely paper-based, most libraries are now almost completely digital. This information revolution has all happened in less than 20 years and has created many novel opportunities and threats for scientists, publishers and libraries.
Today, we are struggling with an embarrassing wealth of digital knowledge on the Web. Most scientists access this knowledge through some kind of digital library, however these places can be cold, impersonal, isolated, and inaccessible places. Many libraries are still clinging to obsolete models of identity, attribution, contribution, citation and publication.
Based on a review published in PLoS Computational Biology, pubmed.gov/18974831 this talk will discuss the current chilly state of digital libraries for biologists, chemists and informaticians, including PubMed and Google Scholar. We highlight problems and solutions to the coupling and decoupling of publication data and metadata, with a tool called citeulike.org. This software tool (and many other tools just like it) exploit the Web to make digital libraries “warmer”: more personal, sociable, integrated, and accessible places.
Finally issues that will help or hinder the continued warming of libraries in the future, particularly the accurate identity of authors and their publications, are briefly introduced. These are discussed in the context of the BBSRC funded REFINE project, at the National Centre for Text Mining (NaCTeM.ac.uk), which is linking biochemical pathway data with evidence for pathways from the PubMed database.
Date: Thursday 21st May 2009, Time: 13.00, Location: John Rylands University (Main) Library Oxford Road, Parkinson Room (inside main entrance, first on right) University of Manchester (number 55 on google map of the Manchester campus). Please come along if you are interested…
References
- Hull, D., Pettifer, S., & Kell, D. (2008). Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web PLoS Computational Biology, 4 (10) DOI: 10.1371/journal.pcbi.1000204
[CC licensed picture above, the John Rylands Library on Deansgate by dpicker: David Picker]
March 16, 2009
October 31, 2008
Defrosting the Digital Library
Bibliographic Tools for the Next Generation Web
We started writing this paper [1] over a year ago, so it’s great to see it finally published today. Here is the abstract:
“Many scientists now manage the bulk of their bibliographic information electronically, thereby organizing their publications and citation material from digital libraries. However, a library has been described as “thought in cold storage,” and unfortunately many digital libraries can be cold, impersonal, isolated, and inaccessible places. In this Review, we discuss the current chilly state of digital libraries for the computational biologist, including PubMed, IEEE Xplore, the ACM digital library, ISI Web of Knowledge, Scopus, Citeseer, arXiv, DBLP, and Google Scholar. We illustrate the current process of using these libraries with a typical workflow, and highlight problems with managing data and metadata using URIs. We then examine a range of new applications such as Zotero, Mendeley, Mekentosj Papers, MyNCBI, CiteULike, Connotea, and HubMed that exploit the Web to make these digital libraries more personal, sociable, integrated, and accessible places. We conclude with how these applications may begin to help achieve a digital defrost, and discuss some of the issues that will help or hinder this in terms of making libraries on the Web warmer places in the future, becoming resources that are considerably more useful to both humans and machines.”
Thanks to Kevin Emamy, Richard Cameron, Martin Flack, and Ian Mulvany for answering questions on the CiteULike and Connotea mailing lists; and Greg Tyrelle for ongoing discussion about metadata and the semantic Web nodalpoint.org. Also thanks to Timo Hannay and Tim O’Reilly for an invitation to scifoo, where some of the issues described in this publication were discussed. Last but not least, thanks to Douglas Kell and Steve Pettifer for helping me write it and the BBSRC for funding it (grant code BB/E004431/1 REFINE project). We hope it is a useful review, and that you enjoy reading it as much as we enjoyed writing it.
References
- Duncan Hull, Steve Pettifer and Douglas B. Kell (2008). Defrosting the digital library: Bibliographic tools for the next generation web. PLoS Computational Biology, 4(10):e1000204+. DOI:10.1371/journal.pcbi.1000204, pmid:18974831, pmcid:2568856, citeulike:3467077
- Also mentioned (in no particular order) by NCESS, Wowter, Twine, Stephen Abram, Rod Page, Digital Koans, Twitter, Bora Zivkovic, Digg, reddit, Library Intelligencer, OpenHelix, Delicious, friendfeed, Dr. Shock, GribbleLab, Nature Blogs, Ben Good, Rafael Sidi, Scholarship 2.0, Subio, up2date, SecondBrain, Hubmed, BusinessExchange, CiteGeist, Connotea and Google
[Sunrise Ice Sculptures picture from Mark K.]
June 20, 2008
A Brief Review of RefWorks
There is no shortage of bibliographic management tools out there, which ultimately aim to save your time managing the papers and books in your personal library. I’ve just been to a demo and sales pitch for one of them, a tool called RefWorks. Refworks claims to be “an online research management, writing and collaboration tool — designed to help researchers easily gather, manage, store and share all types of information, as well as generate citations and bibliographies”. It looks like a pretty good tool, similar to the likes of EndNote but with more web-based features that are common with Citeulike and Connotea. Here are some ultra-brief notes. RefWorks in five minutes, the good, the bad and the ugly.
The Good…
Refworks finer features
- Refworks is web based, you can use it from any computer with an internet connection, without having to install any software. Platform independent, Mac, Windows, Linux, Blackberry, iPhone, Woteva. This feature is becoming increasingly common, see Martin Fenner’s Online reference managers, not quite there yet article at Nature Network.
- Share selected references and bibliographies on the Web via RefShare
- It imports and exports all the things you would expect, Endnote (definitely), XML, Feeds (RSS), flat files, BibTeX (check?), RIS (check?) and several others via the screenscraping tool RefGrab-It
- Interfaces with PubMed and Scopus (and many other databases) closely, e.g. you can search these directly from your RefWorks library. You can also export from Scopus to Refworks…
- Not part of the Reed-Elsevier global empire (yet), currently part of ProQuest, based in California.
- Free 30 day trial is available
- Just like EndNote, it can be closely integrated with Microsoft Word, to cite-while-you-write