bibliography | O'Really?

July 27, 2010

Twenty million papers in PubMed: a triumph or a tragedy?

Filed under: data mining,publishing,web — Duncan Hull @ 3:37 pm
Tags: 20 million, Alon Halevy, Anna Kushnir, Barack Obama, bibliography, cameron neylon, database, discovery deficit, Entrez, Fernando Pereira, filter failure, information overload, ISI WOK, Least Publishable Unit, Medline, MESH, NCBI, Neil Smalheiser, ontology, Open Researcher & Contributor ID, ORCID, PageRank, Peter Norvig, prozac, publish or perish, pubmed, PubMed Central, pubmed tragedies, PubMed triumphs, PubSCIENCE, Rezarta Islamaj, ROFL, scopus, tragedy, triumph, Vetle Torvik

A quick search on pubmed.gov today reveals that the freely available American database of biomedical literature has just passed the 20 million citations mark*. Should we celebrate or commiserate passing this landmark figure? Is it a triumph or a tragedy that PubMed® is the size it is? (more…)

Comments (29)

June 2, 2009

Michael Ley on Digital Bibliographies

Filed under: data mining,defrost,publishing,seminars — Duncan Hull @ 7:43 am
Tags: acm, arxiv, bibliography, citeulike, connotea, DBLP, defrosting the digital library, digital library, Elsevier, google scholar, IEEE, Lincoln Stein, Mendeley, Michael Ley, scopus, screen scraping, Springer, wiley

Michael Ley is visiting Manchester this week, he will be doing a seminar on Wednesday 3rd June, here are some details for anyone who is interested in attending:

Date: 3rd Jun 2009

Title: DBLP: How the data get in

Speaker: Dr Michael Ley. University of Trier, Germany

Time & Location: 14:15, Lecture Theatre 1.4, Kilburn Building

Abstract: The DBLP (Digital Bibliography & Library Project) Computer Science Bibliography now includes more than 1.2 million bibliographic records. For Computer Science researchers the DBLP web site now is a popular tool to trace the work of colleagues and to retrieve bibliographic details when composing the lists of references for new papers. Ranking and profiling of persons, institutions, journals, or conferences is another usage of DBLP. Many scientists are aware of this and want their publications being listed as complete as possible.

The talk focuses on the data acquisition workflow for DBLP. To get ‘clean’ basic bibliographic information for scientific publications remains a chaotic puzzle.

Large publishers are either not interested to cooperate with open services like DBLP, or their policy is very inconsistent. In most cases they are not able or not willing to deliver basic data required for DBLP in a direct way, but they encourage us to crawl their Web sites. This indirection has two main problems:

The organisation and appearance of Web sites changes from time to time, this forces a reimplementation of information extraction scripts. [1]

In many cases manual steps are necessary to get ‘complete’ bibliographic information.

For many small information sources it is not worthwhile to develop information extraction scripts. Data acquisition is done manually. There is an amazing variety of small but interesting journals, conferences and workshops in Computer Science which are not under the umbrella of ACM, IEEE, Springer, Elsevier etc. How they get it often is decided very pragmatically.

The goal of the talk and my visit to Manchester is to start a discussion process: The EasyChair conference management system developed by Andrei Voronkov and DBLP are parts of scientific publication workflow. They should be connected for mutual benefit?

References

Lincoln Stein (2002). Creating a bioinformatics nation: screen scraping is torture Nature, 417 (6885), 119-120 DOI: 10.1038/417119a