Michael Ley is visiting Manchester this week, he will be doing a seminar on Wednesday 3rd June, here are some details for anyone who is interested in attending:
Date: 3rd Jun 2009
Title: DBLP: How the data get in
Speaker: Dr Michael Ley. University of Trier, Germany
Time & Location: 14:15, Lecture Theatre 1.4, Kilburn Building
Abstract: The DBLP (Digital Bibliography & Library Project) Computer Science Bibliography now includes more than 1.2 million bibliographic records. For Computer Science researchers the DBLP web site now is a popular tool to trace the work of colleagues and to retrieve bibliographic details when composing the lists of references for new papers. Ranking and profiling of persons, institutions, journals, or conferences is another usage of DBLP. Many scientists are aware of this and want their publications being listed as complete as possible.
The talk focuses on the data acquisition workflow for DBLP. To get ‘clean’ basic bibliographic information for scientific publications remains a chaotic puzzle.
Large publishers are either not interested to cooperate with open services like DBLP, or their policy is very inconsistent. In most cases they are not able or not willing to deliver basic data required for DBLP in a direct way, but they encourage us to crawl their Web sites. This indirection has two main problems:
- The organisation and appearance of Web sites changes from time to time, this forces a reimplementation of information extraction scripts. 
- In many cases manual steps are necessary to get ‘complete’ bibliographic information.
For many small information sources it is not worthwhile to develop information extraction scripts. Data acquisition is done manually. There is an amazing variety of small but interesting journals, conferences and workshops in Computer Science which are not under the umbrella of ACM, IEEE, Springer, Elsevier etc. How they get it often is decided very pragmatically.
The goal of the talk and my visit to Manchester is to start a discussion process: The EasyChair conference management system developed by Andrei Voronkov and DBLP are parts of scientific publication workflow. They should be connected for mutual benefit?
- Lincoln Stein (2002). Creating a bioinformatics nation: screen scraping is torture Nature, 417 (6885), 119-120 DOI: 10.1038/417119a