NLP | O'Really?

June 10, 2009

Kenjiro Taura on Parallel Workflows

Filed under: informatics,seminars — Duncan Hull @ 7:24 am
Tags: bioinformatics, dmake, dsh, EC2, enju, falkon, Globus, gluepy, GXP, gxp make, Kenjiro Taura, make, makefile, MEDIE, Medline, nactem, NLP, pdsh, pubmed, qmake, ssh, taktuk, University of Tokyo, unixish, workflow

Kenjiro Taura is visting Manchester next week from the Department of Information and Communication Engineering at the University of Tokyo. He will be doing a seminar, the details of which are below:

Title: Large scale text processing made simple by GXP make: A Unixish way to parallel workflow processing

Date-time: Monday, 15 June 2009 at 11:00 AM

Location: Room MLG.001, mib.ac.uk

In the first part of this talk, I will introduce a simple tool called GXP make. GXP is a general purpose parallel shell (a process launcher) for multicore machines, unmanaged clusters accessed via SSH, clusters or supercomputers managed by batch scheduler, distributed machines, or any mixture thereof. GXP make is a ‘make‘ execution engine that executes regular UNIX makefiles in parallel. Make, though typically used for software builds, is in fact a general framework to concisely describe workflows constituting sequential commands. Installation of GXP requires no root privileges and needs to be done only on the user’s home machine. GXP make easily scales to more than 1,000 CPU cores. The net result is that GXP make allows an easy migration of workflows from serial environments to clusters and to distributed environments. In the second part, I will talk about our experiences on running a complex text processing workflow developed by Natural Language Processing (NLP) experts. It is an entire workflow that processes MEDLINE abstracts with deep NLP tools (e.g., Enju parser [1]) to generate search indices of MEDIE, a semantic retrieval engine for MEDLINE. It was originally described in Makefile without a particular provision to parallel processing, yet GXP make was able to run it on clusters with almost no changes to the original Makefile. Time for processing abstracts published in a single day was reduced from approximately eight hours (with a single machine) to twenty minutes with a trivial amount of efforts. A larger scale experiment of processing all abstracts published so far and remaining challenges will also be presented.

References

Miyao, Y., Sagae, K., Saetre, R., Matsuzaki, T., & Tsujii, J. (2008). Evaluating contributions of natural language parsers to protein-protein interaction extraction Bioinformatics, 25 (3), 394-400 DOI: 10.1093/bioinformatics/btn631

October 27, 2006

MEDIE: MEDLINE++

Filed under: informatics — Duncan Hull @ 10:21 pm
Tags: Jun'ichi Tsujii, MEDIE, Medline, MIB, nactem, NLP, nodalpoint, pubmed, software, text mining

MEDIE is an “intelligent” semantic search engine that retrieves biomedical correlations from over 14 million articles in MEDLINE. You can find abstracts and sentences in MEDLINE by specifying the semantics of correlations; for example, What activates tumour suppressor protein p53? So just how useful is MEDIE and is it at the cutting edge?

At the Manchester Interdisciplinary Biocentre (MIB) launch yesterday, Professor Jun’ichi Tsujii gave a presentation on Linking text with knowledge – challenges for Text Mining in Biology. As part of this presentation he gave a demonstration of Medie: an intelligent search engine for Medline. This tool looks quite impressive if you experiment with some sample queries. I wonder what nodalpointers, especially hardened text-miners, natural language processing (NLP) nerds and computational linguists, make of Medie?

[This post was originally published on nodalpoint, with comments]