Title: Large scale text processing made simple by GXP make: A Unixish way to parallel workflow processing
Date-time: Monday, 15 June 2009 at 11:00 AM
Location: Room MLG.001, mib.ac.uk
In the first part of this talk, I will introduce a simple tool called GXP make. GXP is a general purpose parallel shell (a process launcher) for multicore machines, unmanaged clusters accessed via SSH, clusters or supercomputers managed by batch scheduler, distributed machines, or any mixture thereof. GXP make is a ‘make‘ execution engine that executes regular UNIX makefiles in parallel. Make, though typically used for software builds, is in fact a general framework to concisely describe workflows constituting sequential commands. Installation of GXP requires no root privileges and needs to be done only on the user’s home machine. GXP make easily scales to more than 1,000 CPU cores. The net result is that GXP make allows an easy migration of workflows from serial environments to clusters and to distributed environments. In the second part, I will talk about our experiences on running a complex text processing workflow developed by Natural Language Processing (NLP) experts. It is an entire workflow that processes MEDLINE abstracts with deep NLP tools (e.g., Enju parser ) to generate search indices of MEDIE, a semantic retrieval engine for MEDLINE. It was originally described in Makefile without a particular provision to parallel processing, yet GXP make was able to run it on clusters with almost no changes to the original Makefile. Time for processing abstracts published in a single day was reduced from approximately eight hours (with a single machine) to twenty minutes with a trivial amount of efforts. A larger scale experiment of processing all abstracts published so far and remaining challenges will also be presented.
- Miyao, Y., Sagae, K., Saetre, R., Matsuzaki, T., & Tsujii, J. (2008). Evaluating contributions of natural language parsers to protein-protein interaction extraction Bioinformatics, 25 (3), 394-400 DOI: 10.1093/bioinformatics/btn631