The Unreasonable Effectiveness of Google

April 17, 2009

The Unreasonable Effectiveness of Google

Filed under: Googleology — Duncan Hull @ 4:00 pm
Tags: Adam Kilgarriff, Alistair Miles, Allyson Lister, Alon Halevy, Andrew Clegg, Artificial Intelligence, bioformats, Biomodels, bootstrep, ChEBI, David Shotton, Dietrich Rebholz-Schuhmann, Eugene Wigner, Fernando Pereira, Frank van Harmelen, Gene Ontology, Googleology, Googleplex, Jim Hendler, Larry Page, Michael Uschold, Nicolas le Novère, OBO, Opinion, Ora Lassila, Peter Norvig, provocative, pubmed, PubMedCentral, reasoner, Reasoning, sbml, scifoo, Sergey Brin, Steffano Mazzocchi, Tim Berners-Lee, unreasonable

Via the Official Google Research Blog at the University of Google, Alon Halevy, Peter Norvig and Fernando Pereira have published an interesting expert opinion piece in the March/April 2009 edition of IEEE Intelligent Systems: computer.org/intelligent. The paper talks about embracing complexity and making use of the “the unreasonable effectiveness of data” [1] drawing analogies with the “unreasonable effectiveness of mathematics” [2]. There is plenty to agree and disagree with in this provocative article which makes it an entertaining read. So what can we learn from those expert Googlers in the Googleplex?

Well you should go and read the paper for yourself of course, but here is a summary of personal likes and dislikes:

Agreeable opinion: yes, yes, yes!

The following opinions in the paper are agreeable:

“The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available.”

This is especially true in biomedical informatics. If I had a dollar (or a €uro) for every time somebody said to me “if only we had more annotated data”, “if only we had an a better annotated corpus” or “if only we could get better metadata” I’d be richer than Larry Page and Sergey Brin put together. Annotation of data and curation of metadata, is a slow, painful and expensive process. Often annotation just doesn’t scale because you need an army of annotators and curators to put various kinds of labels on data. So yes, hoping for annotated data can be a futile approach – we need techniques for using the data we already have. Which is where semantics comes into play, the paper goes on to discuss semantics saying:

“The semantic web is a convention for formal representation languages that lets software services interact with each other “without needing artificial intelligence”….

…The problem of understanding human speech and writing – the semantic interpretation problem – is quite different the problem of service interoperability…

…Unfortunately the fact that the word “semantic” appears in both “Semantic Web” and “semantic interpretation” means that the two problems have often been conflated, causing needless and endless consternation and confusion”

The definition of semantic web is a bit incomplete here, because it is about more than just services, it’s about data too. But the fact that people use the word “semantic” to mean completely different things is deeply ironic and very true. The text mining community, see Semantic Enrichment of the Scientific Literature (SESL) for example, use the word “semantic” in the sense of semantic interpretation, see Andrew Clegg‘s notes on day 1 and day 2 of this conference and the David Shotton paper [3] for some examples. The database, ontology and semantic web community often use the word “semantic” in the context of deductive reasoning, which has a quite different meaning altogether. So yes, this causes endless consternation, and these two different meanings are just the beginning, there are many other meanings of the word “semantic” [4]. Which brings us neatly on to the topic of Ontological Warfare:

“In some domains, competing factions each want to promote their own ontology. This is a problem in diplomacy, not technology.”

Absolutely, most work in this area is about politics, not science or technology. So having agreed with some of what the paper say, I’m now going to disagree with a whole lot more…

Disagreeable opinion: no, no, no!

The following opinions in the paper are disagreeable, see also the reactions from Frank van Harmelen and Stefano Mazzocchi [5,6]. There are echoes of Peter Norvig vs. Tim Berners-Lee at AAAI’06 here:

“The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available.”

Well yes (see above) but, “large-scale data” isn’t always available. If you are text-mining the biomedical literature in the PubMed database for example, you are almost entirely restricted to mining the abstract summaries which are just a small fraction of the total data. This is slowly changing, thanks to PubMedCentral, but we’ve still got a long way to go before the large-scale scientific data locked up in journal articles is freely available for unrestricted mining.

Another problem with making use of available data, it implies we should just give up with annotating data completely, and stop striving for better data because it is futile and hopeless. I’d have to disagree with this, while it might be expensive, difficult and time-consuming to describe and annotate scientific data – this doesn’t mean we shouldn’t bother. A couple of groups making lots of progress with the annotation of scientific data are the SBML.org and Biomodels.net communities. The Biomodels people just had a weekend workshop which I recently attended, Allyson Lister has blogged this extensively, see day 1 and day 2 for all the gory details. To my mind, SBML and Biomodels are an existence proof that the annotation of data (in this case metabolic models) is achievable, desirable and worth the high cost.

Back to the paper, which goes on to say:

“simple models and a lot of data trump more elaborate models based on less data.”

This is only true some of the time, but not all the time. There should be an “often” in there. “Simple models and a lot of data often trump more elaborate models…”. I can think of lots of science and technology that is based on elaborate models and small amounts of data, in fact, the elaborate model based on less data approach is how the Web came to be in the first place.

The paper moves on to talk about the “O” word (ontologies) and makes some pretty sweeping statements, claiming that:

“Ontology writing: The important and easy cases have been done…

bioformats.org defines chromosomes, species, and gene sequences…

…but there’s a long tail of rarely used concepts that are too expensive to formalize with current technology.”

Yes, some important ontologies already exist, but lots of very important cases in the biomedical and chemical domain and elsewhere remain unfinished, or even unstarted. Bioformats is inappropriate here, because it’s not exactly the leading example of a biomedical ontology, what about the Gene Ontology (GO) and other Open Biomedical Ontologies (OBO)? Finally, yes, there is a long tail of concepts that are too expensive to formalise, but I think many ontologies are nowhere near that long tail yet.

The paper concludes:

“So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do.”

That’s easy to say if you’re Google Inc. It’s not so easy for scientists generally. Some of Google’s data comes from (quote) “user queries in search logs”: following patterns in what people type into google and which results they click on. This is data most scientists just don’t have, and as I mentioned earlier, a lot of other important data is not yet freely available, locked away behind publishers paywalls.

So, not every problem can be easily solved by unsupervised learning on unlabeled data. While it is clearly a very powerful and lucrative solution for Google, it is not always so effective for Science generally, some have called it “Bad Science” [7]. Still, Bad Science can be entertaining sometimes, (good journal club fodder) so go and enjoy the paper.

[More commentary on this post over at friendfeed.]

References

Alon Halevy, Peter Norvig and Fernando Pereira (2009) The unreasonable effectiveness of data IEEE Intelligent Systems, Vol. 24, No. 2. (2009), pp. 8-12. DOI:10.1109/MIS.2009.36
Eugene Wigner (1960) The Unreasonable Effectiveness of Mathematics in the Natural Sciences Communications in Pure and Applied Mathematics, vol. 13, no. 1, pp. 1-14. DOI:10.1002/cpa.3160130102
David Shotton, Katie Portwin, Graham Klyne, Alistair Miles (2009) Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article PLoS Computational Biology, Vol. 5, No. 4, e1000361. DOI:10.1371/journal.pcbi.1000361
Michael Uschold (2003) Where are the semantics in the semantic web? Artificial Intelligence Magazine, Vol. 24, No. 3., pp. 25-36.
Frank van Harmelen (2009) The Unreasonable Effectiveness of Fake Controversies LarKC Weblog
Stefano Mazzocchi (2009) Unreasonable Hypocrisy Stefano’s Linotype blog
Adam Kilgarriff (2007) Googleology is Bad Science Computational Linguistics, Vol. 33, No. 1, pp. 147-151. DOI:10.1162/coli.2007.33.1.147
Massimo Pigliucci (2009). The end of theory in science? EMBO reports, 10 (6), 534-534 DOI: 10.1038/embor.2009.111

Comments (5)

5 Comments »

There’s something nerdily hilarious about the ambiguous meaning of the word ‘semantic’. Douglas Hofstadter would be proud.

Comment by Andrew Clegg — April 17, 2009 @ 4:37 pm | Reply
Thanks for the nice post. The article has been on my to_read list for a little while but I’ve been preoccupied with the whole graduating and becoming unemployed business. You say the article suggests that we give up on developing (I will refrain from using “semantic”) good metadata. Is it really so extreme? After all, Google continues to try to accumulate good annotations for images on the Web (http://images.google.com/imagelabeler/).

Comment by ben — April 17, 2009 @ 6:10 pm | Reply
@Andrew yes, I know exactly what you mean, (in a nerdy way of course!)

@Ben glad you found this post useful. As for giving up on good metadata, its just my interpretation – perhaps I have read too much between the lines. As for semantic, it has become a “dirty word” in certain circles. Tim Berners-Lee’s presentations have gone from mentioning the word “semantic” word in every slide, to barely mentioning it at all. See Tim’s Ted Talk for an example… its an interesting change of language.

Comment by Duncan — April 20, 2009 @ 4:33 pm | Reply
[…] authoring An interesting post from Duncan Hull The Unreasonable Effectiveness of Google about the challenges of a semantic web of data. Since I am talking on the Chemical Semantic Web at […]

Pingback by Unilever Centre for Molecular Informatics, Cambridge - Semantic authoring « petermr’s blog — April 20, 2009 @ 10:22 pm | Reply
[…] manual annotations. However it does a decent job and I think it’s in line with a recent post by Duncan Hull, where he quotes a paper from […]

Pingback by PubChem Bioassay Annotation Poster at So much to do, so little time — April 21, 2009 @ 1:12 pm | Reply

RSS feed for comments on this post. TrackBack URI

O'Really?

April 17, 2009