Michael Galperin | O'Really?

January 18, 2008

One Thousand Databases High (and rising)

Filed under: informatics,Uncategorized — Duncan Hull @ 12:36 pm
Tags: bioinformatics, data tombs, database, DNA mania, Michael Galperin, NAR, Usama Fayyad

Well it’s that time of year again. The 15th annual stamp collecting edition of the journal Nucleic Acids Research (NAR), also known as the 2008 Database issue [1], was published earlier this week. This year there are 1078 databases listed in the collection, 110 more than the previous one (see Figure 1). As we pass the one thousand databases mark (1kDB) I wonder, what proportion of the data in these databases will never be used?

R.I.P. Biological Data?

It seems highly likely that lots of this data is stored in what Usama Fayyad at Yahoo! Research! Laboratories! calls data tombs [2], because as he puts it:

“Our ability to capture and store data has far outpaced our ability to process and utilise it. This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again.”

Like last year, lets illustrate the growth with an obligatory graph, see Figure 1.

Figure 1: Data growth: the ability to capture and store biological data has far outpaced our ability to understand it. Vertical axis is number of databases listed in Nucleic Acids Research [1], Horizontal axis is the year. (Picture drawn with Google Charts API which is OK but as Alf points out, doesn’t do error bars yet).

Another day, another dollar database

Does it matter that large quantities of this data will probably never be used? How could you find out, how much and which data was “write-only”? Will Biologists ever catch up with the physicists when it comes to Very Large stamp collections Databases? Biological databases are pretty big, but can you imagine handling up to 1,500 megabytes of data per second for ten years as the Physicists will soon be doing? You can already hear the (arrogant?) Physicists taunting the Biologists, “my database is bigger than yours”. So there.

Whichever of these databases you are using, happy data mining in 2008. If you are lucky, the data tombs you are working will contain hidden treasure that will make you famous and/or rich. Maybe. Any stamp collector will tell you, some stamps can become very valuable. There’s Gold in them there hills databases you know…

Galperin, M. Y. (2007). The molecular biology database collection: 2008 update. Nucleic Acids Research, Vol. 36, Database issue, pages D2-D4. DOI:10.1093/nar/gkm1037
Fayyad, U. and Uthurusamy, R. (2002). Evolving data into mining solutions for insights. Communications of the ACM, 45(8):28-31. DOI:10.1145/545151.545174
This post originally published on nodalpoint (with comments)
Stamp collectors picture, top right, thanks to daxiang stef / stef yau

January 5, 2007

NAR Database Issue 2007: Not Waving But Drowning?

Filed under: Uncategorized — Duncan Hull @ 10:43 pm
Tags: bioinformatics, data tombs, database, Lincoln Stein, Michael Galperin, NAR, Not waving but drowning, Open Access, OUP, Stevie Smith

The 14th annual Nucleic Acids Research (NAR) database issue 2007 has just been published, open-access. This year is the largest yet (again) with 968 molecular biology databases listed, 110 more than the previous one (see figure below). In the world of biological databases, are we waving or drowning?

Nine hundred and sixty eight is a lot of databases, and even that mind-boggling number is not an exhaustive or comprehensive tally. But is counting all these databases waving or drowning [1]? Will we ever stop stamp-collecting the databases and tools we have in molecular biology? What prompted this is, an employee of the The Boeing Company once told me they have given up counting their databases because there were just too many. Just think of all the databases of design and technical documentation that accompanies the myriad of different aircraft that Boeing manufacture, like the iconic 747 jumbo jet. Now, combine that with all the supply chain, customer and employee information and you can begin to imagine the data deluge that a large multi-national corporation has to handle.

Like Boeing, in Biology we’ve clearly got more data than we know what to do with [2,3]. It won’t be news to bioinformaticians and its been said many times before but its worth repeating again here:

We know how many databases we have but we don’t know what a lot of the data in these databases means, think of all those mystery proteins of unknown function. It will obviously take time until we understand it all…
Most of the data only begins to make sense when it is integrated or mashed-up with other data. However, we still don’t know how to integrate all these databases, or as Lincoln Stein puts it “so far their integration has proved problematic” [4], a bit of an understatement. Many grandiose schemes for the “integration” of biological databases have been proposed over the years, but unfortunately none have been practical to the point of implementation [5]

Despite this, it is still useful to know how many molecular biology databases there are. At least we know how many databases we are drowning in. Thankfully, unlike Boeing, most biological data, algorithms and tools are open-source and more literature is becoming open access which will hopefully make progress more rapid. But biology is more complicated than a Boeing 747, so we’ve got a long-haul flight ahead of us. OK, I’ve managed to completely overstretch that aerospace analogy now so I’ll stop there.

Whatever databases you’ll be using in 2007, have a Happy New Year mining, exploring and understanding the data they contain, not drowning in it.

References

Stevie Smith (1957) Not waving but drowning
Michael Galperin (2007) The Molecular Biology Database Collection: 2007 update Nucleic Acids Research, Vol. 35, Database issue. DOI:10.1093/nar/gkl1008
Alex Bateman (2007) Editorial: What makes a good database? Nucleic Acids Research, Vol. 35, Database issue. DOI:10.1093/nar/gkl1051
Lincoln Stein (2003) Biological Database Integration Nature Reviews Genetics. 4 (5), 337-45. DOI:10.1038/nrg1065
Michael Ashburner (2006) Keynote at the Pacific Symposium on Biocomputing (PSB2006) in Hawaii seeAlso Aloha: Biocomputing in Hawaii
This post originally published on nodalpoint with comments

This work is licensed under a

Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

Comments (1)