data mining | O'Really?

September 9, 2014

Punning with the Pub in PubMed: Are there any decent NCBI puns left? #PubMedPuns

Filed under: data mining,Googleology,paperware,publishing,Science,technology — Duncan Hull @ 10:31 am
Tags: Casey Bergman, defrosting, Elizabeth Gibney, gastropub, google scholar, Johanna McEntyre, Karsten Hokamp, Ken Wolfe, Leo Chalupa, Mark Gerstein, Phil Bourne, portmanteau, pubbit, Pubble, PubBrawl, pubby, PubCast, pubchase, pubclean, PubCrawl, pubcrawler, pubfetch, PubFig, PubFight, pubgames, publican, PubLick, PubLons, publunch, Publy, PubManteau, PubMatch, pubmed, PubMedication, PubMine, pubnet, pubpeer, PubQuiz, PubSCIENCE, pubsearch, PubSnacks, PubSnax, PubSoft, PubSort, Pubsy, Richard van Noorden, RSS, text mining, twitter, twitterbot

PubMedication: do you get your best ideas in the Pub? CC-BY-ND image via trombone65 on Flickr.

Many people claim they get all their best ideas in the pub, but for lots of scientists their best ideas probably come from PubMed.gov – the NCBI’s monster database of biomedical literature. Consequently, the database has spawned a whole slew of tools that riff off the PubMed name, with many puns and portmanteaus (aka “PubManteaus”), and the pub-based wordplays are very common. [1,2]

All of this might make you wonder, are there any decent PubMed puns left? Here’s an incomplete collection:

PubCrawler pubcrawler.ie “goes to the library while you go to the pub…” [3,4]
PubChase pubchase.com is a “life sciences and medical literature recommendations engine. Search smarter, organize, and discover the articles most important to you.” [5]
PubCast scivee.tv/pubcasts allow users to “enliven articles and help drive more views” (to PubMed) [6]
PubFig nothing to do with PubMed, but research done on face and image recognition that happens to be indexed by PubMed. [7]
PubGet pubget.com is a “comprehensive source for science PDFs, including everything you’d find in Medline.” [8]
PubLons publons.com OK, not much to do with PubMed directly but PubLons helps you “you record, showcase, and verify all your peer review activity.”
PubMine “supports intelligent knowledge discovery” [9]
PubNet pubnet.gersteinlab.org is a “web-based tool that extracts several types of relationships returned by PubMed queries and maps them into networks” aka a publication network graph utility. [10]
GastroPub repackages and re-sells ordinary PubMed content disguised as high-end luxury data at a higher premium, similar to a Gastropub.
PubQuiz is either the new name for NCBI database search www.ncbi.nlm.nih.gov/gquery or a quiz where you’re only allowed to use PubMed to answer questions.
PubSearch & PubFetch allows users to “store literature, keyword, and gene information in a relational database, index the literature with keywords and gene names, and provide a Web user interface for annotating the genes from experimental data found in the associated literature” [11]
PubScience is either “peer-reviewed drinking” courtesy of pubsci.co.uk or an ambitious publishing project tragically axed by the U.S. Department of Energy (DoE). [12,13]
PubSub is anything that makes use of the publish–subscribe pattern, such as NCBI feeds. [14]
PubLick as far as I can see, hasn’t been used yet, unless you count this @publick on twitter. If anyone was launching a startup, working in the area of “licking” the tastiest data out of PubMed, that could be a great name for their data-mining business. Alternatively, it could be a catchy new nickname for PubMedCentral (PMC) or Europe PubMedCentral (EuropePMC) [15] – names which don’t exactly trip off the tongue. Since PMC is a free digital archive of publicly accessible full-text scholarly articles, PubLick seems like a appropriate moniker.

PubLick Cat got all the PubMed cream. CC-BY image via dizznbonn on flickr.

There’s probably lots more PubMed puns and portmanteaus out there just waiting to be used. Pubby, Pubsy, PubLican, Pubble, Pubbit, Publy, PubSoft, PubSort, PubBrawl, PubMatch, PubGames, PubGuide, PubWisdom, PubTalk, PubChat, PubShare, PubGrub, PubSnacks and PubLunch could all work. If you’ve know of any other decent (or dodgy) PubMed puns, leave them in the comments below and go and build a scientific twitterbot or cool tool using the same name — if you haven’t already.

References

Lu Z. (2011). PubMed and beyond: a survey of web tools for searching biomedical literature., Database: The Journal of Biological Databases and Curation, http://pubmed.gov/21245076
Hull D., Pettifer S.R. & Kell D.B. (2008). Defrosting the digital library: bibliographic tools for the next generation web., PLOS Computational Biology, PMID: http://pubmed.gov/18974831
Hokamp K. & Wolfe K.H. (2004) PubCrawler: keeping up comfortably with PubMed and GenBank., Nucleic acids research, http://pubmed.gov/15215341
Hokamp K. & Wolfe K. (1999) What’s new in the library? What’s new in GenBank? let PubCrawler tell you., Trends in Genetics, http://pubmed.gov/10529811
Gibney E. (2014). How to tame the flood of literature., Nature, 513 (7516) http://pubmed.gov/25186906
Bourne P. & Chalupa L. (2008). A new approach to scientific dissemination, Materials Today, 11 (6) 48-48. DOI:10.1016/s1369-7021(08)70131-7
Kumar N., Berg A., Belhumeur P.N. & Nayar S. (2011). Describable Visual Attributes for Face Verification and Image Search., IEEE Transactions on Pattern Analysis and Machine Intelligence, http://pubmed.gov/21383395
Featherstone R. & Hersey D. (2010). The quest for full text: an in-depth examination of Pubget for medical searchers., Medical Reference Services Quarterly, 29 (4) 307-319. http://pubmed.gov/21058175
Kim T.K., Wan-Sup Cho, Gun Hwan Ko, Sanghyuk Lee & Bo Kyeng Hou (2011). PubMine: An Ontology-Based Text Mining System for Deducing Relationships among Biological Entities, Interdisciplinary Bio Central, 3 (2) 1-6. DOI:10.4051/ibc.2011.3.2.0007
Douglas S.M., Montelione G.T. & Gerstein M. (2005). PubNet: a flexible system for visualizing literature derived networks., Genome Biology, http://pubmed.gov/16168087
Yoo D., Xu I., Berardini T.Z., Rhee S.Y., Narayanasamy V. & Twigger S. (2006). PubSearch and PubFetch: a simple management system for semiautomated retrieval and annotation of biological information from the literature., Current Protocols in Bioinformatics , http://pubmed.gov/18428773
Seife C. (2002). Electronic publishing. DOE cites competition in killing PubSCIENCE., Science (New York, N.Y.), 297 (5585) 1257-1259. http://pubmed.gov/12193762
Jensen M. (2003). Another loss in the privatisation war: PubScience., Lancet, 361 (9354) 274. http://pubmed.gov/12559859
Dubuque E.M. (2011). Automating academic literature searches with RSS Feeds and Google Reader(™)., Behavior Analysis in Practice, 4 (1) http://pubmed.gov/22532905
McEntyre J.R., Ananiadou S., Andrews S., Black W.J., Boulderstone R., Buttery P., Chaplin D., Chevuru S., Cobley N. & Coleman L.A. & (2010). UKPMC: a full text article resource for the life sciences., Nucleic Acids Research, http://pubmed.gov/21062818

http://twitter.com/Richvn/status/509370496375619584

@dullhunk @McDawg @PubChase @Publick @pubget @Richvn @LizzieGibney @NatureNews nobody written PubCrawl yet? Shame

— Bob O'Hara (@BobOHara) September 9, 2014

http://twitter.com/i_am_kilpatrick/status/509275423415738368

Leave a Comment

April 1, 2014

The Serene Scientists Serenity Prayer via Jon Butterworth

Filed under: data mining,engineering — Duncan Hull @ 2:38 pm
Tags: big data, Giles Fraser, Jon Butterworth, linked data, Open Data, Reinhold Niebuhr, religion, Science, serenity prayer, UCL, wikidata

The Church of Banksy

Whatever your religous preferences, the Serenity Prayer by Reinhold Niebuhr captures a certain wisdom about life in general. So it is good to see that physicist Jon Butterworth at UCL has adapted it [1] for scientists:

“Give me grace to accept with serenity the things that cannot be understood,

Data to investigate the things which can be understood,

And the Wisdom to know the difference.”

Amen!

References

Jon Butterworth (2014) Giles Fraser says scientists are replacing theologians. Some thoughts on that The Gruaniad, 2014-03-31

Leave a Comment

August 3, 2012

Where did all the BBC programme metadata go? The Infax catalogue online

Filed under: data mining,engineering,informatics,publishing,search,technology — Duncan Hull @ 8:25 am
Tags: Alan Turing, Andrius Butkus, Ant Miller, BBC, BBC Sport, BBC2012, Ben Hammersley, big data, Big Metadata, Danny Boyle, data, Dave Beckett, David Rogers, Gary Lineker, Gavin Bell, Huw Edwards, infax, Jake Humphrey, Jennifer Ehle, John Cooper Clarke, Karen Loasby, linked data, LinkedIn, Matt Biddulph, metadata, Michael Petersen, Mishal Husain, Open Data, Radio, Radio Times, Rowan Atkinson, Sue Barker, television, Tom Coates, Tom Loosemore, TomskiTomski, TV-Anytime

bbc.co.uk/programmes as a QR Code by /Sizemore/ Mike Atherton on Flickr available under a creative commons licence

Over at @BBCSport and @BBC2012 there are some Olympian feats of big data wrestling going on behind the scenes for London 2012 [1]. While we all enjoy the Olympics on a range of platforms and devices, a team of twenty engineers is busy making it all happen. It’s great that the BBC, unlike other large organisations, can talk openly about their technology and share hard-won knowledge widely.

Back in 2006 the BBC published another impressive application that allowed users to search and browse over 75 years of programme data. The programme was built from metadata, not the actual audio and visual data from the TV and Radio, but the data that comes after-data, information about the programmes from an internal database known as Infax [2,3].

The web app was published at open.bbc.co.uk/catalogue and built by a crack team of experts led by Matt Biddulph @MattB (and including Tom Coates @TomCoates, Ben Hammersley @BenHammersley, Gavin Bell @zzGavin, Tom Loosemore @TomskiTomski and several others – see comments below).

It allowed users to find weird and wonderful things. For example, you could browse all the programmes that Alan Turing or Albert Einstein had appeared in or search for all the programmes with Jennifer Ehle. You could query it as well, to list all episodes of Dr Who in the order they were aired. It wasn’t so much Big Data as Big Metadata, [4,5] potentially useful for improving the viewing and listening experience of the audience.

At the time of its launch, Dave Beckett @dajobe blogged about it, Matt Biddulph wrote some release notes, Tom Loosemore said a few words, backstage clocked it and I scribbled some notes too. Being a proof-of-concept “experimental prototype”, the app eventually disappeared into the great bit bucket in the sky. The only visible trace of the catalogue today is the blog posts above and the message below which greets you when you visit the site:

“Thank you for your continued interest in the BBC Programme Catalogue. The BBC is now looking into how this data can be incorporated into its programme information pages.”

You can still get some BBC programme metadata from bbc.co.uk/programmes and bbc.co.uk/archive. Every programme has publicly available metadata but only a fraction of what was in the open catalogue. Although the app has gone, lots of the data must still be there somewhere. Take for example, the opening ceremony for the Olympic Games…

The metadata that is currently available

The metadata for each BBC programme can be found via its own page, so the opening ceremony programme has metadata available in xml and rdf which tells you several things including this synopsis:

“Coverage of the opening ceremony, which officially starts at 9.00, with the eyes of the world focused on the Olympic Stadium as the 30th Olympiad is officially declared open by Her Majesty the Queen. Film director Danny Boyle is set to produce a stunning cultural show ahead of the athletes’ parade, during which over 200 countries are expected to be represented. This is followed by the official opening, the arrival of the torch and the lighting of the cauldron.”

The metadata also tells you that this particular programme was presented by Sue Barker, Huw Edwards, Gary Lineker, Jake Humphrey and Mishal Husain. The Executive Producer was Paul Davies and there’s a bunch of other stuff: date of first broadcast, links to related information and clips but that’s about it.

The metadata that used to be available

The great thing about the open catalogue was that it went into lots more detail than above. So, for the Olympics ceremony, the participants in the programme would have been listed as Danny Boyle, Daniel Craig, Thomas Heatherwick, Elizbaeth II, Paul McCartney, Rowan Atkinson, Bradley Wiggins, Kenneth Brannagh, Steve Redgrave, J.K. Rowling and so on. For each contributor, you could see what other programmes they had been involved in, not just recently broadcast ones, but those going back 75 years. You could also see who had collaborated with who and when their first broadcast was and so on. It didn’t just document the über-famous people either, it went into just as much detail about other people you might not necessarily have heard of like Frank Cottrell Boyce, Callum Airlie and Jordan Duckitt. It was great stuff, but neither the archive or current programmes seem to have this level of detail.

Meta-conclusions

It’s a bit of a mystery where all the lovely BBC metadata went, it’s probably just sitting on some servers somewhere, inaccessible to the outside world. With my licence fee paying hat on, this seems a bit of a waste. I’ve asked everyone I know, including people at the beeb, but have drawn a blank. Most have shrugged their shoulders and pointed to the useful but slightly impoverished /programmes and /archive which is why I’m writing this post on t’interwebs.

Maybe the Olympic task of curating all that data makes it un-sustainable. Perhaps somebody decided there is no point competing with wikipedia where wiki-nerds curate programme data for free? It’s possible you can’t justify serving big metadata without giving the actual data (programmes) too? Maybe there’s a shiny new application in the pipeline to replace the catalogue, currently being worked on or an upgrade to @ArchiveAtBBC & @programmes so they include much more data. Could there be issues with publishing this kind of personal data on the web which meant the whole thing got canned? Nasty copyright issues could probably sink a project like this too. Who knows…

Does anyone reading this know the answers? If you do, I’d love to hear from you.

[Update, In October 2014 the Big British Castle launched their BBC Geome project at genome.ch.bbc.co.uk which covers much of the same data as infax, from 1923 to 2009. There’s a pretty decent Wikipedia article on the BBC programme catalogue and its ancestors, including infax]

References

David Rogers (2012). Building the Olympic Data Services, BBC Internet blog
Ant Miller (2012). Opening up the Archives: Finding content using metadata from Infax catalogue and Radio Times, BBC Research & Development blog
Ant Miller (2012). Opening up the Archives: New kinds of metadata, BBC Research & Development blog
Karen Loasby (2006). Changing approaches to metadata at bbc.co.uk: From chaos to control and then letting go again, Bulletin of the American Society for Information Science and Technology, 33 (1) 26. DOI: 10.1002/bult.2006.1720330109
Andrius Butkus and Michael Petersen (2007). Semantic Modelling Using TV-Anytime Genre Metadata, Lecture Notes in Computer Science, 4471 234. DOI: 10.1007/978-3-540-72559-6_24

@dullhunk its v tricky issue.@bbcbackstage tried EVERYTHING to get infax back,can say that @ArchiveAtBBC @BBCRD @InsideBBCFM @adew @meeware

— Ian | #blacklivesmatter | @cubicgarden@mas.to (@cubicgarden) August 3, 2012

@cubicgarden @dullhunk @archiveatbbc @bbcrd infax was a tool for an internal job, not fit for public. Alternatives being worked on, promise!

— meeware (@meeware) August 3, 2012

@dullhunk have left a comment on your BBC Infax post http://t.co/EGw7O56t cc. @dajobe @mattb @benhammersley @ldodds @tomcoates @zzgavin

— Tom Loosemore (@tomskitomski) August 3, 2012

@dullhunk @tomskitomski There's stuff on infax that you'd use internally but couldn't be made public. It just doesn't have the checks on it.

— Tom Coates (@tomcoates) August 3, 2012

Comments (5)

April 2, 2012

Open Data Manchester: Twenty Four Hour Data People

Filed under: cycling,data mining,Science,technology — Duncan Hull @ 12:38 pm
Tags: AHRC, API, Birmingham, bolton, Boris Johnson, Bury, cornerhouse, data, data.gov, e-government, flickr, Francis Maude, Future Everything, geek, greater manchester, hackathon, Howard Bernstein, Isabelle Croissant, Jennifer Molloy, Jennifer Pahlka, Julian Tait, London, madlab, Manchester, MDDA, MediaCityUK, Mike Whitby, NESTA, OKFN, Oldham, omniversity, Open Data, Paul Dobson, Rachel Leech, Robert Zoellick, Rochdale, salford, Sean Ryder, stockport, Tameside, tfgm, Trafford, Wigan

Sean Ryder at the Hacienda by Tangerine Dream on flickr

Sean Ryder, the original twenty-four hour Manchester party person of the Happy Mondays, spins the discs at the Wickerman festival in 2008. Creative commons licensed image via Tangerine Dream on flickr.com

According to Francis Maude, Open Data is the raw material for “next industrial revolution”. Now you should obviously take everything politicians say with a large pinch of salt (especially Maude) but despite the political hyperbole, when it comes to data he is onto something.

According to wikipedia, which is considerably more reliable than politicians, Open Data is:

“the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.”

Open Data is slowly having an impact in the world of science [1] and also in wider society. Initiatives like data.gov in the U.S. and data.gov.uk in England, also known as e-government or government 2.0, have put huge amounts of data in the public domain and there is plenty more data in the pipeline. All of this data makes novel applications possible, like cycling injury maps showing accident black spots, and many others just like it.

To discuss the current status of Open Data in Greater Manchester there were two events last week:

The Open Data Manchester meetup “24 hour data people” [2] at the the Manchester Digital Laboratory (“madlab”), which recently made BBC headlines with the DIY bio project
The Discover Open Data event at the Cornerhouse cinema

Here is a brief and incomplete summary of what went on at these events:

(more…)

Leave a Comment

February 15, 2012

The Open Access Irony Awards: Naming and shaming them

Filed under: data mining,publishing,Science,technology — Duncan Hull @ 11:23 am
Tags: BioMed Central, Bruce Alberts, cameron neylon, Casey Bergman, citeulike, DOAJ, Eefke Smit, Elias Zerhouni, Elsevier, Evilsevier, flickr, foldit, hypocrisy, irony, Jihyun Kim, Jocelyn Kaiser, Joe Dunckley, Jonathan Eisen, Josh Sommer, Keith Epstein, Macmillan, Mark Wolpert, Matthew Cockerill, Mendeley, NIH, Open Access, PLoS, Sage Bionetworks, Springer, Stephen Curry, Stephen Friend, Wiley-Blackwell

Open Access (OA) publishing aims to make the results of scientific research available to the widest possible audience. Scientific papers that are published in Open Access journals are freely available for crucial data mining and for anyone or anything to read, wherever they may be.

In the last ten years, the Open Access movement has made huge progress in allowing:

“any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers.”

But there is still a long way to go yet, as much of the world’s scientific knowledge remains locked up behind publisher’s paywalls, unavailable for re-use by text-mining software and inaccessible to the public, who often funded the research through taxation.

Openly ironic?

Ironically, some of the papers that are inaccessible discuss or even champion the very Open Access movement itself. Sometimes the lack of access is deliberate, other times accidental – but the consequences are serious. Whether deliberate or accidental, restricted access to public scientific knowledge is slowing scientific progress [1]. Sometimes the best way to make a serious point is to have a laugh and joke about it. This is what the Open Access Irony Awards do, by gathering all the offenders in one place, we can laugh and make a serious point at the same time by naming and shaming the papers in question.

To get the ball rolling, here is are some examples:

The Lancet owned by Evilsevier, sorry I mean Elsevier, recently published a paper on “the case for open data” [2] (please login to access article). Login?! Not very open…
Serial offender and über-journal Science has an article by Elias Zerhouni on the NIH public access policy [3] (Subscribe/Join AAAS to View Full Text), another on “making data maximally available” [4] (Subscribe/Join AAAS to View Full Text) and another on a high profile advocate of open science [5] (Buy Access to This Article to View Full Text) Irony of ironies.
From Nature Publishing Group comes a fascinating paper about harnessing the wisdom of the crowds to predict protein structures [6]. Not only have members of the tax-paying public funded this work, they actually did some of the work too! But unfortunately they have to pay to see the paper describing their results. Ironic? Also, another published in Nature Medicine proclaims the “delay in sharing research data is costing lives” [1] (instant access only $32!)
From the British Medical Journal (BMJ) comes the worrying news of dodgy American laws that will lock up valuable scientific data behind paywalls [7] (please subscribe or pay below). Ironic? *
The “green” road to Open Access publishing involves authors uploading their manuscript to self-archive the data in some kind of public repository. But there are many social, political and technical barriers to this, and they have been well documented [8]. You could find out about them in this paper [8], but it appears that the author hasn’t self-archived the paper or taken the “gold” road and pulished in an Open Access journal. Ironic?
Last, but not least, it would be interesting to know what commercial publishers make of all this text-mining magic in Science [9], but we would have to pay $24 to find out. Ironic?

These are just a small selection from amongst many. If you would like to nominate a paper for an Open Access Irony Award, simply post it to the group on Citeulike or group on Mendeley. Please feel free to start your own group elsewhere if you’re not on Citeulike or Mendeley. The name of this award probably originated from an idea Jonathan Eisen, picked up by Joe Dunckley and Matthew Cockerill at BioMed Central (see tweet below). So thanks to them for the inspiration.

"The delay in sharing research data is costing lives" @NatureMedicine must win #oa irony award (via @ste http://twitpic.com/297m0v

— Matthew Cockerill (@opentechmatt) July 27, 2010

For added ironic amusement, take a screenshot of the offending article and post it to the Flickr group. Sometimes the shame is too much, and articles are retrospectively made open access so a screenshot will preserve the irony.

Join us in poking fun at the crazy business of academic publishing, while making a serious point about the lack of Open Access to scientific data.

References

Sommer, Josh (2010). The delay in sharing research data is costing lives Nature Medicine, 16 (7), 744-744 DOI: 10.1038/nm0710-744
Boulton, G., Rawlins, M., Vallance, P., & Walport, M. (2011). Science as a public enterprise: the case for open data The Lancet, 377 (9778), 1633-1635 DOI: 10.1016/S0140-6736(11)60647-8
Zerhouni, Elias (2004). Information Access: NIH Public Access Policy Science, 306 (5703), 1895-1895 DOI: 10.1126/science.1106929
Hanson, B., Sugden, A., & Alberts, B. (2011). Making Data Maximally Available Science, 331 (6018), 649-649 DOI: 10.1126/science.1203354
Kaiser, Jocelyn (2012). Profile of Stephen Friend at Sage Bionetworks: The Visionary Science, 335 (6069), 651-653 DOI: 10.1126/science.335.6069.651
Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J., Beenen, M., Leaver-Fay, A., Baker, D., Popović, Z., & players, F. (2010). Predicting protein structures with a multiplayer online game Nature, 466 (7307), 756-760 DOI: 10.1038/nature09304
Epstein, Keith (2012). Scientists are urged to oppose new US legislation that will put studies behind a pay wall BMJ, 344 (jan17 3) DOI: 10.1136/bmj.e452
Kim, Jihyun (2010). Faculty self-archiving: Motivations and barriers Journal of the American Society for Information Science and Technology DOI: 10.1002/asi.21336
Smit, Eefke, & Van Der Graaf, M. (2012). Journal article mining: the scholarly publishers’ perspective Learned Publishing, 25 (1), 35-46 DOI: 10.1087/20120106

[CC licensed picture “ask me about open access” by mollyali.]

* Please note, some research articles in BMJ are available by Open Access, but news articles like [7] are not. Thanks to Trish Groves at BMJ for bringing this to my attention after this blog post was published. Also, some “articles” here are in a grey area for open access, particularly “journalistic” stuff like news, editorials and correspondence, as pointed out by Becky Furlong. See tweets below…

MT “@dullhunk Open Access Irony Awards: http://t.co/VlBS4eZ4 @bmcmatt @phylogenomics @PLoS @steinsky #openaccess” <- #BMJ research is OA!

— Trish Groves (@trished) February 19, 2012

@dullhunk I entirely agree with concept of #OA irony but many of the examples were journalistic – perhaps a bit unfair?

— Becky Furlong (@becky_furlong) February 20, 2012

@dullhunk @becky_furlong @trished just don't see why any journal would *choose* to put an editorial about increased access behind a paywall

— Matthew Cockerill (@opentechmatt) February 21, 2012

@dullhunk @becky_furlong @trished it's not that it's inconsistent or morally wrong or anything… It's just, er, ironic…

— Matthew Cockerill (@opentechmatt) February 21, 2012

Comments (23)

December 17, 2010

Planet Facebook

Filed under: data mining — Duncan Hull @ 1:59 pm
Tags: Apache Hive, facebook, farcebook, narcissism, paul butler, refusenik, self-esteem, Soraya Mehdizadeh

Whatever your views on Facebook [1], you can’t deny that from space, “Planet Facebook” looks rather intriguing. The wonderful diagram below of Facebook connections has been made by Paul Butler. Even miserable Facebook refuseniks (like me) can’t help but go “ooh that’s pretty” while marvelling at the masterful use of the R language to construct this beautiful map…

References

John H. Tucker (2010). Status update: “I’m so glamorous”. A study of facebook users shows how narcissism and low self-esteem can be interrelated. Scientific American, 303 (5) PMID: 21033279, see also original research by Soraya Mehdizadeh at DOI:10.1089/cyber.2009.0257

Comments (2)

September 1, 2010

How many unique papers are there in Mendeley?

Filed under: data mining,publishing — Duncan Hull @ 10:17 am
Tags: buggotea, citeulike, connotea, DOI, Elsevier, identity, identity crisis, Jan Reichelt, last.fm, Mendeley, More Popular than Jesus, normalisation, PMID, pubmed, redundancy, scopus, Thomson, thomson-reuters, Victor Henning, wok

Mendeley is a handy piece of desktop and web software for managing and sharing research papers [1]. This popular tool has been getting a lot of attention lately, and with some impressive statistics it’s not difficult to see why. At the time of writing Mendeley claims to have over 36 million papers, added by just under half a million users working at more than 10,000 research institutions around the world. That’s impressive considering the startup company behind it have only been going for a few years. The major established commercial players in the field of bibliographic databases (WoK and Scopus) currently have around 40 million documents, so if Mendeley continues to grow at this rate, they’ll be more popular than Jesus (and Elsevier and Thomson) before you can say “bibliography”. But to get a real handle on how big Mendeley is we need to know how many of those 36 million documents are unique because if there are lots of duplicated documents then it will affect the overall head count. (more…)

Comments (29)

July 27, 2010

Twenty million papers in PubMed: a triumph or a tragedy?

Filed under: data mining,publishing,web — Duncan Hull @ 3:37 pm
Tags: 20 million, Alon Halevy, Anna Kushnir, Barack Obama, bibliography, cameron neylon, database, discovery deficit, Entrez, Fernando Pereira, filter failure, information overload, ISI WOK, Least Publishable Unit, Medline, MESH, NCBI, Neil Smalheiser, ontology, Open Researcher & Contributor ID, ORCID, PageRank, Peter Norvig, prozac, publish or perish, pubmed, PubMed Central, pubmed tragedies, PubMed triumphs, PubSCIENCE, Rezarta Islamaj, ROFL, scopus, tragedy, triumph, Vetle Torvik

A quick search on pubmed.gov today reveals that the freely available American database of biomedical literature has just passed the 20 million citations mark*. Should we celebrate or commiserate passing this landmark figure? Is it a triumph or a tragedy that PubMed® is the size it is? (more…)

Comments (29)

June 22, 2010

Impact Factor Boxing 2010

[This post is part of an ongoing series about impact factors. See this post for the latest impact factors published in 2012.]

Roll up, roll up, ladies and gentlemen, Impact Factor Boxing is here again. As with last year (2009), the metrics used in this combat sport are already a year out of date. But this doesn’t stop many people from writing about impact factors and it’s been an interesting year [1] for the metrics used by many to judge the relative value of scientific work. The Public Library of Science (PLoS) launched their article level metrics within the last year following the example of BioMedCentral’s “most viewed” articles feature. Next to these new style metrics, the traditional impact factors live on, despite their limitations. Critics like Harold Varmus have recently pointed out that (quote):

“The impact factor is a completely flawed metric and it’s a source of a lot of unhappiness in the scientific community. Evaluating someone’s scientific productivity by looking at the number of papers they published in journals with impact factors over a certain level is poisonous to the system. A couple of folks are acting as gatekeepers to the distribution of information, and this is a very bad system. It really slows progress by keeping ideas and experiments out of the public domain until reviewers have been satisfied and authors are allowed to get their paper into the journal that they feel will advance their career.”

To be fair though, it’s not the metric that is flawed, more the way it is used (and abused) – a subject covered in much detail in a special issue of Nature at http://nature.com/metrics [2,3,4,5]. It’s much harder than it should be to get hold of these metrics, so I’ve reproduced some data below (fair use? I don’t know I am not a lawyer…) to minimise the considerable frustrations of using Journal Citation Reports (JCR).

Love them, loathe them, use them, abuse them, ignore them or obsess over them … here’s a small selection of the 7347 journals that are tracked in JCR ordered by increasing impact.

Journal Title	2009 data from isiknowledge.com/JCR						Eigenfactor™ Metrics
Journal Title	Total Cites	Impact Factor	5-Year Impact Factor	Immediacy Index	Articles	Cited Half-life	Eigenfactor™ Score	Article Influence™ Score
RSC Integrative Biology	34			0.596	57		0.00000
Communications of the ACM	13853	2.346	3.050	0.350	177	>10.0	0.01411	0.866
IEEE Intelligent Systems	2214	3.144	3.594	0.333	33	6.5	0.00447	0.763
Journal of Web Semantics	651	3.412		0.107	28	4.6	0.00222
BMC Bionformatics	10850	3.428	4.108	0.581	651	3.4	0.07335	1.516
Journal of Molecular Biology	69710	3.871	4.303	0.993	916	9.2	0.21679	2.051
Journal of Chemical Information and Modeling	8973	3.882	3.631	0.695	266	5.9	0.01943	0.772
Journal of the American Medical Informatics Association (JAMIA)	4183	3.974	5.199	0.705	105	5.7	0.01366	1.585
PLoS ONE	20466	4.351	4.383	0.582	4263	1.7	0.16373	1.918
OUP Bioinformatics	36932	4.926	6.271	0.733	677	5.2	0.16661	2.370
Biochemical Journal	50632	5.155	4.365	1.262	455	>10.0	0.10896	1.787
BMC Biology	1152	5.636		0.702	84	2.7	0.00997
PLoS Computational Biology	4674	5.759	6.429	0.786	365	2.5	0.04369	3.080
Genome Biology	12688	6.626	7.593	1.075	186	4.8	0.08005	3.586
Trends in Biotechnology	8118	6.909	8.588	1.407	81	6.4	0.02402	2.665
Briefings in Bioinformatics	2898	7.329	16.146	1.109	55	5.3	0.01928	5.887
Nucleic Acids Research	95799	7.479	7.279	1.635	1070	6.5	0.37108	2.963
PNAS	451386	9.432	10.312	1.805	3765	7.6	1.68111	4.857
PLoS Biology	15699	12.916	14.798	2.692	195	3.5	0.17630	8.623
Nature Biotechnology	31564	29.495	27.620	5.408	103	5.7	0.14503	11.803
Science	444643	29.747	31.052	6.531	897	8.8	1.52580	16.570
Cell	153972	31.152	32.628	6.825	359	8.7	0.70117	20.150
Nature	483039	34.480	32.906	8.209	866	8.9	1.74951	18.054
New England Journal of Medicine	216752	47.050	51.410	14.557	352	7.5	0.67401	19.870

Maybe next year Thomson Reuters, who publish this data, could start attaching large government health warnings (like on cigarette packets) and long disclaimers to this data? WARNING: Abusing these figures can seriously damage your Science – you have been warned!

References

Rizkallah, J., & Sin, D. (2010). Integrative Approach to Quality Assessment of Medical Journals Using Impact Factor, Eigenfactor, and Article Influence Scores PLoS ONE, 5 (4) DOI: 10.1371/journal.pone.0010204
Abbott, A., Cyranoski, D., Jones, N., Maher, B., Schiermeier, Q., & Van Noorden, R. (2010). Metrics: Do metrics matter? Nature, 465 (7300), 860-862 DOI: 10.1038/465860a
Van Noorden, R. (2010). Metrics: A profusion of measures Nature, 465 (7300), 864-866 DOI: 10.1038/465864a
Braun, T., Osterloh, M., West, J., Rohn, J., Pendlebury, D., Bergstrom, C., & Frey, B. (2010). How to improve the use of metrics Nature, 465 (7300), 870-872 DOI: 10.1038/465870a
Lane, J. (2010). Let’s make science metrics more scientific Nature, 464 (7288), 488-489 DOI: 10.1038/464488a

[Creative Commons licensed picture of Golden Gloves Prelim Bouts by Kate Gardiner ]

Comments (3)

April 30, 2010

Daniel Cohen on The Social Life of Digital Libraries

Filed under: biocuration,data mining,publishing — Duncan Hull @ 7:12 am
Tags: Arcadia, Cambridge, citeulike, Clare College, connotea, dancohen, Daniel Cohen, defrosting the digital library, digital library, Firefox, First Monday, George Mason University, GMU, John Naughton, Mekentosj, Mendeley, refworks, scholarometer, Zotero

Daniel Cohen is giving a talk in Cambridge today on The Social Life of Digital Libraries, abstract below:

The digitization of libraries had a clear initial goal: to permit anyone to read the contents of collections anywhere and anytime. But universal access is only the beginning of what may happen to libraries and researchers in the digital age. Because machines as well as humans have access to the same online collections, a complex web of interactions is emerging. Digital libraries are now engaging in online relationships with other libraries, with scholars, and with software, often without the knowledge of those who maintain the libraries, and in unexpected ways. These digital relationships open new avenues for discovery, analysis, and collaboration.

Daniel J. Cohen is an Associate Professor at George Mason University and has been involved in the development of the Zotero extension for the Firefox browser that enables users to manage bibliographic data while doing online research. Zotero [1] is one of many new tools [2] that are attempting to add a social dimension to scholarly information on the Web, so this should be an interesting talk.

If you’d like to come, the talk starts at 6pm in Clare College, Cambridge and you need to RSVP by email via the talks.cam.ac.uk page

References

Cohen, D.J. (2008). Creating scholarly tools and resources for the digital ecosystem: Building connections in the Zotero project. First Monday 13 (8)
Hull, D., Pettifer, S., & Kell, D. (2008). Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web PLoS Computational Biology, 4 (10) DOI: 10.1371/journal.pcbi.1000204

Leave a Comment

September 9, 2014

References

April 1, 2014

References

August 3, 2012

The metadata that is currently available

The metadata that used to be available

Meta-conclusions

References

April 2, 2012

February 15, 2012

Openly ironic?

December 17, 2010

References

September 1, 2010

July 27, 2010

June 22, 2010

References

April 30, 2010

References

Meta / μετά