5 star open phylogenetic data

23 Nov

I’ve recently come across the idea of stars for open data quality thanks to Steve Moss. The table below is from 5stardata:

make your stuff available on the Web (whatever format) under an open license
★★ make it available as structured data (e.g., Excel instead of image scan of a table)
★★★ use non-proprietary formats (e.g., CSV instead of Excel)
★★★★ use URIs to denote things, so that people can point at your stuff
★★★★★ link your data to other data to provide context


How does this relate to phylogenetic data? Here is my suggestion for a star system for phylogenetic data:
Anyone want to suggest changes to this star system?

publish a picture of your tree in a journal article
★★ make seq alignment, tree & metadata available in suppl data with the paper
★★★ as 2star but save as XML e.g. NeXML, PhyloXML in supplementary data with the paper
★★★★ as 3star but place open access NeXML file on FigShare or Dryad with URIs
★★★★★ as 4star and link your data to other data to provide context


1 star: Surely we are past the point where people do not archive their newick tree file? Or am I being too optimistic?

2 star: This seems to be the current standard. Metadata often means a Word doc table or Excel spreadsheet. Unfortunately these complex and fragile proprietary file formats create a barrier to machine reading the data. A simple csv file would be much better and you could easily open it in Excel if you really insist. Surely open access publication is a prerequisite for 2 stars?

3 star: Want to increase your star rating? This would be an easy step to take for most people. Many good programs are supporting new rich standard formats like NeXML and PhyloXML and we should hassle the authors of software not doing so.

4 star: Again this is an easy win. Make sure your data is open access, machine findable and machine readable. Figshare is ridiculously powerful and easy to work with. Your files (all of them) can be bulk uploaded. You will get a repository doi link to quote in your manuscript and share with people. Individual files have doi links too.

5 star: This is more difficult to do well. Some of this may have been achieved by use of XML files, but how much? I have a lot to learn here about linked data. Having files that use the NCBI taxonIDs and official gene names allows automatic link-outs to be created. XML files can do exactly this. But how well does this work? The potential of linked data is also bigger than this, I have more reading to do. I like Tim Berners-Lee’s bag of chips (crisps!) analogy.


The idea of data stars originated with Tim Berners-Lee, and there are nice descriptions of the system on 5stardata.info and a YouTube “bag of chips” video of TBL explaining many of the ideas.

Edit: I liked the idea of starting the list at zero, no stars, because (A) thats how computer languages count (B) you don’t deserve any stars at all for just putting a picture of your data in a publication. But it seemed too petty.


