10 Mar How can we ensure the persistence of analysis software?
I’ve been thinking about sustainable and accessible archiving of bioinformatics software, I’m pretty scandalized at the current state of affairs, and had a bit of a complain about it before. I thought I’d post some links to other people’s ideas and talk a bit about the situation and action that is needed right now.
Hilmar Lapp posted on G+ on the similarity of bioinformatics software persistence to the DataDryad archiving policy implemented by a collection of evolutionary biology journals. That policy change is described in a DataDryad blog post here: http://blog.datadryad.org/2011/01/14/journals-implement-data-archiving-policy/ and the policies with links to the journal editorials here http://datadryad.org/pages/jdap
The journal Computers & Geosciences has a code archiving policy and provides author instructions (PDF) for uploading code when the paper is accepted.
So this is all very nice, many people seem to agree its important, but what is actually happening? What can be done? Well Casey has led the way with action rather than just words by forking public GitHub repositories mentioned in article abstracts to BioinformaticsArchive. I really support this but we can’t rely on Casey to manage all this indefinitely, he has (aspirations) to have a life too!
What I would like to see
My thoughts aren’t very novel, others have put forward many of these ideas:
1. A publisher driven version of the Bioinformatics Archive
I would like to see bioinformatics journals taking a lead on this. Not just recommending but actually enforcing software archiving just as they enforce submission of sequence data to GenBank. A snapshot at time of publication is the minimum required. Even in cases where the code is not submitted (bad), an archive of the program binary so it can actually be found and used later is needed. Hosting on authors’ websites just isn’t good enough. There are good studies of how frequently URLs cited in the biomed literature decay with time [cite source=’pubmed’]17238638[/cite] and the same is certainly true for links to software. Use of the standard code repositories is what we should expect for authors, just as we expect submission of sequence data to a standard repository not hosting on the authors’ website.
I think there is great merit to using a GitHub public repository owned by a consortium of publishers and maybe also academic community representatives. Discuss. An advantage of using a version control system like GitHub is that it would apply not too subtle pressure to host code rather than just the binary.
2. Redundancy to ensure persistence in the worst case scenario
Archive persistence and preventing deletion is a topic that needs careful consideration. Casey discusses this extensively; authors must be prevented from deleting the archive either intentionally or accidentally. If the public repository was owned by the journals’ “Bioinformatics Software Archiving Consortium” (I just made up this consortium, unfortunately it doesn’t exist) then authors could not delete the repository. Sure they could delete their own repository, but the fork at the community GitHub would remain. It is the permanent community fork that must be referenced in the manuscript, though a link to the authors’ perhaps more up to date code repository could be included in the archived publication snapshot via a wiki page, or README document.
Perhaps this archive could be mirrored to BitBucket or similar for added redundancy? FigShare and DataDryad could also be used for archiving, although it would be suboptimal re-inventing the wheel for code. I would like to see FigShare and DataDryad guys enter the discussion and offer advice since they are experts at data archiving.
3. The community to initiate actual action
A conversation with the publishers of bioinformatics software needs to be started right now. Even just PLOS, BMC, and Oxford Journals adopting a joint policy would establish a critical mass for bioinformatics software publishing. I think maybe an open letter signed by as many people as possible might convince these publishers. Pressure on Twitter and Google+ would help too, as it always does. Who can think of a cool hashtag? Though if anyone knows journal editors an exploratory email conversation might be very productive too. Technically this is not challenging, Casey did a version himself at BioinformaticsArchive. There is very little if any monetary cost to implementing this. It wouldn’t take long.
But can competing journals really be organised like this? Yes, absolutely for sure, there is clear precedent in the 2011 action of >30 ecology and evolutionary biology journals. Also, forward-looking journals will realize it is their interests to make this happen. By implementing this they will seem more modern and professional by comparison to journals not thinking along these lines. Researchers will see strict archiving policy as a reason to trust publications in those journals as more than just ephemeral vague descriptions. These will become the prestige journals, because ultimately we researchers determine what the good journals are.
So what next? Well I think gathering solid advice on good practice is important, but we also need action. I’d discussions with the relative journals ASAP. I’m really not sure if I’m the best person to do this, and there may be better ways of doing it than just blurting it all out in a blog like this, but we do need action soon. It feels like the days before GenBank, and I think we should be ashamed of maintaining this status quo.