09 Nov Reproducible phylogenetics part 2a; what
In part 1 of this series I wrote about why we need reproducible phylogenetics, here I write about what we actually need to do.
We need only a few classes of things (open reusable archiving of all data, information provenance, recording of data treatments & software environments), to make our work reproducible. Many of the necessary approaches are already routinely used in computational sciences, but rarely in phylogenetics, which is a shame.
Reproducibility is something that needs to be considered from the start of your work, in the same way that you would incorporate controls from the very start of your experimental design for a wet lab experiment. Perhaps make a written reproducibility plan alongside, and integrated with, your experimental design?
What do we need to make phylogenetics reproducible?
I think the only things we really need are: data provenance, recording of data treatments, archiving of software with recorded versions and settings, and public archiving of all data and outputs.
This is a surprisingly short list yet I think that if implemented this way an experiment could be fully reproducible. I have seen other lists for descriptions of phylogenetic experiments (Leebens-Mack et al. 2006), and for data sharing (Cranston et al. 2014), but I think that although these types of lists are OK, sometimes excellent, they don’t reflect a mature discipline that has learned adequately from the computational sciences.
Open reusable data archiving
Recently this has become pretty easy. Journals accept enormous amounts of supplementary data. We have repositories like the excellent DataDryad and FigShare. We also have GitHub which, although not a repository per se, can make quite a good one. Can you link to your data? Is it permanently archived in replicated repositories? Is it easy to use? My personal preference is FigShare, but you should investigate for yourself if you’re not routinely placing your research data somewhere sensible.
What if we had a computational environment where whenever we did something to a data file to generate an output (eg build a sequence alignment, construct a phylogeny) exactly which data had generated which output was recorded for us accurately, in the background, without need for our intervention or use of a complex file naming scheme. This is data provenance, the management and tracking of information; what file provided information for what output. These tools already exist, and are used in other fields, but not usually in phylogenetics.
Recording data treatments
What if we had a computer environment that not only tracked all data files but also every change made to those data files (even if you use the same file name). This would allow you to return to any version of the file made at any point in time, and your experimental record would specify exactly what file version was used. This is the essence of version control, which is truly powerful data weapon, ubiquitous in the computational sciences, though one most biologists have barely heard of.
Recording data treatments can be done manually, writing down each parameter for each analysis program, but what if absolutely all steps for all programs were automatically recorded without you needing to do anything? This will happen by default if your analysis is scripted, ie carried out in a computer pipeline. A computer pipeline is so-called because it carries information the way a physical pipeline carries water, connecting different places. Pipelining the flow of data between different analysis programs is seen as normal practice in many numerical disciplines, and while the use of scripts is increasing in phylogenetics, pipelines are still too rare.
Recording of software environments
What if we had a computer environment where we could just press “save” and everything would be saved, yes everything. We would not be expected to make a full list of all the software used, with their versions. We would not have to record our operating system, and we would not have to find all the dependencies for programs and scripts to run properly. Our operating system, including all the programs we had used at every stage of the analysis, would be archived with a command. This whole-environment could then be archived as an experimental record, with a doi, allowing other scientists to download and open up this ‘image’ of our machine and carry on where we left off. These “virtual machine images” exist, and are commonly used, but not often in phylogenetics.
I am suggesting that most of the issues surrounding reproducible phylogenetics are solved problems in other disciplines. The things that are still challenging are not about achieving reproducibility but about achieving it easily, irrespective of computational experience, such that reproducibility becomes the default behaviour.
Next I write about some problems and the importance of reusability in addition to reproducibility.
Cranston K, Harmon LJ, O’Leary MA, Lisle C: Best practices for data sharing in phylogenetic research. PLoS Curr 2014, 6.
Leebens-Mack J, Vision T, Brenner E, Bowers JE, Cannon S, Clement MJ, Cunningham CW, dePamphilis C, deSalle R, Doyle JJ, Eisen JA, Gu X, Harshman J, Jansen RK, Kellogg EA, Koonin EV, Mishler BD, Philippe H, Pires JC, Qiu Y-L, Rhee SY, Sjölander K, Soltis DE, Soltis PS, Stevenson DW, Wall K, Warnow T, Zmasek C: Taking the first steps towards a standard for reporting on phylogenies: Minimum Information About a Phylogenetic Analysis (MIAPA). OMICS 2006, 10:231–237.