27 Aug Reproducible research in phylogenetics
I’ve been reading a lot recently about reproducible research (RR) in bioinformatics on several blogs, and Google+ and Twitter. The idea is that it is important that someone is easily able to reproduce* your results (and even figures) from your publication using your provided code and data. I’ve been thinking that this is a movement that urgently needs to spread to phylogenetics research- Reproducible Research in Phylogenetics or RRphylo.
The current state of affairs
The problem is that although methods sections of phylogenetics papers are typically fairly clear, and probably provide all the information required to pretty much replicate the work, it would be a very time-consuming process- lots of hands on time, lots of manual data manipulation. Moreover, many of the settings are assumed (defaults) rather than explicitly specified. ‘Have I replicated what they did?’ would then be judged by a qualitative assessment of whether your tree looked like the one published, and thats not really good enough.
Why it matters
I’m assuming here that what is currently published will allow replication, though in some cases it might not, as Ross Mounce described in his blog. I have had experience of this too, but thats another long and painful story. So why does it matter? It matters because of the next step in research. If the result of your work is important you or someone else will probably want to add genes or species to the analysis, vary the parameters of phylogenetic reconstruction, or alignment, or the model of sequence evolution, or reassess the support values, or something else relevant. Unless the process of analysis has been explicitly characterised in a way that allows replication without extensive manual intervention and guesswork then this cannot be achieved. If you want your work to be a foundation for rapid advancement, rather than just a nice observation, then it must be done reproducibly.
What should be done?
Briefly; pipelines, workflows and/or script-based automation. It is quite possible to create a script-based workflow that recreates your analyses in their entirety, perhaps using one of the Open-Bio projects (e.g. BioPerl, BioPython) or DendroPy, or Hal. This would go from sequence download, manipulation and alignment (though this could be replaced by provision of an alignment file), to phylogenetic analysis, to tree annotation and visualisation. Such a script must, by definition, include all the parameters used at each step- and that is mostly why I prefer that it includes sequence retrieval and manipulation rather than an alignment file.
Phylogeneticists are sometimes much less comfortable with scripting than are bioinformaticians, but this is something that has no option but to change. The scale of phylogenomic data now appearing just cannot (and should not) be handled in the GUI packages that have typically enabled tree building (e.g. MEGA, DAMBE, Mesquite).
There is a certain attraction to GUI approaches though and something that may well increase are GUI workflow builders. The figure above is from Armadillo, which looks interesting, but unfortunately doesn’t seem to be able to save workflows in an accessible form, making it an inappropriate way forward. Galaxy is another good example, able to save standard workflows, but not (yet) well provisioned for phylogenetic analyses.
At the moment a script-based approach linking together analyses is the best approach for RRphylo.
‘So you’ve been doing this, right?’
Err no, I’ve been bad, really I have. Some of my publications are very poor from the view of RRphylo. The reasons for that would take too long to go into, I’m not claiming the moral high ground here, but past mistakes do give me an appreciation of both the need and the challenges in implementation. I absolutely see that this is how I will have to do things in future, not just because (I hope) new community standards will demand it, but also because iterating over minor modifications of analysis is how good phylogenetics is done, and that is best implemented in an automated way.
Automated Methods sections for manuscripts?
One interesting idea, that is both convenient and rigorous, is to have the analysis pipeline write the first draft of your Methods section for you. An RRphylo script that fetched sequences from genbank, aligned them, built a phylogeny, and then annotated a tree figure should be able to describe this itself in a human-readable text format suitable for your manuscript.
The full Methods pipeline is archived at doi:12345678 and is described briefly as follows: sequences were downloaded from NCBI nucleotide database (27 August 2012) with the Entrez query cichlidae[ORGN] AND d-loop[TITL] and 300:1600[SLEN]. These 5,611 sequences were aligned with MUSCLE v1.2.3 with settings……
This is a brief and made up Methods that doesn’t contain the detail it could. Parameter values from the script have been inserted in blue. This sort of an output could also fit very well with the MIAPA project (Minimum Information About a Phylogenetic Analysis). NB it is not this methods information that needs to be distributed, it is the script that carried out these analyses (and produced this human-readable summary as an extra output) that is the real record of distribution.
This won’t be implemented tomorrow, even if everyone immediately agrees with me that it is really important. It is much easier for most people to just write the same old methods section they always have- a general description of what they did that people in the field will understand. I went today to read a lot of Methods sections from phylogeny papers. Some were better than others in describing the important details, but none sounded relevant to the new era of large scale analysis. They sounded like historical legacies, which of course is true of scientific paper writing in general.
It will take a community embarrassment to effect a change; an embarrassment that even the best papers in the field are still vague, still passing the methodological burden to the next researcher, still amateur compared to modern bioinformatics papers, still ultimately irreproducible.
The major barrier to RRphylo is the need to write scripts, a skill with which many phylogeneticists are unfamiliar and uncomfortable. This may be helped by Galaxy or the like, allowing the easy GUI linking of phylogenetic modules and publication of a standard format workflows to MyExperiment (I think Galaxy is the future, for reasons I won’t go into here). Alternatively, maybe some cutting-edge labs will put together a lot of scripts and user-guides allowing even moderately computer literate phylogeneticists to piece together a reproducible workflow. Hal and DendroPy seem the places to start at present, and I shall have to try them out as soon as I can. Other places for workflows that are worth investigating are Ruffus, Sumatra, and Snakemake. At the moment I’ve done a decent amount of Googling and absolutely no testing so I’d be really interested in other suggestions and views on these options.
I think that Reproducible Research in Phylogenetics is incredibly important. Not just to record exactly what you did, not just to make things easier for the next researcher, but because all science should be fully reproducible- of course it should. I’m coming round to the idea that not implementing RRphylo is equivalent to not releasing your new analysis package and just describing what the program does. But maybe I’m just a lone voice?
Our approach to replication in computational science, C Titus Brown
Reproducible Research: A Bioinformatics Case Study Robert Gentleman 2004
‘Next-generation sequencing data interpretation: enhancing reproducibility and accessibility‘ Nekrutenko & Taylor Nature Reviews Genetics 13, 667-672 (September 2012) | doi:10.1038/nrg3305 (subscription required)
Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Goeks et al Genome Biology 2010, 11:R86 doi:10.1186/gb-2010-11-8-r86 (open access)
Reproducible research with bein, BiocodersHub
* I am not going to distinguish between replication and reproducibility in this blog post. See here. There are differences, but both are needed.