17 Oct Reproducible phylogenetics part 1; why
We are still largely missing the benefits of reproducibility in phylogenetics. I think that this makes our lives unnecessarily difficult and makes us particularly poorly prepared to confront modern data-rich phylogenetics. In this first post “Why” I want to talk about why we need reproducible phylogenetics. Then, in part two, “What“, I’m going to talk about some possible approaches to reproducible phylogenetics. In part three, “How“, I’m going to look at some existing software solutions. Lastly, in “ReproPhylo“, I’m going to write about the work my lab is doing to bring these approaches together to create new reproducible phylogenetics solutions.
tldr; Reproducibility helps us do better phylogenetics and do it more easily. There are a number of partial solutions out there. We introduce the ReproPhylo framework for easier + better reproducible experimental phylogenetics.
Why do we need reproducible phylogenetics?
Three short answers, then I explain below.
Answer 1: to do better quality science. This is achieved by being able to build on and extend other people’s work. It is also achieved by being able to take an experimental approach to phylogenetic methodologies.
Answer 2: to make your life much easier. The person most likely to reproduce your work is “future you”, making it easy to reproduce, and then modify, your analysis will save you lots of time. It will help you to do more actual research and less reformatting of files and coaxing belligerent applications to read them.
Answer 3: If your work isn’t fully reproducible is it really science? Sure its nice work that clarifies some important issues, you’re a bright person and its likely correct, but if its not reproducible …. what the hell are you thinking? Is this why you got into science? To accept stories from other scientists based on them being bright and it sounding right? You are much more sceptical than that about the science you read, so shouldn’t people also be sceptical of your work. Yeah, exactly.
Standing on the shoulders of giants
The ability to extend the work of others, to stand on their ‘shoulders’  and reach higher, is how progress is made. “Wow, if I just added these species to that tree, used their analytical approach, I could actually test [whatever]”. But can you add species and use that approach? Or do you have to start from scratch, collecting sequences from GenBank and trying to reproduce their work before extending it?
How much do you want your work to have impact going forward? Make it easy for people to extend your work and you will be influential.
This refers to approaches where we test the influence of method, parameter choice, and data inclusion on our tree structure. How many studies have you seen where people explore parameter space exhaustively and explicitly compare the phylogenies produced. Not many I would guess. Any? The reason is that it is too difficult to experiment when manual approaches to phylogenetics are used. Have you ever experimented with alignment parameters in your MSA program of choice? Most phylogeneticists usually only run the defaults, you can check the Methods sections of papers to confirm this. If I have to align sequences with 6 different programs, each with 50 different combinations of parameters, and then compare some characteristics of the 300 alignments and resulting trees this will be a truly mammoth piece of work. If however I have an entirely reproducible pipeline that will iterate over parameter space and produce a detailed report with clear summaries of alignment characteristics and tree variability then this becomes not an exceptional piece of work but just something I would typically do before getting down to detailed analysis of the question at hand. If a reviewer or critic thinks I have chosen the wrong range of parameters to optimise they can simply add others and hit RUN on my pipeline to compare to my optimum values. The robustness of science improves.
Who will actually reproduce my work?
It could be anyone. What if Professor Big loves your work and wants to extend it, great! But the reality is that the person who will certainly need to reproduce and extend your work is Future You! Make life easy for yourself by starting out reproducibly, anyone else calling you a giant and wanting to stand on your shoulders is a bonus.
What makes you think phylogenetics isn’t pretty much OK now?
Long and painful experience makes me think that. Why don’t you try reproducing 3 phylogenetics results from papers and then your perspective will have changed. Can you get their data easily, or is it a list of Genbank Identifiers in a supplementary Word table that you then have to type in to NCBI website? Can you run their software? If its an old paper can you even find their software? Do you know the dependencies to run it? Do you know what version they ran? Do you know the parameters they used? Are the default parameters now the same as then? Did they exactly record data transformations? Maybe they changed sequence names between the original files and the tree figure. Maybe some taxa were excluded as problematic. Maybe the alignment was manually improved. Maybe some alignment regions were masked. All that is fine, but do you know exactly what they did and how? Did they archive the final tree or only a picture of it? This would only allow you compare by visual inspection to see if you have reproduced a previous study. It is estimated that >60% of phylogenetics studies are ‘lost to science’ . This is a problem.
What is Reproducibility?
I’m not going to cover the semantic differences between reproducibility, replication, repeatability etc. Here I take a practical view of reproducibility as a term used routinely to represent the above terms. I really like this video of Carole Goble explaining the concepts of reproducible research.
Reproducibility is the correct way to do science.
Reproducibility is so integral to what we consider the scientific process that it is hard even to make a counter case here, so I won’t really try. So why isn’t reproducibility the norm? Well a technically poor form of reproducibility is the norm, the Methods section of the journal article. Later in this series I suggest that technical challenges have prevented complete and efficient reproducibility in the past, it hasn’t been your fault, but now those challenges are pretty much solved (part3; How) and we should grasp and benefit from these new possibilities.
 Standing on the shoulders of giants. Wikipedia. Available: http://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants.
 Magee AF, May MR, Moore BR: The Dawn of Open Access to Phylogenetic Data. arXiv [q-bioPE] 2014. http://arxiv.org/abs/1405.6623