04 Feb How to visualize a phylogeny with thousands of tips?
What abilities should a phylogenetic visualisation tool have? What is important when you have so many tips (OTUs) that it is too big to print out or even scroll through on the screen? I have several pieces of research in this last category. In no particular order here are some things that seem important to me-
- It should still be “snappy” when dealing with tens of thousands of OTUs. I think it should be standalone not web-based for tasks like this.
- It should be open-source with an active development community. Can we really keep relying on single program authors for development? No
- It must interact with an associated data file. This data file can be common to a number of trees. It could be parsed from GenBank and keep ALL field data plus user data. This data file is essential for data-driven OTU renaming, searching, collapsing and exporting
- It should collapse OTUs to groups from an associated data file and name these groups. ie automatically group OTUs into “mammalia”, “rotifera”, “arthropoda”, “diptera”. Collapse and name options could be parsed from GenBank taxonomy. See GRUNT.
- It should be able to collapse nodes automatically to form polytomies. These could be clades below a given support value, or below a certain node length.
- It should be able to reroot. User-defined clicking on an OTU or clade, midpoint rooting (default)
- It should be able to test for monophyly of groups. It could colour these groups accordingly. So if all descendent taxa of a node are called mammalia in the taxonomy file then the group is labeled “mammalia”. If another mammal is found outside of mammalia clade then it is flagged as non-monophyletic.
- Should be able to see both the details and the whole picture. At the least click to zoom in and out . So maybe an inset of where in the tree one is and a clickable interface to go somewhere else, is vital. See Rod Page’s ideas on visualisation of large trees on a web page.
- It needs to have search facilities. These should be able to search tree and associated data files. Boolean. Find this text string in these fields AND this in that.
- User definable tip names. It should be easy to switch between different tip names (taken from the data file), such as accession number, species name, etc etc. Should be able to apply rules to this; if this and that then name tip like this.
- It must be able to export reliably, in all tree formats, with appropriately considered tip names etc. As graphics with SVG, PDF, EMF etc supported. Exported graphics must be available in collapsed format too.
- It should be scriptable. Its very useful to have the ability to be incorporated in bioinformatics pipeline. So “program open treefile, collapse according to this datafile and criteria, rename tips according to this, export as SVG”.
Am I asking a lot? Not really, all this can be implemented with current code, people just don’t in general. Any suggestions for more? Any stuff you don’t agree with?
Many programs claim to deal with hundreds or thousands of tips on a tree. My cichlid mtDNA tree has approx 4000 OTUs. The NJ tree would, if printed out, fill more than 40 pages. There are several programs that can deal with this and feel reasonably fast, but it is almost impossible to get a meaningful look at the phylogenetic relationships. Too much data on the screen, I can’t see the wood for the trees. It is essential to be able to collapse down the hundreds of almost identical mtDNA sequences coming from Lake Victoria fish and just label the resulting triangle “Victoria Superflock”. Immediately I can start to see their relationship to others without an enormous amount of scrolling. The datafile would allow me to have this done across the tree with taxonomic names. Imagine a big tree of birds presorted into orders, and labeled accordingly! Immediately you would be able to see whats going on and begin the actual biological interpretation of your data.
There are 2 or 3 programs I am aware of that (almost) do all the above. In other posts I will discuss them, and how I’m currently using them for large scale phylogenetics and informatics. My favourites at the moment are ARB and Treedyn. There is a list of tree viewers at the Treedyn site that seems quite good, perhaps getting a little old now though.
I’ll describe my thoughts on current software, pros and cons, and “the future” in an upcoming posting.