24 Jan Pride before a (data) fall
I’m pretty proud of some parts of my workflow: electronic lab notebook, reproducibility, open data files, (semi-obsessive) automated data backups etc etc. But pride often comes before a fall. I had a bad experience this week where I thought I had lost some important phylogenetic data files (I found them eventually), and I’m writing this to work through what I did wrong, and what I need to change in my work routine.
About 4 years ago I built a phylogenetic tree of some cichlid fish. It was a small, relatively simple analysis, just to describe the phylogenetic relationship between some test fish species in a collaborative genomic experiment. The pretty picture of the phylogeny was a manuscript figure, the analysis had been written up in my ELN as I did it, data files were backed up, no need to worry. Time passed, the paper stalled, and then went through tediously long review, but now I need to submit the files to Dryad and TreeBase ASAP. Fortunately I have the original sequence alignments, the notes on the analysis, and the tree file. Or do I?
This was a few years ago and the way I do phylogenies has changed quite a bit since then (subject of another post). My problem was that I had carried out three similar cichlid projects around the same time. Three cichlid phylogenies using the same genes to address three different questions. I had lots of iterations of file analyses for each data set as I iterated through analysis parameters and approaches. I had made at least two errors:
I had been very unimaginative in my file and folder naming schemes. Lots of things called cich_phylo1 or Bayes_new or cichlid_ND2. Lots of almost identical files in nested folders to search in order to find the one that had generated the figure. Someone I once worked with had the strategy of creating enormously long filenames detailing the analyses that had generated the file. It was impressive but I’m not sure I could do it, though it might work well for generated rather than manually created files. Maybe I’ll adopt it for folder names?
Full file paths
But surely if I had a good experimental record it would identify the exact file I was dealing with. Yes, it does, though I shamefully often neglected to give the full file path. It shouldn’t have been a problem but the analysis was actually done in a very unusual location on my HD (a folder shared with a linux partition) and falling outside my search. The reason I failed to give full file paths was that I was working fast, to a short deadline, and running more than one analysis simultaneously. I was pasting data details, ML parameters, results, and next moves, into my lab book but I was doing it all too fast. If I had an easy way to insert full paths I might have used it, but I didn’t. In OSX you can create shortcuts to copy the full file path and this is now only a right click away, so I have no excuse for not saving full paths into my ELN now.
NB, file modification dates do not always persist
Also, although I had the creation date of the files from my ELN, I wasn’t confident that the dates still persisted. OSX changes modification dates sometimes for no reason. Open a file, read, close- bam! its folder now has today’s date. Also some backups I did a few years ago used ftp (by accident) obliterating the file dates. Although the files in question did have the right dates, I had realised early on in my search that I couldn’t rely on that.
What would have helped?
- Full file paths
- Unique descriptive names, of folders if not files as well
- Annotation of ELN blog posts with a “submitted file” tag or some way to differentiate experimental iterations from the final thing.
- Writing a text file for the final folder detailing the publication of files (effectively tagging the final version for system searches)
- Uploading final files immediately to Figshare. This would have given them each a doi, and that could have be put in the figure legend immediately. Useful.
This all speaks to a bigger issue- reproducibility. In documenting all my analyses I was relying entirely on myself and my ‘good behaviour’. But one bad day, one error, one interruption by a student just as you are writing something, and the record is lost. This is not an issue however with workflows that automatically generate reports along with the results and figures. Here the generation of a complete and detailed record never fails, as long as the script is working correctly. This is the way I now try to do things.
My final point would be: you don’t have a real backup system until you’ve tested it in the real world. A thought experiment of how you would recover isn’t going to be enough. I thought I was fine, but I nearly wasn’t.