Releasing uncurated datasets is essential for reproducible phylogenomics
Phylogenetic analysis of sequence data is a keystone of most evolutionary studies. The deluge of data from next-generation sequencing has led to the routine use of phylogenomics to resolve evolutionary relationships between lineages and for investigating diverse evolutionary questions1–3. We warn that the current practice of only publishing curated phylogenomic datasets risks erroneous data selection in subsequent analyses and hampers reproducibility.
The power of phylogenomics derives from the assumption that most single genes possess some true, albeit sometimes weak, vertically inherited historical signal. By combining hundreds of genes, this phylogenetic signal is amplified and can provide robust support for the correct phylogeny (assuming the use of the appropriate evolutionary model). This assumption holds true only if the datasets for individual genes include sequences representing vertical evolution (orthologues). The inclusion of data reflecting different evolutionary events such as gene duplication (paralogues), lateral gene transfers (LGTs, which lead to the existence of xenologues) or contaminations will produce conflicting signals that can lead to the inference of an incorrect species tree4. It was initially argued that as long as enough vertical data were analysed, a few erroneously included paralogues or contaminants would not impact the results. This belief has largely been refuted, with data showing that even a few loci in datasets that contain hundreds of genes can massively impact the resulting topology and statistical support
Salomaki E.D., Eme L., Brown M.W., Kolísko M. 2020: Releasing uncurated datasets is essential for reproducible phylogenomics. Nature Ecology and Evolution (in press). [IF=12.541]