B. cereus group Multi-Datatype Supertree Reconstruction

Here is a brief description about the strategy used for reconstructing multi-gene and multi-datatype supertrees for the B. cereus group. The same technique was used to build the supertree based on the two DNA typing methods MLST and AFLP and a global supertree based on MLST, AFLP, and MLEE. The MLST+AFLP supertree is used as the reference multi-datatype supertree in the HyperCAT database. It is more reliable than the MLST+AFLP+MLEE supertree, because MLEE, which is based on protein profiles, has a much lower resolution power than MLST and AFLP. The MLST+AFLP+MLEE supertree was built to provide a means to explore the phylogenetic positions of the isolates that have been typed only by MLEE.

Supertrees were built according to the widely used Matrix Representation by Parsimony (MRP) method. Briefly, for each of the 26 MLST gene fragments, three AFLP datasets and one MLEE dataset, a phylogenetic tree is reconstructed using an appropriate method (see below). Then, each individual tree is recoded into a binary matrix representing the branching order (i.e., the phylogenetic groupings). All tree matrices are concatenated into a supermatrix, in which isolates missing from a particular tree are coded using the "?" character representing unknown data. In this supermatrix, the sequence of 0's, 1's, and ?'s defines the branching profile of a strain. Closely related strains have similar branching profiles. Supertrees are then generated from the supermatrix by the Maximum Parsimony technique using the Trees with New Technology (TNT) software. The Maximum Parsimony step infers the trees that would require the minimum number of changes between the branching profiles of all isolates, where the unknown characters can take any of the two possible states 0 or 1 (they are not treated as missing gaps). As several trees can be equally parsimonious, the final supertree is taken as the strict consensus of all parsimony trees. Because MLST, AFLP, and MLEE are based on different amounts of genetic information, the supertree procedure conducted here is weighted accordingly. In the AFLP studies of Ticknor et al. 2001 and Hill et al. 2004 genetic profiles were based on 40 genomic fragments, which can be considered as 40 loci, while the AFLP analysis of Guinebretiere et al. 2008 was based on 68 genomic fragments. Therefore, the groupings coming from the first two AFLP studies and the Guinebretiere study in the MRP supermatrix were given a weight of 40 and 68, respectively, for the parsimony search. Each of the 26 MLST gene represents one locus and was given a weight of 1. The MLEE study relied on 13 enzyme loci and to take into account the fact that MLEE is based on proteins the weight of the MLEE tree was set to 4 (i.e., 13/3). Note that TNT is specifically designed for analysis of large datasets and permits ultra-fast supertree building, allowing the MLST+AFLP and MLST+AFLP+MLEE weighted supertrees to be built in about 7 hours each on this webserver. In addition, TNT showed an improved accuracy over other parsimony programs (including PAUP and PHYLIP), since the speed and algorithms implemented in TNT enable a broader and more efficient exploration of the tree space, therefore allowing the program to find more parsimonious trees (see Goloboff 1999). To compute branch lengths and obtain statistical support values for all groupings (i.e., internal branches) in the supertree the Maximum Likelihood method and the PHYML 3.0 program were employed. Branch confidence was computed using approximate likelihood-ratio tests (aLRTs) for branches with Shimodaira-Hasegawa-like support values, which estimate for each branch the probability (or p-value) of being significant. The branch supports were computed in about four hours.
Multi-datatype supertrees can be reconstructed because MRP only consists in merging trees, which can be built individually from heterogeneous data. Here, trees for each MLST gene were reconstructed by the Maximum Likelihood method with the PHYML 3.0 program, using the Felsenstein 1984 (F84) nucleotide substitution model supplemented with a gamma distribution (F84+Γ). This model allows for unequal base frequencies, transition/transversion rate bias, and gamma-distributed substitution rate variation among sites. Phylogenetic trees based on AFLP were taken from the large-scale studies of Ticknor et al. 2001, Hill et al. 2004, and Guinebretiere et al. 2008. The tree based on MLEE was obtained from an unpublished analysis conducted by Erlendur Helgason, a former member of our research group at the University of Oslo. This analysis was an extension of the study reported in Helgason et al. 2000, in which a number of additional isolates were subsequently added. In all AFLP and MLEE studies, the trees were reconstructed from the AFLP and MLEE profiles using the unweighted pair group method with arithmetic means (UPGMA) clustering method.

A schematic overview of the multi-datatype supertree reconstruction procedure is shown below.

Multi-Datatype Supertree reconstruction summary picture

HyperCAT Home