ParallelEvolCCM usage

ParallelEvolCCM is a tool for the identification of coordinated gain and loss of features. The method is described in detail in the following publication:

The Community Coevolution Model with Application to the Study of Evolutionary Relationships between Genes Based on Phylogenetic Profiles

If you use ParallelEvolCCM in your analysis, please cite the above publication.

ParallelEvolCCM inputs

The ParallelEvolCCM tool requires two inputs:

A phylogenetic tree in Newick format
A presence/absence table in TSV format.

The presence/absence TSV must have genome names equal to the ones in the tree in a 'genome_id' column, with all other columns representing features absent (0) or present (1) in each genome. I.e.:

genome_id   plasmid_AA155   plasmid_AA161
ED010   0   0
ED017   0   1
ED040   0   0
ED073   0   1
ED075   1   1
ED082   0   1
ED142   0   1
ED178   0   1
ED180   0   0

Using ParallelEvolCCM with ARETE

The ParallelEvolCCM tool is also made available through the evolccm entry in ARETE. Making it possible to run the tool with Docker or Singularity.

To execute the ParallelEvolCCM tool with ARETE, run the following command:

nextflow run beiko-lab/ARETE \
  -entry evolccm \
  --core_gene_tree core_gene_alignment.tre \
  --feature_profile feature_profile.tsv.gz \
  -profile docker

The parameters being:

--core_gene_tree - The reference tree, coming from a core genome alignment, like the one generated by the phylo entry in ARETE.
--feature_profile - A presence/absence TSV matrix of features in genomes, like the one created in ARETE's annotation entry.
-profile - The profile to use. In this case, docker.

For more information, check the full ARETE documentation.

Using ParallelEvolCCM by itself

The ParallelEvolCCM tool is a command line tool written in R. It is available through the bin/ParallelEvolCCM.R script.

To download the tool and make it executable, run:

wget https://raw.githubusercontent.com/beiko-lab/arete/master/bin/ParallelEvolCCM.R
chmod +x ParallelEvolCCM.R

ParallelEvolCCM.R has several dependencies, which should automatically be installed the first time you run the script. You may need to install missing Linux packages using the following command:

sudo apt-get install libssl-dev libfontconfig1-dev libharfbuzz-dev \
libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev \
libopenblas-dev

If you prefer to install the dependencies manually, you can do so by running the following R commands:

install.packages(c('ape', 'dplyr', 'phytools', 'foreach', 'doParallel', 'gplots', 'remotes'))
remotes::install_github('beiko-lab/evolCCM')

You can then run the tool like this:

./ParallelEvolCCM.R --intree tree.nwk --intable feature_table.tsv.gz --cores -1

--intree specifies the phylogenetic tree in Newick format.
--intable specifies the feature table in compressed TSV format.
--cores specifies the number of cores to use. Use -1 to use all available cores.

Additional parameters can be found by running ./ParallelEvolCCM.R with no additional parameters.

ParallelEvolCCM Release Package

We also provide test data that can be used to run EvolCCM along with some useful scripts for downstream analyses. These can be found in a .tar.gz file, which you can download like this:

wget https://raw.githubusercontent.com/beiko-lab/arete/master/assets/ParallelEvolCCM_supplement.tar.gz
tar -xzf ParallelEvolCCM_supplement.tar.gz

Structure

The tarball contains three subfolders:

Scripts/ - The R scripts used to generate feature and statistical histograms, and the Python script that is used to build the GraphML files from PECCM output.
100Bifido/ - The results of the 100-genome analysis described in the paper.
1000Bifido/ - The results of the 1000-genome analysis described in the paper.

Each of the results folders contains the following subdirectories and files. X is a placeholder for the size (i.e., 100 or 1000):

SourceFiles/ - The source files.
- Bifido_X.tre: The Newick-formatted tree
- Bifido_X_feature_profile: The tab-separated feature file. This is the input file for ‘PECCM BuildFeatureHistogram.R’, see usage below
Results/ - The results produced by PECCM and the helper scripts.
- EvolCCM_Bifido_X.tre: The tree used by EvolCCM (with midpoint rooting and multifurcating node resolution if necessary)
- EvolCCM_Bifido_X_feature_profile.tsv: A tab-separated file with the p-values and statistics for all pairwise comparisons between features. This is the input file for the scripts PECCM_BuildStatHistogram.R and PECCM_Build_GraphML.py, see below
- EvolCCM_Bifido_X_feature_profile.tsv.pvals and EvolCCM_Bifido_X_feature_profile.tsv.X2: Tab-separated matrices showing the p-values and X2 scores for all features.
- EvolCCM_Bifido_100.graphml: GraphML-formatted file with connections between features.
- Four .jpg files: 'a' is the output feature profile, and 'b', 'c', and 'd' are the feature and statistical distributions.

Additional Scripts

First, in order to recreate the results of the 100-genome dataset, run the command below (specifying any reasonable number of cores):

ParallelEvolCCM.R --intree Bifido_100.tre \
  --intable Bifido_100_feature_profile.tsv \
  --min abundance 0.05 --max abundance 0.95 --cores 8

Four output files will be produced. All will be prefixed with 'EvolCCM' to distiguish them from the input files.

.tre file: The tree used by EvolCCM (with midpoint rooting and multifurcating node resolution if necessary). This file will end with a ‘.tre’ extension.
.tsv file: Statistics associated with the EvolCCM comparisons, with one line for each pairwise comparison.
.tsv.pvals file: A matrix showing the p-values from all-versus-all comparisons between features.
.tsv.X2 file: A matrix showing the X2 values from all-versus-all comparisons between features.

Next, PECCM_BuildFeatureHistogram.R can be used to generate the feature distribution histogram. Usage is:

Rscript PECCM_BuildFeatureHistogram.R infile

Where ‘infile’ is the input feature table (for example, Bifido_100_feature_profile.tsv). There are no other command-line options. A single .jpg file will be produced.

PECCM_BuildStatHistogram.R is used to generate the statistical summary histograms. Usage is:

Rscript PECCM_BuildStatHistogram.R infile

Where ‘infile’ is the input table of results (‘EvolCCM...tsv’). Three .jpg files will be produced.

PECCM_Build_GraphML.py is used to generate a graph from the pairwise comparisons, with optional p-value thresholding. You can also use the --attribute_name_length option to truncate attribute names for visual purposes.

The optional ‘type_underscore’ argument will treat the first part of each attribute name (up to the first underscore) as its type: for example, ‘plasmid_ABC’ and ‘plasmid_def’ would both be treated as objects of type ‘plasmid’, with names ‘ABC’ and ‘DEF’, respectively.

Usage:

python ../PECCM_Build_GraphML.py \
  --attribute_name_length 10 \
  --type_underscore EvolCCM_Bifido_100_feature_profile.tsv \
  EvolCCM_Bifido_100.graphml