ParallelEvolCCM usage
ParallelEvolCCM is a tool for the identification of coordinated gain and loss of features. The method is described in detail in the following publication:
If you use ParallelEvolCCM in your analysis, please cite the above publication.
ParallelEvolCCM inputs
The ParallelEvolCCM tool requires two inputs:
- A phylogenetic tree in Newick format
- A presence/absence table in TSV format.
The presence/absence TSV must have genome names equal to the ones in the tree in a 'genome_id' column, with all other columns representing features absent (0) or present (1) in each genome. I.e.:
genome_id plasmid_AA155 plasmid_AA161
ED010 0 0
ED017 0 1
ED040 0 0
ED073 0 1
ED075 1 1
ED082 0 1
ED142 0 1
ED178 0 1
ED180 0 0
Using ParallelEvolCCM with ARETE
The ParallelEvolCCM tool is also made available through the evolccm
entry in ARETE.
Making it possible to run the tool with Docker or Singularity.
To execute the ParallelEvolCCM tool with ARETE, run the following command:
nextflow run beiko-lab/ARETE \
-entry evolccm \
--core_gene_tree core_gene_alignment.tre \
--feature_profile feature_profile.tsv.gz \
-profile docker
The parameters being:
- The reference tree, coming from a core genome alignment, like the one generated by thephylo
entry in ARETE.--feature_profile
- A presence/absence TSV matrix of features in genomes, like the one created in ARETE'sannotation
- The profile to use. In this case,docker
For more information, check the full ARETE documentation.
Using ParallelEvolCCM by itself
The ParallelEvolCCM tool is a command line tool written in R. It is available through the bin/ParallelEvolCCM.R script.
To download the tool and make it executable, run:
chmod +x ParallelEvolCCM.R
ParallelEvolCCM.R has several dependencies, which should automatically be installed the first time you run the script. You may need to install missing Linux packages using the following command:
sudo apt-get install libssl-dev libfontconfig1-dev libharfbuzz-dev \
libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev \
If you prefer to install the dependencies manually, you can do so by running the following R commands:
install.packages(c('ape', 'dplyr', 'phytools', 'foreach', 'doParallel', 'gplots', 'remotes'))
You can then run the tool like this:
./ParallelEvolCCM.R --intree tree.nwk --intable feature_table.tsv.gz --cores -1
specifies the phylogenetic tree in Newick format.--intable
specifies the feature table in compressed TSV format.--cores
specifies the number of cores to use. Use-1
to use all available cores.
Additional parameters can be found by running ./ParallelEvolCCM.R
with no additional parameters.
ParallelEvolCCM Release Package
We also provide test data that can be used to run EvolCCM along with some useful scripts for downstream analyses. These can be found in a .tar.gz file, which you can download like this:
tar -xzf ParallelEvolCCM_supplement.tar.gz
The tarball contains three subfolders:
- The R scripts used to generate feature and statistical histograms, and the Python script that is used to build the GraphML files from PECCM output. -
- The results of the 100-genome analysis described in the paper. -
- The results of the 1000-genome analysis described in the paper.
Each of the results folders contains the following subdirectories and files. X is a placeholder for the size (i.e., 100 or 1000):
- The source files.-
: The Newick-formatted tree -
: The tab-separated feature file. This is the input file for ‘PECCM BuildFeatureHistogram.R’, see usage below
- The results produced by PECCM and the helper scripts.-
: The tree used by EvolCCM (with midpoint rooting and multifurcating node resolution if necessary) -
: A tab-separated file with the p-values and statistics for all pairwise comparisons between features. This is the input file for the scriptsPECCM_BuildStatHistogram.R
, see below -
: Tab-separated matrices showing the p-values and X2 scores for all features. -
: GraphML-formatted file with connections between features. -
Four .jpg files: 'a' is the output feature profile, and 'b', 'c', and 'd' are the feature and statistical distributions.
Additional Scripts
First, in order to recreate the results of the 100-genome dataset, run the command below (specifying any reasonable number of cores):
ParallelEvolCCM.R --intree Bifido_100.tre \
--intable Bifido_100_feature_profile.tsv \
--min abundance 0.05 --max abundance 0.95 --cores 8
Four output files will be produced. All will be prefixed with 'EvolCCM' to distiguish them from the input files.
.tre file: The tree used by EvolCCM (with midpoint rooting and multifurcating node resolution if necessary). This file will end with a ‘.tre’ extension.
.tsv file: Statistics associated with the EvolCCM comparisons, with one line for each pairwise comparison.
.tsv.pvals file: A matrix showing the p-values from all-versus-all comparisons between features.
.tsv.X2 file: A matrix showing the X2 values from all-versus-all comparisons between features.
Next, PECCM_BuildFeatureHistogram.R
can be used to generate the feature distribution
histogram. Usage is:
Rscript PECCM_BuildFeatureHistogram.R infile
Where ‘infile’ is the input feature table (for example, Bifido_100_feature_profile.tsv). There are no other command-line options. A single .jpg file will be produced.
is used to generate the statistical summary histograms. Usage is:
Rscript PECCM_BuildStatHistogram.R infile
Where ‘infile’ is the input table of results (‘EvolCCM...tsv’). Three .jpg files will be produced.
is used to generate a graph from the pairwise comparisons,
with optional p-value thresholding. You can also use the --attribute_name_length
option to truncate attribute names for visual purposes.
The optional ‘type_underscore’ argument will treat the first part of each attribute name (up to the first underscore) as its type: for example, ‘plasmid_ABC’ and ‘plasmid_def’ would both be treated as objects of type ‘plasmid’, with names ‘ABC’ and ‘DEF’, respectively.
python ../ \
--attribute_name_length 10 \
--type_underscore EvolCCM_Bifido_100_feature_profile.tsv \