ARETE and dataset size

Currently ARETE has three distinct profiles that change the pipeline execution in some ways: The default profile (which we can call small), the medium profile and the large profile.

These three profiles were developed based on the size and diversity of the input dataset and change some parameter defaults based on tests we have performed on similar-sized datasets.

If you want to first gauge the potential diversity of your dataset and have some input assemblies you can try the PopPUNK entry. One of the outputs will provide insight into how many clusters, or lineages, your dataset divides into.

The sizes are:

For the default or small profile, we expect datasets with 100 samples/assemblies or fewer. It runs on the default pipeline parameters, with no changes.
For the medium profile, we expect datasets with >100 and <1000 samples. It increases the default resource requirements for most processes and also uses PPanGGoLiN for pangenome construction, instead of Panaroo.
For the large profile, we expect datasets with >1000 samples. It also increases default resource requirements for some processes and uses PPanGGoLin. Additionally, it enables PopPUNK subsampling, with default parameters.
For the light profile, we expect datasets with at most 12 samples. This is a profile primarily designed to run on personal computers and it disables most ARETE processes.