Lihouck Flavien
MC_MSA

Repository



Stage
Dépot de travail pour versionner le travail réalisé pendant le stage

Requirements
This tool uses the Conda environment manager from the Anaconda software to run.
To install Conda check : https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html

Installation
You only need to download this repository in order to launch the pipeline, the first run will handle the setup for every tools using Conda.

Usage
usage: mc_msa -i INPUT -o OUTPUT [-r REFERENCE -t TOOLS ...]
Creates the config file then runs the Meta-consensus pipeline
Required arguments:

-input INPUT, -i INPUT

Reads file

-output OUTPUT, -o OUTPUT

Target directory for the pipeline results

Standard arguments:

-h, --help

show this help message and exit

-reference REFERENCE, -r REFERENCE

Reference for alignment and statistics

-tools TOOLS, -t TOOLS

The list of tools to use in the meta-consensus

(default: ['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle'])

-cores CORES, -c CORES

The amount of cores to use in the pipeline run (default 1)

Advanced arguments:

-list LIST

A list of regions to work on (format: [r1, r2, ...] or [rStart_End, ...]) (default: no region)

-size SIZE, -s SIZE

The desired region size (default: maximum)

-consensus_threshold CONSENSUS_THRESHOLD, -ct CONSENSUS_THRESHOLD

Threshold(s) used for the MSA consensus step (default: [70])

-metaconsensus_threshold METACONSENSUS_THRESHOLD, -mt METACONSENSUS_THRESHOLD

Threshold(s) used for the Meta-consensus result (default: [60])

-depth DEPTH, -d DEPTH

The depth used in the process (default: max)


Input
The input reads file, in the fasta format.

Output
The output folder will contain 4 folders at the end of a pipeline run:

meta-consensus : the resulting meta-consensus, for each region and with each specified thresholds and depths combination.
consensus: the intermediary consensus for every MSA, stored in a folder tree including region/depth/consenus_threshold/metaconsensus_threshold and consensus alignment
data : the cut reads, calculated MSAs, and possibly cut-reference.
You can use the pipeline with pre-processed MSA by adding the MSA in output/data/msa, naming them MSA_TOOL_rSTART_END_dDEPTH.fasta with TOOL the tool used, START and END the limits of the region, and DEPTH the read depth for the MSA.
logs: all the logs for the pipeline will be here in the final version (for now, some logs end up in the consensus folder ...)


Region selection
There are 2 (two) main ways of setting up how the regions are selected.
You can output manually the regions using -list, allowing 2 formats.


-list "[rStart1, rStart2, rStart3, ...]" : the corresponding regions will be from Start1 to Start2 , then from Start2 to Start 3 and so on.

-list "[rStart1_End1, rStart2_End2, ...]" : the corresponding regions will be from Start1 to End1, then from Start2 to End2 and so on.

You can select a region size and an 'overlap', producing regions.

-size 2000 -overlap 50 : will create regions from the 2nd position to the 2002nd, then from the 1952nd to the 3952 and so on. This way, regions share OVERLAP basis, which can be used to join them.
Setting the region size to 0 will try to process the whole sequence in one file. This will be very slow, and cause some tools to either struggle or not produce a result.

This comes from limitations from the MSA tools themself, as for example abPOA and SPOA require a lot of available RAM to function, and Muscle will slow down a lot for larger regions.

Depth

Authors and acknowledgment
Flavien Lihouck
Special thanks to Coralie Rohmer's work on the tool MSA-limit, which inspired and was used in many parts of this project.

License
Probably CC_SA ?