msa-limit
Msa-limit is an analysis pipeline to test the efficiency of different multiple alignment software (MSA) on long reads. Using nanopore reads and a reference, it generates consensus sequences from the different MSA software to compare to the reference and see if the alignment is correct. (See the schematic in the doc file for more details)
Usable MSA software: muscle,mafft,poa,kalign,spoa,kalign3,clustalo,abpoa,tcoffee
Usage
Conda (>4.10) must be installed (see https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html)
To install msa-limit:
git clone https://gitlab.cristal.univ-lille.fr/crohmer/msa-limit.git
cd msa-limit
Run a test to verify proper operation:
./msa-limit.py test
To start the analysis pipeline:
Usage:
msa-limit.py -i <file_reads> -r <file_ref> [-options]
Arguments:
required:
-i <string>
nanopore long reads file (fasta or fastq)
-r <string>
reference sequence file (fasta, a single sequence)
IUPAC consensus sequence in the diploid case
optional:
-n <string>
default: date and time of execution
name of the experiment
-o <int>
default: 10
number of regions to be tested
-b <int>,<int>,...
beginning(s) position of region(s) (replacing -o)
-d <int>,<int>,...
default: 10,20,50
sequencing depth(s) (number of reads)
-s <int>,<int>,...
default: 100,200
size(s) of region(s)
-t <int>,<int>,...
default: 50
threshold(s) for sequences consensus
-m <string>,<string>,...
default: muscle,mafft,poa,kalign,spoa,kalign3,clustalo,abpoa,tcoffee (all)
MSA software(s) to run
-h
help
Ex: ./msa-limit.py -i reads.fastq -r ref.fasta -b 1,150 -n exp -d 10,100 -s 100,200 -t 50,75 -m mafft,poa
Others modes
There are other features than the basic one for msa-limit:
Usage:
msa-limit.py -i <file_reads> -r <file_ref> [-options]
Other modes:
test
Launches a pipeline test
list
List of existing experiments
summary
More readable summary of experiments for a human
optional:
-n <string> <string> <string> ...
default: all the names of the experiments
names of the experiments you want to display in the summary.
run_config <string> <string> ...
Launches the pipeline from configuration file(s)
required: path to the configuration file(s).
rulegraph
Displays a graph of the snakemake rules
Configuration file
The basic mode of msa-limit creates a configuration file which is then used by the pipeline. It is possible with the run_config mode (msa-limit run_config <config_file>) to directly launch the pipeline with its own configuration file which must respect the following format:
I: <reads_file> #REQUIRED, absolute path of preference
I: <ref_file> #REQUIRED, absolute path of preference
n: test #OPTIONAL, -n
D: [10,20,50] #OPTIONAL, -d
S: [100,200] #OPTIONAL, -s
T: [50] #OPTIONAL, -t
M: [muscle,mafft] #OPTIONAL, -m
O: 10 #OPTIONAL, -o, can be replaced by -b (B: [1,150])
Only snakemake
This pipeline is created from snakemake. If you are familiar with this tool, you can launch the pipeline directly from snakemake with a configuration file. You will need to install snakemake (6.10++) and set the option to use conda
snakemake --configfile <config_file> -c24 --use-conda
Dependencies
- conda 4.10.1+
- python 3.7.4+
Add a new msa software
If you want to add a new msa software in the pipeline, you will have to add a rule in the Snakefile. You will have to either install the software locally or create a conda environment file with the software. The output must be in fasta format In the following commands, replace <new_msa> with the name of the software.
Create a conda environment file:
conda create -n <new_msa>
conda install <new_msa>
conda env export >env_conda/<new_msa>.yaml
Add the rule below in the Snakefile. Replace <msa_limit> with the name of the software. Replaces <command_to_launch_the_software> with the command to run the software. In your command, the input and output file must be replaced with {input} and {output.out}. (Ex: muscle -in {input} -out {output.out})
rule <new_msa> :
input :
os.path.join('{data_set}','selected_read','reads_r{region_size}_d{depth}.fasta')
output :
time = os.path.join('{data_set}','time','MSA_<new_msa>_r{region_size}_d{depth}'),
out = os.path.join('{data_set}','msa','MSA_<new_msa>_r{region_size}_d{depth}.fasta')
message:
"<new_msa> for {wildcards.data_set} (Region size={wildcards.region_size} & Depth={wildcards.depth})"
log:
os.path.join('{data_set}','logs','6_<new_msa>_r{region_size}_d{depth}.log')
conda: #Only if you use conda
"env_conda/<new_msa>.yaml"
shell :
'./src/run_MSA.sh "<command_to_launch_the_software>" {input} {output.out} {output.time} {log} 1'
Warning: If the output of the software is done by the terminal output stream, put only the command with the input and change the 6th parameter of the script run_msa.sh from 1 to 0 (see the rule for Spoa for this case)
Potential issue
Abpoa doesn't run
abpoa may not launch from conda on some machines. To solve this problem, you will have to install it locally (see https://github.com/yangao07/abPOA) and modify the abpoa rule.