Skip to content
Snippets Groups Projects
Commit 40a0400e authored by Lihouck Flavien's avatar Lihouck Flavien
Browse files

Added new strategy for concatenation and test data

parent 717dc76c
Branches
No related tags found
No related merge requests found
...@@ -12,4 +12,4 @@ logs/ ...@@ -12,4 +12,4 @@ logs/
*.bam *.bam
*.sam *.sam
*.txt *.txt
worflow/scripts/config_conda.sh workflow/scripts/config_conda.sh
# Stage # Meta Consensus MSA
Dépot de travail pour versionner le travail réalisé pendant le stage This tools aims to extract a DNA sequence from reads applying a consensus strategy to several MSA built from the reads, creating a meta-MSA from these consensus, and applying the consensus strategy on the meta-MSA, creating what we call a meta-consensus.
## Requirements ## Requirements
...@@ -13,61 +13,72 @@ To install Conda check : https://docs.conda.io/projects/conda/en/latest/user-gui ...@@ -13,61 +13,72 @@ To install Conda check : https://docs.conda.io/projects/conda/en/latest/user-gui
You only need to download this repository in order to launch the pipeline, the first run will handle the setup for every tools using Conda. You only need to download this repository in order to launch the pipeline, the first run will handle the setup for every tools using Conda.
## Usage ## Usage
usage: mc_msa -i INPUT -o OUTPUT [-r REFERENCE -t TOOLS ...] >
> usage: mc_msa -i INPUT -o OUTPUT [-r REFERENCE -t TOOLS ...]
Creates the config file then runs the Meta-consensus pipeline >
>Creates the config file then runs the Meta-consensus pipeline
Required arguments: >
-input INPUT, -i INPUT > Required arguments:
> -input INPUT, -i INPUT
> Reads file > Reads file
> -output OUTPUT, -o OUTPUT
-output OUTPUT, -o OUTPUT
> Target directory for the pipeline results > Target directory for the pipeline results
>
Standard arguments: >Standard arguments:
-h, --help >-h, --help
> show this help message and exit > show this help message and exit
>
-reference REFERENCE, -r REFERENCE > -reference REFERENCE, -r REFERENCE
> Reference for alignment and statistics > Reference for alignment and statistics
>
-tools TOOLS, -t TOOLS >-tools TOOLS, -t TOOLS
> The list of tools to use in the meta-consensus > The list of tools to use in the meta-consensus
> (default: ['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle']) > (default: ['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle'])
>
-cores CORES, -c CORES >-cores CORES, -c CORES
>The amount of cores to use in the pipeline run (default 1) >The amount of cores to use in the pipeline run (default 1)
>
Advanced arguments: >Advanced arguments:
-list LIST > -list LIST
> A list of regions to work on (format: [r1, r2, ...] or [rStart_End, ...]) (default: no region) > A list of regions to work on (format: [r1, r2, ...] or [rStart_End, ...]) (default: no region)
>
-size SIZE, -s SIZE >-size SIZE, -s SIZE
> The desired region size (default: maximum) > The desired region size (default: maximum)
>
-consensus_threshold CONSENSUS_THRESHOLD, -ct CONSENSUS_THRESHOLD >-consensus_threshold CONSENSUS_THRESHOLD, -ct CONSENSUS_THRESHOLD
> Threshold(s) used for the MSA consensus step (default: [70]) > Threshold(s) used for the MSA consensus step (default: [70])
>
-metaconsensus_threshold METACONSENSUS_THRESHOLD, -mt METACONSENSUS_THRESHOLD > -metaconsensus_threshold METACONSENSUS_THRESHOLD, -mt METACONSENSUS_THRESHOLD
> Threshold(s) used for the Meta-consensus result (default: [60]) > Threshold(s) used for the Meta-consensus result (default: [60])
>
-depth DEPTH, -d DEPTH > -depth DEPTH, -d DEPTH
> The depth used in the process (default: max) > The depth used in the process (default: max)
### Example
To run the tool with our provided dataset, run the command :
> $./mc_msa.py -i test_data/stup_virus_seq.fa -o output/stup_virus -r test_data/Stup_virus.fa
This will setup the pipeline and start a run on the reads in the file `stup_virus_seq.fa` , with the reference file `Stup_virus.fa`
and every produced by the pipeline will be in the folder `output/stup_virus`. The meta-consensus sequence can be found at `output/stup_virus/meta_consensus/meta_consensus_d0_t1_50_t2_50.fasta`.
### Input ### Input
The input reads file, in the fasta format. The input reads file, in the fasta format.
### Output ### Output
The output folder will contain 4 folders at the end of a pipeline run: The output folder will contain 6 folders at the end of a pipeline run with a reference, and 4 folders without a reference:
- meta-consensus : the resulting meta-consensus, for each region and with each specified thresholds and depths combination. - meta-consensus : the resulting meta-consensus, for each region and with each specified thresholds and depths combination.
- consensus: the intermediary consensus for every MSA, stored in a folder tree including *region*/*depth*/*consenus_threshold*/*metaconsensus_threshold* and consensus alignment - consensus: the intermediary consensus for every MSA, stored in a folder tree including *region*/*depth*/*consenus_threshold*/*metaconsensus_threshold* and consensus alignment
- data : the cut reads, calculated MSAs, and possibly cut-reference. - data : the preliminary alignment, the cut reads, calculated MSAs, and possibly cut-reference.
You can use the pipeline with pre-processed MSA by adding the MSA in `output/data/msa`, naming them MSA_*TOOL*_r*START_END*_d*DEPTH*.fasta with *TOOL* the tool used, *START* and *END* the limits of the region, and *DEPTH* the read depth for the MSA. You can use the pipeline with pre-processed MSA by adding the MSA in `output/data/msa`, naming them MSA_*TOOL*_r*START_END*_d*DEPTH*.fasta with *TOOL* the tool used, *START* and *END* the limits of the region, and *DEPTH* the read depth for the MSA.
- logs: all the logs for the pipeline **will** be here in the final version (for now, some logs end up in the consensus folder ...) - logs: all the logs for the pipeline **will** be here in the final version (for now, some logs end up in the consensus folder ...)
- plot [requires reference] :
### Region selection ### Region selection
There are 2 (two) main ways of setting up how the regions are selected. There are 2 (two) main ways of setting up how the regions are selected.
...@@ -85,9 +96,16 @@ This comes from limitations from the MSA tools themself, as for example abPOA an ...@@ -85,9 +96,16 @@ This comes from limitations from the MSA tools themself, as for example abPOA an
### Depth ### Depth
## Authors and acknowledgment ## Authors and acknowledgment
Flavien Lihouck Flavien Lihouck
Special thanks to Coralie Rohmer's work on the tool MSA-limit, which inspired and was used in many parts of this project. Special thanks to Coralie Rohmer's work on the tool MSA-limit, which inspired and was used in many parts of this project.
## License ## License
Conda is released under the 3-clause BSD license (https://docs.conda.io/en/latest/license.html)
Snakemake is licensed under the MIT License (https://snakemake.readthedocs.io/en/stable/project_info/license.html)
Probably CC_SA ? Probably CC_SA ?
...@@ -11,22 +11,22 @@ config_file = "workflow/config.yaml" ...@@ -11,22 +11,22 @@ config_file = "workflow/config.yaml"
def create_config() : def create_config() :
pattern = "" pattern = ""
parser = argparse.ArgumentParser(usage="mc_msa -i INPUT -o OUTPUT [-r REFERENCE -t TOOLS ...]" ,description="Creates the config file then runs the Meta-consensus pipeline", add_help=False) parser = argparse.ArgumentParser(usage="mc_msa [--help] -i INPUT -o OUTPUT [-r REFERENCE -t TOOLS ...]\n", description="Creates the config file then runs the Meta-consensus pipeline", add_help=False)
required = parser.add_argument_group("Required arguments") required = parser.add_argument_group("Required arguments")
basic = parser.add_argument_group("Standard arguments") basic = parser.add_argument_group("Standard arguments")
advanced = parser.add_argument_group("Advanced arguments") advanced = parser.add_argument_group("Advanced arguments")
basic.add_argument("-h", "--help", action="help", help="show this help message and exit") basic.add_argument("-h", "-help", action="help", help="show this help message and exit")
required.add_argument("-input", "-i", help="Reads file", required=True) required.add_argument("-input", "-i", help="Reads file", required=True)
required.add_argument("-output", "-o", help="Target directory for the pipeline results", required=True) required.add_argument("-output", "-o", help="Target directory for the pipeline results", required=True)
basic.add_argument("-reference", "-r", help="Reference for alignment and statistics") basic.add_argument("-reference", "-r", help="Reference for alignment and statistics")
advanced.add_argument("-list", help="A list of regions to work on (format: [r1, r2, ...] or [rStart_End, ...]) (default: no region)") #advanced.add_argument("-list", help="A list of regions to work on (format: [r1, r2, ...] or [rStart_End, ...]) (default: no region)")
advanced.add_argument("-size", "-s", help="The desired region size (default: maximum)", default='0') advanced.add_argument("-size", "-s", help="The desired region size (default: maximum)", default='0')
basic.add_argument("-tools", "-t", help="The list of tools to use in the meta-consensus (default: ['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle'])", default="['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle']") basic.add_argument("-tools", "-t", help="The list of tools to use in the meta-consensus (default: ['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle'])", default="['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle']")
advanced.add_argument("-consensus_threshold", "-ct", help="Threshold(s) used for the MSA consensus step (default: [70])", default="[70]") advanced.add_argument("-consensus_threshold", "-ct", help="Threshold(s) used for the MSA consensus step (default: [50])", default="[50]")
advanced.add_argument("-metaconsensus_threshold", "-mt", help="Threshold(s) used for the Meta-consensus result (default: [60])", default="[60]") advanced.add_argument("-metaconsensus_threshold", "-mt", help="Threshold(s) used for the Meta-consensus result (default: [50])", default="[50]")
advanced.add_argument("-depth", "-d", help="The depth used in the process (default: max)", default="0") advanced.add_argument("-depth", "-d", help="The depth used in the process (default: max)", default="0")
#parser.add_argument("-plot", help="Analyse the meta-consensus and MSA consensus quality (requires reference)", action='store_true') #parser.add_argument("-plot", help="Analyse the meta-consensus and MSA consensus quality (requires reference)", action='store_true')
advanced.add_argument("-region_overlap", help="The size of the overlap between regions", default="0") #advanced.add_argument("-region_overlap", help="The size of the overlap between regions", default="0")
basic.add_argument("-cores", "-c", help="The amount of cores to use in the pipeline run (default 1)", default=1) basic.add_argument("-cores", "-c", help="The amount of cores to use in the pipeline run (default 1)", default=1)
args = parser.parse_args() args = parser.parse_args()
...@@ -40,29 +40,29 @@ def create_config() : ...@@ -40,29 +40,29 @@ def create_config() :
if args.reference: if args.reference:
c_file.write("ref_file: "+args.reference+"\n") c_file.write("ref_file: "+args.reference+"\n")
region_param = False region_param = False
if args.list: # if args.list:
r = args.list # r = args.list
r = r.replace("[", "") # r = r.replace("[", "")
r = r.replace("]","") # r = r.replace("]","")
regions = r.split(",") # regions = r.split(",")
actual_regions = "[" # actual_regions = "["
last_elem = None # last_elem = None
for elem in regions: # for elem in regions:
qt = "" if "'" in elem else "'" # qt = "" if "'" in elem else "'"
if '_' in elem: # if '_' in elem:
actual_regions+= qt+ elem.replace('"', "'").replace(" ", "") +qt +"," # actual_regions+= qt+ elem.replace('"', "'").replace(" ", "") +qt +","
elif args.size != 0: # elif args.size != 0:
actual_regions+= qt+ elem.replace('"', "'").replace(" ", "")+"_"+str(int(elem)+int(args.size))+ qt+ "," # actual_regions+= qt+ elem.replace('"', "'").replace(" ", "")+"_"+str(int(elem)+int(args.size))+ qt+ ","
elif last_elem: # elif last_elem:
actual_regions+= qt+last_elem+'_'+elem.replace('"', "'").replace(" ", "")+qt+"," # actual_regions+= qt+last_elem+'_'+elem.replace('"', "'").replace(" ", "")+qt+","
last_elem = elem # last_elem = elem
else: # else:
last_elem = elem # last_elem = elem
actual_regions = actual_regions[:-1] # actual_regions = actual_regions[:-1]
c_file.write("region: "+actual_regions+"]\n") # c_file.write("region: "+actual_regions+"]\n")
else: # else:
c_file.write("region_size: " + args.size+"\n") c_file.write("region_size: " + args.size+"\n")
c_file.write("region_overlap: "+ args.region_overlap+"\n") # c_file.write("region_overlap: "+ args.region_overlap+"\n")
c_file.write("tool: "+ args.tools+ "\n") c_file.write("tool: "+ args.tools+ "\n")
c_file.write("threshold_1: "+ args.consensus_threshold +"\n") c_file.write("threshold_1: "+ args.consensus_threshold +"\n")
c_file.write("threshold_2: "+ args.metaconsensus_threshold +"\n") c_file.write("threshold_2: "+ args.metaconsensus_threshold +"\n")
......
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
import sys import sys
DEFAULT_DISTANCE = 10 DEFAULT_DISTANCE = 10
LINE_HEADER_SIZE = 9
def main(): def main():
if len(sys.argv) > 3: if len(sys.argv) > 3:
...@@ -15,11 +16,11 @@ def main(): ...@@ -15,11 +16,11 @@ def main():
line_cnt = 1 line_cnt = 1
problems = [] problems = []
while line_cnt < len(lines): while line_cnt < len(lines):
char_cnt = 9 char_cnt = LINE_HEADER_SIZE
while char_cnt < len(lines[line_cnt])-9: while char_cnt < len(lines[line_cnt])-LINE_HEADER_SIZE:
if lines[line_cnt][char_cnt] != "|": if lines[line_cnt][char_cnt] != "|":
print((char_cnt, lines[line_cnt-1][char_cnt], lines[line_cnt][char_cnt], lines[line_cnt+1][char_cnt])) print((char_cnt, lines[line_cnt-1][char_cnt], lines[line_cnt][char_cnt], lines[line_cnt+1][char_cnt]))
problems.append((line_cnt, char_cnt-8)) problems.append((line_cnt, char_cnt-LINE_HEADER_SIZE+1))
char_cnt += distance char_cnt += distance
else: else:
char_cnt += 1 char_cnt += 1
...@@ -38,7 +39,7 @@ def main(): ...@@ -38,7 +39,7 @@ def main():
start = y - distance if y - distance > 0 else 0 start = y - distance if y - distance > 0 else 0
end = y + distance if y + distance < len(target[1]) else len(target[1])-1 end = y + distance if y + distance < len(target[1]) else len(target[1])-1
output.write("Error "+str(pb_cnt) + ":\n" + str(value1 + start)+ ": " + target[1][start:end] + "\n") output.write("Error "+str(pb_cnt) + ":\n" + str(value1 + start)+ ": " + target[1][start:end] + "\n")
output.write(str(value1+start)+": "+lines[x][start+9:end+9]+"\n") output.write(str(value1+start)+": "+lines[x][start+LINE_HEADER_SIZE-1:end+LINE_HEADER_SIZE-1]+"\n")
output.write(str(value2 + start) + ": " + query[1][start:end] + "\n\n") output.write(str(value2 + start) + ": " + query[1][start:end] + "\n\n")
pb_cnt += 1 pb_cnt += 1
......
>NC_001792.2 Porcine circovirus 1, complete genome
GTAGTATTACCAGCGCACTTCGGCAGCGGCAGCACCTCGGCAGCGTCAGTGAAAATGCCAAGCAAGAAAAGCGGCCCGCAACCCCATAAGAGGTGGGTGTTCACCCTTAATAATCCTTCCGAGGAGGAGAAAAACAAAATACGGGAGCTTCCAATCTCCCTTTTTGATTATTTTGTTTGCGGAGAGGAAGGTTTGGAAGAGGGTAGAACTCCTCACCTCCAGGGGTTTGCGAATTTTGCTAAGAAGCAGACTTTTAACAAGGTGAAGTGGTATTTTGGTGCCCGCTGCCACATCGAGAAAGCGAAAGGAACCGACCAGCAGAATAAAGAATACTGCAGTAAAGAAGGTCACATACTTATCGAGTGTGGAGCTCCGCGGAACCAGGGGAAGCGCAGCGACCTGTCTACTGCTGTGAGTACCCTTTTGGAGACGGGGTCTTTGGTGACTGTAGCCGAGCAGTTCCCTGTAACGTATGTGAGAAATTTCCGCGGGCTGGCTGAACTTTTGAAAGTGAGCGGGAAGATGCAGCAGCGTGATTGGAAGACAGCTGTACACGTCATAGTGGGCCCGCCCGGTTGTGGGAAGAGCCAGTGGGCCCGTAATTTTACTGAGCCTAGCGACACCTACTGGAAGCCTAGTAGAAATAAGTGGTGGGATGGATATCATGGAGAAGAAGTTGTTGTTTTGGATGATTTTTATGGCTGGTTACCTTGGGATGATCTACTGAGACTGTGTGACCGGTATCCATTGACTGTAGAGACTAAAGGCGGTACTGTTCCTTTTTTGGCCCGCAGTATTTTGATTACCAGCAATCAGGCCCCCCAGGAATGGTACTCCTCAACTGCTGTCCCAGCTGTAGAAGCTCTCTATCGGAGGATTACTACTTTGCAATTTTGGAAGACTGCTGGAGAACAATCCACGGAGGTACCCGAAGGCCGATTTGAAGCAGTGGACCCACCCTGTGCCCTTTTCCCATATAAAATAAATTACTGAGTCTTTTTTGTTATCACATCGTAATGGTTTTTATTTTTATTTATTTAGAGGTCTTTTAGGATAAATTCTCTGAATTGTACATAAATAGTCAGCCTTACCACATAATTTTGGGCTGTGGCTGCATTTTGGAGCGCATAGCCGAGGCCTGTGTGCTCGACATTGGTGTGGGTATTTAAATGGAGCCACAGCTGGTTTCTTTTATTATTTGGGTGGAACCAATCAATTGTTTGGTCCAGCTCAGGTTTGGGGGTGAAGTACCTGGAGTGGTAGGTAAAGGGCTGCCTTATGGTGTGGCGGGAGGAGTAGTTAATATAGGGGTCATAGGCCAAGTTGGTGGAGGGGGTTACAAAGTTGGCATCCAAGATAACAACAGTGGACCCAACACCTCTTTGATTAGAGGTGATGGGGTCTCTGGGGTAAAATTCATATTTAGCCTTTCTAATACGGTAGTATTGGAAAGGTAGGGGTAGGGGGTTGGTGCCGCCTGAGGGGGGGAGGAACTGGCCGATGTTGAATTTGAGGTAGTTAACATTCCAAGATGGCTGCGAGTATCCTCCTTTTATGGTGAGTACCAATTCTGTAGAAAGGCGGGAATTGAAGATACCCGTCTTTCGGCGCCATCTGTAACGGTTTCTGAAGGCGGGGTGTGCCAAATATGGTCTTCTCCGGAGGATGTTTCCAAGATGGCTGCGGGGGCGGGTCCTTCTTCTGCGGTAACGCCTCCTTGGCCACGTCATCCTATAAAAGTGAAAGAAGTGCGCTGCT
This diff is collapsed.
File added
This diff is collapsed.
read_file: ../Lambda_phage/lambda_1000x_10p.fa read_file: ../data_yeast_diploid/diploid_reads_shuffle.fastq
output_folder: Lambda_10 output_folder: ../TESTS/Yeast_Diplo
ref_file: ../Lambda_phage/lambda_virus.fa ref_file: ../data_yeast_diploid/ref_diploid.fasta
region_size: 5000 region_size: 2000
region_overlap: 0
tool: ['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle'] tool: ['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle']
threshold_1: [70] threshold_1: [50]
threshold_2: [60] threshold_2: [50]
depth: 30 depth: [30,40]
...@@ -3,12 +3,20 @@ ...@@ -3,12 +3,20 @@
import sys import sys
def find_nth(string, substring, n):
if (n == 1):
return string.find(substring)
else:
return string.find(substring, find_nth(string, substring, n - 1) + 1)
def main(): def main():
args = sys.argv args = sys.argv
with open(args[1], "w") as output: with open(args[1], "w") as output:
with open(args[2], 'r') as file: with open(args[2], 'r') as file:
lines = file.readlines() lines = file.readlines()
output.write(">meta_consensus\n"+ lines[1][:-1]) output.write(">"+lines[0][:find_nth(lines[0], "_", 3)]+"\n"+ lines[1][:-1])
file_cnt = 3 file_cnt = 3
while file_cnt < len(args): while file_cnt < len(args):
with open(args[file_cnt], "r") as file: with open(args[file_cnt], "r") as file:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment