Added new strategy for concatenation and test data

40a0400e · Lihouck Flavien · 717dc76c · 40a0400e · 40a0400e · 40a0400e
Commit 40a0400e authored Jul 19, 2022 by Lihouck Flavien
--- a/.gitignore
+++ b/.gitignore
@@ -12,4 +12,4 @@ logs/
 *.bam
 *.sam
 *.txt
-worflow/scripts/config_conda.sh
+workflow/scripts/config_conda.sh
--- a/README.md
+++ b/README.md
-# Stage
+# Meta Consensus MSA
-Dépot de travail pour versionner le travail réalisé pendant le stage
+This tools aims to extract a DNA sequence from reads applying a consensus strategy to several MSA built from the reads, creating a meta-MSA from these consensus, and applying the consensus strategy on the meta-MSA, creating what we call a meta-consensus.
 ## Requirements
@@ -13,61 +13,72 @@ To install Conda check : https://docs.conda.io/projects/conda/en/latest/user-gui
 You only need to download this repository in order to launch the pipeline, the first run will handle the setup for every tools using Conda.
 ## Usage
-usage: mc_msa -i INPUT -o OUTPUT [-r REFERENCE -t TOOLS ...]  
+> 
+> usage: mc_msa -i INPUT -o OUTPUT [-r REFERENCE -t TOOLS ...]  
-Creates the config file then runs the Meta-consensus pipeline  
+>
+>Creates the config file then runs the Meta-consensus pipeline  
-Required arguments:  
+>
- -input INPUT, -i INPUT  
+> Required arguments:  
+> -input INPUT, -i INPUT  
 > Reads file
+> -output OUTPUT, -o OUTPUT  
-output OUTPUT, -o OUTPUT  
 > Target directory for the pipeline results  
+>
-Standard arguments:  
+>Standard arguments:  
-h, --help            
+>-h, --help            
 >  show this help message and exit  
+>
- -reference REFERENCE, -r REFERENCE  
+> -reference REFERENCE, -r REFERENCE  
 >  Reference for alignment and statistics  
+>
-tools TOOLS, -t TOOLS  
+>-tools TOOLS, -t TOOLS  
 > The list of tools to use in the meta-consensus  
 > (default: ['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle'])  
+>
-cores CORES, -c CORES  
+>-cores CORES, -c CORES  
 >The amount of cores to use in the pipeline run (default 1)  
+>
-Advanced arguments:  
+>Advanced arguments:  
-  -list LIST            
+>  -list LIST            
 >  A list of regions to work on (format: [r1, r2, ...] or [rStart_End, ...]) (default: no region)  
+>
-size SIZE, -s SIZE
+>-size SIZE, -s SIZE
 > The desired region size (default: maximum)  
+>
-consensus_threshold CONSENSUS_THRESHOLD, -ct CONSENSUS_THRESHOLD  
+>-consensus_threshold CONSENSUS_THRESHOLD, -ct CONSENSUS_THRESHOLD  
 > Threshold(s) used for the MSA consensus step (default: [70])  
+>
-  -metaconsensus_threshold METACONSENSUS_THRESHOLD, -mt METACONSENSUS_THRESHOLD  
+>  -metaconsensus_threshold METACONSENSUS_THRESHOLD, -mt METACONSENSUS_THRESHOLD  
 > Threshold(s) used for the Meta-consensus result (default: [60])  
+>
-  -depth DEPTH, -d DEPTH  
+>  -depth DEPTH, -d DEPTH  
 > The depth used in the process (default: max)  
+### Example
+To run the tool with our provided dataset, run the command :
+> $./mc_msa.py -i test_data/stup_virus_seq.fa -o output/stup_virus -r test_data/Stup_virus.fa
+This will setup the pipeline and start a run on the reads in the file `stup_virus_seq.fa` , with the reference file `Stup_virus.fa`
+and every produced by the pipeline will be in the folder `output/stup_virus`. The meta-consensus sequence can be found at `output/stup_virus/meta_consensus/meta_consensus_d0_t1_50_t2_50.fasta`.
 ### Input
 The input reads file, in the fasta format.
 ### Output
-The output folder will contain 4 folders at the end of a pipeline run:
+The output folder will contain 6 folders at the end of a pipeline run with a reference, and 4 folders without a reference:
 - meta-consensus : the resulting meta-consensus, for each region and with each specified thresholds and depths combination.
 - consensus: the intermediary consensus for every MSA, stored in a folder tree including *region*/*depth*/*consenus_threshold*/*metaconsensus_threshold* and consensus alignment
- data : the cut reads, calculated MSAs, and possibly cut-reference. 
+- data : the preliminary alignment, the cut reads, calculated MSAs, and possibly cut-reference. 
 You can use the pipeline with pre-processed MSA by adding the MSA in `output/data/msa`, naming them MSA_*TOOL*_r*START_END*_d*DEPTH*.fasta with *TOOL* the tool used, *START* and *END* the limits of the region, and *DEPTH* the read depth for the MSA.
 - logs: all the logs for the pipeline **will** be here in the final version (for now, some logs end up in the consensus folder ...)
+- plot [requires reference] : 
 ### Region selection
 There are 2 (two) main ways of setting up how the regions are selected.     
@@ -85,9 +96,16 @@ This comes from limitations from the MSA tools themself, as for example abPOA an
 ### Depth
 ## Authors and acknowledgment
 Flavien Lihouck
 Special thanks to Coralie Rohmer's work on the tool MSA-limit, which inspired and was used in many parts of this project.
 ## License
+Conda is released under the 3-clause BSD license (https://docs.conda.io/en/latest/license.html)
+Snakemake is licensed under the MIT License (https://snakemake.readthedocs.io/en/stable/project_info/license.html)
 Probably CC_SA ?
--- a/mc_msa.py
+++ b/mc_msa.py
@@ -11,22 +11,22 @@ config_file = "workflow/config.yaml"
 def create_config() :
    pattern = ""
-    parser = argparse.ArgumentParser(usage="mc_msa -i INPUT -o OUTPUT [-r REFERENCE -t TOOLS ...]" ,description="Creates the config file then runs the Meta-consensus pipeline", add_help=False)
+    parser = argparse.ArgumentParser(usage="mc_msa [--help] -i INPUT -o OUTPUT [-r REFERENCE -t TOOLS ...]\n", description="Creates the config file then runs the Meta-consensus pipeline", add_help=False)
    required = parser.add_argument_group("Required arguments")
    basic = parser.add_argument_group("Standard arguments")
    advanced = parser.add_argument_group("Advanced arguments")
-    basic.add_argument("-h", "--help", action="help", help="show this help message and exit")
+    basic.add_argument("-h", "-help", action="help", help="show this help message and exit")
    required.add_argument("-input", "-i", help="Reads file", required=True)
    required.add_argument("-output", "-o", help="Target directory for the pipeline results", required=True)
    basic.add_argument("-reference", "-r", help="Reference for alignment and statistics")
-    advanced.add_argument("-list", help="A list of regions to work on (format: [r1, r2, ...] or [rStart_End, ...]) (default: no region)")
+    #advanced.add_argument("-list", help="A list of regions to work on (format: [r1, r2, ...] or [rStart_End, ...]) (default: no region)")
    advanced.add_argument("-size", "-s", help="The desired region size (default: maximum)", default='0')
    basic.add_argument("-tools", "-t", help="The list of tools to use in the meta-consensus (default: ['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle'])", default="['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle']")
-    advanced.add_argument("-consensus_threshold", "-ct", help="Threshold(s) used for the MSA consensus step (default:  [70])", default="[70]")
+    advanced.add_argument("-consensus_threshold", "-ct", help="Threshold(s) used for the MSA consensus step (default:  [50])", default="[50]")
-    advanced.add_argument("-metaconsensus_threshold", "-mt",  help="Threshold(s) used for the Meta-consensus result (default: [60])", default="[60]")
+    advanced.add_argument("-metaconsensus_threshold", "-mt",  help="Threshold(s) used for the Meta-consensus result (default: [50])", default="[50]")
    advanced.add_argument("-depth", "-d", help="The depth used in the process (default: max)", default="0")
    #parser.add_argument("-plot", help="Analyse the meta-consensus and MSA consensus quality (requires reference)", action='store_true')
-    advanced.add_argument("-region_overlap", help="The size of the overlap between regions", default="0")
+    #advanced.add_argument("-region_overlap", help="The size of the overlap between regions", default="0")
    basic.add_argument("-cores", "-c", help="The amount of cores to use in the pipeline run (default 1)", default=1)
    args = parser.parse_args()
@@ -40,29 +40,29 @@ def create_config() :
        if args.reference:
            c_file.write("ref_file: "+args.reference+"\n")
        region_param = False
-        if args.list:
+ #       if args.list:
-            r = args.list
+ #           r = args.list
-            r = r.replace("[", "")
+ #           r = r.replace("[", "")
-            r = r.replace("]","")
+ #           r = r.replace("]","")
-            regions = r.split(",")
+ #           regions = r.split(",")
-            actual_regions = "["
+ #           actual_regions = "["
-            last_elem = None
+ #           last_elem = None
-            for elem in regions:
+ #           for elem in regions:
-                qt = "" if "'" in elem else "'"
+ #               qt = "" if "'" in elem else "'"
-                if '_' in elem:
+ #               if '_' in elem:
-                    actual_regions+= qt+ elem.replace('"', "'").replace(" ", "") +qt +","
+ #                   actual_regions+= qt+ elem.replace('"', "'").replace(" ", "") +qt +","
-                elif args.size != 0:
+ #               elif args.size != 0:
-                    actual_regions+= qt+ elem.replace('"', "'").replace(" ", "")+"_"+str(int(elem)+int(args.size))+ qt+ ","
+ #                   actual_regions+= qt+ elem.replace('"', "'").replace(" ", "")+"_"+str(int(elem)+int(args.size))+ qt+ ","
-                elif last_elem:
+ #               elif last_elem:
-                    actual_regions+= qt+last_elem+'_'+elem.replace('"', "'").replace(" ", "")+qt+","
+ #                   actual_regions+= qt+last_elem+'_'+elem.replace('"', "'").replace(" ", "")+qt+","
-                    last_elem = elem
+ #                   last_elem = elem
-                else:
+ #               else:
-                    last_elem = elem
+  #                  last_elem = elem
-            actual_regions = actual_regions[:-1]
+   #         actual_regions = actual_regions[:-1]
-            c_file.write("region: "+actual_regions+"]\n")
+    #        c_file.write("region: "+actual_regions+"]\n")
-        else:
+     #   else:
        c_file.write("region_size: " + args.size+"\n")
-            c_file.write("region_overlap: "+ args.region_overlap+"\n")
+    #    c_file.write("region_overlap: "+ args.region_overlap+"\n")
        c_file.write("tool: "+ args.tools+ "\n")
        c_file.write("threshold_1: "+ args.consensus_threshold +"\n")
        c_file.write("threshold_2: "+ args.metaconsensus_threshold +"\n")

--- a/src/util/exonerate_parser.py
+++ b/src/util/exonerate_parser.py
@@ -3,6 +3,7 @@
 import sys
 DEFAULT_DISTANCE = 10
+LINE_HEADER_SIZE = 9
 def main():
    if len(sys.argv) > 3:
@@ -15,11 +16,11 @@ def main():
        line_cnt = 1
        problems = []
        while line_cnt < len(lines):
-            char_cnt = 9
+            char_cnt = LINE_HEADER_SIZE
-            while char_cnt < len(lines[line_cnt])-9:
+            while char_cnt < len(lines[line_cnt])-LINE_HEADER_SIZE:
                if lines[line_cnt][char_cnt] != "|":
                    print((char_cnt, lines[line_cnt-1][char_cnt], lines[line_cnt][char_cnt], lines[line_cnt+1][char_cnt]))
-                    problems.append((line_cnt, char_cnt-8))
+                    problems.append((line_cnt, char_cnt-LINE_HEADER_SIZE+1))
                    char_cnt += distance
                else:
                    char_cnt += 1
@@ -38,7 +39,7 @@ def main():
                start = y - distance if y - distance > 0 else 0
                end = y + distance if y + distance < len(target[1]) else len(target[1])-1
                output.write("Error "+str(pb_cnt) + ":\n" + str(value1 + start)+ ": " + target[1][start:end] + "\n")
-                output.write(str(value1+start)+": "+lines[x][start+9:end+9]+"\n")
+                output.write(str(value1+start)+": "+lines[x][start+LINE_HEADER_SIZE-1:end+LINE_HEADER_SIZE-1]+"\n")
                output.write(str(value2 + start) + ": " + query[1][start:end] + "\n\n")
                pb_cnt += 1

--- a/test_data/Stup_virus.fa
+++ b/test_data/Stup_virus.fa
+>NC_001792.2 Porcine circovirus 1, complete genome
+GTAGTATTACCAGCGCACTTCGGCAGCGGCAGCACCTCGGCAGCGTCAGTGAAAATGCCAAGCAAGAAAAGCGGCCCGCAACCCCATAAGAGGTGGGTGTTCACCCTTAATAATCCTTCCGAGGAGGAGAAAAACAAAATACGGGAGCTTCCAATCTCCCTTTTTGATTATTTTGTTTGCGGAGAGGAAGGTTTGGAAGAGGGTAGAACTCCTCACCTCCAGGGGTTTGCGAATTTTGCTAAGAAGCAGACTTTTAACAAGGTGAAGTGGTATTTTGGTGCCCGCTGCCACATCGAGAAAGCGAAAGGAACCGACCAGCAGAATAAAGAATACTGCAGTAAAGAAGGTCACATACTTATCGAGTGTGGAGCTCCGCGGAACCAGGGGAAGCGCAGCGACCTGTCTACTGCTGTGAGTACCCTTTTGGAGACGGGGTCTTTGGTGACTGTAGCCGAGCAGTTCCCTGTAACGTATGTGAGAAATTTCCGCGGGCTGGCTGAACTTTTGAAAGTGAGCGGGAAGATGCAGCAGCGTGATTGGAAGACAGCTGTACACGTCATAGTGGGCCCGCCCGGTTGTGGGAAGAGCCAGTGGGCCCGTAATTTTACTGAGCCTAGCGACACCTACTGGAAGCCTAGTAGAAATAAGTGGTGGGATGGATATCATGGAGAAGAAGTTGTTGTTTTGGATGATTTTTATGGCTGGTTACCTTGGGATGATCTACTGAGACTGTGTGACCGGTATCCATTGACTGTAGAGACTAAAGGCGGTACTGTTCCTTTTTTGGCCCGCAGTATTTTGATTACCAGCAATCAGGCCCCCCAGGAATGGTACTCCTCAACTGCTGTCCCAGCTGTAGAAGCTCTCTATCGGAGGATTACTACTTTGCAATTTTGGAAGACTGCTGGAGAACAATCCACGGAGGTACCCGAAGGCCGATTTGAAGCAGTGGACCCACCCTGTGCCCTTTTCCCATATAAAATAAATTACTGAGTCTTTTTTGTTATCACATCGTAATGGTTTTTATTTTTATTTATTTAGAGGTCTTTTAGGATAAATTCTCTGAATTGTACATAAATAGTCAGCCTTACCACATAATTTTGGGCTGTGGCTGCATTTTGGAGCGCATAGCCGAGGCCTGTGTGCTCGACATTGGTGTGGGTATTTAAATGGAGCCACAGCTGGTTTCTTTTATTATTTGGGTGGAACCAATCAATTGTTTGGTCCAGCTCAGGTTTGGGGGTGAAGTACCTGGAGTGGTAGGTAAAGGGCTGCCTTATGGTGTGGCGGGAGGAGTAGTTAATATAGGGGTCATAGGCCAAGTTGGTGGAGGGGGTTACAAAGTTGGCATCCAAGATAACAACAGTGGACCCAACACCTCTTTGATTAGAGGTGATGGGGTCTCTGGGGTAAAATTCATATTTAGCCTTTCTAATACGGTAGTATTGGAAAGGTAGGGGTAGGGGGTTGGTGCCGCCTGAGGGGGGGAGGAACTGGCCGATGTTGAATTTGAGGTAGTTAACATTCCAAGATGGCTGCGAGTATCCTCCTTTTATGGTGAGTACCAATTCTGTAGAAAGGCGGGAATTGAAGATACCCGTCTTTCGGCGCCATCTGTAACGGTTTCTGAAGGCGGGGTGTGCCAAATATGGTCTTCTCCGGAGGATGTTTCCAAGATGGCTGCGGGGGCGGGTCCTTCTTCTGCGGTAACGCCTCCTTGGCCACGTCATCCTATAAAAGTGAAAGAAGTGCGCTGCT
--- a/test_data/stup_virus_seq.fa
+++ b/test_data/stup_virus_seq.fa
--- a/test_data/stup_virus_seq.fa.gz
+++ b/test_data/stup_virus_seq.fa.gz
--- a/workflow/Snakefile
+++ b/workflow/Snakefile
--- a/workflow/config.yaml
+++ b/workflow/config.yaml
-read_file: ../Lambda_phage/lambda_1000x_10p.fa
+read_file: ../data_yeast_diploid/diploid_reads_shuffle.fastq
-output_folder: Lambda_10
+output_folder: ../TESTS/Yeast_Diplo
-ref_file: ../Lambda_phage/lambda_virus.fa
+ref_file: ../data_yeast_diploid/ref_diploid.fasta
-region_size: 5000
+region_size: 2000
-region_overlap: 0
 tool: ['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle']
-threshold_1: [70]
+threshold_1: [50]
-threshold_2: [60]
+threshold_2: [50]
-depth: 30 
+depth: [30,40] 
--- a/workflow/scripts/concat_regions.py
+++ b/workflow/scripts/concat_regions.py
@@ -3,12 +3,20 @@
 import sys
+def find_nth(string, substring, n):
+    if (n == 1):
+        return string.find(substring)
+    else:
+        return string.find(substring, find_nth(string, substring, n - 1) + 1)
 def main():
    args = sys.argv
    with open(args[1], "w") as output:
        with open(args[2], 'r') as file:
            lines = file.readlines()
-            output.write(">meta_consensus\n"+ lines[1][:-1])
+            output.write(">"+lines[0][:find_nth(lines[0], "_", 3)]+"\n"+ lines[1][:-1])
        file_cnt = 3
        while file_cnt < len(args):
            with open(args[file_cnt], "r") as file: