diff --git a/README.md b/README.md index d65a500..6b0f17a 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,6 @@ This program is currently being developed and in an alpha state. You are welcome * [How it works](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/how_varvamp_works.md) * [FAQ](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/FAQ.md) -Buy Me A Coffee --- **Important disclaimer:** diff --git a/docs/FAQ.md b/docs/FAQ.md index a4a9d2b..8ed1048 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -18,7 +18,7 @@ In your case varVAMP could not find suitable replacement primers in the TILED mo 4. **I have multiple hits after SANGER/QPCR mode. Which should I use?** -varVAMP sorts all amplicons and qpcr designs by score and always assigns the lowest number to the best one of non-overlapping amplicons/qpcr schemes. If you are not interested in a specific gene region, amplicon_0 or qpcr_scheme_0 are your best candidates! +varVAMP sorts all amplicons and qpcr designs by their penalty and always assigns the lowest number to the one with the lowest penalty of the non-overlapping amplicons/qpcr schemes. If you are not interested in a specific gene region, amplicon_0 or qpcr_scheme_0 are your best candidates! 5. **What is deltaG reported for the QPCR mode?** diff --git a/docs/how_varvamp_works.md b/docs/how_varvamp_works.md index 610ea36..8db0b79 100644 --- a/docs/how_varvamp_works.md +++ b/docs/how_varvamp_works.md @@ -23,28 +23,28 @@ varVAMP searches for potential primer regions as defined by a user-defined numbe varVAMP uses [`primer3-py`](https://pypi.org/project/primer3-py/) to search for potential primers. Some of the evaluation process, determining if primers match certain criteria, was adapted from [`primalscheme`](www.github.com/aresti/primalscheme). The primer search contains multiple steps: 1. Digest the primer regions into kmers with the min and max length of primers. This is performed on a consensus sequence that does not contain ambiguous characters but is just the majority consensus of the alignment. Therefore, primer parameters will be later calculated for the best fitting primer. 2. Evaluate if these kmers are potential primers independent of their orientation (temperature, GC, size, poly-x repeats and poly dinucleotide repeats) and dependent on their orientation (secondary structure, GC clamp, number of GCs in the last 5 bases of the 3' end and min 3' nucleotides without an ambiguous base). Filter for kmers that satisfy all constraints and calculate their penalties (explained in the last section). -3. Sanger and tiled mode: Find lowest scoring primer. varVAMP sorts the primers by their score and always takes the best scoring if middle third of the primer has not been covered by a better primer. This greatly reduces the complexity of the later amplicon search while only retaining the best scoring primer of a set of overlapping primers. +3. Sanger and tiled mode: Find primer with the lowest penalty. varVAMP sorts the primers by their penalty and always takes one with the lowest penalty if middle third of the primer has not been covered by a primer with a lower penalty. This greatly reduces the complexity of the later amplicon search while only retaining the best primer of a set of overlapping primers. ### Amplicon search #### Amplicon-tiling -To search for the best scoring amplicon, varVAMP uses a graph based approach. +To search for the best amplicon, varVAMP uses a graph based approach. 1. Create all possible amplicons with the given length constraints and ensure that primer pairs are not forming dimers. -2. Create a graph containing all amplicons and their potential neighboring amplicons. To design a good scheme, the next primer has to lie within the second half of the current primer and satisfy the overlap constraint. The cost to go to a neighboring amplicon is determined by the amplicon score. +2. Create a graph containing all amplicons and their potential neighboring amplicons. To design a good scheme, the next primer has to lie within the second half of the current primer and satisfy the overlap constraint. The cost to go to a neighboring amplicon is determined by the amplicon penalty. 3. Use the [Dijkstra algorithm](https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm) to find the path with the lowest costs from a given start node. 4. Determine potential stop nodes as amplicons with the furthest stop in the alignment. -5. Determine shortest paths between the start and all potential stop nodes. Get the lowest scoring. -6. Repeat steps 3-5 for each start node until the best coverage is reached. Voila! We have the best scoring amplicon scheme! -7. Lastly, the best scoring scheme is evaluated for primer dimers in their respective pools. If a primer dimer pair is found, varVAMP evaluates for each primer their overlapping and previously not considered primers (primer search step 2) and again minimizes the score. The scheme and all primers are updated. If no alternative primers can be found, varVAMP issues a warning and reports the unsolvable primer dimers. +5. Determine shortest paths between the start and all potential stop nodes. Get the lowest cost. +6. Repeat steps 3-5 for each start node until the best coverage is reached. Voila! We have the best amplicon scheme with the lowest cumulative primer penalties in respect to the amplicon length! +7. Lastly, the best scheme is evaluated for primer dimers in their respective pools. If a primer dimer pair is found, varVAMP evaluates for each primer their overlapping and previously not considered primers (primer search step 2) and again minimizes the penalty. The scheme and all primers are updated. If no alternative primers can be found, varVAMP issues a warning and reports the unsolvable primer dimers. #### Sanger sequencing -1. varVAMP sorts all amplicons by their score and takes the non-overlapping amplicon with the lowest score! -2. As varVAMP gives a size penalty to amplicons, varVAMP automatically finds amplicons with low primer scores close to your optimal length (if possible). +1. varVAMP sorts all amplicons by their penalties and takes the non-overlapping amplicon with the lowest penalty! +2. As varVAMP gives a size penalty to amplicons, varVAMP automatically finds amplicons with low primer penalties close to your optimal length (if possible). #### primer BLAST module 1. varVAMP generates a fasta query and searches for possible hits with the same settings as [primer blast](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-134). 2. For each amplicon varVAMP searches for off-targets, defined as hits in the db that are the maximum amplicons length multiplied by `BLAST_SIZE_MULTI` apart from each other. -3. varVAMP appends a high penalty score to the amplicons score if these producing off-targets. This ensures that all other available amplicons are preferentially used. +3. varVAMP appends a high penalty to the amplicons if these produce off-targets. This ensures that all other available amplicons are preferentially used. 4. Reports if amplicons with off-targets are in the final scheme. #### qPCR @@ -62,24 +62,24 @@ To search for the best scoring amplicon, varVAMP uses a graph based approach. ```python PRIMER_MAX_BASE_PENALTY ``` -Each primer is scored for its deviation from the optimal size, GC content and temperature. Base penalty is the sum of these deviations. Primer base penalties higher than the max base penalty are excluded. +Each primer is penalized for its deviation from the optimal size, GC content and temperature. Base penalty is the sum of these deviations. Primer base penalties higher than the max base penalty are excluded. ```python PRIMER_3_PENALTY ``` -Each position in the primer is scored for mismatches in all sequences. If a 3' penalty is given the first in the tuple is multiplied with the frequency mismatch at the very 3' end. The next is multiplied with the -1 freq and so on. Increase penalty if you want to shift amplicons towards best 3' matching. Set the `PRIMER_3_PENALTY` to 0 if you do not care about 3' mismatches. +Each position in the primer is penalized for mismatches in all sequences. If a 3' penalty is given the first in the tuple is multiplied with the frequency mismatch at the very 3' end. The next is multiplied with the -1 freq and so on. Increase penalty if you want to shift amplicons towards best 3' matching. Set the `PRIMER_3_PENALTY` to 0 if you do not care about 3' mismatches. ```python3 PRIMER_PERMUTATION_PENALTY ``` The number of permutations of a primer is multiplied by the penalty. For example 24 permutations and a penalty of 0.1 will yield a penalty of 2.4. Set `PRIMER_PERMUTATION_PENALTY` to 0 if you do not care about the number of permutations. -All scores of a primer are summed up and yield a final score. The score for each amplicon is then the score of its LEFT + RIGHT primers multiplied by the fold increase of the amplicon length compared to the optional length. This insures that in the final scheme not only large amplicons are used. +All penalties of a primer are summed up and yield a final penalty. The penalty for each amplicon is then the penalty of its LEFT + RIGHT primers multiplied by the fold increase of the amplicon length compared to the optional length. This insures that in the final scheme not only large amplicons are used. ```python3 BLAST_PENALTY ``` -If the `-db` argument is used, varVAMP will perform a BLAST search and evaluate off-targets against this database for each amplicon. If an off-target effect is predicted varVAMP will add this penalty to the amplicon score. This insures that this amplicon is only considered if no other amplicons are in this alignment region. +If the `-db` argument is used, varVAMP will perform a BLAST search and evaluate off-targets against this database for each amplicon. If an off-target effect is predicted varVAMP will add this penalty to the amplicon penalty. This insures that this amplicon is only considered if no other amplicons are in this alignment region. #### [Previous: Wet lab protocol](./wet_lab_protocol.md)  [Next: FAQ](./FAQ.md) diff --git a/docs/output.md b/docs/output.md index 505354d..c779d38 100644 --- a/docs/output.md +++ b/docs/output.md @@ -3,22 +3,25 @@ varVAMP produces multiple main output files: -| Mode | Output | Description | -| --- | --- | --- | -| ALL | ambiguous_consensus.fasta | The consensus sequence containing ambiguous nucleotides. | -| ALL | amplicon_plot.pdf | A nice overview for your final amplicon design. | -| ALL| amplicons.bed | A bed file showing the amplicon location compared to the consensus sequence. | -| ALL| per_base_mismatches.pdf | Barplot of the percent mismatches at each nucleotide position of the primer. | -| ALL | primers.bed | A bed file with the primer locations. Includes the primer score. The lower, the better. | -| ALL | varvamp_log.txt | Log file. | -| TILED | unsolvable_primer_dimers.tsv | Only produced if there are primer dimers without replacements. Tells which primers form dimers and at which temperature. -| TILED/SANGER | primer_to_amplicon_assignments.tabular | Simple tab separated file, which primers belong together. Useful for bioinformatic workflows that include primer trimming | -| TILED/SANGER | primer.tsv | A tab separated file with important parameters for the primers including the sequence with ambiguous nucleotides (already in the right strand) and the gc and temperature of the best fitting primer as well as for the mean for all permutations of the primer. | -| QPCR | qpcr_design.tsv | A tab separated file with important parameters for the qPCR amplicon including the deltaG. | -| QPCR | qpcr_primers.tsv | A tab separated file with important parameters for the primers and probes including the sequence with ambiguous nucleotides (already in the right strand) and the gc and temperature of the best fitting primer and probe as well as for the mean for all permutations. | +| Mode | Output | Description | +|--------------|----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| ALL | ambiguous_consensus.fasta | The consensus sequence containing ambiguous nucleotides. | +| ALL | amplicon_plot.pdf | A nice overview for your final amplicon design. | +| ALL | amplicons.bed | A bed file showing the amplicon location compared to the consensus sequence. | +| ALL | per_base_mismatches.pdf | Barplot of the percent mismatches at each nucleotide position of the primer. | +| ALL | primers.bed | A bed file with the primer locations. Includes the primer penalty. The lower, the better. | +| ALL | varvamp_log.txt | Log file. | +| TILED | unsolvable_primer_dimers.tsv | Only produced if there are primer dimers without replacements. Tells which primers form dimers and at which temperature. | +| TILED | primers_pool_0/1.fasta | Primer sequences per pool in fasta format. | +| SANGER | primers.fasta | Primer sequences in fasta format. | +| TILED/SANGER | primer_to_amplicon_assignments.tabular | Simple tab separated file, which primers belong together. Useful for bioinformatic workflows that include primer trimming | +| TILED/SANGER | primer.tsv | A tab separated file with important parameters for the primers including the sequence with ambiguous nucleotides (already in the right strand) and the gc and temperature of the best fitting primer as well as for the mean for all permutations of the primer. | +| QPCR | qpcr_design.tsv | A tab separated file with important parameters for the qPCR amplicon including the deltaG. | +| QPCR | qpcr_primers.tsv | A tab separated file with important parameters for the primers and probes including the sequence with ambiguous nucleotides (already in the right strand) and the gc and temperature of the best fitting primer and probe as well as for the mean for all permutations. | +| QPCR | oligos.fasta | Oligo sequences in fasta format. | -It also produces some secondary output files [*data/*]: +It also produces some secondary output files in `data` : | Mode | Output | Description | | --- | --- | --- | diff --git a/docs/usage.md b/docs/usage.md index 0c56fc9..0f1d29f 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -149,7 +149,7 @@ BLAST_SETTINGS = { # blast settings for query search } BLAST_MAX_DIFF = 0.8 # allowed % differences between primer and BLAST hit BLAST_SIZE_MULTI = 2 # multiplier for the max_amp size of off targets (in relation to max amp size) -BLAST_PENALTY = 50 # amplicon score increase -> considered only if no other possibilities +BLAST_PENALTY = 50 # amplicon penalty increase -> considered only if no other possibilities ``` To apply these new settings just repeat the installation procedure in the varVAMP dir: ```shell diff --git a/docs/wet_lab_protocol.md b/docs/wet_lab_protocol.md index 9669eac..f6800f5 100644 --- a/docs/wet_lab_protocol.md +++ b/docs/wet_lab_protocol.md @@ -10,7 +10,7 @@ The QIAmp Viral RNA Kit was used to isolate RNA out of viral supernatant followi • Elute in 60 µl Buffer AVE -Thanks to Mathias Schmmerer for providing HepE infected cell cultures. +Thanks to Mathias Schemmerer for providing HepE infected cell cultures. ## Amplicon Generation by SuperScript-IV One-Step-PCR The RNA amplification was performed with the following approach and PCR protocol for every single amplicon. diff --git a/docs/workflow.png b/docs/workflow.png index db35b72..ab554e1 100644 Binary files a/docs/workflow.png and b/docs/workflow.png differ diff --git a/varvamp/__init__.py b/varvamp/__init__.py index 768fe84..489c29d 100644 --- a/varvamp/__init__.py +++ b/varvamp/__init__.py @@ -1,3 +1,3 @@ """Tool to design amplicons for highly variable virusgenomes""" _program = "varvamp" -__version__ = "0.9.5" +__version__ = "1.0" diff --git a/varvamp/command.py b/varvamp/command.py index 7005751..f01c39c 100644 --- a/varvamp/command.py +++ b/varvamp/command.py @@ -83,7 +83,7 @@ def get_args(sysargs): ) par.add_argument( "-th", - "--n-threads", + "--threads", help="number of threads", metavar="1", type=int, @@ -201,7 +201,7 @@ def shared_workflow(args, log_file): alignment_cleaned, gaps_to_mask = alignment.process_alignment( preprocessed_alignment, args.threshold, - args.n_threads + args.threads ) logging.varvamp_progress( log_file, @@ -285,7 +285,7 @@ def sanger_and_tiled_shared_workflow(args, left_primer_candidates, right_primer_ logging.varvamp_progress( log_file, progress=0.7, - job="Considering low scoring primers.", + job="Considering primers with low penalties.", progress_text=f"{len(all_primers['+'])} fw and {len(all_primers['-'])} rw primers" ) @@ -319,7 +319,7 @@ def sanger_and_tiled_shared_workflow(args, left_primer_candidates, right_primer_ query_path, amplicons, args.max_length, - args.n_threads, + args.threads, log_file, mode="sanger_tiled" ) @@ -338,8 +338,8 @@ def sanger_workflow(args, amplicons, all_primers, log_file): logging.varvamp_progress( log_file, progress=0.9, - job="Finding low scoring amplicons.", - progress_text=f"{len(amplicon_scheme[0])} low scoring amplicons." + job="Finding amplicons with low penalties.", + progress_text=f"{len(amplicon_scheme[0])} amplicons." ) return amplicon_scheme @@ -430,7 +430,7 @@ def qpcr_workflow(args, data_dir, alignment_cleaned, ambiguous_consensus, majori progress_text=f"{len(qpcr_probes)} potential qPCR probes" ) - # find unique high scoring amplicons with internal probe + # find unique amplicons with a low penalty and an internal probe qpcr_scheme_candidates = qpcr.find_qcr_schemes(qpcr_probes, left_primer_candidates, right_primer_candidates, majority_consensus, ambiguous_consensus) if not qpcr_scheme_candidates: logging.raise_error( @@ -455,12 +455,12 @@ def qpcr_workflow(args, data_dir, alignment_cleaned, ambiguous_consensus, majori query_path, qpcr_scheme_candidates, config.QAMPLICON_LENGTH[1], - args.n_threads, + args.threads, log_file, mode="qpcr" ) # test amplicons for deltaG - final_schemes = qpcr.test_amplicon_deltaG_parallel(qpcr_scheme_candidates, majority_consensus, args.test_n, args.deltaG, args.n_threads) + final_schemes = qpcr.test_amplicon_deltaG_parallel(qpcr_scheme_candidates, majority_consensus, args.test_n, args.deltaG, args.threads) if not final_schemes: logging.raise_error( "no qPCR amplicon passed the deltaG threshold\n", diff --git a/varvamp/scripts/alignment.py b/varvamp/scripts/alignment.py index 1bbda28..1eab2d8 100644 --- a/varvamp/scripts/alignment.py +++ b/varvamp/scripts/alignment.py @@ -30,7 +30,7 @@ def read_alignment(alignment_path): def preprocess(alignment_path): """ force nucleotides to lower and - back transcripe if its RNA + back transcribe if its RNA """ preprocessed_alignment = [] @@ -68,7 +68,7 @@ def find_internal_gaps(unique_gaps, gap): intersection = unique_set.intersection(current_range) if not intersection: continue - if min(intersection) == unique_gap[0] and max(intersection)+1 == unique_gap[1]: + if min(intersection) == unique_gap[0] and max(intersection) + 1 == unique_gap[1]: overlapping_gaps.append(unique_gap) return overlapping_gaps @@ -110,7 +110,6 @@ def create_gap_dictionary(unique_gaps, all_gaps, n_threads): return gap_dict - def find_gaps_to_mask(gap_dict, cutoff): """ filters gaps for their freq cutoff. @@ -136,7 +135,7 @@ def find_gaps_to_mask(gap_dict, cutoff): if opened_region: # write the opened region if the start of the current region # > opened_region[stop] and the last still opened region - if region[0] > opened_region[1] or i == len(potential_gaps)-1: + if region[0] > opened_region[1] or i == len(potential_gaps) - 1: gaps_to_mask.append(opened_region) opened_region = region else: @@ -163,7 +162,7 @@ def clean_gaps(alignment, gaps_to_mask): masked_seq = str() for region in gaps_to_mask: # check if it is three bases or more and mask with 2 Ns - if region[1]-region[0] >= config.QAMPLICON_DEL_CUTOFF: + if region[1] - region[0] >= config.QAMPLICON_DEL_CUTOFF: mask = "NN" # or mask with one N (small deletion) else: @@ -179,12 +178,12 @@ def clean_gaps(alignment, gaps_to_mask): # else we are in the middle of the alignment else: masked_seq = masked_seq + mask + masked_seq_temp - start = region[1]+1 - if max(gaps_to_mask)[1] < len(sequence[1])-1: + start = region[1] + 1 + if max(gaps_to_mask)[1] < len(sequence[1]) - 1: # append the last seq if no gap is at # the end of the sequence start = max(gaps_to_mask)[1] - stop = len(sequence[1])-1 + stop = len(sequence[1]) - 1 masked_seq_temp = sequence[1][start:stop] masked_seq = masked_seq + mask + masked_seq_temp else: @@ -202,7 +201,7 @@ def process_alignment(preprocessed_alignment, threshold, n_threads): """ all_gaps = [] - gap_cutoff = len(preprocessed_alignment)*(1-threshold) + gap_cutoff = len(preprocessed_alignment) * (1 - threshold) for seq in preprocessed_alignment: # find all gaps for all sequences with regular expression -{min} all_gaps.append( diff --git a/varvamp/scripts/blast.py b/varvamp/scripts/blast.py index f76f2b8..a0ce341 100644 --- a/varvamp/scripts/blast.py +++ b/varvamp/scripts/blast.py @@ -114,7 +114,7 @@ def parse_and_filter_BLAST_output(blast_out): # removes tabular output os.remove(blast_out) - return(blast_df) + return blast_df def check_off_targets(df_amp_primers_sorted, max_length, primers): @@ -201,7 +201,7 @@ def predict_non_specific_amplicons(amplicons, blast_df, max_length, mode, n_thre if mode == "sanger_tiled": amplicons[off_target][5] = amplicons[off_target][5] + config.BLAST_PENALTY elif mode == "qpcr": - amplicons[off_target]["score"][0] = amplicons[off_target]["score"][0] + config.BLAST_PENALTY + amplicons[off_target]["penalty"][0] = amplicons[off_target]["penalty"][0] + config.BLAST_PENALTY return off_targets, amplicons @@ -227,7 +227,7 @@ def primer_blast(data_dir, db, query_path, amplicons, max_length, n_threads, log mode, n_threads ) - success_text = f"varVAMP successfully predicted non-specific amplicons:\n\t> {len(off_target_amplicons)}/{len(amplicons)} amplicons could produce amplicons with the blast db.\n\t> raised their amplicon score by {config.BLAST_PENALTY}" + success_text = f"varVAMP successfully predicted non-specific amplicons:\n\t> {len(off_target_amplicons)}/{len(amplicons)} amplicons could produce amplicons with the blast db.\n\t> raised their amplicon penalty by {config.BLAST_PENALTY}" print(success_text) with open(log_file, 'a') as f: print( diff --git a/varvamp/scripts/config.py b/varvamp/scripts/config.py index 3c7b9ec..ef64f22 100644 --- a/varvamp/scripts/config.py +++ b/varvamp/scripts/config.py @@ -8,8 +8,8 @@ PRIMER_TMP = (56, 63, 60) # melting temperatur (min, max, opt) PRIMER_GC_RANGE = (35, 65, 50) # gc (min, max, opt) PRIMER_SIZES = (18, 24, 21) # size (min, max, opt) -PRIMER_MAX_POLYX = 3 # max number of polyx repeats -PRIMER_MAX_DINUC_REPEATS = 3 # max number of dinucleotide repeats +PRIMER_MAX_POLYX = 4 # max number of polyX +PRIMER_MAX_DINUC_REPEATS = 4 # max number of polyXY PRIMER_HAIRPIN = 47 # max melting temp for secondary structure PRIMER_GC_END = (1, 3) # min/max GCs in the last 5 bases of the 3' end PRIMER_MIN_3_WITHOUT_AMB = 3 # min len of 3' without ambiguous charaters @@ -57,7 +57,7 @@ } BLAST_MAX_DIFF = 0.5 # min percent match between primer and BLAST hit (coverage and/or mismatches) BLAST_SIZE_MULTI = 2 # multiplier for the max_amp size of off targets (in relation to max amp size) -BLAST_PENALTY = 50 # amplicon score increase -> considered only if no other possibilities +BLAST_PENALTY = 50 # amplicon penalty increase -> considered only if no other possibilities # nucleotide definitions, do NOT change NUCS = set("atcg") diff --git a/varvamp/scripts/consensus.py b/varvamp/scripts/consensus.py index 59eb43a..9235482 100644 --- a/varvamp/scripts/consensus.py +++ b/varvamp/scripts/consensus.py @@ -13,8 +13,8 @@ def determine_nucleotide_counts(alignment, idx): """ count the number of each nucleotides at an idx of the alignment. return sorted dic. - handels ambiguous nucleotides in sequences. - also handels gaps. + handles ambiguous nucleotides in sequences. + also handles gaps. """ nucleotide_list = [] @@ -39,7 +39,7 @@ def determine_nucleotide_counts(alignment, idx): if nucleotide == "-": to_delete.append(nucleotide) - # drop ambiguous entrys and add adjusted freqs to + # drop ambiguous entries and add adjusted freqs to if to_delete: for i in to_delete: counter.pop(i) diff --git a/varvamp/scripts/logging.py b/varvamp/scripts/logging.py index a0b16b5..dfc6ede 100644 --- a/varvamp/scripts/logging.py +++ b/varvamp/scripts/logging.py @@ -40,7 +40,7 @@ def varvamp_progress(log_file, mode=None, start_time=None, progress=0, job="", p progress bar, main progress logging and folder creation """ barLength = 40 - block = int(round(barLength*progress)) + block = int(round(barLength * progress)) if progress == 0: print( @@ -55,7 +55,8 @@ def varvamp_progress(log_file, mode=None, start_time=None, progress=0, job="", p progress_text = f"all done \n\n\rvarVAMP {__version__} finished in {stop_time.total_seconds()} sec!\n{datetime.datetime.now()}" job = "Finalizing output." print( - "\rJob:\t\t " + job + "\nProgress: \t [{0}] {1}%".format("█"*block + "-"*(barLength-block), progress*100) + "\t" + progress_text, + "\rJob:\t\t " + job + "\nProgress: \t [{0}] {1}%".format("█" * block + "-" * (barLength - block), + progress * 100) + "\t" + progress_text, flush=True ) with open(log_file, 'a') as f: @@ -151,7 +152,7 @@ def raise_arg_errors(args, log_file): "small overlaps might hinder downstream analyses. Consider increasing.", log_file ) - if args.overlap > args.max_length/2 - config.PRIMER_SIZES[1]: + if args.overlap > args.max_length / 2 - config.PRIMER_SIZES[1]: raise_error( "min overlap must be lower than half of your maximum length - maximum primer length. To achieve optimal results reduce it to at least half of your optimal length", log_file, @@ -163,7 +164,7 @@ def raise_arg_errors(args, log_file): log_file, exit=True ) - if args.overlap > args.opt_length/2: + if args.overlap > args.opt_length / 2: raise_error( "your intended overlap is higher than half of your optimal length. This reduces how well varvamps will find overlapping amplicons. Consider decreasing.", log_file @@ -226,9 +227,9 @@ def check_alignment_length(preprocessed_alignment, log_file): smaller_warning, larger_warning = [], [] for name, length in zip(all_names, all_seq_length): # consider variation of sequence lengths and check if it is at least 2 % deviation - if length <= mean_len-3*mean_std and length <= mean_len-mean_len*0.02: + if length <= mean_len - 3 * mean_std and length <= mean_len - mean_len * 0.02: smaller_warning.append(f"{name} ({length} nt)\n") - elif length >= mean_len+3*mean_std and length >= mean_len+mean_len*0.02: + elif length >= mean_len + 3 * mean_std and length >= mean_len + mean_len * 0.02: larger_warning.append(f"{name} ({length} nt)\n") # raise warning for non-empty lists for warning, length_type in zip([larger_warning, smaller_warning], ["larger", "smaller"]): @@ -309,66 +310,70 @@ def confirm_config(args, log_file): exit=True ) # confirm tuples with 3 elements - for type, tup in [("primer temp", config.PRIMER_TMP), ("primer gc", config.PRIMER_GC_RANGE), ("primer size", config.PRIMER_SIZES), ("probe temp", config.QPROBE_TMP), ("probe size", config.QPROBE_SIZES), ("probe gc", config.QPROBE_GC_RANGE)]: + for tup_type, tup in [("primer temp", config.PRIMER_TMP), ("primer gc", config.PRIMER_GC_RANGE), + ("primer size", config.PRIMER_SIZES), ("probe temp", config.QPROBE_TMP), + ("probe size", config.QPROBE_SIZES), ("probe gc", config.QPROBE_GC_RANGE)]: if len(tup) != 3: raise_error( - f"{type} tuple has to have the form (min, max, opt)!", + f"{tup_type} tuple has to have the form (min, max, opt)!", log_file ) error = True if tup[0] > tup[1]: raise_error( - f"min {type} should not exeed max {type}!", + f"min {tup_type} should not exceed max {tup_type}!", log_file ) error = True if tup[0] > tup[2]: raise_error( - f"min {type} should not exeed opt {type}!", + f"min {tup_type} should not exceed opt {tup_type}!", log_file ) error = True if tup[2] > tup[1]: raise_error( - f"opt {type} should not exeed max {type}!", + f"opt {tup_type} should not exeed max {tup_type}!", log_file ) error = True if any(map(lambda var: var < 0, tup)): raise_error( - f"{type} can not contain negative values!", + f"{tup_type} can not contain negative values!", log_file ) error = True # confirm tuples with two elements - for type, tup in [("primer gc at the 3'", config.PRIMER_GC_END), ("probe and primer temp diff", config.QPROBE_TEMP_DIFF), ("distance probe to primer", config.QPROBE_DISTANCE), ("qpcr amplicon length", config.QAMPLICON_LENGTH), ("qpcr amplicon gc", config.QAMPLICON_GC), ("probe gc at the 3'", config.PRIMER_GC_END)]: + for tup_type, tup in [("primer gc at the 3'", config.PRIMER_GC_END), + ("probe and primer temp diff", config.QPROBE_TEMP_DIFF), + ("distance probe to primer", config.QPROBE_DISTANCE), + ("qpcr amplicon length", config.QAMPLICON_LENGTH), ("qpcr amplicon gc", config.QAMPLICON_GC), + ("probe gc at the 3'", config.PRIMER_GC_END)]: if len(tup) != 2: raise_error( - f"{type} tuple has to have the form (min, max)!", + f"{tup_type} tuple has to have the form (min, max)!", log_file ) error = True if tup[0] > tup[1]: raise_error( - f"min {type} should not exeed max {type}!", + f"min {tup_type} should not exceed max {tup_type}!", log_file ) error = True if any(map(lambda var: var < 0, tup)): raise_error( - f"{type} can not contain negative values!", + f"{tup_type} can not contain negative values!", log_file ) error = True # check single values that cannot be negative non_negative_var = [ - ("max polyx repeats", config.PRIMER_MAX_POLYX), - ("max dinucleotide repeats", config.PRIMER_MAX_DINUC_REPEATS), ("min number of 3 prime nucleotides without ambiguous nucleotides", config.PRIMER_MIN_3_WITHOUT_AMB), ("monovalent cation concentration", config.PCR_MV_CONC), ("divalent cation concentration", config.PCR_DV_CONC), ("dNTP concentration", config.PCR_DNTP_CONC), - ("primer temperatur penalty", config.PRIMER_TM_PENALTY), + ("primer temperature penalty", config.PRIMER_TM_PENALTY), ("primer gc penalty", config.PRIMER_GC_PENALTY), ("primer size penalty", config.PRIMER_SIZE_PENALTY), ("max base penalty", config.PRIMER_MAX_BASE_PENALTY), @@ -379,10 +384,10 @@ def confirm_config(args, log_file): ("multiplier of the maximum length for non-specific amplicons", config.BLAST_SIZE_MULTI), ("blast penalty for off targets", config.BLAST_PENALTY) ] - for type, var in non_negative_var: + for var_type, var in non_negative_var: if var < 0: raise_error( - f"{type} can not be negative!", + f"{var_type} can not be negative!", log_file ) error = True @@ -400,6 +405,16 @@ def confirm_config(args, log_file): exit=True ) # specific warnings + if config.PRIMER_MAX_POLYX > 5: + raise_error( + "max polyx > 5 is not recommended.", + log_file, + ) + if config.PRIMER_MAX_DINUC_REPEATS > 5: + raise_error( + "max di-nucleotide repeats > 5 is not recommended.", + log_file, + ) if config.PRIMER_HAIRPIN < 0: raise_error( "decreasing hairpin melting temp to negative values " @@ -426,6 +441,19 @@ def confirm_config(args, log_file): "only the last 5 nucleotides of the 3' end are considered for GC 3'end calculation.", log_file ) + # specific errors + if config.PRIMER_MAX_POLYX < 1: + raise_error( + "max polyx cannot be lower than 1.", + log_file, + exit=True + ) + if config.PRIMER_MAX_DINUC_REPEATS < 1: + raise_error( + "max di-nucleotide repeats cannot be lower than 1.", + log_file, + exit=True + ) if config.BLAST_MAX_DIFF > 1: raise_error( "max difference to the blast db should be between 0-1.", @@ -534,6 +562,7 @@ def confirm_config(args, log_file): print(f"{var} = {var_dic[var]}", file=f) print("\nPROGRESS", file=f) + def goodbye_message(): """ This is what happens when I am bored... @@ -543,6 +572,7 @@ def goodbye_message(): "Thank you. Come again.", ">Placeholder for your advertisement<", "Make primers great again!", + "Ciao cacao!" "And now lets pray to the PCR gods.", "**bibobibobop** task finished", "Thank you for traveling with varVAMP.", @@ -557,4 +587,4 @@ def goodbye_message(): "Task failed successfully.", "Never gonna give you up, never gonna let you down.", ] - print(f"\n{random.choice(messages)}") \ No newline at end of file + print(f"\n{random.choice(messages)}") diff --git a/varvamp/scripts/param_estimation.py b/varvamp/scripts/param_estimation.py index 78ca4a5..e62bb49 100644 --- a/varvamp/scripts/param_estimation.py +++ b/varvamp/scripts/param_estimation.py @@ -6,7 +6,7 @@ from varvamp.scripts import config -def calculate_frequencys(preprocessed_alignment): +def calculate_frequencies(preprocessed_alignment): """ calculate the max nucleotide freq at each pos """ @@ -58,7 +58,7 @@ def get_parameters(preprocessed_alignment, args, log_file): # set coverage to max coverage = 1 # read in the alignment and calc freqs - frequencys = calculate_frequencys(preprocessed_alignment) + frequencies = calculate_frequencies(preprocessed_alignment) # if no args for both threshold and n_ambig are given # set the n_ambig to 2 and optimize threshold if args.threshold is None: @@ -78,7 +78,7 @@ def get_parameters(preprocessed_alignment, args, log_file): # optimize until less than 50 % is covered while coverage >= 0.5 and args.threshold < 1: - distances = calculate_distances(frequencys, args.threshold) + distances = calculate_distances(frequencies, args.threshold) # calculate the cummulative sum of the sum of n conseq. streches # that are together larger than the min primer length covered_pos = sum( diff --git a/varvamp/scripts/primers.py b/varvamp/scripts/primers.py index 4e12e7a..8c62c75 100644 --- a/varvamp/scripts/primers.py +++ b/varvamp/scripts/primers.py @@ -46,7 +46,7 @@ def calc_hairpin(seq): def calc_dimer(seq1, seq2, structure=False): """ - Calculate the heterodimerization thermodynamics of two DNA sequences. + Calculate the hetero-dimerization thermodynamics of two DNA sequences. Return primer3 thermo object. """ return p3.calc_heterodimer( @@ -64,37 +64,37 @@ def calc_max_polyx(seq): """ calculate maximum polyx of a seq """ - previous_nuc = seq[0] - counter = 0 - max_polyx = 0 + previous_nuc, counter, max_polyx = seq[0], 1, 1 + for nuc in seq[1:]: if nuc == previous_nuc: counter += 1 else: - counter = 0 + counter = 1 previous_nuc = nuc if counter > max_polyx: max_polyx = counter - return(max_polyx) + + return max_polyx def calc_max_dinuc_repeats(seq): """ - calculate the amount of repeating - dinucleotides in a sequence + calculate maximum polyxy of a seq """ for s in [seq, seq[1:]]: previous_dinuc = s[0:2] - max_dinuc = 0 - counter = 0 + max_dinuc = 1 + counter = 1 for i in range(2, len(s), 2): if s[i:i+2] == previous_dinuc: counter += 1 else: if counter > max_dinuc: max_dinuc = counter - counter = 0 + counter = 1 previous_dinuc = s[i:i+2] + return max_dinuc @@ -111,17 +111,13 @@ def is_three_prime_ambiguous(amb_seq): """ determine if a sequence contains an ambiguous char at the 3'prime """ - len_3_prime = config.PRIMER_MIN_3_WITHOUT_AMB + len_3_prime, is_amb = config.PRIMER_MIN_3_WITHOUT_AMB, False if len_3_prime != 0: for nuc in amb_seq[len(amb_seq)-len_3_prime:]: if nuc not in config.NUCS: is_amb = True break - else: - is_amb = False - else: - is_amb = False return is_amb @@ -249,15 +245,16 @@ def calc_3_prime_penalty(direction, mismatches): the higher the penalty. uses the previously calculated mismatch list. """ + + penalty = 0 + if config.PRIMER_3_PENALTY: if direction == "-": penalty = sum([m * p for m, p in zip(mismatches[0:len(config.PRIMER_3_PENALTY)], config.PRIMER_3_PENALTY)]) elif direction == "+": penalty = sum([m * p for m, p in zip(mismatches[::-1][0:len(config.PRIMER_3_PENALTY)], config.PRIMER_3_PENALTY)]) - else: - penalty = 0 - return(penalty) + return penalty def filter_kmer_direction_independent(seq, primer_temps=config.PRIMER_TMP, gc_range=config.PRIMER_GC_RANGE, primer_sizes=config.PRIMER_SIZES): @@ -288,7 +285,7 @@ def filter_kmer_direction_dependend(direction, kmer, ambiguous_consensus): kmer_seq = rev_complement(kmer[0]) amb_kmer_seq = rev_complement(ambiguous_consensus[kmer[1]:kmer[2]]) # filter kmer - return( + return ( (calc_hairpin(kmer_seq).tm <= config.PRIMER_HAIRPIN) and (config.PRIMER_GC_END[0] <= calc_end_gc(kmer_seq) <= config.PRIMER_GC_END[1]) and not is_three_prime_ambiguous(amb_kmer_seq) @@ -309,7 +306,7 @@ def find_primers(kmers, ambiguous_consensus, alignment): continue # calc base penalty base_penalty = calc_base_penalty(kmer[0],config.PRIMER_TMP, config.PRIMER_GC_RANGE, config.PRIMER_SIZES) - # calcualte per base mismatches + # calculate per base mismatches per_base_mismatches = calc_per_base_mismatches( kmer, alignment, @@ -366,17 +363,17 @@ def create_primer_dictionary(primer_candidates, direction): def find_best_primers(left_primer_candidates, right_primer_candidates): """ Primer candidates are likely overlapping. Here, the list of primers - is sorted for the best to worst scoring. Then, the next best scoring + is sorted for the lowest to highest penalty. Then, the next lowest is retained if it does not have any nucleotides that have already - been covered by the middle third of a better scoring primer candidate. + been covered by the middle third of a better primer candidate. This significantly reduces the amount of primers while retaining - the best scoring ones. + the ones with the lowest penalties. Example: - -------- (score 1) 1 - ------------- (score 1) 2 - ------------ (score 0.8) 3 - ----------(score 0.9) 4 + -------- (penalty 1) 1 + ------------- (penalty 1) 2 + ------------ (penalty 0.8) 3 + ----------(penalty 0.9) 4 --> primer 3 would be retained and primer 2 excluded, primer 4 and 1 will however be considered in the next set of overlapping primers. @@ -385,9 +382,9 @@ def find_best_primers(left_primer_candidates, right_primer_candidates): all_primers = {} for direction, primer_candidates in [("+", left_primer_candidates), ("-", right_primer_candidates)]: - # sort the primers for the best scoring, and if same score by start + # sort the primers by penalty, and if same penalty by start primer_candidates.sort(key=lambda x: (x[3], x[1])) - # ini everything with the top scoring primer + # ini everything with the primer with the lowest penalty to_retain = [primer_candidates[0]] primer_ranges = list(range(primer_candidates[0][1], primer_candidates[0][2]+1)) primer_set = set(primer_ranges) @@ -401,7 +398,7 @@ def find_best_primers(left_primer_candidates, right_primer_candidates): if not any(x in primer_positions for x in primer_set): # update the primer set primer_set.update(primer_positions) - # append this primer as it is well scoring and not overlapping + # append this primer as it has a low penalty and is not overlapping # with another already retained primer to_retain.append(primer) diff --git a/varvamp/scripts/qpcr.py b/varvamp/scripts/qpcr.py index 5cdc796..e24519a 100644 --- a/varvamp/scripts/qpcr.py +++ b/varvamp/scripts/qpcr.py @@ -13,6 +13,7 @@ from varvamp.scripts import primers from varvamp.scripts import reporting + def choose_probe_direction(seq): """ choose the direction of the probe so @@ -43,10 +44,10 @@ def filter_probe_direction_dependent(seq): """ filter probes for their direction """ - return( - (seq[:1] != "g") # no 5' g as this leads to quenching - and (primers.calc_hairpin(seq).tm <= config.PRIMER_HAIRPIN) - and (config.QPROBE_GC_END[0] <= primers.calc_end_gc(seq) <= config.QPROBE_GC_END[1]) + return ( + (seq[:1] != "g") # no 5' g as this leads to quenching + and (primers.calc_hairpin(seq).tm <= config.PRIMER_HAIRPIN) + and (config.QPROBE_GC_END[0] <= primers.calc_end_gc(seq) <= config.QPROBE_GC_END[1]) ) @@ -59,21 +60,23 @@ def get_qpcr_probes(kmers, ambiguous_consensus, alignment_cleaned): for kmer in kmers: # filter probe for base params - if not primers.filter_kmer_direction_independent(kmer[0], config.QPROBE_TMP, config.QPROBE_GC_RANGE, config.QPROBE_SIZES): + if not primers.filter_kmer_direction_independent(kmer[0], config.QPROBE_TMP, config.QPROBE_GC_RANGE, + config.QPROBE_SIZES): continue # do not allow ambiguous chars at both ends if ambiguous_ends(ambiguous_consensus[kmer[1]:kmer[2]]): continue # calc penalties analogous to primer search - base_penalty = primers.calc_base_penalty(kmer[0], config.QPROBE_TMP, config.QPROBE_GC_RANGE, config.QPROBE_SIZES) + base_penalty = primers.calc_base_penalty(kmer[0], config.QPROBE_TMP, config.QPROBE_GC_RANGE, + config.QPROBE_SIZES) per_base_mismatches = primers.calc_per_base_mismatches( - kmer, - alignment_cleaned, - ambiguous_consensus - ) + kmer, + alignment_cleaned, + ambiguous_consensus + ) permutation_penalty = primers.calc_permutation_penalty( - ambiguous_consensus[kmer[1]:kmer[2]] - ) + ambiguous_consensus[kmer[1]:kmer[2]] + ) # determine the direction with more cytosine or set both if 50 % direction = choose_probe_direction(kmer[0]) # create probe dictionary @@ -81,15 +84,19 @@ def get_qpcr_probes(kmers, ambiguous_consensus, alignment_cleaned): if filter_probe_direction_dependent(kmer[0]): probe_name = f"PROBE_{probe_idx}_FW" three_prime_penalty = primers.calc_3_prime_penalty("+", per_base_mismatches) - probe_candidates[probe_name] = [kmer[0], kmer[1], kmer[2], base_penalty + permutation_penalty + three_prime_penalty, per_base_mismatches, direction] + probe_candidates[probe_name] = [kmer[0], kmer[1], kmer[2], + base_penalty + permutation_penalty + three_prime_penalty, + per_base_mismatches, direction] probe_idx += 1 if "-" in direction: if filter_probe_direction_dependent(primers.rev_complement(kmer[0])): probe_name = f"PROBE_{probe_idx}_RW" three_prime_penalty = primers.calc_3_prime_penalty("-", per_base_mismatches) - probe_candidates[probe_name] = [primers.rev_complement(kmer[0]), kmer[1], kmer[2], base_penalty + permutation_penalty + three_prime_penalty, per_base_mismatches, direction] + probe_candidates[probe_name] = [primers.rev_complement(kmer[0]), kmer[1], kmer[2], + base_penalty + permutation_penalty + three_prime_penalty, + per_base_mismatches, direction] probe_idx += 1 - # sort by score + # sort by penalty probe_candidates = dict(sorted(probe_candidates.items(), key=lambda x: x[1][3])) return probe_candidates @@ -97,7 +104,7 @@ def get_qpcr_probes(kmers, ambiguous_consensus, alignment_cleaned): def flanking_primer_subset(primer_list, primer_direction, probe): """ - subset for primers flanking the probe and sort by score + subset for primers flanking the probe and sort by penalty """ subset = [] @@ -111,7 +118,7 @@ def flanking_primer_subset(primer_list, primer_direction, probe): if window_start < primer[1] and primer[2] < window_stop: subset.append(primer) - # sort by score and start if same score + # sort by penalty and start if same penalty subset.sort(key=lambda x: (x[3], x[1])) return subset @@ -126,9 +133,9 @@ def hardfilter_amplicon(majority_consensus, left_primer, right_primer): amplicon_seq = majority_consensus[left_primer[1]:right_primer[2]] # check for the right amplicon length return ( - (config.QAMPLICON_LENGTH[0] <= amplicon_length <= config.QAMPLICON_LENGTH[1]) - and (config.QAMPLICON_GC[0] <= primers.calc_gc(amplicon_seq) <= config.QAMPLICON_GC[1]) - and not "NN" in amplicon_seq + (config.QAMPLICON_LENGTH[0] <= amplicon_length <= config.QAMPLICON_LENGTH[1]) + and (config.QAMPLICON_GC[0] <= primers.calc_gc(amplicon_seq) <= config.QAMPLICON_GC[1]) + and "NN" not in amplicon_seq ) @@ -148,8 +155,8 @@ def check_end_overlap(dimer_result): # check for overlaps at the ends and the min overlap (allows for some amount of mismatches) if overlap > config.END_OVERLAP and nt_count <= len(structure[0]) + overlap + 1 and " " not in structure[1].lstrip(" "): return True - else: - return False + + return False def forms_dimer_or_overhangs(right_primer, left_primer, probe, ambiguous_consensus): @@ -188,7 +195,7 @@ def forms_dimer_or_overhangs(right_primer, left_primer, probe, ambiguous_consens def assess_amplicons(left_subset, right_subset, qpcr_probes, probe, majority_consensus, ambiguous_consensus): """ - assess if a potential amplicon is a qPCR scheme for a specific probe and return the best scoring + assess if a potential amplicon is a qPCR scheme for a specific probe and return the one with the lowest penalty """ primer_combinations = () @@ -203,14 +210,14 @@ def assess_amplicons(left_subset, right_subset, qpcr_probes, probe, majority_con # ... the probe is close enough to the primer on the same strand if "FW" in probe: if not qpcr_probes[probe][1] in range( - left_primer[2] + config.QPROBE_DISTANCE[0], - left_primer[2] + config.QPROBE_DISTANCE[1] + 1 + left_primer[2] + config.QPROBE_DISTANCE[0], + left_primer[2] + config.QPROBE_DISTANCE[1] + 1 ): continue elif "RW" in probe: if not right_primer[1] in range( - qpcr_probes[probe][2] + config.QPROBE_DISTANCE[0], - qpcr_probes[probe][2] + config.QPROBE_DISTANCE[1] + 1 + qpcr_probes[probe][2] + config.QPROBE_DISTANCE[0], + qpcr_probes[probe][2] + config.QPROBE_DISTANCE[1] + 1 ): continue @@ -220,12 +227,14 @@ def assess_amplicons(left_subset, right_subset, qpcr_probes, probe, majority_con continue # ... the probe has a higher temp than the primers and ... probe_temp = primers.calc_temp(qpcr_probes[probe][0]) - if not all([config.QPROBE_TEMP_DIFF[0] <= probe_temp-x <= config.QPROBE_TEMP_DIFF[1] for x in primer_temps]): + if not all( + [config.QPROBE_TEMP_DIFF[0] <= probe_temp - x <= config.QPROBE_TEMP_DIFF[1] for x in primer_temps]): continue # .... all combination of oligos do not form dimers or overhangs. if forms_dimer_or_overhangs(right_primer, left_primer, qpcr_probes[probe], ambiguous_consensus): continue - # append to list and break as this is the lowest scoring primer combi (primers are sorted by score) + # append to list and break as this is the primer combi + # with the lowest penalty (primers are sorted by penalty) amplicon_found = True break # break also the outer loop @@ -236,14 +245,15 @@ def assess_amplicons(left_subset, right_subset, qpcr_probes, probe, majority_con return primer_combinations -def find_qcr_schemes(qpcr_probes, left_primer_candidates, right_primer_candidates, majority_consensus, ambiguous_consensus): +def find_qcr_schemes(qpcr_probes, left_primer_candidates, right_primer_candidates, majority_consensus, + ambiguous_consensus): """ this finds the final qPCR schemes. it slices for primers flanking a probe and test all left/right combinations whether they are potential amplicons. as primers - are sorted by score, only the very first match is considered as this has the - lowest score. however, probes are overlapping and there is a high chance that - left and right primers are found multiple times. to consider only one primer- - probe combination the probes are also sorted by score. therefore, if a primer + are sorted by penalty, only the very first match is considered as this has the + lowest penalty. however, probes are overlapping and there is a high chance that + left and right primers are found multiple times. to consider only one primer-probe + combination the probes are also sorted by penalty. therefore, if a primer combination has been found already the optimal probe was already selected and there is no need to consider this primer probe combination. """ @@ -258,24 +268,25 @@ def find_qcr_schemes(qpcr_probes, left_primer_candidates, right_primer_candidate # consider if there are primers flanking the probe ... if not left_subset or not right_subset: continue - primer_combination = assess_amplicons(left_subset, right_subset, qpcr_probes, probe, majority_consensus, ambiguous_consensus) + primer_combination = assess_amplicons(left_subset, right_subset, qpcr_probes, probe, majority_consensus, + ambiguous_consensus) # ... a combi has been found, ... if not primer_combination: continue - # ...and this combi is not already present for a probe with a better score. + # ...and this combi is not already present for a probe with a better penalty. if primer_combination in found_amplicons: continue # populate the primer dictionary: amplicon_nr += 1 found_amplicons.append(primer_combination) qpcr_scheme_candidates[f"AMPLICON_{amplicon_nr}"] = { - "score": qpcr_probes[probe][3]+primer_combination[0][3]+primer_combination[1][3], + "penalty": qpcr_probes[probe][3] + primer_combination[0][3] + primer_combination[1][3], "probe": qpcr_probes[probe], "left": primer_combination[0], "right": primer_combination[1] } - # and again sort by total score (left + right + probe) - qpcr_scheme_candidates = dict(sorted(qpcr_scheme_candidates.items(), key=lambda x: x[1]["score"])) + # and again sort by total penalty (left + right + probe) + qpcr_scheme_candidates = dict(sorted(qpcr_scheme_candidates.items(), key=lambda x: x[1]["penalty"])) return qpcr_scheme_candidates @@ -325,7 +336,7 @@ def test_amplicon_deltaG_parallel(qpcr_schemes_candidates, majority_consensus, n if any(x in amp_positions for x in amplicon_set): continue # and if this passes cutoff make a dict entry and do not allow further - # amplicons in that region (they will have a lower score) + # amplicons in that region (they will have a lower penalty) if deltaG > deltaG_cutoff: new_name = f"QPCR_SCHEME_{passed_counter}" final_schemes[new_name] = qpcr_schemes_candidates[amp_name] diff --git a/varvamp/scripts/regions.py b/varvamp/scripts/regions.py index c1b891e..ffc449c 100644 --- a/varvamp/scripts/regions.py +++ b/varvamp/scripts/regions.py @@ -13,9 +13,7 @@ def find_regions(consensus_amb, allowed_ambiguous): sequence length """ - current_window = [] - primer_regions = [] - last_amb = 0 + current_window, primer_regions, all_previous_ambiguous_pos, last_amb = [], [], [], 0 # append one N so the last window is closed seq = str(consensus_amb) + "N" @@ -59,9 +57,11 @@ def mean(primer_regions, consensus): that cover the consensus sequence """ covered_set = set() + for region in primer_regions: pos_list = list(range(region[0], region[1])) [covered_set.add(x) for x in pos_list] + return round(len(covered_set)/len(consensus)*100, 1) diff --git a/varvamp/scripts/reporting.py b/varvamp/scripts/reporting.py index 3cd7be6..ba83d54 100644 --- a/varvamp/scripts/reporting.py +++ b/varvamp/scripts/reporting.py @@ -73,7 +73,7 @@ def write_primers_to_bed(outfile, primer_name, primer_properties, direction): primer_properties[1], # start primer_properties[2], # stop primer_name, - round(primer_properties[3], 1), # score + round(primer_properties[3], 1), # penalty direction, sep="\t", file=o @@ -130,14 +130,15 @@ def write_qpcr_to_files(path, final_schemes, ambiguous_consensus): tsv_file_2 = os.path.join(path, "qpcr_primers.tsv") primer_bed_file = os.path.join(path, "primers.bed") amplicon_bed_file = os.path.join(path, "amplicons.bed") + primer_fasta_file = os.path.join(path, "oligos.fasta") - with open(tsv_file, "w") as tsv, open(tsv_file_2, "w") as tsv2, open(amplicon_bed_file, "w") as bed: + with open(tsv_file, "w") as tsv, open(tsv_file_2, "w") as tsv2, open(amplicon_bed_file, "w") as bed, open(primer_fasta_file, "w") as fasta: print( - "qpcr_scheme\toligo_type\tstart\tstop\tseq\tsize\tgc_best\ttemp_best\tmean_gc\tmean_temp\tscore", + "qpcr_scheme\toligo_type\tstart\tstop\tseq\tsize\tgc_best\ttemp_best\tmean_gc\tmean_temp\tpenalty", file=tsv2 ) print( - "qpcr_scheme\tscore\tdeltaG\tlength\tstart\tstop\tseq", + "qpcr_scheme\tpenalty\tdeltaG\tlength\tstart\tstop\tseq", file=tsv ) for scheme in final_schemes: @@ -147,7 +148,7 @@ def write_qpcr_to_files(path, final_schemes, ambiguous_consensus): final_schemes[scheme]["left"][1], final_schemes[scheme]["right"][2], scheme, - round(final_schemes[scheme]["score"], 1), + round(final_schemes[scheme]["penalty"], 1), sep="\t", file=bed ) @@ -157,7 +158,7 @@ def write_qpcr_to_files(path, final_schemes, ambiguous_consensus): amplicon_seq = ambiguous_consensus[amplicon_start:amplicon_stop] print( scheme, - round(final_schemes[scheme]["score"], 1), + round(final_schemes[scheme]["penalty"], 1), final_schemes[scheme]["deltaG"], len(amplicon_seq), amplicon_start + 1, @@ -168,7 +169,7 @@ def write_qpcr_to_files(path, final_schemes, ambiguous_consensus): ) # write tsv2 for type in final_schemes[scheme]: - if type == "score" or type == "deltaG": + if type == "penalty" or type == "deltaG": continue seq = ambiguous_consensus[final_schemes[scheme][type][1]:final_schemes[scheme][type][2]] if type == "right" or all([type == "probe", final_schemes[scheme]["probe"][5] == "-"]): @@ -185,7 +186,7 @@ def write_qpcr_to_files(path, final_schemes, ambiguous_consensus): type, final_schemes[scheme][type][1] + 1, final_schemes[scheme][type][2], - seq, + seq.upper(), len(seq), round(primers.calc_gc(final_schemes[scheme][type][0]), 1), round(primers.calc_temp(final_schemes[scheme][type][0]), 1), @@ -202,6 +203,8 @@ def write_qpcr_to_files(path, final_schemes, ambiguous_consensus): final_schemes[scheme][type], direction ) + # write fasta + print(f">{scheme}_{type}\n{seq.upper()}", file=fasta) def write_scheme_to_files(path, amplicon_scheme, ambiguous_consensus, mode): @@ -219,75 +222,85 @@ def write_scheme_to_files(path, amplicon_scheme, ambiguous_consensus, mode): with open(tsv_file, "w") as tsv, open(amplicon_bed_file, "w") as bed, open(tabular_file, "w") as tabular: # write header for primer tsv print( - "amlicon_name\tamplicon_length\tprimer_name\tpool\tstart\tstop\tseq\tsize\tgc_best\ttemp_best\tmean_gc\tmean_temp\tscore", + "amlicon_name\tamplicon_length\tprimer_name\talternate_primer_name\tpool\tstart\tstop\tseq\tsize\tgc_best\ttemp_best\tmean_gc\tmean_temp\tpenalty", file=tsv ) - counter = 0 for pool in amplicon_scheme: - for amp in amplicon_scheme[pool]: - # give a new amplicon name - new_name = f"amplicon_{str(counter)}" - counter += 1 - # get left and right primers and their names - primer_names = list(amplicon_scheme[pool][amp].keys()) - left = (primer_names[0], amplicon_scheme[pool][amp][primer_names[0]]) - right = (primer_names[1], amplicon_scheme[pool][amp][primer_names[1]]) - amp_length = right[1][2] - left[1][1] - # write amplicon bed - if mode == "tiled": - bed_score = pool - elif mode == "sanger": - bed_score = round(left[1][3] + right[1][3], 1) - print( - "ambiguous_consensus", - left[1][1], - right[1][2], - new_name, - bed_score, - sep="\t", - file=bed - ) - # write primer assignments tabular file - print( - left[0], - right[0], - sep="\t", - file=tabular - ) - # write primer tsv and primer bed - for direction, primer in [("+", left), ("-", right)]: - seq = ambiguous_consensus[primer[1][1]:primer[1][2]] - if direction == "-": - seq = primers.rev_complement(seq) - # calc primer parameters for all permutations - permutations = get_permutations(seq) - gc, temp = calc_mean_stats(permutations) - # write tsv file + if mode == "sanger": + primer_fasta_file = os.path.join(path, "primers.fasta") + else: + primer_fasta_file = os.path.join(path, f"primers_pool_{pool}.fasta") + with open(primer_fasta_file, "w") as primer_fasta: + for amp in amplicon_scheme[pool]: + # give a new amplicon name + new_name = f"amplicon_{str(counter)}" + counter += 1 + # get left and right primers and their names + primer_names = list(amplicon_scheme[pool][amp].keys()) + left = (primer_names[0], amplicon_scheme[pool][amp][primer_names[0]]) + right = (primer_names[1], amplicon_scheme[pool][amp][primer_names[1]]) + amp_length = right[1][2] - left[1][1] + # write amplicon bed + if mode == "tiled": + bed_score = pool + elif mode == "sanger": + bed_score = round(left[1][3] + right[1][3], 1) print( + "ambiguous_consensus", + left[1][1], + right[1][2], new_name, - amp_length, - primer[0], - pool, - primer[1][1] + 1, - primer[1][2], - seq, - len(primer[1][0]), - round(primers.calc_gc(primer[1][0]), 1), - round(primers.calc_temp(primer[1][0]), 1), - gc, - temp, - round(primer[1][3], 1), + bed_score, sep="\t", - file=tsv + file=bed ) - # write primer bed file - write_primers_to_bed( - primer_bed_file, - primer[0], - primer[1], - direction + # write primer assignments tabular file + print( + left[0], + right[0], + sep="\t", + file=tabular ) + # write primer tsv and primer bed + for direction, primer in [("+", left), ("-", right)]: + seq = ambiguous_consensus[primer[1][1]:primer[1][2]] + if direction == "-": + seq = primers.rev_complement(seq) + primer_name = f"{new_name}_RW" + else: + primer_name = f"{new_name}_FW" + # write primers to fasta pool file + print(f">{primer_name}\n{seq.upper()}", file=primer_fasta) + # calc primer parameters for all permutations + permutations = get_permutations(seq) + gc, temp = calc_mean_stats(permutations) + # write tsv file + print( + new_name, + amp_length, + primer_name, + primer[0], + pool, + primer[1][1] + 1, + primer[1][2], + seq.upper(), + len(primer[1][0]), + round(primers.calc_gc(primer[1][0]), 1), + round(primers.calc_temp(primer[1][0]), 1), + gc, + temp, + round(primer[1][3], 1), + sep="\t", + file=tsv + ) + # write primer bed file + write_primers_to_bed( + primer_bed_file, + primer[0], + primer[1], + direction + ) def write_dimers(path, primer_dimers): @@ -521,7 +534,7 @@ def get_QPCR_primers_for_plot(amplicon_schemes): for scheme in amplicon_schemes: for type in amplicon_schemes[scheme]: - if type == "score" or type == "deltaG": + if type == "penalty" or type == "deltaG": continue primer_name = f"{scheme}_{type}" amplicon_primers.append((primer_name, amplicon_schemes[scheme][type])) diff --git a/varvamp/scripts/scheme.py b/varvamp/scripts/scheme.py index 0dc3ba4..e741087 100644 --- a/varvamp/scripts/scheme.py +++ b/varvamp/scripts/scheme.py @@ -10,33 +10,34 @@ from varvamp.scripts import config, primers -class Graph(object): +def construct_graph(nodes, init_graph): """ - a graph class + This method makes sure that the graph is symmetrical, but sets the costs + for nodes in the reverse direction to infinity to make sure dijkstra + never goes to an amplicon that is in the wrong direction. """ + graph = {} + for node in nodes: + graph[node] = {} - def __init__(self, nodes, init_graph): - self.nodes = nodes - self.graph = self.construct_graph(nodes, init_graph) + graph.update(init_graph) - def construct_graph(self, nodes, init_graph): - """ - This method makes sure that the graph is symmetrical, but sets the score - for nodes in the reverse direction to infinity to make sure dijkstra - never goes to an amplicon that is in the wrong direction. - """ - graph = {} - for node in nodes: - graph[node] = {} + for node, neighbors in graph.items(): + for neighbor in neighbors.keys(): + if graph[neighbor].get(node, False) is False: + graph[neighbor][node] = float("infinity") + + return graph - graph.update(init_graph) - for node, neighbors in graph.items(): - for neighbor in neighbors.keys(): - if graph[neighbor].get(node, False) is False: - graph[neighbor][node] = float("infinity") +class Graph(object): + """ + a graph class + """ - return graph + def __init__(self, nodes, init_graph): + self.nodes = nodes + self.graph = construct_graph(nodes, init_graph) def get_nodes(self): """ @@ -78,7 +79,7 @@ def find_amplicons(all_primers, opt_len, max_len): if primers.calc_dimer(left_primer[0], right_primer[0]).tm > config.PRIMER_MAX_DIMER_TMP: continue # calculate length dependend amplicon costs as the cumulative primer - # score multiplied by the e^(fold length of the optimal length). + # penalty multiplied by the e^(fold length of the optimal length). amplicon_costs = (right_primer[3] + left_primer[3])*math.exp(amplicon_length/opt_len) amplicon_name = "amplicon_"+str(amplicon_number) amplicon_dict[amplicon_name] = [ @@ -158,23 +159,22 @@ def dijkstra_algorithm(graph, start_node): def get_end_node(previous_nodes, shortest_path, amplicons): """ - get the target node with the lowest score out of all + get the target node with the lowest penalty costs out of all nodes that have the same maximum end position """ - stop_nucleotide = 0 + stop_nucleotide, possible_end_nodes = 0, {} for node in previous_nodes.keys(): - # check if an node has a larger stop -> empty dict and set new + # check if a node has a larger stop -> empty dict and set new # best stop nucleotide if amplicons[node][1] > stop_nucleotide: - possible_end_nodes = {} - possible_end_nodes[node] = shortest_path[node] + possible_end_nodes = {node: shortest_path[node]} stop_nucleotide = amplicons[node][1] # if nodes have the same stop nucleotide, add to dictionary elif amplicons[node][1] == stop_nucleotide: possible_end_nodes[node] = shortest_path[node] - # return the end node with the lowest score + # return the end node with the lowest penalty costs return min(possible_end_nodes.items(), key=lambda x: x[1]) @@ -209,9 +209,9 @@ def create_scheme_dic(amplicon_scheme, amplicons, all_primers): for pool in (0, 1): for amp in amplicon_scheme[pool::2]: scheme_dictionary[pool][amp] = {} - primers = [amplicons[amp][2], amplicons[amp][3]] - scheme_dictionary[pool][amp][primers[0]] = all_primers["+"][primers[0]] - scheme_dictionary[pool][amp][primers[1]] = all_primers["-"][primers[1]] + primer_pair = [amplicons[amp][2], amplicons[amp][3]] + scheme_dictionary[pool][amp][primer_pair[0]] = all_primers["+"][primer_pair[0]] + scheme_dictionary[pool][amp][primer_pair[1]] = all_primers["-"][primer_pair[1]] return scheme_dictionary @@ -222,10 +222,9 @@ def find_best_covering_scheme(amplicons, amplicon_graph, all_primers): coverage with the minimal costs is achieved. """ # ini - coverage = 0 best_coverage = 0 max_stop = max(amplicons.items(), key=lambda x: x[1])[1][1] - best_score = float('infinity') + lowest_costs = float('infinity') for start_node in amplicons: # if the currently best coverage + start nucleotide of the currently tested amplicon @@ -235,24 +234,22 @@ def find_best_covering_scheme(amplicons, amplicon_graph, all_primers): previous_nodes, shortest_path = dijkstra_algorithm(amplicon_graph, start_node) # only continue if there are previous_nodes if previous_nodes: - target_node, score = get_end_node(previous_nodes, shortest_path, amplicons) + target_node, costs = get_end_node(previous_nodes, shortest_path, amplicons) coverage = amplicons[target_node][1] - amplicons[start_node][0] # if the new coverage is larger, go for the larger coverage if coverage > best_coverage: best_start_node = start_node best_target_node = target_node best_previous_nodes = previous_nodes - best_shortest_path = shortest_path - best_score = score + lowest_costs = costs best_coverage = coverage # if the coverages are identical, go for the lowest costs elif coverage == best_coverage: - if score < best_score: + if costs < lowest_costs: best_start_node = start_node best_target_node = target_node best_previous_nodes = previous_nodes - best_shortest_path = shortest_path - best_score = score + lowest_costs = costs best_coverage = coverage else: # check if the single amplicon has the largest coverage so far @@ -260,11 +257,10 @@ def find_best_covering_scheme(amplicons, amplicon_graph, all_primers): if coverage > best_coverage: best_start_node = start_node best_previous_nodes = previous_nodes - best_shortest_path = shortest_path - best_score = amplicons[start_node][5] + lowest_costs = amplicons[start_node][5] best_coverage = coverage # no need to check more, the best covering amplicon scheme was found and - # has the minimal score compared to the schemes with the same coverage + # has the minimal costs compared to the schemes with the same coverage else: break @@ -280,7 +276,7 @@ def find_best_covering_scheme(amplicons, amplicon_graph, all_primers): def test_scheme_for_dimers(amplicon_scheme): """ - test the best scoring scheme for primer dimers + test the lowest-cost scheme for primer dimers """ primer_dimers = [] @@ -341,8 +337,8 @@ def test_overlaps_for_dimers(overlapping_primers): """ for first_overlap in overlapping_primers[0]: for second_overlap in overlapping_primers[1]: - # return the first match. primers are sorted by score. - # first pair that makes it has the lowest score + # return the first match. primers are sorted by penalty. + # first pair that makes it has the lowest penalty if primers.calc_dimer(first_overlap[3][0], second_overlap[3][0]).tm <= config.PRIMER_MAX_DIMER_TMP: return [first_overlap, second_overlap] @@ -387,7 +383,7 @@ def check_and_solve_heterodimers(amplicon_scheme, left_primer_candidates, right_ def find_sanger_amplicons(amplicons, all_primers, n): """ - find the best scoring non-overlapping amplicons + find non-overlapping amplicons with low penalties from all found amplicons. only for the SANGER mode. """ # sort amplicons