Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 3.0.0 #23

Merged
merged 215 commits into from
Jan 2, 2025
Merged
Changes from 1 commit
Commits
Show all changes
215 commits
Select commit Hold shift + click to select a range
6022168
first change to new format
luissian Nov 2, 2023
5c7e3e3
First changes to the new refactorization
luissian Dec 19, 2023
dc10a7a
Implemented analyze_schema in multiple cpus
luissian Dec 26, 2023
dacde99
Liting analyze_schema
luissian Dec 26, 2023
beb0413
Added code to test alfaclust results
luissian Dec 28, 2023
a25365c
Implemented Analyze schema
luissian Jan 4, 2024
d212ab4
Merge pull request #11 from luissian/develop
luissian Jan 5, 2024
0874569
Update analyze schema with Comments in previous PR. Added liting work…
luissian Jan 8, 2024
a3b02f5
fixiing some liting
luissian Jan 8, 2024
1335166
fixiing more liting errors
luissian Jan 8, 2024
4a21e14
changed file extension for old python files
luissian Jan 8, 2024
7b09ba2
fixing latest liting
luissian Jan 8, 2024
fdae124
fixing latest liting 2
luissian Jan 8, 2024
ff975a0
fixing latest liting 3
luissian Jan 8, 2024
c5792fe
checking liting in functions which are defined the type of variable
luissian Jan 16, 2024
a67d30f
Checking testing file
luissian Jan 17, 2024
d1412cf
first draft to run test
luissian Jan 17, 2024
ad0a2cf
first draft to run test
luissian Jan 17, 2024
38ed823
added deps to github action test
luissian Jan 17, 2024
50011e3
Updated test and environment for conda installation
luissian Jan 17, 2024
38c69c2
Removed python packages from conda and move to pip
luissian Jan 17, 2024
a04af8e
remove Self from annotation
luissian Jan 17, 2024
ee8f6d6
liting
luissian Jan 17, 2024
9deea44
Updated with comments in PR#13. Adding testing analyze schema with te…
luissian Jan 22, 2024
e6a81fb
fixing liting and error testing
luissian Jan 22, 2024
ccbe7c6
Again trying to fix liting and testing
luissian Jan 22, 2024
844a2ea
modified schema input parameter
luissian Jan 22, 2024
7235f22
correcting wrong path of schema
luissian Jan 22, 2024
70f451e
including echo ls to know which is the working path
luissian Jan 22, 2024
dd2429c
including ls to know which is the working path
luissian Jan 22, 2024
d4002f6
removing variable
luissian Jan 22, 2024
a77d859
activate conda environment
luissian Jan 22, 2024
85c579a
added conda init before activate conda base
luissian Jan 22, 2024
aa893ae
activte conda env with the source command and the activate
luissian Jan 22, 2024
38bb761
testing how to run prokka
luissian Jan 22, 2024
c568162
testing how to run prokka_2
luissian Jan 22, 2024
cfdf752
testing 1
luissian Jan 22, 2024
32242b1
test2
luissian Jan 22, 2024
10c21dd
test3
luissian Jan 22, 2024
c280e14
test4
luissian Jan 22, 2024
2924816
test5
luissian Jan 22, 2024
2ae8222
test5
luissian Jan 22, 2024
c171635
test6
luissian Jan 22, 2024
97003ea
added kaleido package
luissian Jan 22, 2024
a81f27d
Udpated code with latest comment in PR
luissian Jan 23, 2024
7a0aea5
replace deprecate setup.py for pyproject.toml
luissian Jan 24, 2024
96eb877
reduce the 2 loop for checking the duplicated and sub allele
luissian Jan 25, 2024
30844af
remove unnecesary comments
luissian Jan 25, 2024
fe8f691
testing new pyproject.toml file
luissian Jan 25, 2024
792e3d4
fixing liting and update test
luissian Jan 25, 2024
b6d5841
update installation taranis for testing
luissian Jan 25, 2024
09732f6
include all command in the same run
luissian Jan 25, 2024
742aed3
Split the number of cpus used for prokka and app
luissian Jan 25, 2024
999f80b
commit to start with reference allele feature
luissian Jan 28, 2024
d2b4c2a
Including poetry.lock in gitignore
luissian Jan 28, 2024
6a8a78a
Fixing liting
luissian Jan 28, 2024
54fc96f
Fixing liting
luissian Jan 28, 2024
ee9d3c2
created distance matrix class
luissian Feb 4, 2024
72d2395
saving work. Clustering sequences is done, pending convert to allele …
luissian Feb 4, 2024
bdb1b3a
commit unsaved changes
luissian Feb 4, 2024
e03691d
formating files
luissian Feb 4, 2024
fe8fb8e
fixing liting
luissian Feb 4, 2024
fd12a42
fixing liting 2
luissian Feb 4, 2024
aa591c0
fixing liting 3
luissian Feb 4, 2024
48b954f
updating package files
luissian Feb 5, 2024
8e1c65d
Checking if conflicts are gone
luissian Feb 5, 2024
e079dbc
Added closest distance
luissian Feb 5, 2024
bf974c4
Implemented cluster center
luissian Feb 6, 2024
d4d8df2
Completed saved reference alleles to file
luissian Feb 6, 2024
e568594
Modified project.toml to run script
luissian Feb 6, 2024
a34912b
remove poetry when installed taranis
luissian Feb 6, 2024
f57b881
added documentation for each function
luissian Feb 7, 2024
019ae42
Fixed liting
luissian Feb 7, 2024
a8a8b94
implemented evaluation cluster
luissian Feb 10, 2024
0faf7da
Implemented parallel and dinamic clustering
luissian Feb 17, 2024
7138e25
fixed liting
luissian Feb 17, 2024
5f2785f
liting for eval_cluster
luissian Feb 17, 2024
c917d04
liting for eval_cluster 2
luissian Feb 17, 2024
9efcc58
liting for eval_cluster 3
luissian Feb 17, 2024
ba544c6
adding files for docs and testing with pytest
luissian Mar 1, 2024
ac191ae
working on allele calling, exact match
luissian Mar 6, 2024
cf06950
fixed issue when reading on pandas long files
luissian Mar 9, 2024
3150088
added classification
luissian Mar 9, 2024
b9361ee
liting
luissian Mar 11, 2024
f6a3ea1
correcting liting
luissian Mar 11, 2024
7503ea9
Fixing issues described in PR's comments
luissian Mar 11, 2024
7cab1d3
Added conda env before execute test
luissian Mar 11, 2024
362dd9f
fixing error in command line of reference-alleles
luissian Mar 11, 2024
41538fa
adding table of graphic data
luissian Mar 12, 2024
950845b
trying to fix too many opened files
luissian Mar 13, 2024
7838992
removed pdb tag
luissian Mar 13, 2024
432d76c
delete old utils
saramonzon Mar 28, 2024
b8fa603
reduced default resolution to 0.75
saramonzon Mar 28, 2024
7ec2356
reduced default resolution to 0.75
saramonzon Mar 28, 2024
897b238
added seqCluster class from alphaclust
saramonzon Mar 28, 2024
cf73541
updated default resolution to 0.75
saramonzon Mar 28, 2024
6726c06
changed class init for seqCluster
saramonzon Mar 28, 2024
ffc7bff
updated default kmer size for mash
saramonzon Mar 28, 2024
68da1d0
sorted imports
saramonzon Mar 28, 2024
2b91303
changed cluster generation to use alphaclust SeqCluster class
saramonzon Mar 28, 2024
8cf03e5
updated default blast params
saramonzon Mar 29, 2024
c82a4a8
fixed variable name
saramonzon Apr 1, 2024
f9c9b0b
changed cluster center to maximum number of alleles with more than 0.…
saramonzon Apr 1, 2024
d7867bd
changet value to param
saramonzon Apr 1, 2024
23b725b
linting
saramonzon Apr 1, 2024
80f0198
linting new version black
saramonzon Apr 1, 2024
7687db3
removed trailing whitespaces
saramonzon Apr 1, 2024
571dcf0
removed bug, sometimes alleles not float
saramonzon Apr 1, 2024
e5c903a
reduced id threshold for reference allele evaluation from 0.90 to 0.85
saramonzon Apr 2, 2024
40fd747
create commit for testing
luissian Mar 12, 2024
23a0a38
added file for checking allele type
luissian Mar 12, 2024
7078111
creating collect_data function
luissian Mar 12, 2024
b73a704
liting
luissian Mar 12, 2024
afe70ff
change to use node.js 20
luissian Mar 12, 2024
b1b06c9
Adding product annotation
luissian Mar 13, 2024
00c3693
Include product in annotation file missing in previous commit
luissian Mar 13, 2024
aac7cd6
added annotation information to output files
luissian Mar 13, 2024
b710b08
created inferred class to track inferred alleles
luissian Mar 16, 2024
683b246
litting
luissian Mar 16, 2024
169acc0
solving liting
luissian Mar 16, 2024
1cf8528
create finally at try when searching for distance
luissian Mar 17, 2024
f36a2c5
fixed bug in saving annotation per allele
luissian Mar 17, 2024
ba108b8
Check req programs before starting
luissian Mar 18, 2024
f0291a1
save code before using valid result from blast
luissian Mar 18, 2024
8ad751c
implemented NIPH and NIPHEM clasification
luissian Mar 18, 2024
01e6a03
fixing liting and error in program parameter
luissian Mar 18, 2024
4ee8bd8
Implemented SNP file
luissian Mar 19, 2024
fbc76fa
added graphics per allele classification
luissian Mar 19, 2024
f0d7921
implemented alignment and parallel
luissian Mar 21, 2024
a93d300
correcting litin
luissian Mar 22, 2024
a03848e
removing comma at the end onf line in allele_calling_match file
luissian Mar 23, 2024
a2f3511
added comment changes at PR 17
luissian Mar 24, 2024
c41824c
adding docstring and include threshold parameter
luissian Mar 24, 2024
d660401
remove the fix value and assign it to threshold parameter
luissian Mar 24, 2024
2ab64e7
included reference allele sequence
luissian Mar 25, 2024
03fed19
prevent that 2 instances call the method at the same time
luissian Mar 25, 2024
334db03
add threshold parameter
luissian Mar 25, 2024
6692e20
re-writing the classification alleles
luissian Mar 27, 2024
a2e779d
implemented percentage identity as parameter, default is 90
luissian Mar 27, 2024
e3996b8
update the snp implementation including new fields in the snp output …
luissian Mar 27, 2024
1c6e8cf
solving liting
luissian Mar 27, 2024
72cf5c8
Partial implementation of multi alignment feature
luissian Mar 29, 2024
04cb801
add function to extend sequence to find stop codon
luissian Apr 7, 2024
8ef97dd
Added extension sequences when start codon is not found because is trunk
luissian Apr 7, 2024
41804bc
fixing litin
luissian Apr 7, 2024
c69a8d2
fixing litin on allele_calling
luissian Apr 7, 2024
83a03e4
Fixed issue on not start codon
luissian Apr 9, 2024
32c87ec
fixed ouput data when LNF
luissian Apr 9, 2024
5b0bc00
liting
luissian Apr 9, 2024
bcda972
liting 2
luissian Apr 9, 2024
4b0377e
Implementing parallel at multi alignment
luissian Apr 10, 2024
d9269ce
litin
luissian Apr 10, 2024
042dcb3
check if mafft is installed
luissian Apr 11, 2024
a2ffba4
litin
luissian Apr 11, 2024
003447f
implemented distance matrix
luissian Apr 13, 2024
adc9992
liting
luissian Apr 13, 2024
aa3b354
Update information with input parameters and outfiles description
luissian Apr 15, 2024
2e50f88
fixed comments, added eval id as parameter for reference alleles
saramonzon Apr 11, 2024
bec96c6
added eval_identity to parallel execution function
saramonzon Apr 15, 2024
2761905
added left eval_id to EvaluateCluster call
saramonzon Apr 15, 2024
ef96e75
variable renaming and psudocode
saramonzon Apr 16, 2024
930ddf0
first draft code for prot conversion and extend sequence fix
saramonzon Apr 16, 2024
dee3b6a
fixed bug when niph/niphem, removed checking allele match as not impo…
saramonzon Apr 16, 2024
97b17f8
fixed update classification for niph/niphem, fixed wrong indent in fi…
saramonzon Apr 16, 2024
8f11892
changed exact match detection from grep to biopython
saramonzon Apr 16, 2024
26bf2af
removed grep execution
saramonzon Apr 16, 2024
b150297
TPR when any protein translation error, fixed bug when b_split_data i…
saramonzon Apr 16, 2024
ac5e4f5
added some twicks when strand is -, and some linting
saramonzon Apr 16, 2024
f498fb5
fixed LNF in allele_match.tsv output, variable renaming, comment for …
saramonzon Apr 17, 2024
fe73f6b
renaming, changed output of allele details from list to dict, moved e…
saramonzon Apr 17, 2024
4c1671d
fixed wrong unpack in extend seq find function call
saramonzon Apr 17, 2024
15213b1
sort locus_names when printing results. Added search for EXC match af…
saramonzon Apr 18, 2024
a365dc9
function organization, linting and added function when sequence is no…
saramonzon Apr 19, 2024
1e18108
added comments
saramonzon Apr 19, 2024
2a6e353
linting
saramonzon Apr 19, 2024
68a5f3c
variable renaming from allele to locus for clarity
saramonzon Apr 19, 2024
6de10d9
added mash blast correlation script
saramonzon Apr 27, 2024
48f48db
adapted mash blast correlation script
saramonzon Apr 29, 2024
f3a6476
added requiremens for assets
saramonzon Apr 29, 2024
c084830
a couple of fixes in mash_blast_correlation script
saramonzon Apr 29, 2024
26bc4a6
fix result df creation in mash blat correlation script
saramonzon Apr 29, 2024
62786a1
linting mash blast correlation script
saramonzon Apr 29, 2024
8163d80
fix typo in requirements.txt
saramonzon Apr 30, 2024
578a838
renaming and cleaning main
saramonzon Apr 30, 2024
3f1a489
rewriten hamming distance, tested and fixed bugs
saramonzon Apr 30, 2024
3c48772
renaming def in utils
saramonzon Apr 30, 2024
c4b9110
rewritten filter_df function, now filtering both rows and columns, ma…
saramonzon Apr 30, 2024
8f46223
rename variable for clarity, add dtype to read_csv
saramonzon May 1, 2024
4d4607c
clarified help messages, fixed default value for keeping a locus in d…
saramonzon May 1, 2024
dbfda28
removed missing pdb
saramonzon May 1, 2024
317a99a
fixed big fingers typo
saramonzon May 1, 2024
2e7ce82
variable renaming and fixed bounds for condition
saramonzon May 1, 2024
f805814
minor modifications in graph for mash blast corr script
saramonzon May 1, 2024
769758f
added filter with regex
saramonzon May 1, 2024
418c4f2
added notebook for benchmarking analysis
saramonzon May 1, 2024
bd2f495
added masking after filtering per row in df_filter
saramonzon May 1, 2024
7e381ca
black linting
saramonzon May 1, 2024
df4e0f6
fix snp output printing
saramonzon May 3, 2024
7b6729a
minor modifications adding seqsphere for notebook, added blast_id_lis…
saramonzon May 7, 2024
455e7dc
modifications in jup notebook
saramonzon Dec 30, 2024
ae50002
change name from taranis to taranys
saramonzon Jan 2, 2025
c55e33e
rename folder and delete unused files
saramonzon Jan 2, 2025
da77627
changed permissions to file
saramonzon Jan 2, 2025
0d1099e
deleted setup.py, added pip publish workflow to github actions
saramonzon Jan 2, 2025
332ddd8
modified pyproject.toml
saramonzon Jan 2, 2025
a3e29c3
moved to correct path pypi_publish workflow
saramonzon Jan 2, 2025
7b6bd65
added contributing guidelines file
saramonzon Jan 2, 2025
cfb5db6
fix path in pypi workflow
saramonzon Jan 2, 2025
bdf21a9
updated and fixed dependencies
saramonzon Jan 2, 2025
c511058
renamed param in analyze schema
saramonzon Jan 2, 2025
e0f7f25
fixes in params help, added show default when appropiate
saramonzon Jan 2, 2025
4901b30
readme docs are updated
saramonzon Jan 2, 2025
cc713b8
added changelog
saramonzon Jan 2, 2025
fb2c7ab
updated gitignore
saramonzon Jan 2, 2025
1de6d7a
fix tests workflow gact
saramonzon Jan 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Liting analyze_schema
  • Loading branch information
luissian committed Dec 26, 2023
commit dacde9975f3709ca16f18164e74d2f751d8f9e7a
19 changes: 16 additions & 3 deletions taranis/__main__.py
Original file line number Diff line number Diff line change
@@ -129,6 +129,7 @@ def taranis_cli(verbose, log_file):
# testing data for analyze schema
# taranis analyze-schema -i /media/lchapado/Reference_data/proyectos_isciii/taranis/taranis_testing_data/listeria_testing_schema -o /media/lchapado/Reference_data/proyectos_isciii/taranis/test/analyze_schema


@taranis_cli.command(help_priority=1)
@click.option(
"-i",
@@ -193,8 +194,7 @@ def analyze_schema(
usegenus,
):
schema_files = taranis.utils.get_files_in_folder(inputdir, "fasta")



"""
schema_analyze = {}
for schema_file in schema_files:
@@ -205,7 +205,20 @@ def analyze_schema(
# for schema_file in schema_files:
results = []
with concurrent.futures.ProcessPoolExecutor() as executor:
futures = [executor.submit(taranis.analyze_schema.prueba_paralelizacion, schema_file, output, remove_subset, remove_duplicated, remove_no_cds, genus, species, usegenus) for schema_file in schema_files]
futures = [
executor.submit(
taranis.analyze_schema.prueba_paralelizacion,
schema_file,
output,
remove_subset,
remove_duplicated,
remove_no_cds,
genus,
species,
usegenus,
)
for schema_file in schema_files
]
# Collect results as they complete
for future in concurrent.futures.as_completed(futures):
results.append(future.result())
105 changes: 62 additions & 43 deletions taranis/analyze_schema.py
Original file line number Diff line number Diff line change
@@ -11,6 +11,7 @@
import taranis.utils

import pdb

log = logging.getLogger(__name__)
stderr = rich.console.Console(
stderr=True,
@@ -19,36 +20,36 @@
force_terminal=taranis.utils.rich_force_colors(),
)


class AnalyzeSchema:
def __init__(
self,
schema_allele,
output,
remove_subset,
remove_duplicated,
remove_no_cds,
genus,
species,
usegenus
):
self.schema_allele = schema_allele
self.allele_name = Path(self.schema_allele).stem
self.output = output
self.remove_subset = remove_subset
self.remove_duplicated = remove_duplicated
self.remove_no_cds = remove_no_cds
self.genus = genus
self.species = species
self.usegenus = usegenus

self,
schema_allele,
output,
remove_subset,
remove_duplicated,
remove_no_cds,
genus,
species,
usegenus,
):
self.schema_allele = schema_allele
self.allele_name = Path(self.schema_allele).stem
self.output = output
self.remove_subset = remove_subset
self.remove_duplicated = remove_duplicated
self.remove_no_cds = remove_no_cds
self.genus = genus
self.species = species
self.usegenus = usegenus

def check_allele_quality (self):
def check_allele_quality(self):
a_quality = {}
allele_seq = {}
bad_quality_record = []
with open(self.schema_allele) as fh:
for record in SeqIO.parse(self.schema_allele, "fasta"):
a_quality[record.id] = {"quality": "Good quality", "reason": "-" }
a_quality[record.id] = {"quality": "Good quality", "reason": "-"}
allele_seq[record.id] = str(record.seq)
a_quality[record.id]["length"] = len(str(record.seq))
if len(record.seq) % 3 != 0:
@@ -67,27 +68,30 @@ def check_allele_quality (self):
else:
record_sequence = str(record.seq)
a_quality[record.id]["order"] = sequence_order
if record_sequence[0:3] not in taranis.utils.START_CODON_FORWARD :
if record_sequence[0:3] not in taranis.utils.START_CODON_FORWARD:
a_quality[record.id]["quality"] = "Bad quality"
a_quality[record.id]["reason"] = "Start codon not found"
continue
if record_sequence[-3:] not in taranis.utils.STOP_CODON_FORWARD :
if record_sequence[-3:] not in taranis.utils.STOP_CODON_FORWARD:
a_quality[record.id]["quality"] = "Bad quality"
a_quality[record.id]["reason"] = "Stop codon not found"
continue
if taranis.utils.find_multiple_stop_codons(record_sequence):
a_quality[record.id]["quality"] = "Bad quality"
a_quality[record.id]["reason"] = "Multiple stop codons found"
continue
if self.remove_no_cds and a_quality[record.id]["quality"] == "Bad quality":
bad_quality_record.append(record.id)

if (
self.remove_no_cds
and a_quality[record.id]["quality"] == "Bad quality"
):
bad_quality_record.append(record.id)

if self.remove_duplicated:
# get the unique sequences and compare the length with all sequences
unique_seq = list(set(list(allele_seq.values())))
if len(unique_seq) < len(allele_seq):
tmp_dict = {}
for rec_id , seq_value in allele_seq.items():
for rec_id, seq_value in allele_seq.items():
if seq_value not in tmp_dict:
tmp_dict[seq_value] = 0
else:
@@ -116,50 +120,65 @@ def check_allele_quality (self):
def fetch_statistics_from_alleles(self, a_quality):
record_data = {}
bad_quality_reason = {}
a_length = []
a_length = []
bad_quality_counter = 0
for record_id in a_quality.keys():
record_data["allele_name"] = self.allele_name
a_length.append(a_quality[record_id]["length"])
if a_quality[record_id]["quality"] == "Bad quality":
bad_quality_counter += 1
bad_quality_reason[a_quality[record_id]["reason"]] = bad_quality_reason.get(a_quality[record_id]["reason"], 0 ) +1
bad_quality_reason[a_quality[record_id]["reason"]] = (
bad_quality_reason.get(a_quality[record_id]["reason"], 0) + 1
)
total_alleles = len(a_length)
record_data["min_length"] = min(a_length)
record_data["max_length"] = max(a_length)
record_data["num_alleles"] = total_alleles
record_data["mean_length"] = round(statistics.mean(a_length),2)
record_data["good_percent"] = round(100*(total_alleles - bad_quality_counter) / total_alleles, 2)
record_data["mean_length"] = round(statistics.mean(a_length), 2)
record_data["good_percent"] = round(
100 * (total_alleles - bad_quality_counter) / total_alleles, 2
)
return record_data


def analyze_allele_in_schema(self):
allele_data = {}
# Perform quality
a_quality = self.check_allele_quality()
# run annotations
prokka_folder = os.path.join(self.output, "prokka", self.allele_name)
anotation_files = taranis.utils.create_annotation_files(self.schema_allele, prokka_folder, self.allele_name)
allele_data["annotation_gene"] = taranis.utils.read_annotation_file(anotation_files+ ".tsv", self.allele_name).get(self.allele_name)
anotation_files = taranis.utils.create_annotation_files(
self.schema_allele, prokka_folder, self.allele_name
)
allele_data["annotation_gene"] = taranis.utils.read_annotation_file(
anotation_files + ".tsv", self.allele_name
).get(self.allele_name)
allele_data.update(self.fetch_statistics_from_alleles(a_quality))
return allele_data

def prueba_paralelizacion(schema_allele,


def prueba_paralelizacion(
schema_allele,
output,
remove_subset,
remove_duplicated,
remove_no_cds,
genus,
species,
usegenus,
):
schema_obj = AnalyzeSchema(
schema_allele,
output,
remove_subset,
remove_duplicated,
remove_no_cds,
genus,
species,
usegenus
):
schema_obj = AnalyzeSchema(schema_allele, output, remove_subset, remove_duplicated, remove_no_cds, genus, species, usegenus)
usegenus,
)
return schema_obj.analyze_allele_in_schema()


def collect_statistics(stat_data):


stats_df = pd.DataFrame(stat_data)
print(stats_df)