You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The original WGS-HGT was a workflow to benchmark multiple currently available HGT detection tools. Since then, the goal of the project has largely evolved and expanded. Meanwhile, some coding techniques and standards have been updated too. Thus I am planning on re-organizing the codebase to keep it updated.
Here we define that WGS-HGT is a loosen repository that hosts all relevent codes under the larger framework of the "Web of Life" project. Codes live here until they are migrated to more suitable repositories.
Naming
The phrase "WGS-HGT" isn't easy to pronounce, and doesn't precisely describe the current plan of the whole project. People have suggested two candidates:
"weboflife", the same as the project name, least confusing, but when lowercased and merged looks bit awkward.
"horizomer", which indicates that the complete set of horizontally acquired genes in a genome should be called a "horizome", and the goal of the software package is to identify them.
What do you think? Any new ideas are welcome!
Structure
The codebase shall be divided into the following second-level directories under wgshgt:
wrapper: Codes for running third-party programs, reformatting inputs and parsing outputs.
One program occupies one subdirectory, for modularity purpose. The directory should contain one Python script that provides programming interface for crosstalking with the program, and Bash scripts if necessary.
Codes that automate the installation of the programs should be included too, but their actual content can be migrated to conda recipes, leaving only interface.
data: Codes for constructing or retrieving data (e.g., random gene shuffler, genome evolution simulator, genome downloader), actual datasets (if small), and descriptions of large, external test datasets (if not automatically retrievable).
reference: Codes for building reference databases, including genome pool, gene family pool, species tree, gene tree, etc. Or just descriptions of reference databases.
If the scripts have to call external programs to fullfill the function, they should call wrappers rather than launching programs by themselves, unless the programs are very generic (e.g., a GNU tool).
predict: Codes for inferring HGT and other evolutionary events on individual input genomes. These are for end users to analyze their own datasets.
render: Codes for visualizing trees, networks and other forms of display items.
benchmark: Codes for performing benchmark of HGT-prediction methods and other tools.
misc: Codes that cannot fit into existing categories, or codes that have not been sufficiently engineered to live in other directories.
Each directory may contain a tests directory to host unit test scripts. Each tests directory may contains a data directory to store small data files for unit tests. But the unit test codes may also access datasets in first-level data directory.
Because individual steps for predicting, rendering and benchmarking may have to be executed in different work environments, most scripts should have command-line interface (via click).
Please share with people your valuable thoughts. Thank you!
Plan for Re-organizing the WGS-HGT codebase
Rationale
The original WGS-HGT was a workflow to benchmark multiple currently available HGT detection tools. Since then, the goal of the project has largely evolved and expanded. Meanwhile, some coding techniques and standards have been updated too. Thus I am planning on re-organizing the codebase to keep it updated.
Here we define that WGS-HGT is a loosen repository that hosts all relevent codes under the larger framework of the "Web of Life" project. Codes live here until they are migrated to more suitable repositories.
Naming
The phrase "WGS-HGT" isn't easy to pronounce, and doesn't precisely describe the current plan of the whole project. People have suggested two candidates:
What do you think? Any new ideas are welcome!
Structure
The codebase shall be divided into the following second-level directories under
wgshgt
:wrapper
: Codes for running third-party programs, reformatting inputs and parsing outputs.data
: Codes for constructing or retrieving data (e.g., random gene shuffler, genome evolution simulator, genome downloader), actual datasets (if small), and descriptions of large, external test datasets (if not automatically retrievable).reference
: Codes for building reference databases, including genome pool, gene family pool, species tree, gene tree, etc. Or just descriptions of reference databases.predict
: Codes for inferring HGT and other evolutionary events on individual input genomes. These are for end users to analyze their own datasets.render
: Codes for visualizing trees, networks and other forms of display items.benchmark
: Codes for performing benchmark of HGT-prediction methods and other tools.misc
: Codes that cannot fit into existing categories, or codes that have not been sufficiently engineered to live in other directories.Each directory may contain a
tests
directory to host unit test scripts. Eachtests
directory may contains adata
directory to store small data files for unit tests. But the unit test codes may also access datasets in first-leveldata
directory.Because individual steps for predicting, rendering and benchmarking may have to be executed in different work environments, most scripts should have command-line interface (via
click
).Please share with people your valuable thoughts. Thank you!
@ekopylova @wasade @RNAer @mortonjt @sjanssen2 @antgonza @tkosciol
The text was updated successfully, but these errors were encountered: