(WIP) Re-organize the codebase #68

qiyunzhu · 2017-05-27T04:36:25Z

Plan for Re-organizing the WGS-HGT codebase

Rationale

The original WGS-HGT was a workflow to benchmark multiple currently available HGT detection tools. Since then, the goal of the project has largely evolved and expanded. Meanwhile, some coding techniques and standards have been updated too. Thus I am planning on re-organizing the codebase to keep it updated.

Here we define that WGS-HGT is a loosen repository that hosts all relevent codes under the larger framework of the "Web of Life" project. Codes live here until they are migrated to more suitable repositories.

Naming

The phrase "WGS-HGT" isn't easy to pronounce, and doesn't precisely describe the current plan of the whole project. People have suggested two candidates:

"weboflife", the same as the project name, least confusing, but when lowercased and merged looks bit awkward.
"horizomer", which indicates that the complete set of horizontally acquired genes in a genome should be called a "horizome", and the goal of the software package is to identify them.

What do you think? Any new ideas are welcome!

Structure

The codebase shall be divided into the following second-level directories under wgshgt:

wrapper: Codes for running third-party programs, reformatting inputs and parsing outputs.
- One program occupies one subdirectory, for modularity purpose. The directory should contain one Python script that provides programming interface for crosstalking with the program, and Bash scripts if necessary.
- Codes that automate the installation of the programs should be included too, but their actual content can be migrated to conda recipes, leaving only interface.
data: Codes for constructing or retrieving data (e.g., random gene shuffler, genome evolution simulator, genome downloader), actual datasets (if small), and descriptions of large, external test datasets (if not automatically retrievable).
reference: Codes for building reference databases, including genome pool, gene family pool, species tree, gene tree, etc. Or just descriptions of reference databases.
- If the scripts have to call external programs to fullfill the function, they should call wrappers rather than launching programs by themselves, unless the programs are very generic (e.g., a GNU tool).
predict: Codes for inferring HGT and other evolutionary events on individual input genomes. These are for end users to analyze their own datasets.
render: Codes for visualizing trees, networks and other forms of display items.
benchmark: Codes for performing benchmark of HGT-prediction methods and other tools.
misc: Codes that cannot fit into existing categories, or codes that have not been sufficiently engineered to live in other directories.

Each directory may contain a tests directory to host unit test scripts. Each tests directory may contains a data directory to store small data files for unit tests. But the unit test codes may also access datasets in first-level data directory.

Because individual steps for predicting, rendering and benchmarking may have to be executed in different work environments, most scripts should have command-line interface (via click).

Please share with people your valuable thoughts. Thank you!

@ekopylova @wasade @RNAer @mortonjt @sjanssen2 @antgonza @tkosciol

The text was updated successfully, but these errors were encountered:

qiyunzhu changed the title ~~(WIP) Reorganize WGS-HGT codebase~~ (WIP) Re-organize the codebase May 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) Re-organize the codebase #68

(WIP) Re-organize the codebase #68

qiyunzhu commented May 27, 2017 •

edited

Loading

(WIP) Re-organize the codebase #68

(WIP) Re-organize the codebase #68

Comments

qiyunzhu commented May 27, 2017 • edited Loading

Plan for Re-organizing the WGS-HGT codebase

Rationale

Naming

Structure

qiyunzhu commented May 27, 2017 •

edited

Loading