diff --git a/docs_manual/index.rst b/docs_manual/index.rst index c5bfe1a9..5019cdf3 100644 --- a/docs_manual/index.rst +++ b/docs_manual/index.rst @@ -18,14 +18,12 @@ Manual | :ref:`Upload fastq files to SODAR ` | :ref:`Upload results of the Seasnap pipeline to SODAR ` | :ref:`Create a sample info file for Sea-snap ` - | :ref:`Tools for archiving old projects ` Use cases Use cases for common processing tasks. | :ref:`Exome sequencing ` | :ref:`Clinical single cell pipeline ` - | :ref:`Archiving projects ` Project Info More information on the project, including the changelog, list of contributing authors, and @@ -56,7 +54,6 @@ Project Info man_ingest_fastq man_itransfer_results man_write_sample_info - man_archive .. toctree:: :caption: Use Cases @@ -66,7 +63,6 @@ Project Info usecase_exomes usecase_single_cell - usecase_archive_project .. toctree:: :caption: Project Info diff --git a/docs_manual/man_archive.rst b/docs_manual/man_archive.rst deleted file mode 100644 index a1cfd1fb..00000000 --- a/docs_manual/man_archive.rst +++ /dev/null @@ -1,334 +0,0 @@ -.. _man_archive: - -====================== -Manual for ``archive`` -====================== - -The ``cubi-tk archive`` is designed to facilitate the archival of older projects away from the cluster's fast file system. -This document provides an overview of these commands, and how they can be adapted to meet specific needs. - --------- -Glossary --------- - -Hot storage: Fast and expensive, therefore usually size restricted. For example: - -- GPFS by DDN (currently at ``/fast``) -- Ceph with SSDs - -Warm storage: Slower, but with more space and possibly mirroring. For example: - -- SODAR with irods -- Ceph with HDDs (``/data/cephfs-2/``) - -Cold storage: For data that needs to be accessed only rarely. For example: - -- Tape archive - ---------------------------------- -Background: the archiving process ---------------------------------- - -CUBI archive resources are three-fold: - -- SODAR and associated irods storage should contain raw data generated for the project. SODAR also contains important results (mapping, variants, differential expression, ...). -- Gitlab contains small files required to generate the results, typically scripts, configuration files, READMEs, meeting notes, ..., but also knock-in gene sequence, list of papers, gene lists, etc. -- The rest should be stored in CEPH (warm storage). - -For older projects or intermediate results produced by older pipelines the effort of uploading the data to SODAR & gitlab may not be warranted. In this case, the bulk of the archive might be stored in the CEPH file system. - -**The module aims to facilitate this last step, i.e. the archival of old projects to move them away from the hot storage.** - ------------------------------- -Archiving process requirements ------------------------------- - -Archived projects should contain all **important** files, but not data already stored elsewhere. In particular, the following files should **not** be archived: - -- raw data (``*.fastq.gz`` files) saved in SODAR or in the ``STORE``, -- data from public repositories (SRA, GDC portal, ...) that can easily be downloaded again, -- static data such as genome sequence & annotations, variant databases from gnomAD, ... that can also be easily retrieved, -- indices files for mapping that can be re-generated. - -**Importantly, a README file should be present in the archive, briefly describing the project, listing contacts to the client & within CUBI and providing links to SODAR & Gitlab when appropriate.** - - -**The purpose of the module is:** - -- to provide a summary of files that require special attention, for example symlinks which targets lie outside of the project, or large files (``*.fastq.gz`` or ``*.bam`` especially) -- to create a temporary directory that mimicks the archived files with symlinks, -- to use this temporary directory as template to copy files on the CEPH filesystem, and -- to compute checksums on the originals and copies, to ensure accuracy of the copy process. - - ------------ -Basic usage ------------ - - -Summary of files in project -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code-block:: bash - - $ cubi-tk archive summary PROJECT_DIRECTORY DESTINATION - -Unlike other ``cubi-tk`` commands, here ``DESTINATION`` is not a landing zone, but a local filename for the summary of files that require attention. - -By default, the summary reports: - -- dangling symlinks (also dangling because of permission), -- symlinks pointing outside of the project directory, -- large (greater than 256MB) ``*.fastq.gz``, ``*.fq.gz`` & ``*.bam`` files, -- large static data files with extension ``*.gtf``, ``*.gff``, ``*.fasta`` & ``*.fa`` (possibly gzipped), that can potentially be publicly available. -- large files from SRA with prefix ``SRR``. - -The summary file is a table with the following columns: - -- **Class**: the name(s) of the pattern(s) that match the file. When the file matches several patterns, all are listed, separated by ``|``. -- **Filename**: the relative path of the file (from the project's root). -- **Target**: the symlink's target (when applicable) -- **ResolvedName**: the resolved (absolute, symlinks removed) path of the target. When the target doesn't exist or is inaccessible because of permissions, the likely path of the target. -- **Size**: file size (target file size for symlinks). When the file doesn't exist, it is set to 0. -- **Dangling**: ``True`` when the file cannot be read (missing or inaccessible), ``False`` otherwise. -- **Outside**: ``True`` when the target path is outside of the project directory, ``False`` otherwise. It is always ``False`` for real files (_i.e._ not symlinks). - -The summary step also reports an overview of the results, with the total number of files, the total size of the project, and the number of links to files. Number of dangling links and links inaccessible because of permission issues are listed separately. Likewise, the number of files outside of the projects, which are linked to from within the project by symlinks is also quoted. Finally, for each of the "important files" classes, the number of files, the number of files outside of the project directory and the number of files lost because of symlink failures are reported. - - -Archive preparation: README.md file creation -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code-block:: bash - - $ cubi-tk archive readme PROJECT_DIRECTORY README_FILE - -``README_FILE`` is here the path to the README file that will be created. It must not exist. - -The README file will be created by filling contact information interactively. Command-line options are also available, but interactive confirmation is needed. - -It is possible to test if a generated README file is valid for project archival, using - -.. code-block:: bash - - $ cubi-tk archive readme --is-valid PROJECT_DIRECTORY README_FILE - -The module will highlight mandatory records that could not be found in the current file. These mandatory records are lines following the patterns below:: - - - P.I.: [Name of the PI, any string](mailto:) - - Client contact: [Name of our contact in the PI's group](mailto:) - - CUBI project leader: [Name of the CUBI member leading the project] - - CUBI contact: [Name of the archiver](mailto:) - - Project name: - - Start date: YYYY-MM-DD - - Current status: - - - -Archive preparation: temporary copy -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code-block:: bash - - $ cubi-tk archive prepare --readme README PROJECT_DIRECTORY TEMPORARY_DESTINATION - -``TEMPORARY_DESTINATION`` is here the path to the temporary directory that will be created. It must not exist. - -For each file that must be archived, the module creates a symlink to that file's absolute path. The module also reproduces the project's directories hierarchy, so that the symlink sits in the same relative position in the temporary directory than in the original project. - -The module deals with symlinks in the project differently whether their target in inside the project or not. For symlinks pointing outside of the project, a symlink to the target's absolute path is created. For symlinks pointing inside the project, a relative path symlink is created. This allows to store all files (even those outside of the project), without duplicating symlinks inside the project. - -Additional transformation of the original files are carried out during the preparation step: - -- The contents of the ``.snakemake``, ``sge_log``, ``cubi-wrappers`` & ``snappy-pipeline`` directories are processed differently: the directories are tarred & compressed in the temporary destination, to reduce the number of inodes in the archive. -- The core dump files are not copied to the temporary destination, and therefore won't be copied to the final archive. -- The ``README.md`` file created by the ``readme`` subcommand must also be included to be put in the temprary's destination top level. - If the original project already contains a ``README.md`` file, it will be appended to the generated one, as the latter is valid (it contains all mandatory information). - - -Copy to archive & verification -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code-block:: bash - - $ cubi-tk archive copy TEMPORARY_DESTINATION FINAL_DESTINATION - -``FINAL_DESTINATION`` is here the path to the final destination of the archive, on the warm storage. It must not exist. - - - -------------- -Configuration -------------- - -The files reported in the summary are under user control, through the ``--classes`` option, which must point to a yaml file describing the regular expression pattern & minimum size for each class. For example, raw data files can be identified as follows: - -.. code-block:: yaml - - fastq: - min_size: 268435456 - pattern: "^(.*/)?[^/]+(\\.f(ast)?q(\\.gz)?)$" - - -The files larger than 256MB, with extension ``*.fastq``, ``*.fq``, ``*.fastq.gz`` or ``*.fq.gz`` will be reported with the class ``fastq``. -Any number of file class can be defined. The default classes configuration is in ``cubi_tk/archive/classes.yaml`` - -The behaviour of the archive preparation can also be changed using the ``--rules`` option. The rules are also described in a yaml file by regular expression patterns. - -Three different archiving options are implemented: - -- **ignore**: the files or directories matching the pattern are simply omitted from the temporary destination. This is useful to ignore remaining temporary files, core dumps or directories containing lists of input symlinks, for example. -- **compress**: the files or directories matching the pattern will be replaced in the temporary destination by a compressed (gzipped) tar file. This is how ``.snakemake`` or ``sge_log`` directories are treated by default, but patterns for other directories may be added, for example for the Slurm log directories. -- **squash**: the files matching the pattern will be replaced by zero-length placeholders in the temporary destination. A md5 checksum file will be added next to the original file, to enable verification. - -When the user doesn't specify her own set using the ``--rules`` option, the rules applied are the following: core dumps are ignored, ``.snakemake``, ``sge_log``, ``.git``, ``snappy-pipeline`` and ``cubi_wrappers`` directories are compressed, and nothing is squashed. The exact definitions are: - -.. code-block:: yaml - - ignore: # Patterns for files or directories to skip - - "^(.*/)?core\\.[0-9]+$" - - "^(.*/)?\\.venv$" - - compress: # Patterns for files or directories to tar-gzip - - "^(.*/)?\\.snakemake$" - - "^(.*/)?sge_log$" - - "^(.*/)?\\.git$" - - "^(.*/)?snappy-pipeline$" - - "^(.*/)?cubi_wrappers$" - - squash: [] # Patterns for files to squash (compute MD5 checksum, and replace by zero-length placeholder) - - --------- -Examples --------- - -Consider an example project. It contains: - -- raw data in a ``raw_data`` directory, some of which is stored outside of the project's directory, -- processing results in the ``pipeline`` directory, -- additional data files & scripts in ``extra_data``, -- a ``.snakemake`` directory that can potentially contain many files in conda environments, for example, and -- a bunch on temporary & obsolete files that shouldn't be archived, conveniently grouped into the ``ignored_dir`` directory. - -The architecture of this toy project is displayed below:: - - - project/ - ├── extra_data - │   ├── dangling_symlink -> ../../outside/inexistent_data - │   ├── file.public - │   ├── to_ignored_dir -> ../ignored_dir - │   └── to_ignored_file -> ../ignored_dir/ignored_file - ├── ignored_dir - │   └── ignored_file - ├── pipeline - │   ├── output - │   │   ├── sample1 - │   │   │   └── results -> ../../work/sample1/results - │   │   └── sample2 -> ../work/sample2 - │   └── work - │   ├── sample1 - │   │   └── results - │   └── sample2 - │   └── results - ├── raw_data - │   ├── batch1 -> ../../outside/batch1 - │   ├── batch2 - │   │   ├── sample2.fastq.gz -> ../../../outside/batch2/sample2.fastq.gz - │   │   └── sample2.fastq.gz.md5 -> ../../../outside/batch2/sample2.fastq.gz.md5 - │   └── batch3 - │   ├── sample3.fastq.gz - │   └── sample3.fastq.gz.md5 - └── .snakemake - └── snakemake - - -Prepare the copy on the temporary destination -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Imagine now that the raw data is already safely archived in SODAR. We don't want to save these files in duplicate, so we decide ito _squash_ the raw data files so that their size is set to 0, and their md5 checksum is added. We also do the same for the publicly downloadable file ``file.public``. We also want to ignore the junk in ``ignored_dir``, and to compress the ``.snakemake`` directory. So we have the following rules: - - -.. code-block: yaml - - ignore: - - ignored_dir - - compress: - - "^(.*/)?\\.snakemake$" - - squash: - - "^(.*/)?file\\.public$" - - "^(.*/)?raw_data/(.*/)?[^/]+\\.fastq\\.gz$" - - -After running the preparation command ``cubi-tk archive prepare --rules my_rules.yaml project temp_dest``, the temporary destination contains the following files:: - - temp_dest - ├── _hashdeep_report.txt - ├── extra_data - │   ├── file.public - │   ├── file.public.md5 - │   ├── to_ignored_dir -> ../ignored_dir - │   └── to_ignored_file -> ../ignored_dir/ignored_file - ├── pipeline - │   ├── output - │   │   ├── sample1 - │   │   │   └── results -> ../../work/sample1/results - │   │   └── sample2 -> ../work/sample2 - │   └── work - │   ├── sample1 - │   │   └── results -> /absolute_path/project/pipeline/work/sample1/results - │   └── sample2 - │   └── results -> /absolute_path/project/pipeline/work/sample2/results - ├── raw_data - │   ├── batch1 - │   │   ├── sample1.fastq.gz - │   │   └── sample1.fastq.gz.md5 -> /absolute_path/outside/batch1/sample1.fastq.gz.md5 - │   ├── batch2 - │   │   ├── sample2.fastq.gz - │   │   └── sample2.fastq.gz.md5 -> /absolute_path/outside/batch2/sample2.fastq.gz.md5 - │   └── batch3 - │   ├── sample3.fastq.gz - │   └── sample3.fastq.gz.md5 -> /absolute_path/project/raw_data/batch3/sample3.fastq.gz.md5 - ├── README.md - └── .snakemake.tar.gz - - -The inaccessible file ``project/extra_data/dangling_symlink`` & the contents of the ``project/ignored_dir`` are not present in the temporary destination, either because they are not accessible, or because they have been conscientiously ignored by the preparation step. - -The ``.snakemake`` directory is replaced by the the gzipped tar file ``.snakemake.tar.gz`` in the temporary destination. - -The ``file.public`` & the 3 ``*.fastq.gz`` files have been replaced by placeholder files of size 0. For ``file.public``, the md5 checksum has been computed by the preparing step, but for the ``*.fastq.gz`` files, the existing checksums are used. - -All other files are kept for archiving: symlinks for real files point to their target's absolute path, symlinks are absolute for paths outside of the project, and relative for paths inside the project. - -Finally, the hashdeep report of the original project directory is written to the temporary destination, and a ``README.md`` file is created. **At this point, we edit the ``README.md`` file to add a meaningful description of the project.** If a ``README.md`` file was already present in the orginial project directory, its content will be added to the newly created file. - -Note that the symlinks ``temp_dest/extra_data/to_ignored_dir`` & ``temp_dest/extra_data/to_ignored_file`` are dangling, because the link themselves were not omitted, but their targets were. **This is the expected, but perhaps unwanted behaviour**: symlinks pointing to files or directories within compressed or ignored directories will be dangling in the temporary destination, as the original file exists, but is not part of the temporary destination. - - -Copy to the final destination -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -When the ``README.md`` editing is complete, the copy to the final destination on the warm file system can be done. It is matter of ``cubi-tk archive copy temp_dest final_dest``. - -The copy step writes in the final destination the hashdeep audit of the copy against the original project. This audit is expected to fail, because files & directories are ignored, compressed or squashed. The option ``--keep-workdir--hashdeep``, the programme also outputs the hashdeep report of the temporary destination, and the audit of the final copy against the temporary destination. Both the report and the audit are also stored in the final copy directory. The audit of the copy against the temporary destination should be successful, as the copy doesn't re-process files, it only follows symlinks. - -If all steps have been completed successfully (including checking the ``README.md`` for validity), then a marker file named ``archive_copy_complete`` is created. The final step is to remove write permissions if the ``--read-only`` option was selected. - - ----------------------------- -Additional notes and caveats ----------------------------- - -- Generally, the module doesn't like circular symlinks. It is wise to fix them before any operation, or use the rules facility to ignore them during preparation. The ``--dont-follow-links`` option in the summary step prevents against such problems, at the expense of missing some files in the report. -- The module is untested for symlink corner cases (for example, where a symlink points to a symlink outside of the project, which in turn points to another file in the project). -- In the archive, relative symlinks within the project are resolved. For example, in the original project one might have ``variants.vcf -> ../work/variants.vcf -> variants.somatic.vcf``. In the archive, the link will be ``variants.vcf -> ../work/variants.somatic.vcf``. - ----------------- -More Information ----------------- - -Also see ``cubi-tk archive --help``, ``cubi-tk archive summary --help``, ``cubi-tk archive prepare --help`` & ``cubi-tk archive copy --help`` for more information. diff --git a/docs_manual/usecase_archive_project.rst b/docs_manual/usecase_archive_project.rst deleted file mode 100644 index f5aa8508..00000000 --- a/docs_manual/usecase_archive_project.rst +++ /dev/null @@ -1,265 +0,0 @@ -.. _usecase_archive: - -============================= -Use Case: Archiving a project -============================= - -This section describes the process of archiving a project using ``cubi-tk``. -This section provides an example of how cubi-tk can be used in different cases. - --------- -Overview --------- - -The general process to archive projects is: - -1. Get acquainted with the contents of the projects directory. - The command ``cubi-tk archive summary`` provides a basic facility to identify several important aspects for the archival process. - It does not, however, check whether files are already stored on SODAR. This must be done independently. -2. Archives **must** be accompanied by a ``README.md`` file, which provides important contact information about the project's scientific P.I., - e-mail addresses of the post-doc in charge, the individuals in CUBI that processed the data, and the person in charge of the archive. - URLs for SODAR & Gitlab are also important. - The command ``cubi-tk archive readme`` creates a valid README file, that contains these informations. -3. In many cases, not all files should be archived: there is no need to duplicate large sequencing files (fastrq or bam) if they are already safely stored on SODAR. - Likewise, whole genome sequence, annotations, indices, should not be archived in most cases. - The command ``cubi-tk archive prepare`` identifies files that must be copied, and those which shouldn't. - (it can do a bit more, see below). -4. Once these preparation steps have been carried out, the command ``cubi-tk archive copy`` performs the copy of the project to its final archive destination. - This command creates checksums for all files in the project, and in the archive copy. It provides an audit of the comparison between these two sets of checksums, - to ensure trhat the archival was successful. - -Each of these steps descibed above are discussed below, to give practical examples, and to suggest good practice. - -------- -Summary -------- - -The summarisation step aims to report several cases of files that may require attention for archiving. -In particular, symbolic links to destinations outside of the project's directory should be reported. -Dangling symbolic links (either because the target is missing, or because of permissions) are also listed. - -The module also lists specific files of interest. By default, large bam or fastq files (larger than 256MB) -are reported, as well as large fasta files, annotations (with ``.gtf`` or ``.gff`` extensions), and -short-read-archive sequencing data. - -It is possible for the user to change the reporting criteria, using a ``yaml`` file & the ``--classes`` option. -For example: - - .. code-block:: bash - - $ cubi-tk archive summary \ - --classes reporting_classes.yaml \ # Use your own reporting selection - \ - - -The default summary classes can be found in ``/cubi_tk/archive/classes.yaml``. -Its content reads: - - .. code-block:: yaml - - fastq: - min_size: 268435456 - pattern: "^(.*/)?[^/]+(\\.f(ast)?q(\\.gz)?)$" - bam: - min_size: 268435456 - pattern: "^(.*/)?[^/]+(\\.bam(\\.bai)?)$" - public: - min_size: 268435456 - pattern: "^(.*/)?(SRR[0-9]+[^/]*|[^/]+\\.(fa(sta)?|gtf|gff[23]?)(\\.gz)?)$" - -The output of the summarization is a table, with the reason why the file is reported in the first column, -the file name, the symlink target if the file is a symlink, the file's normalised path, its size, -and, in case of symlinks, if the target is accessible, and if it is inside the project or not. - - --------------------- -Readme file creation --------------------- - -The module creates README files that **must** contain contact information to - -- The project's scientific P.I. (Name & email address), -- The contact to the person in charge of the project, very often a post-doc in the P.I.'s group (name & e-mail address), -- The contact to the person who is archiving the project (name & e-mail address). This person will be the project's contact in CUBI. -- The name of the person who actually did the data processing & analysis in CUBI. - It is generally the same person who is archiving the project, unless he or she has left CUBI. - -The SODAR & Gitlab's URLs should also be present in the README file, when applicable. -But this information is not mandatory, unlike the contact information. - -**Important notes** - -The creation of the README file is a frequent source of errors and frustrations. -To minimize the inconveniences, please heed these wise words. - -- E-mail addresses must be present, valid & cannot contain uppercase letters (don't ask why...) -- Generally, the module is quite fussy about the format. Spaces, justification, ... may be important. -- Upon README creation, the project directory is quickly scanned to generate an overview of the - project's size and number of inodes. For large projects, it is possible to disable this behaviour - using the ``--skip-collect`` option. -- Because of these problems, the module offers a possibility to check README file validity. The command is - `cubi-tk archive readme --is-valid project_dir readme_file`. -- If a README file is already present in the project, it will be appended at the bottom of the - README file generated by the module. - -Most importantly, please edit your README file after generation by the module. The module generates -no description of the aims & results of the project, even though it is very useful and important to have. - - ------------------------ -Preparation of the copy ------------------------ - -During preparation, the user can select the files that will be archived, those that will be discarded, -and those that must be processed differently. - -The file selection is achieved by creating a temporary copy of the project's directory structure, -using symbolic links. The location of this temporary copy is called *temporary destination*. - -When copying a file to this temporary destination, its fate is decided based on its filename & path, -using regular expression pattern matching. There are 4 types of operations: - -- The files are selected for copy. This is the default behaviour. -- Files can be omitted (or *ignored*) from the copy. -- Directories with many (smallish) files can be tarred & compressed to reduce the total number of inodes (which is very file-system friendly). -- Finally, files can be *squashed*. In this case, a file will have its md5 checksum computed and seved in a companion files next to it, and - the file will finally be replaced with a placeholder with the same name, but with a size of 0. - This is useful for large files that can easily be downloaded again from the internet. - Public sequencing datasets, genome sequences & annotations are typical examples. - -The user can impose its own rules, based on the content of the project. -The selection rules are defined in a yaml file accessed through the module's ``--rules`` option. -The default rules file is in ``/cubi_tk/archive/default_rules.yaml``, -and its content reads: - - .. code-block:: yaml - - ignore: # Patterns for files or directories to skip - - "^(.*/)?core\\.[0-9]+$" # Ignore core dumps - - "^(.*/)?\\.venv$" # Ignore virtual environment .venv directories - - compress: # Patterns for files or directories to tar-gzip - - "^(.*/)?\\.snakemake$" # Created by snakemake process - - "^(.*/)?sge_log$" # Snappy SGE log directories - - "^(.*/)?\\.git$" # Git internals - - "^(.*/)?snappy-pipeline$" # Copy of snappy - - "^(.*/)?cubi_wrappers$" # Copy of snappy's ancestor - - squash: [] # Patterns for files to squash (compute MD5 checksum, and replace by zero-length placeholder) - - -**Important notes** - -- The temporary destination is typically chosen as ``/fast/scratch/users//Archive/``. -- The README file generated in the previous step is copied to the temporary destination using the module's ``--readme`` option. -- When the temporary destination is complete, the module creates a complete list of all files accessible from the original project directory, - and computes md5 & sh256 checksums, using ``hashdeep``. - This is done **for all files accessible from the project's directory**, including all symbolic links. -- The computation of checksums can be extremely time-consuming. Multiple threads can be used with the ``--num-threads`` option. - Nevertheless, in most cases, it is advisable to submit the preparation as a slurm job, rather than interactively. - - -Example of usage: - -.. code-block:: bash - - $ cubi-tk archive prepare \ - --rules \ # Project-specific rules - --readme \ # README.md file generated in the previous step - --ignore-tar-errors \ # Useful only in cases of inaccessible files to compress - \ - - - -------------------------- -Copy to final destination -------------------------- - -The last step consist in copying all files in the temporary destination to the archiving location. -This is done internally using ``rsync``, having previously removed all symbolic links connecting files wihtin the project directory. -These *local* symbolic links are restored after the copy is complete, in both the temporary & final destinations. -After the copy is complete, the archiving directory can be protected against writing with the ``--read-only`` option. - -A verification based on md5 checksums is automatically done between the original project directory and the final copy. -In most cases, differences between the directories are expected, because of the files ignored, compressed and squashed. -However, it is good practice to examine the audit file to make sure that all files missing from the copy are missing for the right reasons. -The report of checksums of all files in the original project, and the audit result are both present in the final destination, -as files called ``_hashdeep_report.txt`` and ``_hashdeep_audit.txt`` respectively. - -For additional verification, it is also possible to request (using the ``--keep-workdir-hashdeep`` option) a hashdeep report of the -temporary destination, and the corresponding audit of the final copy. These contents of these two directories -are expected to be identical, and any discrepancy should be looked at carefully. -The report & audit files relative to the temporary destination are called ``_workdir_report.txt`` & ``_workdir_audit.txt``. - -Finally, the copy and hasdeep steps are quite time-consuming, and it is good practice to submit the copy as a slurm job -rather than interactively, even when multiple threads are used (through the ``--num-threads`` option). - -An example of a copy script that can be submitted to slurm is: - -.. code-block:: bash - - #!/bin/bash - - #SBATCH --job-name=copy - #SBATCH --output=slurm_log/copy.%j.out - #SBATCH --error=slurm_log/copy.%j.err - #SBATCH --partition=medium - #SBATCH --mem=4000 - #SBATCH --time=72:00:00 - #SBATCH --ntasks=1 - #SBATCH --cpus-per-task=8 - - # ------------------ Command-line options ----------------------------- - - # Taken from https://stackoverflow.com/questions/402377/using-getopts-to-process-long-and-short-command-line-options - TEMP=$(getopt -o ts:d: --long dryrun,source:,destination: -- "$@") - - if [ $? != 0 ] ; then echo "Terminating..." >&2 ; exit 1 ; fi - - # Note the quotes around '$TEMP': they are essential! - eval set -- "$TEMP" - - dryrun=0 - src="" - dest="" - while true; do - case "$1" in - -t | --dryrun ) dryrun=1; shift ;; - -s | --source ) src="$2"; shift 2 ;; - -d | --destination ) dest="$2"; shift 2 ;; - -- ) shift; break ;; - * ) break ;; - esac - done - - if [[ "X$src" == "X" ]] ; then echo "No project directory defined" >&2 ; exit 1 ; fi - if [[ ! -d "$src" ]] ; then echo "Can't find project directory $src" >&2 ; exit 1 ; fi - if [[ "X$dest" == "X" ]] ; then echo "No temporary directory defined" >&2 ; exit 1 ; fi - if [[ -e "$dest" ]] ; then echo "Temporary directory $dest already exists" >&2 ; exit 1 ; fi - - if [[ dryrun -eq 1 ]] ; then - echo "cubi-tk archive copy " - echo "--read-only --keep-workdir-hashdeep --num-threads 8 " - echo "\"$src\" \"$dest\"" - exit 0 - fi - - # ---------------------- Subtmit to slurm ----------------------------- - - export LC_ALL=en_US - unset DRMAA_LIBRARY_PATH - - test -z "${SLURM_JOB_ID}" && SLURM_JOB_ID=$(date +%Y-%m-%d_%H-%M) - mkdir -p slurm_log/${SLURM_JOB_ID} - - CONDA_PATH=$HOME/work/miniconda3 - set +euo pipefail - conda deactivate &>/dev/null || true # disable any existing - source $CONDA_PATH/etc/profile.d/conda.sh - conda activate cubi_tk # enable found - set -euo pipefail - - cubi-tk archive copy \ - --read-only --keep-workdir-hashdeep --num-threads 8 \ - "$src" "$dest" - diff --git a/environment.yaml b/environment.yaml index 90dc0426..836630a7 100644 --- a/environment.yaml +++ b/environment.yaml @@ -9,7 +9,6 @@ dependencies: - python ~=3.12 - pip - uv >=0.5 - - hashdeep - pysam >=0.22 - vcfpy >=0.13.8 - gcc_linux-64 >=13,<14 diff --git a/src/cubi_tk/__main__.py b/src/cubi_tk/__main__.py index 3f33f060..3582c230 100644 --- a/src/cubi_tk/__main__.py +++ b/src/cubi_tk/__main__.py @@ -12,8 +12,6 @@ from cubi_tk import __version__ -from .archive import run as run_archive -from .archive import setup_argparse as setup_argparse_archive from .common import run_nocmd from .irods import run as run_irods from .irods import setup_argparse as setup_argparse_irods @@ -83,7 +81,6 @@ def setup_argparse(): setup_argparse_sea_snap( subparsers.add_parser("sea-snap", help="Tools for supporting the RNA-SeASnaP pipeline.") ) - setup_argparse_archive(subparsers.add_parser("archive", help="helper for archiving projects.")) return parser, subparsers @@ -118,7 +115,6 @@ def main(argv=None): "sodar": run_sodar, "irods": run_irods, "org-raw": run_org_raw, - "archive": run_archive, } res = cmds[args.cmd](args, parser, subparsers.choices[args.cmd] if args.cmd else None) diff --git a/src/cubi_tk/archive/__init__.py b/src/cubi_tk/archive/__init__.py deleted file mode 100644 index b4fc1743..00000000 --- a/src/cubi_tk/archive/__init__.py +++ /dev/null @@ -1,53 +0,0 @@ -"""``cubi-tk archive``: tools for archive projects (to the CEPH system, for example) - -Available Commands ------------------- - -``summary`` - Lists files that might be problematic for archival (symlinks & large files) -``prepare`` - prepare archive: checks presence of README, compress .snakemake & others -``readme`` - prepare README.md file: creates a valid README.md file with necessary contacts & URLs -``copy`` - perform archival: copies the prepared output to its final destination, with hashdeep audit - -More Information ----------------- - -- Also see ``cubi-tk archive`` :ref:`cli_main ` and ``cubi-tk archive --help`` for more information. - -""" - -import argparse - -from ..common import run_nocmd -from .copy import setup_argparse as setup_argparse_copy -from .prepare import setup_argparse as setup_argparse_prepare -from .readme import setup_argparse as setup_argparse_readme -from .summary import setup_argparse as setup_argparse_summary - - -def setup_argparse(parser: argparse.ArgumentParser) -> None: - """Main entry point for archive command.""" - subparsers = parser.add_subparsers(dest="archive_cmd") - - setup_argparse_copy(subparsers.add_parser("copy", help="Perform archival (copy and audit)")) - setup_argparse_prepare( - subparsers.add_parser("prepare", help="Prepare the project directory for archival") - ) - setup_argparse_readme(subparsers.add_parser("readme", help="Prepare a valid README.md")) - setup_argparse_summary( - subparsers.add_parser( - "summary", - help="Collects a summary of files in the project directory. The summary can be saved to a file for further inspection", - ) - ) - - -def run(args, parser, subparser): - """Main entry point for archive command.""" - if not args.archive_cmd: # pragma: nocover - return run_nocmd(args, parser, subparser) - else: - return args.archive_cmd(args, parser, subparser) diff --git a/src/cubi_tk/archive/classes.yaml b/src/cubi_tk/archive/classes.yaml deleted file mode 100644 index 7f390c00..00000000 --- a/src/cubi_tk/archive/classes.yaml +++ /dev/null @@ -1,10 +0,0 @@ -fastq: - min_size: 268435456 - pattern: "^(.*/)?[^/]+(\\.f(ast)?q(\\.gz)?)$" -bam: - min_size: 268435456 - pattern: "^(.*/)?[^/]+(\\.bam(\\.bai)?)$" -public: - min_size: 268435456 - pattern: "^(.*/)?(SRR[0-9]+[^/]*|[^/]+\\.(fa(sta)?|gtf|gff[23]?)(\\.gz)?)$" - diff --git a/src/cubi_tk/archive/common.py b/src/cubi_tk/archive/common.py deleted file mode 100644 index 19ae0707..00000000 --- a/src/cubi_tk/archive/common.py +++ /dev/null @@ -1,157 +0,0 @@ -"""``cubi-tk archive``: common features""" - -import argparse -import json -import os -from pathlib import Path -import subprocess -import sys -import typing - -import attr - - -@attr.s(frozen=True, auto_attribs=True) -class Config: - """Configuration for common archive subcommands.""" - - verbose: bool - config: str - sodar_url: str - sodar_api_token: str = attr.ib(repr=lambda value: "***") # type: ignore - project: str - - -@attr.s(frozen=True, auto_attribs=True) -class FileAttributes: - """Attributes for files & symlinks""" - - relative_path: str - resolved: Path - symlink: bool - dangling: bool - outside: bool - target: str - size: int - - -class ArchiveCommandBase: - """Implementation of archive subcommands.""" - - command_name = "" - - def __init__(self, config: Config): - self.config = config - self.project = None - - @classmethod - def setup_argparse(cls, parser: argparse.ArgumentParser) -> None: - """Setup argument parser.""" - parser.add_argument( - "--hidden-cmd", dest="archive_cmd", default=cls.run, help=argparse.SUPPRESS - ) - - parser.add_argument("project", help="Path of project directory") - - @classmethod - def run( - cls, args, _parser: argparse.ArgumentParser, _subparser: argparse.ArgumentParser - ) -> typing.Optional[int]: - """Entry point into the command.""" - raise NotImplementedError("Must be implemented in derived classes") - - def check_args(self, args): - """Called for checking arguments, override to change behaviour.""" - raise NotImplementedError("Must be implemented in derived classes") - - def execute(self) -> typing.Optional[int]: - raise NotImplementedError("Must be implemented in derived classes") - - -def get_file_attributes(filename, relative_to): - """Returns attributes of the file named `filename`. - - The attributes are: - - relative_path: the file path relative to directory `relative_to` - - resolved: the resolved path (i.e. normalised absolute path to the file) - - symlink: True if the file is a symlink, False otherwise - - dangling: True if the symlink's target cannot be read (missing or permissions), - False the filename is not a symlink, or if the target can be read - - outside: True if the file is not in the `relative_to` directory - - target: the symlink target, or None if filename isn't a symlink - - size: the size of the file, or of its target if the file is a symlink. - If the file is a dangling symlink, the size is set to 0 - """ - resolved = Path(filename).resolve(strict=False) - symlink = os.path.islink(filename) - if symlink: - target = os.readlink(filename) - try: - dangling = not resolved.exists() - except PermissionError: - dangling = None - if dangling is None or dangling: - size = 0 - else: - size = resolved.stat().st_size - else: - dangling = False - outside = False - target = None - size = resolved.stat().st_size - outside = os.path.relpath(resolved, start=relative_to).startswith("../") - return FileAttributes( - relative_path=os.path.relpath(filename, start=relative_to), - resolved=resolved, - symlink=symlink, - dangling=dangling, - outside=outside, - target=target, - size=size, - ) - - -def traverse_project_files(directory, followlinks=True): - root = Path(directory).resolve(strict=True) - for path, _, files in os.walk(root, followlinks=followlinks): - for filename in files: - yield get_file_attributes(os.path.join(path, filename), root) - - -def load_variables(template_dir): - """ - :param template_dir: Path to cookiecutter directory. - :type template_dir: str - - :return: Returns load variables found in the cokiecutter template directory. - """ - config_path = os.path.join(template_dir, "cookiecutter.json") - with open(config_path, "rt", encoding="utf8") as inputf: - result = json.load(inputf) - return result - - -def run_hashdeep(directory, out_file=None, num_threads=4, ref_file=None): - """Run hashdeep recursively on directory, following symlinks, stores the result in out_file. - Hashdeep can be run in normal or audit mode, when ref_file is provided.""" - # Output of out_file of stdout - if out_file: - f = open(out_file, "wt") - else: - f = sys.stdout - # hashdeep command for x or for audit - cmd = ["hashdeep", "-j", str(num_threads), "-l", "-r"] - if ref_file: - cmd += ["-vvv", "-a", "-k", ref_file, "."] - else: - cmd += ["-o", "fl", "."] - # Run hashdeep from the directory, storing the output in f - p = subprocess.Popen(cmd, cwd=directory, encoding="utf-8", stdout=f, stderr=None) - p.wait() - # Return hashdeep return value - return p.returncode - - -def setup_argparse(parser: argparse.ArgumentParser) -> None: - """Setup argument parser for ``cubi-tk archive``.""" - return ArchiveCommandBase.setup_argparse(parser) diff --git a/src/cubi_tk/archive/copy.py b/src/cubi_tk/archive/copy.py deleted file mode 100644 index 6523101e..00000000 --- a/src/cubi_tk/archive/copy.py +++ /dev/null @@ -1,277 +0,0 @@ -"""``cubi-tk archive prepare``: Prepare a project for archival""" - -import argparse -import atexit -import datetime -import os -import re -import shutil -import subprocess -import tempfile -import typing - -import attr -from logzero import logger - -from . import common, readme -from ..common import execute_shell_commands -from ..exceptions import InvalidReadmeException, MissingFileException - - -@attr.s(frozen=True, auto_attribs=True) -class Config(common.Config): - """Configuration for prepare.""" - - skip: typing.List[str] - num_threads: int - check_work_dest: bool - read_only: bool - destination: str - - -HASHDEEP_REPORT_PATTERN = re.compile( - "^(([0-9]{4})-([0-9]{2})-([0-9]{2}))_hashdeep_(report|audit).txt$" -) - - -class ArchiveCopyCommand(common.ArchiveCommandBase): - """Implementation of archive copy command.""" - - command_name = "copy" - - def __init__(self, config: Config): - super().__init__(config) - self.project_dir = None - self.dest_dir = None - - @classmethod - def setup_argparse(cls, parser: argparse.ArgumentParser) -> None: - """Setup argument parser.""" - super().setup_argparse(parser) - - parser.add_argument("--num-threads", type=int, default=4, help="Number of parallel threads") - parser.add_argument( - "--skip", type=str, nargs="*", help="Step to skip (hashdeep, rsync, audit)" - ) - parser.add_argument( - "--keep-workdir-hashdeep", - default=False, - action="store_true", - help="Save hashdeep report & audit of the temporary destination", - ) - parser.add_argument( - "--read-only", - default=False, - action="store_true", - help="Change destination files to read-only", - ) - parser.add_argument( - "destination", help="Final destination directory for archive, must not exist" - ) - - @classmethod - def run( - cls, args, _parser: argparse.ArgumentParser, _subparser: argparse.ArgumentParser - ) -> typing.Optional[int]: - """Entry point into the command.""" - return cls(args).execute() - - def check_args(self, args): - """Called for checking arguments, override to change behaviour.""" - res = 0 - - if os.path.exists(self.config.destination): - logger.error("Destination directory {} already exists".format(self.config.destination)) - res = 1 - - return res - - def execute(self) -> typing.Optional[int]: - """Copies the contents of the input directory to the output path, following symlinks. - The accuracy of the copy is verified by running hashdeep on the original files, and - (in audit mode) on the copy. - - The copy module is meant to be executed after `cubi-tk archive prepare`. The prepare - steps creates a temporary directory, with symlinks pointing to absolute paths of the - files that must be copied. The copy is done using the `rsync` command, in a mode - which follows symlinks. - - After the preparation step, relative symlinks pointing inside the project are retained. - Those should not be copied by `rsync`, to avoid duplication of potentially large files. - Therefore, the symlinks are deleted from the temporary directory before copy, and - re-created after the copy is finished in both the original temporary directory and the - final archive copy. - """ - res = self.check_args(self.config) - if res: # pragma: nocover - return res - - logger.info("Starting cubi-tk archive copy") - logger.info(" args: %s", self.config) - - self.project_dir = os.path.realpath(self.config.project) - self.dest_dir = os.path.realpath(self.config.destination) - - # Find relative symlinks that point inside the project directory - rel_symlinks = [] - rel_symlinks = self._find_relative_symlinks(self.project_dir, rel_symlinks) - logger.info("Set {} relative symlinks aside".format(len(rel_symlinks))) - - # Make sure to restore relative symlinks - atexit.register( - self._restore_relative_symlinks, root=self.project_dir, rel_symlinks=rel_symlinks - ) - - tmpdir = tempfile.TemporaryDirectory() - - status = 0 - try: - if not readme.is_readme_valid(os.path.join(self.project_dir, "README.md")): - raise InvalidReadmeException("README.md file missing or invalid") - if not self.config.skip or "check_work" not in self.config.skip: - work_report = os.path.join( - tmpdir.name, datetime.date.today().strftime("%Y-%m-%d_workdir_report.txt") - ) - logger.info( - "Preparing hashdeep report of {} to {}".format(self.project_dir, work_report) - ) - self._hashdeep_report(self.project_dir, work_report) - - if not self.config.skip or "rsync" not in self.config.skip: - # Remove relative symlinks that point within the project to avoid file copy duplication - self._remove_relative_symlinks(rel_symlinks) - - self._rsync(self.project_dir, self.dest_dir) - - # Add relative symlinks to the copy - self._restore_relative_symlinks(self.dest_dir, rel_symlinks) - - if not self.config.skip or "check_work" not in self.config.skip: - work_audit = os.path.join( - tmpdir.name, datetime.date.today().strftime("%Y-%m-%d_workdir_audit.txt") - ) - self._hashdeep_audit(self.dest_dir, work_report, work_audit) - - if not self.config.skip or "audit" not in self.config.skip: - report = self._find_hashdeep_report(self.project_dir) - audit = os.path.join( - tmpdir.name, datetime.date.today().strftime("%Y-%m-%d_hashdeep_audit.txt") - ) - self._hashdeep_audit(self.dest_dir, report, audit) - shutil.move(audit, os.path.join(self.dest_dir, os.path.basename(audit))) - - if res != 0 or self.config.keep_workdir_hashdeep: - shutil.move(work_report, os.path.join(self.dest_dir, os.path.basename(work_report))) - shutil.move(work_audit, os.path.join(self.dest_dir, os.path.basename(work_audit))) - - if readme.is_readme_valid(os.path.join(self.dest_dir, "README.md")): - open(os.path.join(self.dest_dir, "archive_copy_complete"), "w").close() - if self.config.read_only: - execute_shell_commands([["chmod", "-R", "ogu-w", self.dest_dir]]) - else: - raise MissingFileException("Missing or illegal README.md file") - except Exception as e: - status = 1 - logger.error(e) - - return status - - def _rsync(self, origin, destination): - # rsync -a without copy symlinks as symlinks, devices & special files - logger.info("Copy files from {} to {}".format(origin, destination)) - cmd = ["rsync", "-rptgo", "--copy-links", origin + "/", destination] - subprocess.run(cmd, check=True) - - def _find_hashdeep_report(self, directory): - ref_files = list( - filter( - lambda x: HASHDEEP_REPORT_PATTERN.match(x) - and HASHDEEP_REPORT_PATTERN.match(x).group(5) == "report", - os.listdir(self.project_dir), - ) - ) - if not ref_files: - raise MissingFileException("Cannot find hashdeep report to perform audit") - ref_files.sort(reverse=True) - return ref_files[0] - - def _hashdeep_report(self, directory, report): - """Runs hashdeep in report mode, raising an exception in case of error""" - res = common.run_hashdeep( - directory=directory, out_file=report, num_threads=self.config.num_threads - ) - if res != 0: - raise subprocess.SubprocessError("Hashdeep report failed") - - def _hashdeep_audit(self, directory, report, audit): - """Runs hashdeep in audit mode. Missing or added files are ignored""" - logger.info("Audit of {} from {}".format(directory, report)) - res = common.run_hashdeep( - directory=directory, - out_file=audit, - num_threads=self.config.num_threads, - ref_file=report, - ) - if res == 0: - logger.info("Audit passed without errors") - elif res == 1 or res == 2: - logger.warning( - "Audit found missing or changed files, check {} for errors".format( - os.path.basename(audit) - ) - ) - else: - raise subprocess.SubprocessError("Audit failed, check {} for errors".format(audit)) - - def _find_relative_symlinks(self, path, rel_symlinks): - """Recursively traverse a directory (path) to find all relative symbolic links. - The relative symlinks (symlink name & relative target) are stored in a list. - """ - if os.path.islink(path): - if not os.path.isabs(os.readlink(path)): - relative = os.path.relpath(path, start=self.project_dir) - target = os.readlink(path) - rel_symlinks.append((relative, target)) - return rel_symlinks - - if os.path.isdir(path): - for child in os.listdir(path): - rel_symlinks = self._find_relative_symlinks(os.path.join(path, child), rel_symlinks) - return rel_symlinks - - def _remove_relative_symlinks(self, rel_symlinks): - """Remove relative symlinks from the original directory""" - for relative, _ in rel_symlinks: - os.remove(os.path.join(self.project_dir, relative)) - - def _restore_relative_symlinks(self, root, rel_symlinks, add_dangling=True): - """Relative symlinks from list are added to the destination directory. - - root: str - path to the root directory from which the symlinks are created - rel_symlinks: List[Tuple[str, str]] - list of symlinks. - The first tuple element is the symlink path, relative to root. - The second tuple element is the symlink target. - add_dangling: bool - controls if dangling symlinks must be created or not. - It can happen that some relative symlinks point to missing file, - for example if it belonged within a directory which has been squashed or - ignored during the preparation step. - The symlink should be there, even though the target file is not accessible - in the archived copy. - """ - for relative, target in rel_symlinks: - symlink_path = os.path.join(root, relative) - if os.path.exists(symlink_path) or os.path.islink(symlink_path): - continue - symlink_dir = os.path.dirname(symlink_path) - if add_dangling or os.path.exists(os.path.join(symlink_dir, target)): - os.makedirs(symlink_dir, mode=488, exist_ok=True) # 488 is 750 in octal - os.symlink(target, symlink_path) - logger.info("Restored relative symlinks in {}".format(root)) - - -def setup_argparse(parser: argparse.ArgumentParser) -> None: - """Setup argument parser for ``cubi-tk archive copy``.""" - return ArchiveCopyCommand.setup_argparse(parser) diff --git a/src/cubi_tk/archive/default_rules.yaml b/src/cubi_tk/archive/default_rules.yaml deleted file mode 100644 index 652bac0d..00000000 --- a/src/cubi_tk/archive/default_rules.yaml +++ /dev/null @@ -1,12 +0,0 @@ -ignore: # Patterns for files or directories to skip - - "^(.*/)?core\\.[0-9]+$" - - "^(.*/)?\\.venv$" - -compress: # Patterns for files or directories to tar-gzip - - "^(.*/)?\\.snakemake$" - - "^(.*/)?sge_log$" - - "^(.*/)?\\.git$" - - "^(.*/)?snappy-pipeline$" - - "^(.*/)?cubi_wrappers$" - -squash: [] # Patterns for files to squash (compute MD5 checksum, and replace by zero-length placeholder) diff --git a/src/cubi_tk/archive/prepare.py b/src/cubi_tk/archive/prepare.py deleted file mode 100644 index a10d0e0e..00000000 --- a/src/cubi_tk/archive/prepare.py +++ /dev/null @@ -1,286 +0,0 @@ -"""``cubi-tk archive prepare``: Prepare a project for archival""" - -import argparse -import datetime -import os -import re -import sys -import time -import typing - -import attr -from logzero import logger -import yaml - -from . import common -from ..common import compute_md5_checksum, execute_shell_commands -from .readme import is_readme_valid - -MSG = "**Contents of original `README.md` file**" - - -@attr.s(frozen=True, auto_attribs=True) -class Config(common.Config): - """Configuration for prepare.""" - - rules: typing.Dict[ - str, typing.Any - ] # The regular expression string read from the yaml file in compiled into a re.Pattern - skip: bool - num_threads: int - no_readme: bool - destination: str - - -class ArchivePrepareCommand(common.ArchiveCommandBase): - """Implementation of archive prepare command.""" - - command_name = "prepare" - - def __init__(self, config: Config): - super().__init__(config) - self.project_dir = None - self.dest_dir = None - - self.start = time.time() - self.inode = 0 - - @classmethod - def setup_argparse(cls, parser: argparse.ArgumentParser) -> None: - """Setup argument parser.""" - super().setup_argparse(parser) - - parser.add_argument("--num-threads", type=int, default=4, help="Number of parallel threads") - parser.add_argument( - "--rules", "-r", default=os.path.join(os.path.dirname(__file__), "default_rules.yaml") - ) - parser.add_argument("--readme", help="Path to README.md created with cubi-tk") - parser.add_argument( - "--ignore-tar-errors", - action="store_true", - help="Ignore errors due to access permissions in when compressind folders", - ) - - parser.add_argument( - "destination", help="Destination directory (for symlinks and later archival)" - ) - - @classmethod - def run( - cls, args, _parser: argparse.ArgumentParser, _subparser: argparse.ArgumentParser - ) -> typing.Optional[int]: - """Entry point into the command.""" - return cls(args).execute() - - def check_args(self, args): - """Called for checking arguments, override to change behaviour.""" - res = 0 - - if os.path.exists(self.config.destination): - logger.error("Destination directory {} already exists".format(self.config.destination)) - res = 1 - - return res - - def execute(self) -> typing.Optional[int]: - """Execute the upload to sodar.""" - res = self.check_args(self.config) - if res: # pragma: nocover - return res - - logger.info("Starting cubi-tk archive prepare") - logger.info(" args: %s", self.config) - - # Remove all symlinks to absolute paths - self.project_dir = os.path.realpath(self.config.project) - self.dest_dir = os.path.realpath(self.config.destination) - - os.makedirs(self.dest_dir, mode=488, exist_ok=False) - - rules = self._get_rules(self.config.rules) - - # Recursively traverse the project and create archived files & links - self._archive_path(self.project_dir, rules) - - sys.stdout.write(" " * 80 + "\r") - sys.stdout.flush() - - # Copy README.md - if self.config.readme: - ArchivePrepareCommand._copy_readme( - os.path.realpath(self.config.readme), os.path.join(self.dest_dir, "README.md") - ) - else: - logger.warning("No READ.md file supplied, it may cause problems during copy") - - # Run hashdeep on original project directory - logger.info("Preparing the hashdeep report of {}".format(self.project_dir)) - res = common.run_hashdeep( - directory=self.project_dir, - out_file=os.path.join( - self.dest_dir, datetime.date.today().strftime("%Y-%m-%d_hashdeep_report.txt") - ), - num_threads=self.config.num_threads, - ) - if res: - logger.error("hashdeep command has failed with return code {}".format(res)) - return res - - return 0 - - def _archive_path(self, path, rules): - """Recursively archive files in the path, according to the rules""" - self._progress() - - # Dangling link - if not os.path.exists(path): - logger.warning("File or directory cannot be read, not archived : '{}'".format(path)) - return - - # Check how the path should be processed by regular expression matching - status = "archive" - for rule, patterns in rules.items(): - for pattern in patterns: - if pattern.match(path): - status = rule - - if status == "ignore": - return - if status == "compress": - self._compress(path) - return - if status == "squash": - self._squash(path) - return - assert status == "archive" - - # Archive files - if not os.path.isdir(path): - self._archive(path) - else: - # Process only true directories (not symlinks) or symlinks pointing outside of project - if not os.path.islink(path) or self._is_outside( - os.path.realpath(path), self.project_dir - ): - for child in os.listdir(path): - self._archive_path(os.path.join(path, child), rules) - else: - self._archive(path) - - def _progress(self): - self.inode += 1 - if self.inode % 1000 == 0: - delta = int(time.time() - self.start) - sys.stdout.write( - "\rElapsed time: %02d:%02d:%02d, number of files processed: %d, rate: %.1f [files/sec]\r" - % ( - delta // 3600, - (delta % 3600) // 60, - delta % 60, - self.inode, - self.inode / delta if delta > 0 else 0, - ) - ) - sys.stdout.flush() - - def _compress(self, path): - if os.path.exists(path + ".tar.gz"): - raise ValueError( - "File or directory cannot be compressed, compressed file already exists : '{}'".format( - path - ) - ) - - relative = os.path.relpath(path, start=self.project_dir) - destination = os.path.join(self.dest_dir, relative) - - os.makedirs(os.path.dirname(destination), mode=488, exist_ok=True) - cmd = [ - "tar", - "-zcvf", - destination + ".tar.gz", - "--transform=s/^{}/{}/".format(os.path.basename(path), os.path.basename(destination)), - "-C", - os.path.dirname(path), - os.path.basename(path), - ] - if self.config.ignore_tar_errors: - cmd.insert(len(cmd) - 1, "--ignore-failed-read") - execute_shell_commands([cmd], verbose=self.config.verbose) - - def _squash(self, path): - if os.path.isdir(path): - raise ValueError("Path is a directory and cannot be squashed : '{}'".format(path)) - - relative = os.path.relpath(path, start=self.project_dir) - destination = os.path.join(self.dest_dir, relative) - - # Create empty placeholder - os.makedirs(os.path.dirname(destination), mode=488, exist_ok=True) - open(destination, "w").close() - - # Create checksum if missing - if not os.path.exists(path + ".md5"): - md5 = compute_md5_checksum(os.path.realpath(path), verbose=self.config.verbose) - with open(destination + ".md5", "w") as f: - f.write(md5 + " " + os.path.basename(destination)) - - def _archive(self, path): - relative = os.path.relpath(path, start=self.project_dir) - destination = os.path.join(self.dest_dir, relative) - - os.makedirs(os.path.dirname(destination), mode=488, exist_ok=True) - if os.path.islink(path): - target = os.path.realpath(path) - relative_link = os.path.relpath(target, start=self.project_dir) - if relative_link.startswith("../") or relative_link.startswith("/"): - os.symlink(target, destination) - else: - os.symlink(os.readlink(path), destination) - else: - os.symlink(os.path.realpath(path), destination) - - @staticmethod - def _copy_readme(src, target): - logger.info("Using README file {}".format(src)) - os.makedirs(os.path.realpath(os.path.dirname(target)), mode=488, exist_ok=True) - with open(src, "rt") as f: - lines = [x.rstrip() for x in f.readlines()] - - if os.path.exists(target): - lines.extend(["", "", "-" * 80, "", "", MSG, "", "", "-" * 80, "", ""]) - with open(target, "rt") as f: - lines.extend([x.rstrip() for x in f.readlines()]) - os.remove(target) - - with open(os.path.realpath(target), "wt") as f: - f.write("\n".join(lines)) - - if not is_readme_valid(os.path.realpath(target), verbose=True): - logger.warning("Invalid README.md, it may cause problems upon copy") - - @staticmethod - def _get_rules(filename): - logger.info("Obtaining archive rules from {}".format(filename)) - with open(filename, "rt") as f: - rules = yaml.safe_load(f) - - for rule, patterns in rules.items(): - compiled = [] - for pattern in patterns: - compiled.append(re.compile(pattern)) - rules[rule] = compiled - - return rules - - @staticmethod - def _is_outside(path, directory): - path = os.path.realpath(path) - directory = os.path.realpath(directory) - relative = os.path.relpath(path, start=directory) - return relative.startswith("../") - - -def setup_argparse(parser: argparse.ArgumentParser) -> None: - """Setup argument parser for ``cubi-tk archive prepare``.""" - return ArchivePrepareCommand.setup_argparse(parser) diff --git a/src/cubi_tk/archive/readme.py b/src/cubi_tk/archive/readme.py deleted file mode 100644 index 2be38c93..00000000 --- a/src/cubi_tk/archive/readme.py +++ /dev/null @@ -1,317 +0,0 @@ -"""``cubi-tk archive prepare``: Prepare a project for archival""" - -import argparse -import errno -import os -import re -import shutil -import sys -import tempfile -import time -import typing - -import attr -from cookiecutter.main import cookiecutter -from cubi_isa_templates import IsaTabTemplate -from logzero import logger - -from . import common -from ..common import execute_shell_commands - -_TEMPLATE_DIR = os.path.join(os.path.dirname(__file__), "templates") - -TEMPLATE = IsaTabTemplate( - name="archive", - path=_TEMPLATE_DIR, - description="Prepare project for archival", - configuration=common.load_variables(template_dir=_TEMPLATE_DIR), -) - -DU = re.compile("^ *([0-9]+)[ \t]+[^ \t]+.*$") -DATE = re.compile("^(20[0-9][0-9]-[01][0-9]-[0-3][0-9])[_-].+$") - -MAIL = ( - "(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*" - '|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]' - '|\\\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")' - "@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?" - "|\\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}" - "(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:" - "(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]" - "|\\\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)" - "\\])" -) - -PATTERNS = { - "project_name": re.compile("^ *- *Project name: *.+$"), - "date": re.compile("^ *- *Start date: *[0-9]{4}-[0-9]{2}-[0-9]{2}.*$"), - "status": re.compile("^ *- *Current status: *(Active|Inactive|Finished|Archived) *$"), - "PI": re.compile("^ *- P.I.: \\[([A-z '-]+)\\]\\(mailto:(" + MAIL + ")\\) *$"), - "client": re.compile("^ *- *Client contact: \\[([A-z '-]+)\\]\\(mailto:(" + MAIL + ")\\) *$"), - "archiver": re.compile("^ *- *CUBI contact: \\[([A-z '-]+)\\]\\(mailto:(" + MAIL + ")\\) *$"), - "CUBI": re.compile("^ *- *CUBI project leader: ([A-z '-]+) *$"), -} - -COMMANDS = { - "size": ["du", "--bytes", "--max-depth=0"], - "inodes": ["du", "--inodes", "--max-depth=0"], - "size_follow": ["du", "--dereference", "--bytes", "--max-depth=0"], - "inodes_follow": ["du", "--dereference", "--inodes", "--max-depth=0"], -} - - -@attr.s(frozen=True, auto_attribs=True) -class Config(common.Config): - """Configuration for prepare.""" - - filename: str - skip_collect: bool - is_valid: bool - no_input: bool - - -class ArchiveReadmeCommand(common.ArchiveCommandBase): - """Implementation of archive readme command.""" - - command_name = "readme" - - def __init__(self, config: Config): - super().__init__(config) - self.project_dir = None - self.readme_file = None - - self.start = time.time() - self.inode = 0 - - @classmethod - def setup_argparse(cls, parser: argparse.ArgumentParser) -> None: - """Setup argument parser.""" - super().setup_argparse(parser) - - parser.add_argument( - "--skip-collect", - "-s", - action="store_true", - help="Skip the collection of file size & inodes", - ) - parser.add_argument( - "--is-valid", "-t", action="store_true", help="Test validity of existing README file" - ) - # Enable pytest - parser.add_argument("--no-input", action="store_true", help=argparse.SUPPRESS) - add_readme_parameters(parser) - - parser.add_argument("filename", help="README.md path & filename") - - @classmethod - def run( - cls, args, _parser: argparse.ArgumentParser, _subparser: argparse.ArgumentParser - ) -> typing.Optional[int]: - """Entry point into the command.""" - return cls(args).execute() - - def check_args(self, args): - """Called for checking arguments, override to change behaviour.""" - res = 0 - - if not self.config.is_valid and os.path.exists(self.config.filename): - logger.error("Readme file {} already exists".format(self.config.filename)) - res = 1 - if self.config.is_valid and not os.path.exists(self.config.filename): - logger.error("Missing readme file {}, can't test validity".format(self.config.filename)) - res = 1 - - return res - - def execute(self) -> typing.Optional[int]: - """Execute the upload to sodar.""" - res = self.check_args(self.config) - if res: # pragma: nocover - return res - - logger.info("Starting cubi-tk archive readme") - logger.info(" args: %s", self.config) - - # Remove all symlinks to absolute paths - self.project_dir = os.path.realpath(self.config.project) - self.readme_file = os.path.realpath(self.config.filename) - - # Check existing README file validity if requested - if self.config.is_valid: - res = not is_readme_valid(self.readme_file, verbose=True) - if res == 0: - logger.info("README file is valid: {}".format(self.readme_file)) - return res - - logger.info("Preparing README.md") - extra_context = self._create_extra_context(self.project_dir) - - self.create_readme(self.readme_file, extra_context=extra_context) - - if not is_readme_valid(self.readme_file, verbose=True): - res = 1 - return res - - def create_readme(self, readme_file, extra_context=None): - try: - tmp = tempfile.mkdtemp() - - # Create the readme file in temp directory - cookiecutter( - template=TEMPLATE.path, - extra_context=extra_context, - output_dir=tmp, - no_input=self.config.no_input, - ) - - # Copy it back to destination, including contents of former incomplete README.md - os.makedirs(os.path.dirname(readme_file), mode=488, exist_ok=True) - shutil.copyfile( - os.path.join(tmp, extra_context["project_name"], "README.md"), readme_file - ) - finally: - try: - shutil.rmtree(tmp) - except OSError as e: - if e.errno != errno.ENOENT: - raise - - def _extra_context_from_config(self): - extra_context = {} - if self.config: - for name in TEMPLATE.configuration: - var_name = "var_%s" % name - if getattr(self.config, var_name, None) is not None: - extra_context[name] = getattr(self.config, var_name) - continue - if isinstance(self.config, dict) and var_name in self.config: - extra_context[name] = self.config[var_name] - return extra_context - - def _create_extra_context(self, project_dir): - extra_context = self._extra_context_from_config() - - if self.config.skip_collect: - for context_name, _ in COMMANDS.items(): - extra_context[context_name] = "NA" - extra_context["snakemake_nb"] = "NA" - else: - logger.info("Collecting size & inodes numbers") - for context_name, cmd in COMMANDS.items(): - if context_name not in extra_context.keys(): - cmd.append(project_dir) - extra_context[context_name] = DU.match( - execute_shell_commands([cmd], check=False, verbose=False) - ).group(1) - - if "snakemake_nb" not in extra_context.keys(): - extra_context["snakemake_nb"] = ArchiveReadmeCommand._get_snakemake_nb(project_dir) - - if "archiver_name" not in extra_context.keys(): - extra_context["archiver_name"] = ArchiveReadmeCommand._get_archiver_name() - - if "archiver_email" not in extra_context.keys(): - extra_context["archiver_email"] = ( - "{}@bih-charite.de".format(extra_context["archiver_name"]).lower().replace(" ", ".") - ) - if "CUBI_name" not in extra_context.keys(): - extra_context["CUBI_name"] = extra_context["archiver_name"] - - if "PI_name" in extra_context.keys() and "PI_email" not in extra_context.keys(): - extra_context["PI_email"] = ( - "{}@charite.de".format(extra_context["PI_name"]).lower().replace(" ", ".") - ) - if "client_name" in extra_context.keys() and "client_email" not in extra_context.keys(): - extra_context["client_email"] = ( - "{}@charite.de".format(extra_context["client_name"]).lower().replace(" ", ".") - ) - - if "SODAR_UUID" in extra_context.keys() and "SODAR_URL" not in extra_context.keys(): - if getattr(self.config, "sodar_server_url", None) is not None: - extra_context["SODAR_URL"] = "{}/projects/{}".format( - self.config.sodar_server_url, extra_context["SODAR_UUID"] - ) - elif "sodar_server_url" in self.config: - extra_context["SODAR_URL"] = "{}/projects/{}".format( - self.config["sodar_server_url"], extra_context["SODAR_UUID"] - ) - - if "directory" not in extra_context.keys(): - extra_context["directory"] = project_dir - if "project_name" not in extra_context.keys(): - extra_context["project_name"] = os.path.basename(project_dir) - if "start_date" not in extra_context.keys() and DATE.match(extra_context["project_name"]): - extra_context["start_date"] = DATE.match(extra_context["project_name"]).group(1) - if "current_status" not in extra_context.keys(): - extra_context["current_status"] = "Finished" - - return extra_context - - @staticmethod - def _get_snakemake_nb(project_dir): - cmds = [ - [ - "find", - project_dir, - "-type", - "d", - "-name", - ".snakemake", - "-exec", - "du", - "--inodes", - "--max-depth=0", - "{}", - ";", - ], - ["cut", "-f", "1"], - ["paste", "-sd+"], - ["bc"], - ] - return execute_shell_commands(cmds, check=False, verbose=False) - - @staticmethod - def _get_archiver_name(): - cmds = [ - ["pinky", "-l", os.getenv("USER")], - ["grep", "In real life:"], - ["sed", "-e", "s/.*In real life: *//"], - ] - output = execute_shell_commands(cmds, check=False, verbose=False) - return output.rstrip() - - -def add_readme_parameters(parser): - for name in TEMPLATE.configuration: - key = name.replace("_", "-") - parser.add_argument( - "--var-%s" % key, help="template variable %s" % repr(name), default=None - ) - - -def is_readme_valid(filename=None, verbose=False): - if filename is None: - f = sys.stdin - else: - if not os.path.exists(filename): - if verbose: - logger.error("No README file {}".format(filename)) - return False - f = open(filename, "rt") - matching = set() - for line in f: - line = line.rstrip() - for name, pattern in PATTERNS.items(): - if pattern.match(line): - matching.add(name) - f.close() - if verbose: - for name, _ in PATTERNS.items(): - if name not in matching: - logger.warning("Entry {} missing from README.md file".format(name)) - return set(PATTERNS.keys()).issubset(matching) - - -def setup_argparse(parser: argparse.ArgumentParser) -> None: - """Setup argument parser for ``cubi-tk archive readme``.""" - return ArchiveReadmeCommand.setup_argparse(parser) diff --git a/src/cubi_tk/archive/summary.py b/src/cubi_tk/archive/summary.py deleted file mode 100644 index 21d7108d..00000000 --- a/src/cubi_tk/archive/summary.py +++ /dev/null @@ -1,250 +0,0 @@ -"""``cubi-tk archive summary``: Creates a summary table of problematic files and files of interest""" - -import argparse -import os -from pathlib import Path -import re -import sys -import time -import typing - -import attr -from logzero import logger -import yaml - -from . import common -from .common import traverse_project_files - - -@attr.s(frozen=True, auto_attribs=True) -class Config(common.Config): - """Configuration for find-file.""" - - classes: str - table: str - - -class ArchiveSummaryCommand(common.ArchiveCommandBase): - """Implementation of archive summary command.""" - - command_name = "summary" - - @classmethod - def setup_argparse(cls, parser: argparse.ArgumentParser) -> None: - """Setup argument parser.""" - super().setup_argparse(parser) - - parser.add_argument( - "--classes", - default=os.path.join(os.path.dirname(__file__), "classes.yaml"), - help="Location of the file describing files of interest", - ) - parser.add_argument( - "--dont-follow-links", - action="store_true", - help="Do not follow symlinks to directories. Required when the project contains circular symlinks", - ) - parser.add_argument("table", help="Location of the summary output table") - - @classmethod - def run( - cls, args, _parser: argparse.ArgumentParser, _subparser: argparse.ArgumentParser - ) -> typing.Optional[int]: - """Entry point into the command.""" - return cls(args).execute() - - def check_args(self, args): - """Called for checking arguments, override to change behaviour.""" - res = 0 - - if not os.path.exists(self.config.project) or not os.path.isdir(self.config.project): - logger.error("Illegal project path : '{}'".format(self.config.project)) - res = 1 - - return res - - def execute(self) -> typing.Optional[int]: - """Traverse all project files for summary""" - res = self.check_args(self.config) - if res: # pragma: nocover - return res - - logger.info("Starting cubi-tk archive summary") - logger.info(" args: %s", self.config) - - stats = self._init_stats(os.path.normpath(os.path.realpath(self.config.classes))) - - f = open(self.config.table, "wt") if self.config.table else sys.stdout - - # Print output table title lines - resolved = Path(self.config.project) - title = "# Files in {}".format(self.config.project) - if self.config.project != str(resolved): - title += " (resolved to {})".format(str(resolved)) - print(title, file=f) - print( - "\t".join( - ["Class", "FileName", "Target", "ResolvedName", "Size", "Dangling", "Outside"] - ), - file=f, - ) - - # Traverse the project tree to accumulate statistics and populate the output table - self.start = time.time() - for file_attr in traverse_project_files( - self.config.project, followlinks=not self.config.dont_follow_links - ): - self._aggregate_stats(file_attr, stats, f) - f.close() - - # Clear the progress line - if self.config.table: - sys.stdout.write(" " * 80 + "\r") - sys.stdout.flush() - - # Print general overview on the screen - self._report_stats(stats) - - return 0 - - def _report_stats(self, stats): - logger.info( - "Number of files in {}: {} ({} outside of project directory)".format( - self.config.project, stats["nFile"], stats["nOutside"] - ) - ) - logger.info( - "Number of links: {} ({} dangling, {} inaccessible (permissions))".format( - stats["nLink"], stats["nDangling"], stats["nInaccessible"] - ) - ) - logger.info( - "Total size: {} ({} in files outside of the directory)".format( - stats["size"], stats["size_outside"] - ) - ) - for name, the_stat in stats["classes"].items(): - logger.info( - "Number of {} files: {} (total size: {})".format( - name, the_stat["nFile"], the_stat["size"] - ) - ) - logger.info( - "Number of {} files outside the projects directory: {} (total size: {})".format( - name, the_stat["nOutside"], the_stat["size_outside"] - ) - ) - logger.info( - "Number of {} files lost (dangling or inaccessible): {}".format( - name, the_stat["nLost"] - ) - ) - - @staticmethod - def _init_stats(f=None): - if not f: - f = sys.stdout - if isinstance(f, str): - f = open(f, "rt") - - stats = { - "nFile": 0, - "size": 0, - "nLink": 0, - "nDangling": 0, - "nInaccessible": 0, - "nOutside": 0, - "size_outside": 0, - "classes": {}, - } - for name, params in yaml.safe_load(f).items(): - stats["classes"][name] = { - "min_size": int(params["min_size"]), - "pattern": re.compile(params["pattern"]), - "nFile": 0, - "size": 0, - "nLost": 0, - "nOutside": 0, - "size_outside": 0, - } - - return stats - - def _aggregate_stats(self, file_attr, stats, f): - """Aggregate statistics for one file""" - save = [] - - stats["nFile"] += 1 - stats["size"] += file_attr.size - - if file_attr.outside: - stats["nOutside"] += 1 - stats["size_outside"] += file_attr.size - save.append("outside") - - # symlinks - if file_attr.target: - stats["nLink"] += 1 - if file_attr.dangling: - stats["nDangling"] += 1 - save.append("dangling") - if file_attr.dangling is None: - stats["nInaccessible"] += 1 - save.append("inaccessible") - - # File classes - for name, the_class in stats["classes"].items(): - if not the_class["pattern"].match(file_attr.relative_path): - continue - is_lost = file_attr.target and (file_attr.dangling is None or file_attr.dangling) - if file_attr.size < the_class["min_size"] and not is_lost: - continue - save.append(name) - the_class["nFile"] += 1 - if is_lost: - the_class["nLost"] += 1 - else: - if file_attr.outside: - the_class["nOutside"] += 1 - the_class["size_outside"] += file_attr.size - the_class["size"] += file_attr.size - - if save: - self._print_file_attr("|".join(save), file_attr, f) - - # Report progress - if self.config.table and stats["nFile"] % 1000 == 0: - delta = int(time.time() - self.start) - sys.stdout.write( - "\rElapsed time: %02d:%02d:%02d, number of files processed: %d, rate: %.1f [files/sec]\r" - % ( - delta // 3600, - (delta % 3600) // 60, - delta % 60, - stats["nFile"], - stats["nFile"] / delta if delta > 0 else 0, - ) - ) - sys.stdout.flush() - - def _print_file_attr(self, the_class, fn, f): - """Print one row of the summary table""" - print( - "\t".join( - [ - the_class, - fn.relative_path, - fn.target if fn.target else "", - str(fn.resolved), - str(fn.size), - str(fn.dangling), - str(fn.outside), - ] - ), - file=f, - ) - - -def setup_argparse(parser: argparse.ArgumentParser) -> None: - """Setup argument parser for ``cubi-tk archive summary``.""" - return ArchiveSummaryCommand.setup_argparse(parser) diff --git a/src/cubi_tk/archive/templates/cookiecutter.json b/src/cubi_tk/archive/templates/cookiecutter.json deleted file mode 100644 index 5a89f70d..00000000 --- a/src/cubi_tk/archive/templates/cookiecutter.json +++ /dev/null @@ -1,60 +0,0 @@ -{ - "directory": "", - - "PI_name": "", - "PI_email": "", - "archiver_name": [ - "Eric Blanc", - "Manuela Benary", - "Dieter Beule", - "Manuel Holtgrewe", - "Andranik Ivanov", - "Mathias Kuhring", - "Mikko Nieminen", - "Benedikt Obermayer-Wasserscheid", - "Oliver Stolpe", - "Nina Thiessen", - "Eudes Viera Barbosa", - "January Weiner" - ], - "archiver_email": "", - "CUBI_name": [ - "Kajetan Bentele", - "Eric Blanc", - "Manuela Benary", - "Dieter Beule", - "Oliver Drechsel", - "Dermot Harnett", - "Manuel Holtgrewe", - "Andranik Ivanov", - "Mathias Kuhring", - "Clemens Messerschmidt", - "Jose Muino Acuna", - "Mikko Nieminen", - "Benedikt Obermayer-Wasserscheid", - "Patrick Pett", - "Oliver Stolpe", - "Nina Thiessen", - "Eudes Viera Barbosa", - "January Weiner" - ], - "client_name": "", - "client_email": "", - "SODAR_UUID": "", - "SODAR_URL": "", - "Gitlab_URL": "", - "project_name": "{{cookiecutter.directory}}", - "start_date": "", - "current_status": [ - "Active", - "Inactive", - "Finished", - "Archived" - ], - - "size": "0", - "inodes": "0", - "size_follow": "0", - "inodes_follow": "0", - "snakemake_nb": "0" -} diff --git a/src/cubi_tk/archive/templates/{{cookiecutter.project_name}}/README.md b/src/cubi_tk/archive/templates/{{cookiecutter.project_name}}/README.md deleted file mode 100644 index d7adbac4..00000000 --- a/src/cubi_tk/archive/templates/{{cookiecutter.project_name}}/README.md +++ /dev/null @@ -1,29 +0,0 @@ -# Project description - -Terse project description - -## Contacts and links - -- P.I.: [{{cookiecutter.PI_name}}](mailto:{{cookiecutter.PI_email}}) -- Client contact: [{{cookiecutter.client_name}}](mailto:{{cookiecutter.client_email}}) -- CUBI contact: [{{cookiecutter.archiver_name}}](mailto:{{cookiecutter.archiver_email}}) -- CUBI project leader: {{cookiecutter.CUBI_name}} -- SODAR project UUID: {{cookiecutter.SODAR_UUID}} -- SODAR URL: {{cookiecutter.SODAR_URL}} -- CUBI gitlab URL: {{cookiecutter.Gitlab_URL}} -- HPCC directory: {{cookiecutter.directory}} - -## Project status - -- Project name: {{cookiecutter.project_name}} -- Start date: {{cookiecutter.start_date}} -- Current status: {{cookiecutter.current_status}} -- Total size: {{cookiecutter.size}} (following links: {{cookiecutter.size_follow}}) -- Total number of files (inodes): {{cookiecutter.inodes}} (following links: {{cookiecutter.inodes_follow}}) -- Total number of files in `.snakemake` directories: {{cookiecutter.snakemake_nb}} - -## Public datasets files - -List here the provenance of public files that were used during the project life cycle, -but that should *NOT* be archived with the rest of the project. - diff --git a/tests/data/archive/classes.yaml b/tests/data/archive/classes.yaml deleted file mode 100644 index ab8cdb84..00000000 --- a/tests/data/archive/classes.yaml +++ /dev/null @@ -1,3 +0,0 @@ -fastq: - min_size: 1 - pattern: "^(.*/)?[^/]+(\\.f(ast)?q(\\.gz)?)$" diff --git a/tests/data/archive/final_dest_verif.tar.gz b/tests/data/archive/final_dest_verif.tar.gz deleted file mode 100644 index 283e1113..00000000 Binary files a/tests/data/archive/final_dest_verif.tar.gz and /dev/null differ diff --git a/tests/data/archive/outside/batch1/sample1.fastq.gz b/tests/data/archive/outside/batch1/sample1.fastq.gz deleted file mode 100644 index faa74c06..00000000 --- a/tests/data/archive/outside/batch1/sample1.fastq.gz +++ /dev/null @@ -1 +0,0 @@ -Raw data for sample 1 diff --git a/tests/data/archive/outside/batch1/sample1.fastq.gz.md5 b/tests/data/archive/outside/batch1/sample1.fastq.gz.md5 deleted file mode 100644 index 0ca6788f..00000000 --- a/tests/data/archive/outside/batch1/sample1.fastq.gz.md5 +++ /dev/null @@ -1 +0,0 @@ -253c6617ff7ac497959c9c8259030b7b sample1.fastq.gz diff --git a/tests/data/archive/outside/batch2/sample2.fastq.gz b/tests/data/archive/outside/batch2/sample2.fastq.gz deleted file mode 100644 index 80680a66..00000000 --- a/tests/data/archive/outside/batch2/sample2.fastq.gz +++ /dev/null @@ -1 +0,0 @@ -Raw data for sample 2 diff --git a/tests/data/archive/outside/batch2/sample2.fastq.gz.md5 b/tests/data/archive/outside/batch2/sample2.fastq.gz.md5 deleted file mode 100644 index 2c5d9c54..00000000 --- a/tests/data/archive/outside/batch2/sample2.fastq.gz.md5 +++ /dev/null @@ -1 +0,0 @@ -692e06599e32fa9e837c9c78a2956df4 sample2.fastq.gz diff --git a/tests/data/archive/outside/dir/outside_dir_file b/tests/data/archive/outside/dir/outside_dir_file deleted file mode 100644 index 2a9b623a..00000000 --- a/tests/data/archive/outside/dir/outside_dir_file +++ /dev/null @@ -1 +0,0 @@ -File outside of project dir accessed within the project dir via a symlink to the parent directory diff --git a/tests/data/archive/outside/files/outside_file b/tests/data/archive/outside/files/outside_file deleted file mode 100644 index c2ba9a62..00000000 --- a/tests/data/archive/outside/files/outside_file +++ /dev/null @@ -1 +0,0 @@ -File outside of project dir accessed within the project dir via a symlink diff --git a/tests/data/archive/project/.snakemake/snakemake b/tests/data/archive/project/.snakemake/snakemake deleted file mode 100644 index ba3b7bff..00000000 --- a/tests/data/archive/project/.snakemake/snakemake +++ /dev/null @@ -1 +0,0 @@ -snakemake diff --git a/tests/data/archive/project/extra_data/dangling_symlink b/tests/data/archive/project/extra_data/dangling_symlink deleted file mode 120000 index cf3c471b..00000000 --- a/tests/data/archive/project/extra_data/dangling_symlink +++ /dev/null @@ -1 +0,0 @@ -../../outside/inexistent_data \ No newline at end of file diff --git a/tests/data/archive/project/extra_data/file.public b/tests/data/archive/project/extra_data/file.public deleted file mode 100644 index e95681fa..00000000 --- a/tests/data/archive/project/extra_data/file.public +++ /dev/null @@ -1 +0,0 @@ -file.public diff --git a/tests/data/archive/project/extra_data/to_ignored_dir b/tests/data/archive/project/extra_data/to_ignored_dir deleted file mode 120000 index 3766ebc6..00000000 --- a/tests/data/archive/project/extra_data/to_ignored_dir +++ /dev/null @@ -1 +0,0 @@ -../ignored_dir \ No newline at end of file diff --git a/tests/data/archive/project/extra_data/to_ignored_file b/tests/data/archive/project/extra_data/to_ignored_file deleted file mode 120000 index a520ba90..00000000 --- a/tests/data/archive/project/extra_data/to_ignored_file +++ /dev/null @@ -1 +0,0 @@ -../ignored_dir/ignored_file \ No newline at end of file diff --git a/tests/data/archive/project/ignored_dir/ignored_file b/tests/data/archive/project/ignored_dir/ignored_file deleted file mode 100644 index cc254cdd..00000000 --- a/tests/data/archive/project/ignored_dir/ignored_file +++ /dev/null @@ -1 +0,0 @@ -file in the project directory that must NOT be archive, as it is located in a ignored directory diff --git a/tests/data/archive/project/pipeline/output/sample1/results b/tests/data/archive/project/pipeline/output/sample1/results deleted file mode 120000 index a01e22fd..00000000 --- a/tests/data/archive/project/pipeline/output/sample1/results +++ /dev/null @@ -1 +0,0 @@ -../../work/sample1/results \ No newline at end of file diff --git a/tests/data/archive/project/pipeline/output/sample2 b/tests/data/archive/project/pipeline/output/sample2 deleted file mode 120000 index a6751205..00000000 --- a/tests/data/archive/project/pipeline/output/sample2 +++ /dev/null @@ -1 +0,0 @@ -../work/sample2 \ No newline at end of file diff --git a/tests/data/archive/project/pipeline/work/sample1/results b/tests/data/archive/project/pipeline/work/sample1/results deleted file mode 100644 index 01dc7cb1..00000000 --- a/tests/data/archive/project/pipeline/work/sample1/results +++ /dev/null @@ -1 +0,0 @@ -Results for sample 1 diff --git a/tests/data/archive/project/pipeline/work/sample2/results b/tests/data/archive/project/pipeline/work/sample2/results deleted file mode 100644 index c6cc2f95..00000000 --- a/tests/data/archive/project/pipeline/work/sample2/results +++ /dev/null @@ -1 +0,0 @@ -Results for sample 2 diff --git a/tests/data/archive/project/raw_data/batch1 b/tests/data/archive/project/raw_data/batch1 deleted file mode 120000 index 9f50f751..00000000 --- a/tests/data/archive/project/raw_data/batch1 +++ /dev/null @@ -1 +0,0 @@ -../../outside/batch1 \ No newline at end of file diff --git a/tests/data/archive/project/raw_data/batch2/sample2.fastq.gz b/tests/data/archive/project/raw_data/batch2/sample2.fastq.gz deleted file mode 120000 index bb9053f8..00000000 --- a/tests/data/archive/project/raw_data/batch2/sample2.fastq.gz +++ /dev/null @@ -1 +0,0 @@ -../../../outside/batch2/sample2.fastq.gz \ No newline at end of file diff --git a/tests/data/archive/project/raw_data/batch2/sample2.fastq.gz.md5 b/tests/data/archive/project/raw_data/batch2/sample2.fastq.gz.md5 deleted file mode 120000 index 48976105..00000000 --- a/tests/data/archive/project/raw_data/batch2/sample2.fastq.gz.md5 +++ /dev/null @@ -1 +0,0 @@ -../../../outside/batch2/sample2.fastq.gz.md5 \ No newline at end of file diff --git a/tests/data/archive/project/raw_data/batch3/sample3.fastq.gz b/tests/data/archive/project/raw_data/batch3/sample3.fastq.gz deleted file mode 100644 index 173d6149..00000000 --- a/tests/data/archive/project/raw_data/batch3/sample3.fastq.gz +++ /dev/null @@ -1 +0,0 @@ -Raw data for sample 3 diff --git a/tests/data/archive/project/raw_data/batch3/sample3.fastq.gz.md5 b/tests/data/archive/project/raw_data/batch3/sample3.fastq.gz.md5 deleted file mode 100644 index 73f320f7..00000000 --- a/tests/data/archive/project/raw_data/batch3/sample3.fastq.gz.md5 +++ /dev/null @@ -1 +0,0 @@ -33f02a5e7d74950fd073707450e2fb7d sample3.fastq.gz diff --git a/tests/data/archive/project/raw_data/batch4/sample4.fastq.gz b/tests/data/archive/project/raw_data/batch4/sample4.fastq.gz deleted file mode 120000 index 527c105d..00000000 --- a/tests/data/archive/project/raw_data/batch4/sample4.fastq.gz +++ /dev/null @@ -1 +0,0 @@ -../../../outside/batch4/sample4.fastq.gz \ No newline at end of file diff --git a/tests/data/archive/rules.yaml b/tests/data/archive/rules.yaml deleted file mode 100644 index a973f9f2..00000000 --- a/tests/data/archive/rules.yaml +++ /dev/null @@ -1,10 +0,0 @@ -ignore: - - "^(.*/)?ignored_dir$" - -squash: - - "^(.*/)?[^/]+\\.public$" - - "^(.*/)?raw_data/(.*/)?[^/]+\\.fastq.gz$" - -compress: - - "^(.*/)?.snakemake$" - diff --git a/tests/data/archive/summary.tbl b/tests/data/archive/summary.tbl deleted file mode 100644 index 1d2acbbf..00000000 --- a/tests/data/archive/summary.tbl +++ /dev/null @@ -1,9 +0,0 @@ -# Files in project -Class FileName Target ResolvedName Size Dangling Outside -outside|dangling extra_data/dangling_symlink ../../outside/inexistent_data /filesystem/outside/inexistent_data 0 True True -outside|fastq raw_data/batch2/sample2.fastq.gz ../../../outside/batch2/sample2.fastq.gz /filesystem/outside/batch2/sample2.fastq.gz 22 False True -outside raw_data/batch2/sample2.fastq.gz.md5 ../../../outside/batch2/sample2.fastq.gz.md5 /filesystem/outside/batch2/sample2.fastq.gz.md5 51 False True -fastq raw_data/batch3/sample3.fastq.gz /filesystem/project/raw_data/batch3/sample3.fastq.gz 22 False False -outside|fastq raw_data/batch1/sample1.fastq.gz /filesystem/outside/batch1/sample1.fastq.gz 22 False True -outside raw_data/batch1/sample1.fastq.gz.md5 /filesystem/outside/batch1/sample1.fastq.gz.md5 51 False True -outside|dangling|fastq raw_data/batch4/sample4.fastq.gz ../../../outside/batch4/sample4.fastq.gz /filesystem/outside/batch4/sample4.fastq.gz 0 True True diff --git a/tests/data/archive/temp_dest_verif.tar.gz b/tests/data/archive/temp_dest_verif.tar.gz deleted file mode 100644 index 185c0f97..00000000 Binary files a/tests/data/archive/temp_dest_verif.tar.gz and /dev/null differ diff --git a/tests/test_archive_common.py b/tests/test_archive_common.py deleted file mode 100644 index d761112c..00000000 --- a/tests/test_archive_common.py +++ /dev/null @@ -1,26 +0,0 @@ -"""Tests for ``cubi_tk.archive.common``. - -We only run some smoke tests here. -""" - -import os -from pathlib import Path - -import cubi_tk.archive.common - - -def test_run_archive_get_file_attributes(): - project = os.path.join(os.path.dirname(__file__), "data", "archive", "project") - - relative_path = os.path.join("raw_data", "batch2", "sample2.fastq.gz") - filename = os.path.join(project, relative_path) - attributes = cubi_tk.archive.common.FileAttributes( - relative_path=relative_path, - resolved=Path(filename).resolve(), - symlink=True, - dangling=False, - outside=True, - target=os.path.join("..", "..", "..", "outside", "batch2", "sample2.fastq.gz"), - size=22, - ) - assert cubi_tk.archive.common.get_file_attributes(filename, project) == attributes diff --git a/tests/test_archive_copy.py b/tests/test_archive_copy.py deleted file mode 100644 index 1bfca73d..00000000 --- a/tests/test_archive_copy.py +++ /dev/null @@ -1,158 +0,0 @@ -"""Tests for ``cubi_tk.archive.copy``. - -We only run some smoke tests here. -""" - -import datetime -import filecmp -import glob -import os -import re -import tempfile - -import pytest - -from cubi_tk.__main__ import main, setup_argparse -from cubi_tk.common import execute_shell_commands - -HASHDEEP_TITLES_PATTERN = re.compile("^(%|#).*$") -IGNORE_FILES_PATTERN = re.compile("^(.*/)?(hashdeep|workdir)_(report|audit)\\.txt$") -IGNORE_LINES_PATTERN = re.compile( - "^.+,(.*/)?(\\.snakemake\\.tar\\.gz|1970-01-01_hashdeep_report\\.txt)$" -) - - -def test_run_archive_copy_help(capsys): - parser, _subparsers = setup_argparse() - with pytest.raises(SystemExit) as e: - parser.parse_args(["archive", "copy", "--help"]) - - assert e.value.code == 0 - - res = capsys.readouterr() - assert res.out - assert not res.err - - -def test_run_archive_copy_nothing(capsys): - parser, _subparsers = setup_argparse() - - with pytest.raises(SystemExit) as e: - parser.parse_args(["archive", "copy"]) - - assert e.value.code == 2 - - res = capsys.readouterr() - assert not res.out - assert res.err - - -def sort_hashdeep_title_and_body(filename): - titles = [] - body = [] - with open(filename, "rt") as f: - lines = [x.rstrip() for x in f.readlines()] - for line in lines: - line.rstrip() - if HASHDEEP_TITLES_PATTERN.match(line): - titles.append(line) - else: - if not IGNORE_LINES_PATTERN.match(line): - body.append(line) - return (sorted(titles), sorted(body)) - - -def test_run_archive_copy_smoke_test(mocker): - base_path = os.path.join(os.path.dirname(__file__), "data", "archive") - with tempfile.TemporaryDirectory() as tmp_dir: - execute_shell_commands( - [ - [ - "tar", - "-zxf", - os.path.join(base_path, "temp_dest_verif.tar.gz"), - "--directory", - tmp_dir, - ] - ] - ) - execute_shell_commands( - [ - [ - "tar", - "-zxf", - os.path.join(base_path, "final_dest_verif.tar.gz"), - "--directory", - tmp_dir, - ] - ] - ) - - argv = [ - "archive", - "copy", - "--keep-workdir-hashdeep", - os.path.join(tmp_dir, "temp_dest_verif"), - os.path.join(tmp_dir, "final_dest"), - ] - setup_argparse() - - # --- run tests - res = main(argv) - assert res == 0 - - # --- remove timestamps on all hashdeep reports & audits - now = datetime.date.today().strftime("%Y-%m-%d") - prefix = os.path.join(tmp_dir, "final_dest") - for fn in ["hashdeep_audit", "workdir_report", "workdir_audit"]: - from_fn = "{}_{}.txt".format(now, fn) - to_fn = "{}.txt".format(fn) - os.rename(os.path.join(prefix, from_fn), os.path.join(prefix, to_fn)) - - # --- check report - (repo_titles, repo_body) = sort_hashdeep_title_and_body( - os.path.join(tmp_dir, "final_dest_verif", "workdir_report.txt") - ) - (tmp_titles, tmp_body) = sort_hashdeep_title_and_body( - os.path.join(tmp_dir, "final_dest", "workdir_report.txt") - ) - - # --- check audits - for fn in ["hashdeep_audit", "workdir_audit"]: - with open(os.path.join(tmp_dir, "final_dest_verif", fn + ".txt"), "r") as f: - repo = sorted(f.readlines()) - with open(os.path.join(tmp_dir, "final_dest", fn + ".txt"), "r") as f: - tmp = sorted(f.readlines()) - assert repo == tmp - - # --- test all copied files, except the hashdeep report & audit, that can differ by line order - prefix = os.path.join(tmp_dir, "final_dest_verif") - ref_fns = [ - os.path.relpath(x, start=prefix) - for x in filter( - lambda x: os.path.isfile(x) or os.path.islink(x), - glob.glob(prefix + "/**/*", recursive=True), - ) - ] - ref_fns = filter(lambda x: not IGNORE_FILES_PATTERN.match(x), ref_fns) - prefix = os.path.join(tmp_dir, "final_dest") - test_fns = [ - os.path.relpath(x, start=prefix) - for x in filter( - lambda x: os.path.isfile(x) or os.path.islink(x), - glob.glob(prefix + "/**/*", recursive=True), - ) - ] - test_fns = filter(lambda x: not IGNORE_FILES_PATTERN.match(x), test_fns) - - matches, mismatches, errors = filecmp.cmpfiles( - os.path.join(tmp_dir, "final_dest_verif"), - os.path.join(tmp_dir, "final_dest"), - common=ref_fns, - shallow=False, - ) - assert len(matches) > 0 - assert sorted(errors) == ["extra_data/to_ignored_dir", "extra_data/to_ignored_file"] - assert sorted(mismatches) == ["pipeline/output/sample2"] - - assert os.path.exists(os.path.join(tmp_dir, "final_dest", "archive_copy_complete")) diff --git a/tests/test_archive_prepare.py b/tests/test_archive_prepare.py deleted file mode 100644 index 0756bfef..00000000 --- a/tests/test_archive_prepare.py +++ /dev/null @@ -1,129 +0,0 @@ -"""Tests for ``cubi_tk.archive.prepare``. - -We only run some smoke tests here. -""" - -import datetime -import filecmp -import glob -import os -import re -import tempfile - -import pytest - -from cubi_tk.__main__ import main, setup_argparse -from cubi_tk.common import execute_shell_commands - -from .test_archive_copy import sort_hashdeep_title_and_body - -SNAKEMAKE = re.compile("^.*\\.snakemake\\.tar\\.gz$") -HASHDEEP = re.compile("^(([0-9]{4})-([0-9]{2})-([0-9]{2}))_hashdeep_report\\.txt$") - - -def test_run_archive_prepare_help(capsys): - parser, _subparsers = setup_argparse() - with pytest.raises(SystemExit) as e: - parser.parse_args(["archive", "prepare", "--help"]) - - assert e.value.code == 0 - - res = capsys.readouterr() - assert res.out - assert not res.err - - -def test_run_archive_prepare_nothing(capsys): - parser, _subparsers = setup_argparse() - - with pytest.raises(SystemExit) as e: - parser.parse_args(["archive", "prepare"]) - - assert e.value.code == 2 - - res = capsys.readouterr() - assert not res.out - assert res.err - - -def test_run_archive_prepare_smoke_test(): - repo_dir = os.path.join(os.path.dirname(__file__), "data", "archive") - with tempfile.TemporaryDirectory() as tmp_dir: - execute_shell_commands( - [ - [ - "tar", - "-zxf", - os.path.join(repo_dir, "temp_dest_verif.tar.gz"), - "--directory", - tmp_dir, - ] - ] - ) - project_name = "project" - - argv = [ - "archive", - "prepare", - "--rules", - os.path.join(repo_dir, "rules.yaml"), - "--readme", - os.path.join(tmp_dir, "temp_dest_verif", "README.md"), - os.path.join(repo_dir, project_name), - os.path.join(tmp_dir, "temp_dest"), - ] - setup_argparse() - - # --- run tests - res = main(argv) - assert not res - - # --- remove hashdeep report filename timestamp - os.rename( - os.path.join( - tmp_dir, "temp_dest", datetime.date.today().strftime("%Y-%m-%d_hashdeep_report.txt") - ), - os.path.join(tmp_dir, "temp_dest", "1970-01-01_hashdeep_report.txt"), - ) - - # --- compare hashdeep report with reference - (repo_titles, repo_body) = sort_hashdeep_title_and_body( - os.path.join(tmp_dir, "temp_dest_verif", "1970-01-01_hashdeep_report.txt") - ) - (tmp_titles, tmp_body) = sort_hashdeep_title_and_body( - os.path.join(tmp_dir, "temp_dest", "1970-01-01_hashdeep_report.txt") - ) - # No test on gzipped files, timestamp stored on gzip format could be different - assert repo_body == tmp_body - - prefix = os.path.join(tmp_dir, "temp_dest_verif") - ref_fns = [ - os.path.relpath(x, start=prefix) - for x in filter( - lambda x: os.path.isfile(x) or os.path.islink(x), - glob.glob(prefix + "/**/*", recursive=True), - ) - ] - prefix = os.path.join(tmp_dir, "temp_dest") - test_fns = [ - os.path.relpath(x, start=prefix) - for x in filter( - lambda x: os.path.isfile(x) or os.path.islink(x), - glob.glob(prefix + "/**/*", recursive=True), - ) - ] - assert sorted(ref_fns) == sorted(test_fns) - - matches, mismatches, errors = filecmp.cmpfiles( - os.path.join(tmp_dir, "temp_dest_verif"), - os.path.join(tmp_dir, "temp_dest"), - common=ref_fns, - shallow=False, - ) - assert len(matches) > 0 - assert sorted(errors) == ["extra_data/to_ignored_dir", "extra_data/to_ignored_file"] - assert sorted(mismatches) == [ - "1970-01-01_hashdeep_report.txt", - "README.md", - "pipeline/output/sample2", - ] diff --git a/tests/test_archive_readme.py b/tests/test_archive_readme.py deleted file mode 100644 index 0ed78a9a..00000000 --- a/tests/test_archive_readme.py +++ /dev/null @@ -1,76 +0,0 @@ -"""Tests for ``cubi_tk.archive.prepare``. - -We only run some smoke tests here. -""" - -import os -import tempfile - -import pytest - -from cubi_tk.__main__ import main, setup_argparse -import cubi_tk.archive.readme - - -def test_run_archive_readme_help(capsys): - parser, _subparsers = setup_argparse() - with pytest.raises(SystemExit) as e: - parser.parse_args(["archive", "readme", "--help"]) - - assert e.value.code == 0 - - res = capsys.readouterr() - assert res.out - assert not res.err - - -def test_run_archive_readme_nothing(capsys): - parser, _subparsers = setup_argparse() - - with pytest.raises(SystemExit) as e: - parser.parse_args(["archive", "readme"]) - - assert e.value.code == 2 - - res = capsys.readouterr() - assert not res.out - assert res.err - - -# TODO: Fix test -@pytest.mark.skip() -def test_run_archive_readme_smoke_test(): - with tempfile.TemporaryDirectory() as tmp_dir: - project_name = "project" - project_dir = os.path.join(os.path.dirname(__file__), "data", "archive", project_name) - - readme_path = os.path.join(tmp_dir, project_name, "README.md") - - argv = [ - "--sodar-server-url", - "https://sodar.bihealth.org", - "archive", - "readme", - "--var-PI-name", - "Maxene Musterfrau", - "--var-archiver-name", - "Eric Blanc", - "--var-client-name", - "Max Mustermann", - "--var-SODAR-UUID", - "00000000-0000-0000-0000-000000000000", - "--var-Gitlab-URL", - "https://cubi-gitlab.bihealth.org", - "--var-start-date", - "1970-01-01", - "--no-input", - project_dir, - readme_path, - ] - setup_argparse() - - # --- run tests - res = main(argv) - assert not res - - assert cubi_tk.archive.readme.is_readme_valid(readme_path) diff --git a/tests/test_archive_summary.py b/tests/test_archive_summary.py deleted file mode 100644 index f35f15cd..00000000 --- a/tests/test_archive_summary.py +++ /dev/null @@ -1,73 +0,0 @@ -"""Tests for ``cubi_tk.archive.summary``. - -We only run some smoke tests here. -""" - -import os -import tempfile - -import pytest - -from cubi_tk.__main__ import main, setup_argparse - - -def test_run_archive_summary_help(capsys): - parser, _subparsers = setup_argparse() - with pytest.raises(SystemExit) as e: - parser.parse_args(["archive", "summary", "--help"]) - - assert e.value.code == 0 - - res = capsys.readouterr() - assert res.out - assert not res.err - - -def test_run_archive_summary_nothing(capsys): - parser, _subparsers = setup_argparse() - - with pytest.raises(SystemExit) as e: - parser.parse_args(["archive", "summary"]) - - assert e.value.code == 2 - - res = capsys.readouterr() - assert not res.out - assert res.err - - -def test_run_archive_summary_smoke_test(): - filename = "summary.tbl" - with tempfile.TemporaryDirectory() as tmp_dir: - repo_dir = os.path.join(os.path.dirname(__file__), "data", "archive") - target_file = os.path.join(repo_dir, filename) - mocked_file = os.path.join(tmp_dir, filename) - - argv = [ - "archive", - "summary", - "--class", - os.path.join(repo_dir, "classes.yaml"), - os.path.join(repo_dir, "project"), - mocked_file, - ] - setup_argparse() - - # --- run tests - res = main(argv) - assert not res - - mocked = [line.rstrip().split("\t") for line in open(mocked_file, "rt")][1:] - target = [line.rstrip().split("\t") for line in open(target_file, "rt")][1:] - assert len(mocked) == len(target) - j = target[0].index("ResolvedName") - failed = [] - for value in target: - found = False - for v in mocked: - if v[-j] == value[-j]: - found = True - break - if not found: - failed.append(value) - assert len(failed) == 0