03.05-host_transcriptomics.Rmd

# Host transcriptomics (HT) data processing {#host-transcriptomics-data-processing}

The analysis of host transcriptomic data can be conducted following two main strategies, which depend on whether a well-annotated reference genome that contain all gene sequences of the studied transcripts is available or not. This is seldom the case in non-model organisms that lack complete reference genomes with high-quality annotation of genetic information, although many reference genome generation initiatives are rapidly increasing the number of available reference genomes. The pros and cons of using either approach have been addressed in the literature [@Lee2021-th].

In the following two chapters, you will find example pipelines to process host transcriptomic data through both strategies:

*  **[Reference-based host transcriptomics (HT) data processing](#host-transcriptomics-data-processing-reference-based)**
*  **[Reference-free host transcriptomics (HT) data processing](#host-transcriptomics-data-processing-reference-free)**

## Reference-based host transcriptomics (HT) data processing {#host-transcriptomics-data-processing-reference-based}

### Quality-filtering {-}
Fastp is a high-performance FASTQ preprocessor that can be used to clean up raw sequencing reads from Illumina platforms. It provides various quality control, filtering, and trimming options to remove low-quality bases, contaminants, and adapter sequences. The code provided performs a number of these steps, including trimming of poly-G and poly-X tails, which are commonly observed in Illumina reads, filtering reads based on quality and length, and removing adapter sequences. The resulting cleaned reads are then written to the specified output files. Additionally, fastp provides comprehensive quality control reports in both HTML and JSON formats, which can be used to assess the quality of the input reads and the impact of the processing steps. Overall, pre-processing raw sequencing reads with fastp is a critical step in ensuring the accuracy and reliability of downstream bioinformatics analyses.

\small
```{sh ht-fastp, eval=FALSE}
fastp \
      --in1 {input.read1} --in2 {input.read2} \
      --out1 {output.read1} --out2 {output.read2} \
      --trim_poly_g \
      --trim_poly_x \
      --low_complexity_filter \
      --n_base_limit 5 \
      --qualified_quality_phred 20 \
      --length_required 60 \
      --thread {threads} \
      --html {output.fastp_html} \
      --json {output.fastp_json} \
      --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
      --adapter_sequence_r2  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
```
\normalsize

### Ribosomal RNA removal {-}
Ribodetector can be used for efficient removal of rRNA sequences from host transcriptomics data, improving accuracy and reducing computational time. This tool helps to better identify transcripts and understand gene expression in complex microbiomes.

\small
```{sh ht-ribodetector, eval=FALSE}
ribodetector_cpu \
      -t 24 \
      -l 150 \
      -i {input.r1} {input.r2} \
      -e rrna \
      -o {output.non_rna_r1} {output.non_rna_r2}
```
\normalsize

### Reference genome indexing {-}
In order to use STAR for host transcriptomics, it is neccesary to first generates a genome index, which  can be used for multiple RNA-seq experiments.

\small
```{sh ht-star-index, eval=FALSE}
STAR \
      --runMode genomeGenerate \
      --runThreadN {threads} \
      --genomeDir {input} \
      --genomeFastaFiles {input}/*.fna \
      --sjdbGTFfile {input}/*.gtf \
      --sjdbOverhang {params.readlength}
```
\normalsize

### Read mapping against reference genome {-}
STAR (Spliced Transcripts Alignment to a Reference) is a fast and efficient method for aligning RNA-seq reads to a reference genome. It uses a two-pass alignment approach to detect spliced transcripts, improve accuracy and speed up the alignment process.

\small
```{sh ht-star-mapping, eval=FALSE}
STAR \
      --runMode alignReads \
      --runThreadN {threads} \
      --genomeDir {params.genome} \
      --readFilesIn {input.read1} {input.read2} \
      --outFileNamePrefix {wildcards.sample} \
      --outSAMtype BAM SortedByCoordinate \
      --outReadsUnmapped Fastx \
      --readFilesCommand zcat \
      --quantMode GeneCounts
```
\normalsize

:::: {.authorbox}
Contents of this section were created by [Antton Alberdi](#antton-alberdi) and [Raphael Eisenhofer](#raphael-eisenhofer).
:::

## Reference-free host transcriptomics (HT) data processing {#host-transcriptomics-data-processing-reference-free}

Contents will be added shortly.