Build branch use-biobox-modules with version use-biobox-modules (68139c6)
Build pipeline: viash-hub.rnaseq.use-biobox-modules-qh9pg
Source commit: 68139c6a1e
Source message: remove migrated components
This commit is contained in:
5
.gitignore
vendored
Normal file
5
.gitignore
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
.nextflow*
|
||||
work
|
||||
testData
|
||||
test_results
|
||||
target
|
||||
136
README.md
Normal file
136
README.md
Normal file
@@ -0,0 +1,136 @@
|
||||
# RNAseq.vsh
|
||||
|
||||
<!-- README.md is generated by running 'quarto render README.qmd' -->
|
||||
|
||||
A version of the [nf-core/rnaseq](https://github.com/nf-core/rnaseq)
|
||||
pipeline (version 3.14.0) in the [Viash framework](http://www.viash.io).
|
||||
|
||||
## Rationale
|
||||
|
||||
We stick to the original nf-core pipeline as much as possible. This also
|
||||
means that we create a subworkflow for the 5 main stages of the pipeline
|
||||
as depicted in the [README](https://github.com/nf-core/rnaseq).
|
||||
|
||||
## Getting started
|
||||
|
||||
As test data, we can use the small dataset nf-core provided with [their
|
||||
`test`
|
||||
profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv):
|
||||
<https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
|
||||
|
||||
A simple script has been provided to fetch those files from the github
|
||||
repository and store them under `testData/minimal_test` (the
|
||||
subdirectory is created to support `full_test` later as well):
|
||||
`bin/get_minimal_test_data.sh`.
|
||||
|
||||
Additionally, a script has been provided to fetch some additional
|
||||
resources for unit testing the components. Thes will be stored under
|
||||
`testData/unit_test_resources`: `bin/get_unit test_data.sh`
|
||||
|
||||
To get started, we need to:
|
||||
|
||||
1. [Install
|
||||
`nextflow`](https://www.nextflow.io/docs/latest/getstarted.html)
|
||||
system-wide
|
||||
|
||||
2. Fetch the test data:
|
||||
|
||||
``` bash
|
||||
bin/minimal_test.sh
|
||||
bin/get_minimal_test_data.sh
|
||||
```
|
||||
|
||||
## Running the pipeline
|
||||
|
||||
To actually run the pipeline, we first need to build the components and
|
||||
pipeline:
|
||||
|
||||
``` bash
|
||||
viash ns build --setup cb --parallel
|
||||
```
|
||||
|
||||
Now we can run the pipeline using the command:
|
||||
|
||||
``` bash
|
||||
nextflow run target/nextflow/workflows/pre_processing/main.nf \
|
||||
-profile docker \
|
||||
--id test \
|
||||
--input testData/minimal_test/SRR6357070_1.fastq.gz \
|
||||
--publish_dir testData/test_output/
|
||||
```
|
||||
|
||||
Alternatively, we can run the pipeline with a sample sheet using the
|
||||
built-in `--param_list` functionality: (Read file paths must be
|
||||
specified relative to the sample sheet’s path)
|
||||
|
||||
``` bash
|
||||
cat > testData/minimal_test/input_fastq/sample_sheet.csv << HERE
|
||||
id,fastq_1,fastq_2,strandedness
|
||||
WT_REP1,SRR6357070_1.fastq.gz;SRR6357071_1.fastq.gz,SRR6357070_2.fastq.gz;SRR6357071_2.fastq.gz,reverse
|
||||
WT_REP2,SRR6357072_1.fastq.gz,SRR6357072_2.fastq.gz,reverse
|
||||
RAP1_UNINDUCED_REP1,SRR6357073_1.fastq.gz,,reverse
|
||||
HERE
|
||||
|
||||
nextflow run target/nextflow/workflows/rnaseq/main.nf \
|
||||
--param_list testData/minimal_test/input_fastq/sample_sheet.csv \
|
||||
--publish_dir "test_results/full_pipeline_test" \
|
||||
--fasta testData/minimal_test/reference/genome.fasta \
|
||||
--gtf testData/minimal_test/reference/genes.gtf.gz \
|
||||
--transcript_fasta testData/minimal_test/reference/transcriptome.fasta \
|
||||
-profile docker
|
||||
```
|
||||
|
||||
## Pipeline sub-workflows and components
|
||||
|
||||
The pipeline has 5 sub-workflows that can be run separately.
|
||||
|
||||
1. Prepare genome: This is a workflow for preparing all the reference
|
||||
data required for downstream analysis, i.e., uncompress provided
|
||||
reference data or generate the required index files (for STAR,
|
||||
Salmon, Kallisto, RSEM, BBSplit).
|
||||
|
||||
2. Pre-processing: This is a workflow for performing quality control on
|
||||
the input reads It performs FastQC, extracts UMIs, trims adapters,
|
||||
and removes ribosomal RNA reads. Adapters can be trimmed using
|
||||
either Trim galore! or fastp (work in progress).
|
||||
|
||||
3. Genome alignment and quantification: This is a workflow for
|
||||
performing genome alignment using STAR and transcript quantification
|
||||
using Salmon or RSEM (using RSEM’s built-in support for STAR) (work
|
||||
in progress). Alignment sorting and indexing, as well as computation
|
||||
of statistics from the BAM files is performed using Samtools.
|
||||
UMI-based deduplication is also performed.
|
||||
|
||||
4. Post-processing: This is a workflow for duplicate read marking
|
||||
(picard MarkDuplicates), transcript assembly and quantification
|
||||
(StringTie), and creation of bigWig coverage files.
|
||||
|
||||
5. Pseudo alignment and quantification: This is a workflow for
|
||||
performing pseudo alignment and transcript quantification using
|
||||
Salmon or Kallisto.
|
||||
|
||||
6. Final QC: This is a workflow for performing extensive quality
|
||||
control (RSeQC, dupRadar, Qualimap, Preseq, DESeq2, featureCounts).
|
||||
It presents QC for raw reads, alignments, gene biotype, sample
|
||||
similarity, and strand specificity (MultiQC).
|
||||
|
||||
## Reusing components from biobox
|
||||
|
||||
At the moment, this pipeline makes use of the following components from
|
||||
[biobox](https://github.com/viash-hub/biobox):
|
||||
|
||||
- `gffread`
|
||||
- `star/star_genome_generate`
|
||||
- `star/star_align_reads`
|
||||
- `salmon/salmon_index`
|
||||
- `salmon/salmon_quant`
|
||||
- `featurecounts`
|
||||
- `samtools/samtools_sort`
|
||||
- `samtools/samtools_index`
|
||||
- `samtools/samtools_stats`
|
||||
- `samtools/samtools_flagstat`
|
||||
- `samtools/samtools_idxstats`
|
||||
- `multiqc` (work in progress - updating `assets/multiqc_config.yaml`)
|
||||
- `fastp` (work in progress)
|
||||
- `rsem/rsem_prepare_reference` (work in progress)
|
||||
- `rsem/rsem_calculate_expression` (work in progress)
|
||||
107
README.qmd
Normal file
107
README.qmd
Normal file
@@ -0,0 +1,107 @@
|
||||
---
|
||||
title: RNAseq.vsh
|
||||
format: gfm
|
||||
---
|
||||
|
||||
<!-- README.md is generated by running 'quarto render README.qmd' -->
|
||||
|
||||
```{r, echo = FALSE, message = FALSE, error = FALSE, warning = FALSE}
|
||||
library(tidyverse)
|
||||
```
|
||||
|
||||
A version of the [nf-core/rnaseq](https://github.com/nf-core/rnaseq) pipeline (version 3.14.0) in the [Viash framework](http://www.viash.io).
|
||||
|
||||
## Rationale
|
||||
|
||||
We stick to the original nf-core pipeline as much as possible. This also means that we create a subworkflow for the 5 main stages of the pipeline as depicted in the [README](https://github.com/nf-core/rnaseq).
|
||||
|
||||
## Getting started
|
||||
|
||||
As test data, we can use the small dataset nf-core provided with [their `test` profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv): <https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
|
||||
|
||||
A simple script has been provided to fetch those files from the github repository and store them under `testData/minimal_test` (the subdirectory is created to support `full_test` later as well): `bin/get_minimal_test_data.sh`.
|
||||
|
||||
Additionally, a script has been provided to fetch some additional resources for unit testing the components. Thes will be stored under `testData/unit_test_resources`: `bin/get_unit test_data.sh`
|
||||
|
||||
To get started, we need to:
|
||||
|
||||
1. [Install `nextflow`](https://www.nextflow.io/docs/latest/getstarted.html) system-wide
|
||||
|
||||
2. Fetch the test data:
|
||||
|
||||
``` bash
|
||||
bin/minimal_test.sh
|
||||
bin/get_minimal_test_data.sh
|
||||
```
|
||||
|
||||
## Running the pipeline
|
||||
|
||||
To actually run the pipeline, we first need to build the components and pipeline:
|
||||
|
||||
``` bash
|
||||
viash ns build --setup cb --parallel
|
||||
```
|
||||
|
||||
Now we can run the pipeline using the command:
|
||||
|
||||
``` bash
|
||||
nextflow run target/nextflow/workflows/pre_processing/main.nf \
|
||||
-profile docker \
|
||||
--id test \
|
||||
--fastq_1 testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz \
|
||||
--publish_dir testData/test_output/
|
||||
```
|
||||
|
||||
Alternatively, we can run the pipeline with a sample sheet using the built-in `--param_list` functionality: (Read file paths must be specified relative to the sample sheet's path)
|
||||
|
||||
``` bash
|
||||
cat > testData/minimal_test/input_fastq/sample_sheet.csv << HERE
|
||||
id,fastq_1,fastq_2,strandedness
|
||||
WT_REP1,SRR6357070_1.fastq.gz;SRR6357071_1.fastq.gz,SRR6357070_2.fastq.gz;SRR6357071_2.fastq.gz,reverse
|
||||
WT_REP2,SRR6357072_1.fastq.gz,SRR6357072_2.fastq.gz,reverse
|
||||
RAP1_UNINDUCED_REP1,SRR6357073_1.fastq.gz,,reverse
|
||||
HERE
|
||||
|
||||
nextflow run target/nextflow/workflows/rnaseq/main.nf \
|
||||
--param_list testData/minimal_test/input_fastq/sample_sheet.csv \
|
||||
--publish_dir "test_results/full_pipeline_test" \
|
||||
--fasta testData/minimal_test/reference/genome.fasta \
|
||||
--gtf testData/minimal_test/reference/genes.gtf.gz \
|
||||
--transcript_fasta testData/minimal_test/reference/transcriptome.fasta \
|
||||
-profile docker
|
||||
```
|
||||
|
||||
## Pipeline sub-workflows and components
|
||||
|
||||
The pipeline has 5 sub-workflows that can be run separately.
|
||||
|
||||
1. Prepare genome: This is a workflow for preparing all the reference data required for downstream analysis, i.e., uncompress provided reference data or generate the required index files (for STAR, Salmon, Kallisto, RSEM, BBSplit).
|
||||
|
||||
2. Pre-processing: This is a workflow for performing quality control on the input reads It performs FastQC, extracts UMIs, trims adapters, and removes ribosomal RNA reads. Adapters can be trimmed using either Trim galore! or fastp (work in progress).
|
||||
|
||||
3. Genome alignment and quantification: This is a workflow for performing genome alignment using STAR and transcript quantification using Salmon or RSEM (using RSEM's built-in support for STAR) (work in progress). Alignment sorting and indexing, as well as computation of statistics from the BAM files is performed using Samtools. UMI-based deduplication is also performed.
|
||||
|
||||
4. Post-processing: This is a workflow for duplicate read marking (picard MarkDuplicates), transcript assembly and quantification (StringTie), and creation of bigWig coverage files.
|
||||
|
||||
5. Pseudo alignment and quantification: This is a workflow for performing pseudo alignment and transcript quantification using Salmon or Kallisto.
|
||||
|
||||
6. Final QC: This is a workflow for performing extensive quality control (RSeQC, dupRadar, Qualimap, Preseq, DESeq2, featureCounts). It presents QC for raw reads, alignments, gene biotype, sample similarity, and strand specificity (MultiQC).
|
||||
|
||||
## Reusing components from biobox
|
||||
At the moment, this pipeline makes use of the following components from [biobox](https://github.com/viash-hub/biobox):
|
||||
|
||||
* `gffread`
|
||||
* `star/star_genome_generate`
|
||||
* `star/star_align_reads`
|
||||
* `salmon/salmon_index`
|
||||
* `salmon/salmon_quant`
|
||||
* `featurecounts`
|
||||
* `samtools/samtools_sort`
|
||||
* `samtools/samtools_index`
|
||||
* `samtools/samtools_stats`
|
||||
* `samtools/samtools_flagstat`
|
||||
* `samtools/samtools_idxstats`
|
||||
* `multiqc` (work in progress - updating `assets/multiqc_config.yaml`)
|
||||
* `fastp` (work in progress)
|
||||
* `rsem/rsem_prepare_reference` (work in progress)
|
||||
* `rsem/rsem_calculate_expression` (work in progress)
|
||||
25
_viash.yaml
Normal file
25
_viash.yaml
Normal file
@@ -0,0 +1,25 @@
|
||||
name: rnaseq
|
||||
|
||||
viash_version: 0.9.0
|
||||
|
||||
source: src
|
||||
target: target
|
||||
|
||||
info:
|
||||
test_resources:
|
||||
- path: gs://viash-hub-test-data/rnaseq/v1
|
||||
dest: testData
|
||||
|
||||
config_mods: |
|
||||
.requirements.commands := ['ps']
|
||||
.runners[.type == 'nextflow'].directives.tag := '$id'
|
||||
|
||||
repositories:
|
||||
- name: biobox
|
||||
type: vsh
|
||||
repo: biobox
|
||||
tag: main
|
||||
- name: craftbox
|
||||
type: vsh
|
||||
repo: craftbox
|
||||
tag: v0.1.0
|
||||
105
bin/get_minimal_test_data.sh
Executable file
105
bin/get_minimal_test_data.sh
Executable file
@@ -0,0 +1,105 @@
|
||||
#!/bin/bash
|
||||
|
||||
CURR=`pwd`
|
||||
|
||||
### Get input fastq files for the minimal test
|
||||
|
||||
DEST_FASTQ="testData/minimal_test/input_fastq"
|
||||
mkdir -p $DEST_FASTQ
|
||||
cd $DEST_FASTQ
|
||||
|
||||
echo "Fetching FastQ files..."
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357070_1.fastq.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357070_2.fastq.gz
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357071_1.fastq.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357071_2.fastq.gz
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357072_1.fastq.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357072_2.fastq.gz
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357073_1.fastq.gz
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357074_1.fastq.gz
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357075_1.fastq.gz
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357076_1.fastq.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357076_2.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz
|
||||
|
||||
cd $CURR
|
||||
|
||||
### Get reference files for the minimal test
|
||||
|
||||
DEST_REF="testData/minimal_test/reference"
|
||||
mkdir -p $DEST_REF
|
||||
cd $DEST_REF
|
||||
|
||||
echo "Fetching reference data..."
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/bbsplit_fasta_list.txt
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genes.gff.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genes.gtf.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genome.fasta
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/gfp.fa.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/hisat2.tar.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/rsem.tar.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/salmon.tar.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/transcriptome.fasta
|
||||
|
||||
wget https://raw.githubusercontent.com/nf-core/rnaseq/3.12.0/assets/rrna-db-defaults.txt
|
||||
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genome.fasta
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genes.gtf.gz
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genes.gff.gz
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/transcriptome.fasta
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/gfp.fa.gz
|
||||
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/bbsplit_fasta_list.txt
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/hisat2.tar.gz
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/salmon.tar.gz
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/rsem.tar.gz
|
||||
|
||||
cd $CURR
|
||||
|
||||
NEWDEST1_REF="$CURR/testData/minimal_test/reference/rRNA"
|
||||
mkdir -p $NEWDEST1_REF
|
||||
cd $NEWDEST1_REF
|
||||
for LINE in `cat ../rrna-db-defaults.txt`
|
||||
do
|
||||
wget $LINE
|
||||
done
|
||||
cd $CURR
|
||||
find $NEWDEST1_REF -type f > $DEST_REF/rrna-db-defaults.txt
|
||||
|
||||
NEWDEST2_REF="$CURR/testData/minimal_test/reference/bbsplit_fasta"
|
||||
mkdir -p $NEWDEST2_REF
|
||||
while IFS=, read -r -a line; do
|
||||
url="${line[1]}"
|
||||
name="$NEWDEST2_REF/${line[0]}.fa"
|
||||
wget $url -O "$name"
|
||||
line+=("$name")
|
||||
IFS=','
|
||||
echo "${line[*]}" >> "$NEWDEST2_REF/tmp.txt"
|
||||
done < "$DEST_REF/bbsplit_fasta_list.txt"
|
||||
cut -d',' -f1,3 "$NEWDEST2_REF/tmp.txt" > "$DEST_REF/bbsplit_fasta_list.txt"
|
||||
rm "$NEWDEST2_REF/tmp.txt"
|
||||
50
bin/get_unit_test_data.sh
Executable file
50
bin/get_unit_test_data.sh
Executable file
@@ -0,0 +1,50 @@
|
||||
#!/bin/bash
|
||||
|
||||
CURR=`pwd`
|
||||
DEST="testData/unit_test_resources"
|
||||
mkdir -p $DEST
|
||||
cd $DEST
|
||||
|
||||
echo "Fetching unit test resources..."
|
||||
|
||||
## UMI_TOOLS
|
||||
# extract
|
||||
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/slim.fastq.gz
|
||||
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.1.gz
|
||||
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.2.gz
|
||||
# dedup
|
||||
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/chr19.bam
|
||||
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/chr19.bam.bai
|
||||
|
||||
# MultiQC
|
||||
wget https://multiqc.info/examples/rna-seq/data.zip
|
||||
|
||||
# dupRadar
|
||||
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/genes.gtf
|
||||
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
|
||||
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam.bai
|
||||
|
||||
|
||||
### Resources from https://github.com/snakemake/snakemake-wrappers/tree/master/bio
|
||||
# DESeq2
|
||||
wget https://github.com/snakemake/snakemake-wrappers/raw/master/bio/deseq2/deseqdataset/test/dataset/counts.tsv
|
||||
|
||||
# preseq lc_extrap
|
||||
wget https://github.com/snakemake/snakemake-wrappers/raw/master/bio/preseq/lc_extrap/test/samples/a.sorted.bed
|
||||
wget https://github.com/smithlabcode/preseq/raw/master/data/SRR1106616_5M_subset.bam
|
||||
|
||||
|
||||
### nf-core test datasets
|
||||
# sarscov2
|
||||
mkdir -p sarscov2
|
||||
wget -O sarscov2/genome.sizes https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.sizes
|
||||
wget -O sarscov2/test.bedgraph https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/illumina/bedgraph/test.bedgraph
|
||||
wget -O sarscov2/genome.fasta https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta
|
||||
wget -O sarscov2/genome.fasta.fai https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta.fai
|
||||
wget -O sarscov2/test.paired_end.sorted.bam https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/illumina/bam/test.paired_end.sorted.bam
|
||||
wget -O sarscov2/test.paired_end.sorted.bam.bai https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/illumina/bam/test.paired_end.sorted.bam.bai
|
||||
wget -O sarscov2/test.bed https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/bed/test.bed
|
||||
wget -O sarscov2/test.bed12 https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/bed/test.bed12
|
||||
wget -O sarscov2/genome.gtf https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.gtf
|
||||
|
||||
cd $CURR
|
||||
40
examples/minimal.yaml
Normal file
40
examples/minimal.yaml
Normal file
@@ -0,0 +1,40 @@
|
||||
param_list:
|
||||
- id: WT_REP1
|
||||
fastq_1:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz
|
||||
fastq_2:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz
|
||||
strandedness: reverse
|
||||
- id: WT_REP2
|
||||
fastq_1:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz
|
||||
fastq_2:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz
|
||||
strandedness: reverse
|
||||
- id: RAP1_IAA_30M_REP1
|
||||
fastq_1:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz
|
||||
fastq_2:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz
|
||||
strandedness: reverse
|
||||
- id: RAP1_UNINDUCED_REP1
|
||||
fastq_1:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz
|
||||
strandedness: reverse
|
||||
- id: RAP1_UNINDUCED_REP2
|
||||
fastq_1:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz
|
||||
strandedness: reverse
|
||||
|
||||
fasta: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/genome.fasta
|
||||
gtf: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/genes.gtf.gz
|
||||
additional_fasta: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/gfp.fa.gz
|
||||
transcript_fasta: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/transcriptome.fasta
|
||||
bbsplit_fasta_list: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/bbsplit_fasta_list.txt
|
||||
salmon_index: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/salmon.tar.gz
|
||||
skip_bbsplit: true
|
||||
multiqc_custom_config: https://raw.githubusercontent.com/viash-hub/rnaseq/refs/heads/main/src/assets/multiqc_config.yml
|
||||
multiqc_methods_description: https://raw.githubusercontent.com/viash-hub/rnaseq/refs/heads/main/src/assets/methods_description_template.yml
|
||||
3
main.nf
Normal file
3
main.nf
Normal file
@@ -0,0 +1,3 @@
|
||||
workflow {
|
||||
print("This is a dummy placeholder for pipeline execution. Please use the corresponding nf files for running pipelines.")
|
||||
}
|
||||
27
nextflow.config
Normal file
27
nextflow.config
Normal file
@@ -0,0 +1,27 @@
|
||||
// template nextflow.config for nested workflows
|
||||
|
||||
manifest {
|
||||
nextflowVersion = '!>=20.12.1-edge'
|
||||
}
|
||||
|
||||
docker {
|
||||
fixOwnership = true
|
||||
}
|
||||
|
||||
|
||||
// TODO 1: unquote and adapt `rootDir` according to relative path within project
|
||||
// params {
|
||||
// rootDir = "$projectDir/../.."
|
||||
// }
|
||||
//
|
||||
// workflowDir = "${params.rootDir}/workflows"
|
||||
// targetDir = "${params.rootDir}/target/nextflow"
|
||||
|
||||
// TODO 2: insert custom imports here
|
||||
|
||||
// TODO 3: unquote
|
||||
// docker {
|
||||
// runOptions = "-v \$(realpath ${params.rootDir}):\$(realpath ${params.rootDir})"
|
||||
// }
|
||||
|
||||
|
||||
25
src/assets/methods_description_template.yml
Normal file
25
src/assets/methods_description_template.yml
Normal file
@@ -0,0 +1,25 @@
|
||||
id: "rnaseq.vsh-methods-description"
|
||||
description: "Suggested text and references to use when describing pipeline usage within the methods section of a publication."
|
||||
section_name: "nf-core/rnaseq Methods Description"
|
||||
section_href: "https://github.com/nf-core/rnaseq"
|
||||
plot_type: "html"
|
||||
|
||||
data: |
|
||||
<h4>Methods</h4>
|
||||
<p>Data was processed using rnaseq.vsh which is a version of the nf-core/rnaseq (v.3.14.0) workflow wriiten using the Viash framework .</p>
|
||||
<p>The pipeline was executed with Nextflow v${workflow.nextflow.version} (<a href="https://doi.org/10.1038/nbt.3820">Di Tommaso <em>et al.</em>, 2017</a>) with the following command:</p>
|
||||
<pre><code>${workflow.commandLine}</code></pre>
|
||||
<h4>References</h4>
|
||||
<ul>
|
||||
<li>Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316-319. <a href="https://doi.org/10.1038/nbt.3820">https://doi.org/10.1038/nbt.3820</a></li>
|
||||
<li>Ewels, P. A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garcia, M. U., Di Tommaso, P., & Nahnsen, S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nature Biotechnology, 38(3), 276-278. <a href="https://doi.org/10.1038/s41587-020-0439-x">https://doi.org/10.1038/s41587-020-0439-x</a></li>
|
||||
<li>VIASH</li>
|
||||
</ul>
|
||||
<div class="alert alert-info">
|
||||
<h5>Notes:</h5>
|
||||
<ul>
|
||||
${nodoi_text}
|
||||
<li>The command above does not include parameters contained in any configs or profiles that may have been used. Ensure the config file is also uploaded with your publication!</li>
|
||||
<li>You should also cite all software used within this run. Check the "Software Versions" of this report to get version information.</li>
|
||||
</ul>
|
||||
</div>
|
||||
164
src/assets/multiqc_config.yml
Normal file
164
src/assets/multiqc_config.yml
Normal file
@@ -0,0 +1,164 @@
|
||||
report_comment: >
|
||||
This report has been generated by the <a href="https://github.com/viash-hub/rnaseq" </a>
|
||||
analysis pipeline.
|
||||
report_section_order:
|
||||
"rnaseq-methods-description":
|
||||
order: -1000
|
||||
software_versions:
|
||||
order: -1001
|
||||
"rnaseq.vsh-summary":
|
||||
order: -1002
|
||||
|
||||
export_plots: true
|
||||
disable_version_detection: true
|
||||
|
||||
# Run only these modules
|
||||
run_modules:
|
||||
- custom_content
|
||||
- fastqc
|
||||
- cutadapt
|
||||
- fastp
|
||||
- sortmerna
|
||||
- star
|
||||
- rsem
|
||||
- salmon
|
||||
- kallisto
|
||||
- samtools
|
||||
- picard
|
||||
- preseq
|
||||
- rseqc
|
||||
- qualimap
|
||||
|
||||
# Order of modules
|
||||
top_modules:
|
||||
- "fail_trimming"
|
||||
- "fail_mapping"
|
||||
- "fail_strand"
|
||||
- "star_rsem_deseq2_pca"
|
||||
- "star_rsem_deseq2_clustering"
|
||||
- "star_salmon_deseq2_pca"
|
||||
- "star_salmon_deseq2_clustering"
|
||||
- "salmon_deseq2_pca"
|
||||
- "salmon_deseq2_clustering"
|
||||
- "kallisto_deseq2_pca"
|
||||
- "kallisto_deseq2_clustering"
|
||||
- "biotype_counts"
|
||||
- "dupradar"
|
||||
|
||||
module_order:
|
||||
- fastqc:
|
||||
name: "FastQC (raw)"
|
||||
info: "This section of the report shows FastQC results before adapter trimming."
|
||||
path_filters:
|
||||
- "*.read_*.fastqc.zip"
|
||||
- cutadapt
|
||||
- fastp
|
||||
- fastqc:
|
||||
name: "FastQC (trimmed)"
|
||||
info: "This section of the report shows FastQC results after adapter trimming."
|
||||
path_filters:
|
||||
- "*.trimgalore.read_*.fastqc.zip"
|
||||
|
||||
# Don't show % Dups in the General Stats table (we have this from Picard)
|
||||
table_columns_visible:
|
||||
fastqc:
|
||||
percent_duplicates: False
|
||||
|
||||
extra_fn_clean_extn:
|
||||
# - ".mapping_quality"
|
||||
# - ".MarkDuplicates_flagstat.output.flagstat"
|
||||
# - ".MarkDuplicates_idxstats.output.idxstats"
|
||||
# - ".MarkDuplicates_stats.output.txt"
|
||||
# - ".genome_sorted_MarkDuplicates.output.bam"
|
||||
# - ".genome_sorted_MarkDuplicates"
|
||||
- ".read_1"
|
||||
- ".read_2"
|
||||
|
||||
# See https://github.com/ewels/MultiQC_TestData/blob/master/data/custom_content/with_config/table_headerconfig/multiqc_config.yaml
|
||||
custom_data:
|
||||
fail_trimming:
|
||||
section_name: "WARNING: Fail Trimming Check"
|
||||
description: "List of samples that failed the minimum trimmed reads threshold specified via the '--min_trimmed_reads' parameter, and hence were ignored for the downstream processing steps."
|
||||
plot_type: "table"
|
||||
pconfig:
|
||||
id: "fail_trimmed_samples_table"
|
||||
table_title: "Samples failed trimming threshold"
|
||||
namespace: "Samples failed trimming threshold"
|
||||
format: "{:.0f}"
|
||||
fail_mapping:
|
||||
section_name: "WARNING: Fail Alignment Check"
|
||||
description: "List of samples that failed the STAR minimum mapped reads threshold specified via the '--min_mapped_reads' parameter, and hence were ignored for the downstream processing steps."
|
||||
plot_type: "table"
|
||||
pconfig:
|
||||
id: "fail_mapped_samples_table"
|
||||
table_title: "Samples failed mapping threshold"
|
||||
namespace: "Samples failed mapping threshold"
|
||||
format: "{:.2f}"
|
||||
fail_strand:
|
||||
section_name: "WARNING: Fail Strand Check"
|
||||
description: "List of samples that failed the strandedness check between that provided in the samplesheet and calculated by the <a href='http://rseqc.sourceforge.net/#infer-experiment-py'>RSeQC infer_experiment.py</a> tool."
|
||||
plot_type: "table"
|
||||
pconfig:
|
||||
id: "fail_strand_check_table"
|
||||
table_title: "Samples failed strandedness check"
|
||||
namespace: "Samples failed strandedness check"
|
||||
format: "{:.2f}"
|
||||
|
||||
# Customise the module search patterns to speed up execution time
|
||||
# - Skip module sub-tools that we are not interested in
|
||||
# - Replace file-content searching with filename pattern searching
|
||||
# - Don't add anything that is the same as the MultiQC default
|
||||
# See https://multiqc.info/docs/#optimise-file-search-patterns for details
|
||||
sp:
|
||||
|
||||
fastqc/zip:
|
||||
fn: "*.fastqc.zip"
|
||||
|
||||
cutadapt:
|
||||
fn: "*.trimming_report*.txt"
|
||||
|
||||
fastp:
|
||||
fn: "*.fastp_out.json"
|
||||
|
||||
sortmerna:
|
||||
fn: "*sortmerna*.log"
|
||||
|
||||
star:
|
||||
fn: "*.star_align_reads.log.txt"
|
||||
|
||||
# hisat2:
|
||||
# fn: "*.hisat2.summary.log"
|
||||
# salmon:
|
||||
# fn: "*meta_info.json"
|
||||
|
||||
preseq:
|
||||
fn: "*.lc_extrap.txt"
|
||||
|
||||
samtools/stats:
|
||||
fn: "*_stats.output.txt"
|
||||
samtools/flagstat:
|
||||
fn: "*_flagstat.output.flagstat"
|
||||
samtools/idxstats:
|
||||
fn: "*_idxstats.output.idxstats"
|
||||
|
||||
rseqc/bam_stat:
|
||||
fn: "*.mapping_quality.txt"
|
||||
rseqc/junction_saturation:
|
||||
fn: "*.junction_saturation_plot.r"
|
||||
rseqc/junction_annotation:
|
||||
fn: "*.junction_annotation.log"
|
||||
rseqc/read_duplication_pos:
|
||||
fn: "*.duplication_rate_mapping.xls"
|
||||
rseqc/read_distribution:
|
||||
fn: "*.read_distribution.txt"
|
||||
rseqc/infer_experiment:
|
||||
fn: "*.strandedness.txt"
|
||||
rseqc/inner_distance:
|
||||
fn: "*.inner_distance_freq.txt"
|
||||
rseqc/tin:
|
||||
fn: "*.tin_summary.txt"
|
||||
|
||||
picard/markdups:
|
||||
fn: "*.MarkDuplicates.metrics.txt"
|
||||
|
||||
skip_versions_section: true
|
||||
8
src/assets/rrna-db-defaults.txt
Normal file
8
src/assets/rrna-db-defaults.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/rfam-5.8s-database-id98.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/rfam-5s-database-id98.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-arc-16s-id95.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-arc-23s-id98.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-bac-16s-id90.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-bac-23s-id98.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-18s-id95.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-28s-id98.fasta
|
||||
56
src/bedtools_genomecov/config.vsh.yaml
Normal file
56
src/bedtools_genomecov/config.vsh.yaml
Normal file
@@ -0,0 +1,56 @@
|
||||
name: bedtools_genomecov
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/bedtools_genomecov.nf]
|
||||
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
|
||||
description: Compute BEDGRAPH (-bg) summaries of feature coverage
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--strandedness"
|
||||
type: string
|
||||
choices: ["unstranded", "forward", "reverse", "auto"]
|
||||
description: Sample strand-specificity.
|
||||
- name: "--bam"
|
||||
type: file
|
||||
description: Genome BAM file
|
||||
- name: "--extra_bedtools_args"
|
||||
type: string
|
||||
default: ''
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--bedgraph_forward"
|
||||
type: file
|
||||
default: $id.forward.bedgraph
|
||||
direction: output
|
||||
- name: "--bedgraph_reverse"
|
||||
type: file
|
||||
default: $id.reverse.bedgraph
|
||||
direction: output
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/chr19.bam
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
apt-get update && \
|
||||
apt-get install -y build-essential wget && \
|
||||
wget --no-check-certificate https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static && \
|
||||
mv bedtools.static /usr/local/bin/bedtools && \
|
||||
chmod a+x /usr/local/bin/bedtools
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
25
src/bedtools_genomecov/script.sh
Normal file
25
src/bedtools_genomecov/script.sh
Normal file
@@ -0,0 +1,25 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
prefix_forward="forward"
|
||||
prefix_reverse="reverse"
|
||||
if [ $par_strandedness == 'reverse' ]; then
|
||||
prefix_forward="reverse"
|
||||
prefix_reverse="forward"
|
||||
fi
|
||||
|
||||
bedtools genomecov \
|
||||
-ibam $par_bam \
|
||||
-bg \
|
||||
-strand + \
|
||||
$par_extra_bedtools_args | bedtools sort > $prefix_forward.bedGraph
|
||||
|
||||
bedtools genomecov \
|
||||
-ibam $par_bam \
|
||||
-bg \
|
||||
-strand - \
|
||||
$par_extra_bedtools_args | bedtools sort > $prefix_reverse.bedGraph
|
||||
|
||||
mv $prefix_forward.bedGraph $par_bedgraph_forward
|
||||
mv $prefix_reverse.bedGraph $par_bedgraph_reverse
|
||||
22
src/bedtools_genomecov/test.sh
Normal file
22
src/bedtools_genomecov/test.sh
Normal file
@@ -0,0 +1,22 @@
|
||||
#!/bin/bash
|
||||
|
||||
id="SRR6357070"
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--strandedness unstranded \
|
||||
--bam $meta_resources_dir/chr19.bam \
|
||||
--bedgraph_forward chr19_forward.bedgraph \
|
||||
--bedgraph_reverse chr19_reverse.bedgraph
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
# check whether output exists
|
||||
[ ! -f "chr19_forward.bedgraph" ] && echo "File 'chr19_forward.bedgraph' does not exist!" && exit 1
|
||||
[ ! -s "chr19_forward.bedgraph" ] && echo "File 'chr19_forward.bedgraph' is empty!" && exit 1
|
||||
[ ! -f "chr19_reverse.bedgraph" ] && echo "File 'chr19_reverse.bedgraph' does not exist!" && exit 1
|
||||
[ ! -s "chr19_reverse.bedgraph" ] && echo "File 'chr19_reverse.bedgraph' is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
54
src/cat_additional_fasta/config.vsh.yaml
Normal file
54
src/cat_additional_fasta/config.vsh.yaml
Normal file
@@ -0,0 +1,54 @@
|
||||
name: "cat_additional_fasta"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/cat_additional_fasta.nf]
|
||||
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
|
||||
description: |
|
||||
Concatenate addional fasta file to reference FASTA and GTF files.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--fasta"
|
||||
type: file
|
||||
required: true
|
||||
description: Path to FASTA genome file.
|
||||
- name: "--gtf"
|
||||
type: file
|
||||
description: Path to GTF annotation file.
|
||||
- name: "--additional_fasta"
|
||||
type: file
|
||||
description: FASTA file to concatenate to genome FASTA file e.g. containing spike-in sequences.
|
||||
- name: "--biotype"
|
||||
type: string
|
||||
description: Biotype value to use when appending entries to GTF file when additional fasta file is provided.
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--fasta_output"
|
||||
type: file
|
||||
direction: output
|
||||
description: Concatenated FASTA file.
|
||||
- name: "--gtf_output"
|
||||
type: file
|
||||
direction: output
|
||||
description: Concatenated GTF file.
|
||||
|
||||
resources:
|
||||
- type: python_script
|
||||
path: script.py
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/reference/genome.fasta
|
||||
- path: /testData/minimal_test/reference/genes.gtf.gz
|
||||
- path: /testData/minimal_test/reference/gfp.fa.gz
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: python
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
80
src/cat_additional_fasta/script.py
Normal file
80
src/cat_additional_fasta/script.py
Normal file
@@ -0,0 +1,80 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Read a custom fasta file and create a custom GTF containing each entry
|
||||
"""
|
||||
from itertools import groupby
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
|
||||
## VIASH START
|
||||
par = {
|
||||
"fasta": "testData/minimal_test/reference/genome.fasta",
|
||||
"gtf": "testData/minimal_test/reference/genes.gtf",
|
||||
"additional_fasta": "testData/minimal_test/reference/gfp.fa.gz",
|
||||
"biotype": "gene_biotype",
|
||||
"fasta_output": "genome_gfp.fasta",
|
||||
"gtf_output": "genome_gfp.gtf",
|
||||
}
|
||||
meta = {
|
||||
"functionality_name": "cat_additonal_fasta"
|
||||
}
|
||||
## VIASH END
|
||||
|
||||
def fasta_iter(fasta_name):
|
||||
"""
|
||||
modified from Brent Pedersen
|
||||
Correct Way To Parse A Fasta File In Python
|
||||
given a fasta file. yield tuples of header, sequence
|
||||
|
||||
Fasta iterator from https://www.biostars.org/p/710/#120760
|
||||
"""
|
||||
with open(fasta_name) as fh:
|
||||
# ditch the boolean (x[0]) and just keep the header or sequence since
|
||||
# we know they alternate.
|
||||
faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
|
||||
for header in faiter:
|
||||
# drop the ">"
|
||||
headerStr = header.__next__()[1:].strip()
|
||||
# join all sequence lines to one.
|
||||
seq = "".join(s.strip() for s in faiter.__next__())
|
||||
yield (headerStr, seq)
|
||||
|
||||
def fasta2gtf(fasta, output, biotype):
|
||||
fiter = fasta_iter(fasta)
|
||||
# GTF output lines
|
||||
lines = []
|
||||
attributes = 'exon_id "{name}.1"; exon_number "1";{biotype} gene_id "{name}_gene"; gene_name "{name}_gene"; gene_source "custom"; transcript_id "{name}_gene"; transcript_name "{name}_gene";\n'
|
||||
line_template = "{name}\ttransgene\texon\t1\t{length}\t.\t+\t.\t" + attributes
|
||||
for ff in fiter:
|
||||
name, seq = ff
|
||||
# Use first ID as separated by spaces as the "sequence name"
|
||||
# (equivalent to "chromosome" in other cases)
|
||||
seqname = name.split()[0]
|
||||
# Remove all spaces
|
||||
name = seqname.replace(" ", "_")
|
||||
length = len(seq)
|
||||
biotype_attr = ""
|
||||
if biotype:
|
||||
biotype_attr = f' {biotype} "transgene";'
|
||||
line = line_template.format(name=name, length=length, biotype=biotype_attr)
|
||||
lines.append(line)
|
||||
with open(output, "w") as f:
|
||||
f.write("".join(lines))
|
||||
|
||||
add_name = os.path.basename(par['additional_fasta'])
|
||||
output = os.path.splitext(add_name)[0] + ".gtf"
|
||||
fasta2gtf(par['additional_fasta'], output, par['biotype'])
|
||||
|
||||
with open(par['fasta'], 'r') as f1:
|
||||
content1 = f1.read()
|
||||
with open(par['additional_fasta'], 'r') as f2:
|
||||
content2 = f2.read()
|
||||
with open(par['fasta_output'], 'w') as f_out:
|
||||
f_out.write(content1 + content2)
|
||||
with open(par['gtf'], 'r') as g1:
|
||||
g_content1 = g1.read()
|
||||
with open(output, 'r') as g2:
|
||||
g_content2 = g2.read()
|
||||
with open(par['gtf_output'], 'w') as g_out:
|
||||
g_out.write(g_content1 + g_content2)
|
||||
26
src/cat_additional_fasta/test.sh
Normal file
26
src/cat_additional_fasta/test.sh
Normal file
@@ -0,0 +1,26 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
gunzip "$meta_resources_dir/genes.gtf"
|
||||
gunzip "$meta_resources_dir/gfp.fa"
|
||||
|
||||
"$meta_executable" \
|
||||
--fasta "$meta_resources_dir/genome.fasta" \
|
||||
--gtf "$meta_resources_dir/genes.gtf" \
|
||||
--additional_fasta "$meta_resources_dir/gfp.fa" \
|
||||
--biotype gene_biotype \
|
||||
--fasta_output genome_gfp.fasta \
|
||||
--gtf_output genome_gfp.gtf
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -f "genome_gfp.fasta" ] && echo "File 'genome_gfp.fasta' does not exist!" && exit 1
|
||||
[ ! -s "genome_gfp.fasta" ] && echo "File 'genome_gfp.fasta' is empty!" && exit 1
|
||||
[ ! -f "genome_gfp.gtf" ] && echo "File 'genome_gfp.gtf' does not exist!" && exit 1
|
||||
[ ! -s "genome_gfp.gtf" ] && echo "File 'genome_gfp.gtf' is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
54
src/cat_fastq/config.vsh.yaml
Normal file
54
src/cat_fastq/config.vsh.yaml
Normal file
@@ -0,0 +1,54 @@
|
||||
name: "cat_fastq"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/cat/fastq/main.nf, modules/nf-core/cat/fastq/meta.yml]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: Concatenate multiple fastq files
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--read_1"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ";"
|
||||
description: Read 1 fastq files to be concatenated
|
||||
- name: "--read_2"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ";"
|
||||
description: Read 2 fastq files to be concatenated
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--fastq_1"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id_r1.fastq
|
||||
description: Concatenated read 1 fastq
|
||||
- name: "--fastq_2"
|
||||
type: file
|
||||
direction: output
|
||||
must_exist: false
|
||||
default: $id_r2.fastq
|
||||
description: Concatenated read 2 fastq
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
|
||||
- path: /testData/minimal_test/input_fastq/SRR6357071_1.fastq.gz
|
||||
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
|
||||
- path: /testData/minimal_test/input_fastq/SRR6357071_2.fastq.gz
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
20
src/cat_fastq/script.sh
Normal file
20
src/cat_fastq/script.sh
Normal file
@@ -0,0 +1,20 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
IFS=";" read -ra read_1 <<< $par_read_1
|
||||
IFS=";" read -ra read_2 <<< $par_read_2
|
||||
|
||||
filename=$(basename -- "${read_1[0]}")
|
||||
if [ ${filename##*.} == "gz" ]; then
|
||||
command="zcat"
|
||||
else
|
||||
command="cat"
|
||||
fi
|
||||
|
||||
if [ ${#read_1[@]} -gt 0 ]; then
|
||||
$command ${read_1[*]} > $par_fastq_1
|
||||
fi
|
||||
if [ ${#read_2[@]} -gt 0 ]; then
|
||||
$command ${read_2[*]} > $par_fastq_2
|
||||
fi
|
||||
44
src/cat_fastq/test.sh
Normal file
44
src/cat_fastq/test.sh
Normal file
@@ -0,0 +1,44 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
echo ">>> Testing paired-end read samples with multiple replicates"
|
||||
"$meta_executable" \
|
||||
--read_1 $meta_resources_dir/SRR6357070_1.fastq.gz\;$meta_resources_dir/SRR6357071_1.fastq.gz \
|
||||
--read_2 $meta_resources_dir/SRR6357070_2.fastq.gz\;$meta_resources_dir/SRR6357071_2.fastq.gz \
|
||||
--fastq_1 read_1.merged.fastq \
|
||||
--fastq_2 read_2.merged.fastq
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -f "read_1.merged.fastq" ] && echo "Merged read 1 file does not exist!" && exit 1
|
||||
[ ! -s "read_1.merged.fastq" ] && echo "Merged read 1 file is empty!" && exit 1
|
||||
[ ! -f "read_2.merged.fastq" ] && echo "Merged read 2 file does not exist!" && exit 1
|
||||
[ ! -s "read_2.merged.fastq" ] && echo "Merged read 2 file is empty!" && exit 1
|
||||
|
||||
echo ">>> Check number of reads"
|
||||
rep1_1=$(zcat $meta_resources_dir/SRR6357070_1.fastq.gz | echo $((`wc -l`/4)))
|
||||
rep1_2=$(zcat $meta_resources_dir/SRR6357070_2.fastq.gz | echo $((`wc -l`/4)))
|
||||
rep2_1=$(zcat $meta_resources_dir/SRR6357071_1.fastq.gz | echo $((`wc -l`/4)))
|
||||
rep2_2=$(zcat $meta_resources_dir/SRR6357071_2.fastq.gz | echo $((`wc -l`/4)))
|
||||
merged_1=$(cat read_1.merged.fastq | echo $((`wc -l`/4)))
|
||||
merged_2=$(cat read_2.merged.fastq | echo $((`wc -l`/4)))
|
||||
[[ $(( $rep1_1 + $rep2_1 )) != $merged_1 ]] || [[ $(( $rep1_2 + $rep2_2 )) != $merged_2 ]] && echo "Concatenation unsuccessful!" && exit 1
|
||||
|
||||
rm read_1.merged.fastq read_2.merged.fastq
|
||||
|
||||
echo ">>> Testing single-end read samples with multiple replicates"
|
||||
"$meta_executable" \
|
||||
--read_1 $meta_resources_dir/SRR6357070_1.fastq.gz\;$meta_resources_dir/SRR6357071_1.fastq.gz \
|
||||
--fastq_1 read_1.merged.fastq
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -f "read_1.merged.fastq" ] && echo "Merged read 1 file does not exist!" && exit 1
|
||||
[ ! -s "read_1.merged.fastq" ] && echo "Merged read 1 file is empty!" && exit 1
|
||||
|
||||
echo ">>> Check number of reads"
|
||||
rep1_1=$(zcat $meta_resources_dir/SRR6357070_1.fastq.gz | echo $((`wc -l`/4)))
|
||||
rep2_1=$(zcat $meta_resources_dir/SRR6357071_1.fastq.gz | echo $((`wc -l`/4)))
|
||||
merged_1=$(cat read_1.merged.fastq | echo $((`wc -l`/4)))
|
||||
[ $(( $rep1_1 + $rep2_1 )) != $merged_1 ] && echo "Concatenation unsuccessful!" && exit 1
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
78
src/deseq2_qc/config.vsh.yaml
Normal file
78
src/deseq2_qc/config.vsh.yaml
Normal file
@@ -0,0 +1,78 @@
|
||||
name: deseq2_qc
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/deseq2_qc.nf]
|
||||
last_sha: 92b2a7857de1dda9d1c19a088941fc81e2976ff7
|
||||
description: |
|
||||
run deseq2, perform pca, generate heatmaps and scatterplots for samples in the counts files
|
||||
|
||||
argument_groups:
|
||||
- name: "input"
|
||||
arguments:
|
||||
- name: "--counts"
|
||||
type: file
|
||||
description: Count file matrix where rows are genes and columns are samples.
|
||||
required: true
|
||||
- name: "--vst"
|
||||
type: boolean
|
||||
default: false
|
||||
description: Use vst transformation instead of rlog with .DESeq2
|
||||
- name: "--count_col"
|
||||
type: integer
|
||||
default: 3
|
||||
description: First column containing sample count data.
|
||||
- name: "--id_col"
|
||||
type: integer
|
||||
default: 1
|
||||
description: Column containing identifiers to be used.
|
||||
- name: "--sample_suffix"
|
||||
type: string
|
||||
description: Suffix to remove after sample name in columns e.g. '.rmDup.bam' if 'DRUG_R1.rmDup.bam'.
|
||||
default: ""
|
||||
- name: "--outprefix"
|
||||
type: string
|
||||
default: deseq2
|
||||
description: Output prefix
|
||||
- name: "--label"
|
||||
type: string
|
||||
description: Label to used in MultiQC report
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--outdir"
|
||||
type: file
|
||||
direction: output
|
||||
default: deseq2
|
||||
- name: "--pca_multiqc"
|
||||
type: file
|
||||
direction: output
|
||||
default: deseq2.pca.vals_mqc.tsv
|
||||
- name: "--sample_dists_multiqc"
|
||||
type: file
|
||||
direction: output
|
||||
default: deseq2.sample.dists_mqc.tsv
|
||||
|
||||
resources:
|
||||
# adapted from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/deseq2_qc.r
|
||||
- type: r_script
|
||||
path: script.r
|
||||
# Add proper default headers as part of the component
|
||||
- path: deseq2_pca_header.txt
|
||||
- path: deseq2_clustering_header.txt
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/counts.tsv
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: rocker/r2u:22.04
|
||||
setup:
|
||||
- type: r
|
||||
cran: [ optparse, ggplot2, RColorBrewer, pheatmap, stringr, matrixStats ]
|
||||
bioc: [ DESeq2 ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
12
src/deseq2_qc/deseq2_clustering_header.txt
Normal file
12
src/deseq2_qc/deseq2_clustering_header.txt
Normal file
@@ -0,0 +1,12 @@
|
||||
#id: 'deseq2_clustering'
|
||||
#section_name: 'DESeq2 sample similarity'
|
||||
#description: "is generated from clustering by Euclidean distances between
|
||||
# <a href='https://bioconductor.org/packages/release/bioc/html/DESeq2.html' target='_blank'>DESeq2</a>
|
||||
# rlog values for each sample
|
||||
# in the <a href='https://github.com/nf-core/rnaseq/blob/master/bin/deseq2_qc.r'><code>deseq2_qc.r</code></a> script."
|
||||
#plot_type: 'heatmap'
|
||||
#anchor: 'deseq2_clustering'
|
||||
#pconfig:
|
||||
# title: 'DESeq2: Heatmap of the sample-to-sample distances'
|
||||
# xlab: True
|
||||
# reverseColors: True
|
||||
11
src/deseq2_qc/deseq2_pca_header.txt
Normal file
11
src/deseq2_qc/deseq2_pca_header.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
#id: 'deseq2_pca'
|
||||
#section_name: 'DESeq2 PCA plot'
|
||||
#description: "PCA plot between samples in the experiment.
|
||||
# These values are calculated using <a href='https://bioconductor.org/packages/release/bioc/html/DESeq2.html'>DESeq2</a>
|
||||
# in the <a href='https://github.com/nf-core/atacseq/blob/master/bin/deseq2_qc.r'><code>deseq2_qc.r</code></a> script."
|
||||
#plot_type: 'scatter'
|
||||
#anchor: 'deseq2_pca'
|
||||
#pconfig:
|
||||
# title: 'DESeq2: Principal component plot'
|
||||
# xlab: PC1
|
||||
# ylab: PC2
|
||||
232
src/deseq2_qc/script.r
Executable file
232
src/deseq2_qc/script.r
Executable file
@@ -0,0 +1,232 @@
|
||||
## VIASH START
|
||||
par <- list(
|
||||
counts = "testData/unit_test_resources/counts.tsv",
|
||||
id_col = 1,
|
||||
sample_suffix = "",
|
||||
outprefix = "deseq2",
|
||||
count_col = 2,
|
||||
deseq2_output = "deseq2",
|
||||
pca_multiqc = "pca.vals_mqc.tsv",
|
||||
dists_multiqc = "sample.dists_mqc.tsv",
|
||||
vst = FALSE
|
||||
)
|
||||
meta <- list(
|
||||
resources_dir = "src/deseq2_qc"
|
||||
)
|
||||
## VIASH END
|
||||
|
||||
# REQUIREMENTS
|
||||
|
||||
## PCA, HEATMAP AND SCATTERPLOTS FOR SAMPLES IN COUNTS FILE
|
||||
## - SAMPLE NAMES HAVE TO END IN e.g. "_R1" REPRESENTING REPLICATE ID. LAST 3 CHARACTERS OF SAMPLE NAME WILL BE TRIMMED TO OBTAIN GROUP ID FOR DESEQ2 COMPARISONS.
|
||||
|
||||
# LOAD LIBRARIES
|
||||
library(DESeq2)
|
||||
library(ggplot2)
|
||||
library(RColorBrewer)
|
||||
library(pheatmap)
|
||||
library(stringr)
|
||||
|
||||
if (file.exists(par$outdir) == FALSE) {
|
||||
dir.create(par$outdir, recursive = TRUE)
|
||||
}
|
||||
|
||||
# READ IN COUNTS FILE
|
||||
count_table <- read.delim(file = par$counts, header = TRUE, row.names = NULL)
|
||||
rownames(count_table) <- count_table[, par$id_col]
|
||||
count_table <- count_table[, par$count_col:ncol(count_table), drop = FALSE]
|
||||
colnames(count_table) <- gsub(par$sample_suffix, "", colnames(count_table))
|
||||
colnames(count_table) <- gsub(pattern = '\\.$', replacement = '', colnames(count_table))
|
||||
|
||||
# RUN DESEQ2
|
||||
samples_vec <- colnames(count_table)
|
||||
name_components <- strsplit(samples_vec, "_")
|
||||
n_components <- length(name_components[[1]])
|
||||
decompose <- n_components != 1 && all(sapply(name_components, length) == n_components)
|
||||
coldata <- data.frame(samples_vec, sample = samples_vec, row.names = 1)
|
||||
if (decompose) {
|
||||
groupings <- as.data.frame(lapply(1:n_components, function(i) sapply(name_components, "[[", i)))
|
||||
n_distinct <- sapply(groupings, function(grp) length(unique(grp)))
|
||||
groupings <- groupings[n_distinct != 1 & n_distinct != length(samples_vec)]
|
||||
if (ncol(groupings) != 0) {
|
||||
names(groupings) <- paste0("Group", 1:ncol(groupings))
|
||||
coldata <- cbind(coldata, groupings)
|
||||
} else {
|
||||
decompose <- FALSE
|
||||
}
|
||||
}
|
||||
|
||||
DDSFile <- paste(par$outdir, "/", par$outprefix, ".dds.RData", sep = "")
|
||||
|
||||
counts <- count_table[, samples_vec, drop = FALSE]
|
||||
dds <- DESeqDataSetFromMatrix(countData = round(counts), colData = coldata, design = ~1)
|
||||
dds <- estimateSizeFactors(dds)
|
||||
|
||||
# No point if only one sample, or one gene
|
||||
if (min(dim(count_table)) <= 1) {
|
||||
save(dds, file = DDSFile)
|
||||
saveRDS(dds, file = sub("\\.dds\\.RData$", ".rds", DDSFile))
|
||||
warning("Not enough samples or genes in counts file for PCA.", call. = FALSE)
|
||||
quit(save = "no", status = 0, runLast = FALSE)
|
||||
}
|
||||
|
||||
if (!par$vst) {
|
||||
vst_name <- "rlog"
|
||||
rld <- rlog(dds)
|
||||
} else {
|
||||
vst_name <- "vst"
|
||||
rld <- varianceStabilizingTransformation(dds)
|
||||
}
|
||||
|
||||
assay(dds, vst_name) <- assay(rld)
|
||||
save(dds, file = DDSFile)
|
||||
saveRDS(dds, file = sub("\\.dds\\.RData$", ".rds", DDSFile))
|
||||
|
||||
# PLOT QC
|
||||
|
||||
##' PCA pre-processeor
|
||||
##'
|
||||
##' Generate all the necessary information to plot PCA from a DESeq2 object
|
||||
##' in which an assay containing a variance-stabilised matrix of counts is
|
||||
##' stored. Copied from DESeq2::plotPCA, but with additional ability to
|
||||
##' say which assay to run the PCA on.
|
||||
##'
|
||||
##' @param object The DESeq2DataSet object.
|
||||
##' @param ntop number of top genes to use for principla components, selected by highest row variance.
|
||||
##' @param assay the name or index of the assay that stores the variance-stabilised data.
|
||||
##' @return A data.frame containing the projected data alongside the grouping columns.
|
||||
##' A 'percentVar' attribute is set which includes the percentage of variation each PC explains,
|
||||
##' and additionally how much the variation within that PC is explained by the grouping variable.
|
||||
##' @author Gavin Kelly
|
||||
|
||||
plotPCA_vst <- function(object, ntop = 500, assay = length(assays(object))) {
|
||||
rv <- rowVars(assay(object, assay))
|
||||
select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
|
||||
pca <- prcomp(t(assay(object, assay)[select, ]), center = TRUE, scale = FALSE)
|
||||
percentVar <- pca$sdev^2 / sum(pca$sdev^2)
|
||||
df <- cbind(as.data.frame(colData(object)), pca$x)
|
||||
# Order points so extreme samples are more likely to get label
|
||||
ord <- order(abs(rank(df$PC1) - median(df$PC1)), abs(rank(df$PC2) - median(df$PC2)))
|
||||
df <- df[ord, ]
|
||||
attr(df, "percentVar") <- data.frame(PC = seq(along = percentVar), percentVar = 100 * percentVar)
|
||||
return(df)
|
||||
}
|
||||
|
||||
PlotFile <- paste(par$outdir, "/", par$outprefix, ".plots.pdf", sep = "")
|
||||
|
||||
pdf(file = PlotFile, onefile = TRUE, width = 7, height = 7)
|
||||
|
||||
## PCA
|
||||
ntop <- c(500, Inf)
|
||||
for (n_top_var in ntop) {
|
||||
pca_data <- plotPCA_vst(dds, assay = vst_name, ntop = n_top_var)
|
||||
percentVar <- round(attr(pca_data, "percentVar")$percentVar)
|
||||
plot_subtitle <- ifelse(n_top_var == Inf, "All genes", paste("Top", n_top_var, "genes"))
|
||||
pl <- ggplot(pca_data, aes(PC1, PC2, label = paste0(" ", sample, " "))) +
|
||||
geom_point() +
|
||||
geom_text(check_overlap = TRUE, vjust = 0.5, hjust="inward") +
|
||||
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
|
||||
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
|
||||
labs(title = paste0("First PCs on ", vst_name, "-transformed data"), subtitle = plot_subtitle) +
|
||||
theme(legend.position = "top",
|
||||
panel.grid.major = element_blank(),
|
||||
panel.grid.minor = element_blank(),
|
||||
panel.background = element_blank(),
|
||||
panel.border = element_rect(colour = "black", fill = NA, size = 1))
|
||||
print(pl)
|
||||
|
||||
if (decompose) {
|
||||
pc_names <- paste0("PC", attr(pca_data, "percentVar")$PC)
|
||||
long_pc <- reshape(pca_data, varying=pc_names, direction="long", sep="", timevar="component", idvar="pcrow")
|
||||
long_pc <- subset(long_pc, component <= 5)
|
||||
long_pc_grp <- reshape(long_pc, varying = names(groupings), direction = "long", sep = "", timevar = "grouper")
|
||||
long_pc_grp <- subset(long_pc_grp, grouper <= 5)
|
||||
long_pc_grp$component <- paste("PC", long_pc_grp$component)
|
||||
long_pc_grp$grouper <- paste0(long_pc_grp$grouper, c("st", "nd", "rd", "th", "th")[long_pc_grp$grouper], " prefix")
|
||||
pl <- ggplot(long_pc_grp, aes(x = Group, y = PC)) +
|
||||
geom_point() +
|
||||
stat_summary(fun = mean, geom = "line", aes(group = 1)) +
|
||||
labs(x = NULL, y = NULL, subtitle = plot_subtitle, title = "PCs split by sample-name prefixes") +
|
||||
facet_grid(component ~ grouper, scales = "free_x") +
|
||||
scale_x_discrete(guide = guide_axis(n.dodge = 3))
|
||||
print(pl)
|
||||
}
|
||||
} # at end of loop, we'll be using the user-defined ntop if any, else all genes
|
||||
|
||||
## WRITE PC1 vs PC2 VALUES TO FILE
|
||||
pca_vals <- pca_data[, c("PC1", "PC2")]
|
||||
colnames(pca_vals) <- paste0(colnames(pca_vals), ": ", percentVar[1:2], '% variance')
|
||||
pca_vals <- cbind(sample = rownames(pca_vals), pca_vals)
|
||||
pca_vals_file <- paste(par$outdir, "/", par$outprefix, ".pca_vals.txt", sep = "")
|
||||
write.table(pca_vals, file = pca_vals_file,
|
||||
row.names = FALSE, col.names = TRUE,
|
||||
sep = "\t", quote = TRUE)
|
||||
|
||||
## SAMPLE CORRELATION HEATMAP
|
||||
sampleDists <- dist(t(assay(dds, vst_name)))
|
||||
sampleDistMatrix <- as.matrix(sampleDists)
|
||||
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
|
||||
pheatmap(
|
||||
sampleDistMatrix,
|
||||
clustering_distance_rows = sampleDists,
|
||||
clustering_distance_cols = sampleDists,
|
||||
col = colors,
|
||||
main = paste("Euclidean distance between", vst_name, "of samples")
|
||||
)
|
||||
|
||||
## WRITE SAMPLE DISTANCES TO FILE
|
||||
sample_dist_file <- paste(par$outdir, "/", par$outprefix, ".sample.dists.txt", sep = "")
|
||||
write.table(cbind(sample = rownames(sampleDistMatrix), sampleDistMatrix),
|
||||
file = sample_dist_file, row.names = FALSE,
|
||||
col.names = TRUE, sep = "\t", quote = FALSE)
|
||||
dev.off()
|
||||
|
||||
# SAVE SIZE FACTORS
|
||||
SizeFactorsDir <- paste0(par$outdir, "/size_factors/")
|
||||
if (file.exists(SizeFactorsDir) == FALSE) {
|
||||
dir.create(SizeFactorsDir, recursive = TRUE)
|
||||
}
|
||||
|
||||
NormFactorsFile <- paste(SizeFactorsDir, par$outprefix, ".size_factors.RData", sep = "")
|
||||
|
||||
normFactors <- sizeFactors(dds)
|
||||
save(normFactors, file = NormFactorsFile)
|
||||
|
||||
for (name in names(sizeFactors(dds))) {
|
||||
sizeFactorFile <- paste(SizeFactorsDir, name, ".txt", sep = "")
|
||||
write(as.numeric(sizeFactors(dds)[name]), file = sizeFactorFile)
|
||||
}
|
||||
|
||||
# R SESSION INFO
|
||||
RLogFile <- "R_sessionInfo.log"
|
||||
|
||||
sink(RLogFile)
|
||||
a <- sessionInfo()
|
||||
print(a)
|
||||
sink()
|
||||
|
||||
# Prepare files for MultiQC
|
||||
|
||||
readLines(paste0(meta$resources_dir, "/deseq2_pca_header.txt")) |>
|
||||
stringr::str_replace(pattern = "#id: 'deseq2_pca'",
|
||||
replace = paste0("#id: '", par$label, "_deseq2_pca'")) |>
|
||||
writeLines(con = "tmp.txt")
|
||||
|
||||
readLines(paste0("tmp.txt")) |>
|
||||
stringr::str_replace(pattern = "#section_name: 'DESeq2 PCA plot'",
|
||||
replace = paste0("#section_name: 'DESeq2 PCA plot - '", par$label)) |>
|
||||
writeLines(con = "tmp.txt")
|
||||
|
||||
system2("cat", args = paste0("tmp.txt ", pca_vals_file), stdout = par$pca_multiqc)
|
||||
|
||||
readLines(paste0(meta$resources_dir, "/deseq2_clustering_header.txt")) |>
|
||||
stringr::str_replace(pattern = "#id: 'deseq2_clustering'",
|
||||
replace = paste0("#id: '", par$label, "_deseq2_clustering'")) |>
|
||||
writeLines(con = "tmp.txt")
|
||||
|
||||
readLines(paste0("tmp.txt")) |>
|
||||
stringr::str_replace(pattern = "#section_name: 'DESeq2 sample similarity'",
|
||||
replace = paste0("#section_name: 'DESeq2 sample similarity - '", par$label)) |>
|
||||
writeLines(con = "tmp.txt")
|
||||
|
||||
system2("cat", args = paste0("tmp.txt ", sample_dist_file), stdout = par$sample_dists_multiqc)
|
||||
29
src/deseq2_qc/test.sh
Normal file
29
src/deseq2_qc/test.sh
Normal file
@@ -0,0 +1,29 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Run executable
|
||||
echo "> Running $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--counts $meta_resources_dir/counts.tsv \
|
||||
--id_col 1 \
|
||||
--sample_suffix '' \
|
||||
--outprefix deseq2 \
|
||||
--count_col 2 \
|
||||
--deseq2_output "deseq2/" \
|
||||
--pca_multiqc pca.vals_mqc.tsv \
|
||||
--sample_dists_multiqc sample.dists_mqc.tsv \
|
||||
--outdir deseq2
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> Check whether output exists"
|
||||
|
||||
[ ! -d "deseq2" ] && echo "deseq2 was not created" && exit 1
|
||||
[ -z "$(ls -A 'deseq2')" ] && echo "deseq2 is empty" && exit 1
|
||||
[ ! -f "pca.vals_mqc.tsv" ] && echo "pca.vals_mqc.tsv was not created" && exit 1
|
||||
[ ! -s "pca.vals_mqc.tsv" ] && echo "pca.vals_mqc.tsv is empty" && exit 1
|
||||
[ ! -f "sample.dists_mqc.tsv" ] && echo "sample.dists_mqc.tsv was not created" && exit 1
|
||||
[ ! -s "sample.dists_mqc.tsv" ] && echo "sample.dists_mqc.tsv is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
118
src/dupradar/config.vsh.yaml
Normal file
118
src/dupradar/config.vsh.yaml
Normal file
@@ -0,0 +1,118 @@
|
||||
name: "dupradar"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/dupradar.nf]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: |
|
||||
Assessment of duplication rates in RNA-Seq datasets
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--id"
|
||||
type: string
|
||||
description: Sample ID
|
||||
|
||||
- name: "--input"
|
||||
type: file
|
||||
required: true
|
||||
description: path to input alignment file in BAM format
|
||||
|
||||
- name: "--gtf_annotation"
|
||||
type: file
|
||||
required: true
|
||||
description: path to GTF annotation file.
|
||||
|
||||
- name: "--paired"
|
||||
type: boolean
|
||||
description: add flag if input alignment file consists of paired reads
|
||||
|
||||
- name: "--strandedness"
|
||||
type: string
|
||||
required: false
|
||||
choices: ["forward", "reverse", "unstranded"]
|
||||
description: strandedness of input bam file reads (forward, reverse or unstranded (default, applicable to paired reads))
|
||||
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_dupmatrix"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.dup_matrix.txt
|
||||
description: path to output file (txt) of duplicate tag counts
|
||||
|
||||
- name: "--output_dup_intercept_mqc"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.dup_intercept_mqc.txt
|
||||
description: path to output file (txt) of multiqc intercept value DupRadar
|
||||
|
||||
- name: "--output_duprate_exp_boxplot"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.duprate_exp_boxplot.pdf
|
||||
description: path to output file (pdf) of distribution of expression box plot
|
||||
|
||||
- name: "--output_duprate_exp_densplot"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.duprate_exp_densityplot.pdf
|
||||
description: path to output file (pdf) of 2D density scatter plot of duplicate tag counts
|
||||
|
||||
- name: "--output_duprate_exp_denscurve_mqc"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.duprate_exp_density_curve_mqc.txt
|
||||
description: path to output file (pdf) of density curve of gene duplication multiqc
|
||||
|
||||
- name: "--output_expression_histogram"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.expression_hist.pdf
|
||||
description: path to output file (pdf) of distribution of RPK values per gene histogram
|
||||
|
||||
- name: "--output_intercept_slope"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.intercept_slope.txt
|
||||
description: output file (txt) with progression of duplication rate value
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
# Copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/dupradar.r
|
||||
- path: dupradar.r
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
|
||||
- path: /testData/unit_test_resources/genes.gtf
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ r-base ]
|
||||
- type: r
|
||||
bioc: [ dupRadar ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
154
src/dupradar/dupradar.r
Normal file
154
src/dupradar/dupradar.r
Normal file
@@ -0,0 +1,154 @@
|
||||
#!/usr/bin/env Rscript
|
||||
|
||||
# Command line argument processing
|
||||
args = commandArgs(trailingOnly=TRUE)
|
||||
if (length(args) < 5) {
|
||||
stop("Usage: dupRadar.r <input.bam> <sample_id> <annotation.gtf> <strandDirection:0=unstranded/1=forward/2=reverse> <paired/single> <nbThreads> <R-package-location (optional)>", call.=FALSE)
|
||||
}
|
||||
|
||||
message("paired_end is", args[5])
|
||||
message("the type is is", class(args[5]))
|
||||
|
||||
input_bam <- args[1]
|
||||
output_prefix <- args[2]
|
||||
annotation_gtf <- args[3]
|
||||
stranded <- as.numeric(args[4])
|
||||
paired_end <- ifelse(args[5] == "true", TRUE, FALSE)
|
||||
threads <- as.numeric(args[6])
|
||||
|
||||
bamRegex <- "(.+)\\.bam$"
|
||||
|
||||
if(!(grepl(bamRegex, input_bam) && file.exists(input_bam) && (!file.info(input_bam)$isdir))) stop("First argument '<input.bam>' must be an existing file (not a directory) with '.bam' extension...")
|
||||
if(!(file.exists(annotation_gtf) && (!file.info(annotation_gtf)$isdir))) stop("Third argument '<annotation.gtf>' must be an existing file (and not a directory)...")
|
||||
if(is.na(stranded) || (!(stranded %in% (0:2)))) stop("Fourth argument <strandDirection> must be a numeric value in 0(unstranded)/1(forward)/2(reverse)...")
|
||||
if(is.na(threads) || (threads<=0)) stop("Fifth argument <nbThreads> must be a strictly positive numeric value...")
|
||||
|
||||
# Debug messages (stderr)
|
||||
message("Input bam (Arg 1): ", input_bam)
|
||||
message("Output basename(Arg 2): ", output_prefix)
|
||||
message("Input gtf (Arg 3): ", annotation_gtf)
|
||||
message("Strandness (Arg 4): ", c("unstranded", "forward", "reverse")[stranded+1])
|
||||
message("paired_end (Arg 5): ", paired_end)
|
||||
message("Nb threads (Arg 6): ", threads)
|
||||
message("R package loc. (Arg 7): ", ifelse(length(args) > 4, args[5], "Not specified"))
|
||||
|
||||
|
||||
# Load / install packages
|
||||
if (length(args) > 5) { .libPaths( c( args[6], .libPaths() ) ) }
|
||||
if (!require("dupRadar")){
|
||||
source("http://bioconductor.org/biocLite.R")
|
||||
biocLite("dupRadar", suppressUpdates=TRUE)
|
||||
library("dupRadar")
|
||||
}
|
||||
if (!require("parallel")) {
|
||||
install.packages("parallel", dependencies=TRUE, repos='http://cloud.r-project.org/')
|
||||
library("parallel")
|
||||
}
|
||||
|
||||
# Duplicate stats
|
||||
dm <- analyzeDuprates(input_bam, annotation_gtf, stranded, paired_end, threads)
|
||||
write.table(dm, file=paste(output_prefix, "_dupMatrix.txt", sep=""), quote=F, row.name=F, sep="\t")
|
||||
|
||||
# 2D density scatter plot
|
||||
pdf(paste0(output_prefix, "_duprateExpDens.pdf"))
|
||||
duprateExpDensPlot(DupMat=dm)
|
||||
title("Density scatter plot")
|
||||
mtext(output_prefix, side=3)
|
||||
dev.off()
|
||||
fit <- duprateExpFit(DupMat=dm)
|
||||
cat(
|
||||
paste("- dupRadar Int (duprate at low read counts):", fit$intercept),
|
||||
paste("- dupRadar Sl (progression of the duplication rate):", fit$slope),
|
||||
fill=TRUE, labels=output_prefix,
|
||||
file=paste0(output_prefix, "_intercept_slope.txt"), append=FALSE
|
||||
)
|
||||
|
||||
# Create a multiqc file dupInt
|
||||
sample_name <- gsub("Aligned.sortedByCoord.out.markDups", "", output_prefix)
|
||||
line="#id: DupInt
|
||||
#plot_type: 'generalstats'
|
||||
#pconfig:
|
||||
# dupRadar_intercept:
|
||||
# title: 'dupInt'
|
||||
# namespace: 'DupRadar'
|
||||
# description: 'Intercept value from DupRadar'
|
||||
# max: 100
|
||||
# min: 0
|
||||
# scale: 'RdYlGn-rev'
|
||||
# format: '{:.2f}%'
|
||||
Sample dupRadar_intercept"
|
||||
|
||||
write(line,file=paste0(output_prefix, "_dup_intercept_mqc.txt"),append=TRUE)
|
||||
write(paste(sample_name, fit$intercept),file=paste0(output_prefix, "_dup_intercept_mqc.txt"),append=TRUE)
|
||||
|
||||
# Get numbers from dupRadar GLM
|
||||
curve_x <- sort(log10(dm$RPK))
|
||||
curve_y = 100*predict(fit$glm, data.frame(x=curve_x), type="response")
|
||||
# Remove all of the infinite values
|
||||
infs = which(curve_x %in% c(-Inf,Inf))
|
||||
curve_x = curve_x[-infs]
|
||||
curve_y = curve_y[-infs]
|
||||
# Reduce number of data points
|
||||
curve_x <- curve_x[seq(1, length(curve_x), 10)]
|
||||
curve_y <- curve_y[seq(1, length(curve_y), 10)]
|
||||
# Convert x values back to real counts
|
||||
curve_x = 10^curve_x
|
||||
# Write to file
|
||||
line="#id: dupradar
|
||||
#section_name: 'DupRadar'
|
||||
#section_href: 'bioconductor.org/packages/release/bioc/html/dupRadar.html'
|
||||
#description: \"provides duplication rate quality control for RNA-Seq datasets. Highly expressed genes can be expected to have a lot of duplicate reads, but high numbers of duplicates at low read counts can indicate low library complexity with technical duplication.
|
||||
# This plot shows the general linear models - a summary of the gene duplication distributions. \"
|
||||
#pconfig:
|
||||
# title: 'DupRadar General Linear Model'
|
||||
# xLog: True
|
||||
# xlab: 'expression (reads/kbp)'
|
||||
# ylab: '% duplicate reads'
|
||||
# ymax: 100
|
||||
# ymin: 0
|
||||
# tt_label: '<b>{point.x:.1f} reads/kbp</b>: {point.y:,.2f}% duplicates'
|
||||
# xPlotLines:
|
||||
# - color: 'green'
|
||||
# dashStyle: 'LongDash'
|
||||
# label:
|
||||
# style: {color: 'green'}
|
||||
# text: '0.5 RPKM'
|
||||
# verticalAlign: 'bottom'
|
||||
# y: -65
|
||||
# value: 0.5
|
||||
# width: 1
|
||||
# - color: 'red'
|
||||
# dashStyle: 'LongDash'
|
||||
# label:
|
||||
# style: {color: 'red'}
|
||||
# text: '1 read/bp'
|
||||
# verticalAlign: 'bottom'
|
||||
# y: -65
|
||||
# value: 1000
|
||||
# width: 1"
|
||||
|
||||
write(line,file=paste0(output_prefix, "_duprateExpDensCurve_mqc.txt"),append=TRUE)
|
||||
write.table(
|
||||
cbind(curve_x, curve_y),
|
||||
file=paste0(output_prefix, "_duprateExpDensCurve_mqc.txt"),
|
||||
quote=FALSE, row.names=FALSE, col.names=FALSE, append=TRUE,
|
||||
)
|
||||
|
||||
# Distribution of expression box plot
|
||||
pdf(paste0(output_prefix, "_duprateExpBoxplot.pdf"))
|
||||
duprateExpBoxplot(DupMat=dm)
|
||||
title("Percent Duplication by Expression")
|
||||
mtext(output_prefix, side=3)
|
||||
dev.off()
|
||||
|
||||
# Distribution of RPK values per gene
|
||||
pdf(paste0(output_prefix, "_expressionHist.pdf"))
|
||||
expressionHist(DupMat=dm)
|
||||
title("Distribution of RPK values per gene")
|
||||
mtext(output_prefix, side=3)
|
||||
dev.off()
|
||||
|
||||
# Print sessioninfo to standard out
|
||||
print(output_prefix)
|
||||
citation("dupRadar")
|
||||
sessionInfo()
|
||||
28
src/dupradar/script.sh
Normal file
28
src/dupradar/script.sh
Normal file
@@ -0,0 +1,28 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -exo pipefail
|
||||
|
||||
function num_strandness {
|
||||
if [ $par_strandedness == 'unstranded' ]; then echo 0
|
||||
elif [ $par_strandedness == 'forward' ]; then echo 1
|
||||
elif [ $par_strandedness == 'reverse' ]; then echo 2
|
||||
else echo "strandedness must be unstranded, forward or reverse." && \
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
Rscript "$meta_resources_dir/dupradar.r" \
|
||||
$par_input \
|
||||
$par_id \
|
||||
$par_gtf_annotation \
|
||||
$(num_strandness) \
|
||||
$par_paired \
|
||||
${meta_cpus:-1}
|
||||
|
||||
mv "$par_id"_dupMatrix.txt $par_output_dupmatrix
|
||||
mv "$par_id"_dup_intercept_mqc.txt $par_output_dup_intercept_mqc
|
||||
mv "$par_id"_duprateExpBoxplot.pdf $par_output_duprate_exp_boxplot
|
||||
mv "$par_id"_duprateExpDens.pdf $par_output_duprate_exp_densplot
|
||||
mv "$par_id"_duprateExpDensCurve_mqc.txt $par_output_duprate_exp_denscurve_mqc
|
||||
mv "$par_id"_expressionHist.pdf $par_output_expression_histogram
|
||||
mv "$par_id"_intercept_slope.txt $par_output_intercept_slope
|
||||
51
src/dupradar/test.sh
Normal file
51
src/dupradar/test.sh
Normal file
@@ -0,0 +1,51 @@
|
||||
#!/bin/bash
|
||||
|
||||
# define input and output for script
|
||||
input_bam="$meta_resources_dir/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam"
|
||||
input_gtf="$meta_resources_dir/genes.gtf"
|
||||
|
||||
output_dupmatrix="dup_matrix.txt"
|
||||
output_dup_intercept_mqc="dup_intercept_mqc.txt"
|
||||
output_duprate_exp_boxplot="duprate_exp_boxplot.pdf"
|
||||
output_duprate_exp_densplot="duprate_exp_densityplot.pdf"
|
||||
output_duprate_exp_denscurve_mqc="duprate_exp_density_curve_mqc.pdf"
|
||||
output_expression_histogram="expression_hist.pdf"
|
||||
output_intercept_slope="intercept_slope.txt"
|
||||
|
||||
# Run executable
|
||||
echo "> Running $meta_functionality_name for unpaired reads, writing to tmpdir $tmpdir."
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$input_bam" \
|
||||
--id "test" \
|
||||
--gtf_annotation "$input_gtf" \
|
||||
--strandedness "forward" \
|
||||
--paired false \
|
||||
--output_dupmatrix $output_dupmatrix \
|
||||
--output_dup_intercept_mqc $output_dup_intercept_mqc \
|
||||
--output_duprate_exp_boxplot $output_duprate_exp_boxplot \
|
||||
--output_duprate_exp_densplot $output_duprate_exp_densplot \
|
||||
--output_duprate_exp_denscurve_mqc $output_duprate_exp_denscurve_mqc \
|
||||
--output_expression_histogram $output_expression_histogram \
|
||||
--output_intercept_slope $output_intercept_slope
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> asserting output has been created for paired read input"
|
||||
[ ! -f "$output_dupmatrix" ] && echo "$output_dupmatrix was not created" && exit 1
|
||||
[ ! -s "$output_dupmatrix" ] && echo "$output_dupmatrix is empty" && exit 1
|
||||
[ ! -f "$output_dup_intercept_mqc" ] && echo "$output_dup_intercept_mqc was not created" && exit 1
|
||||
[ ! -s "$output_dup_intercept_mqc" ] && echo "$output_dup_intercept_mqc is empty" && exit 1
|
||||
[ ! -f "$output_duprate_exp_boxplot" ] && echo "$output_duprate_exp_boxplot was not created" && exit 1
|
||||
[ ! -s "$output_duprate_exp_boxplot" ] && echo "$output_duprate_exp_boxplot is empty" && exit 1
|
||||
[ ! -f "$output_duprate_exp_densplot" ] && echo "$output_duprate_exp_densplot was not created" && exit 1
|
||||
[ ! -s "$output_duprate_exp_densplot" ] && echo "$output_duprate_exp_densplot is empty" && exit 1
|
||||
[ ! -f "$output_duprate_exp_denscurve_mqc" ] && echo "$output_duprate_exp_denscurve_mqc was not created" && exit 1
|
||||
[ ! -s "$output_duprate_exp_denscurve_mqc" ] && echo "$output_duprate_exp_denscurve_mqc is empty" && exit 1
|
||||
[ ! -f "$output_expression_histogram" ] && echo "$output_expression_histogram was not created" && exit 1
|
||||
[ ! -s "$output_expression_histogram" ] && echo "$output_expression_histogram is empty" && exit 1
|
||||
[ ! -f "$output_intercept_slope" ] && echo "$output_intercept_slope was not created" && exit 1
|
||||
[ ! -s "$output_intercept_slope" ] && echo "$output_intercept_slope is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
57
src/getchromsizes/config.vsh.yaml
Normal file
57
src/getchromsizes/config.vsh.yaml
Normal file
@@ -0,0 +1,57 @@
|
||||
name: "getchromsizes"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/custom/getchromsizes/main.nf, modules/nf-core/custom/getchromsizes/meta.yml]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: |
|
||||
Generates a FASTA file of chromosome sizes and a fasta index file.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--fasta"
|
||||
type: file
|
||||
description: Genome fasta files
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--sizes"
|
||||
type: file
|
||||
direction: output
|
||||
description: File containing chromosome lengths
|
||||
- name: "--fai"
|
||||
type: file
|
||||
description: FASTA index file
|
||||
direction: output
|
||||
- name: "--gzi" # optional
|
||||
type: file
|
||||
description: Optional gzip index file for compressed inputs
|
||||
direction: output
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/reference/genome.fasta
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
apt-get update && \
|
||||
apt-get install -y autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev libncurses5-dev curl bzip2 && \
|
||||
curl -fsSL https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2 -o samtools-1.18.tar.bz2 && \
|
||||
tar -xjf samtools-1.18.tar.bz2 && \
|
||||
rm samtools-1.18.tar.bz2 && \
|
||||
cd samtools-1.18 && \
|
||||
./configure && \
|
||||
make && \
|
||||
make install
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
9
src/getchromsizes/script.sh
Executable file
9
src/getchromsizes/script.sh
Executable file
@@ -0,0 +1,9 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
filename="$(basename -- $par_fasta)"
|
||||
|
||||
samtools faidx $par_fasta
|
||||
cut -f 1,2 "$par_fasta.fai" > $par_sizes
|
||||
mv "$par_fasta.fai" $par_fai
|
||||
16
src/getchromsizes/test.sh
Normal file
16
src/getchromsizes/test.sh
Normal file
@@ -0,0 +1,16 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo "Testing $meta_functionality_name"
|
||||
"$meta_executable" \
|
||||
--fasta "$meta_resources_dir/genome.fasta" \
|
||||
--sizes genome.fasta.sizes \
|
||||
--fai genome.fasta.fai
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -f "genome.fasta.sizes" ] && echo "Chromosome lengths file does not exist!" && exit 1
|
||||
[ ! -s "genome.fasta.sizes" ] && echo "Chromosome lengths file is empty!" && exit 1
|
||||
[ ! -f "genome.fasta.fai" ] && echo "FASTA index file does not exist!" && exit 1
|
||||
[ ! -s "genome.fasta.fai" ] && echo "FASTA index file does is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
45
src/gtf2bed/config.vsh.yaml
Normal file
45
src/gtf2bed/config.vsh.yaml
Normal file
@@ -0,0 +1,45 @@
|
||||
name: "gtf2bed"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/gtf2bed.nf]
|
||||
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
|
||||
description: |
|
||||
Create BED annotation file from GTF.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--gtf"
|
||||
type: file
|
||||
required: true
|
||||
description: A reference file in GTF format.
|
||||
|
||||
- name: " Output"
|
||||
arguments:
|
||||
- name: "--bed_output"
|
||||
type: file
|
||||
direction: output
|
||||
required: true
|
||||
description: BED file resulting from the conversion of the GTF input file.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
# Copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/gtf2bed
|
||||
- path: gtf2bed.pl
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/reference/genes.gtf.gz
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [perl]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
122
src/gtf2bed/gtf2bed.pl
Executable file
122
src/gtf2bed/gtf2bed.pl
Executable file
@@ -0,0 +1,122 @@
|
||||
#!/usr/bin/env perl
|
||||
|
||||
# Copyright (c) 2011 Erik Aronesty (erik@q32.com)
|
||||
#
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
# of this software and associated documentation files (the "Software"), to deal
|
||||
# in the Software without restriction, including without limitation the rights
|
||||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
# copies of the Software, and to permit persons to whom the Software is
|
||||
# furnished to do so, subject to the following conditions:
|
||||
#
|
||||
# The above copyright notice and this permission notice shall be included in
|
||||
# all copies or substantial portions of the Software.
|
||||
#
|
||||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
||||
# THE SOFTWARE.
|
||||
#
|
||||
# ALSO, IT WOULD BE NICE IF YOU LET ME KNOW YOU USED IT.
|
||||
|
||||
use Getopt::Long;
|
||||
|
||||
my $extended;
|
||||
GetOptions("x"=>\$extended);
|
||||
|
||||
$in = shift @ARGV;
|
||||
|
||||
my $in_cmd =($in =~ /\.gz$/ ? "gunzip -c $in|" : $in =~ /\.zip$/ ? "unzip -p $in|" : "$in") || die "Can't open $in: $!\n";
|
||||
open IN, $in_cmd;
|
||||
|
||||
while (<IN>) {
|
||||
$gff = 2 if /^##gff-version 2/;
|
||||
$gff = 3 if /^##gff-version 3/;
|
||||
next if /^#/ && $gff;
|
||||
|
||||
s/\s+$//;
|
||||
# 0-chr 1-src 2-feat 3-beg 4-end 5-scor 6-dir 7-fram 8-attr
|
||||
my @f = split /\t/;
|
||||
if ($gff) {
|
||||
# most ver 2's stick gene names in the id field
|
||||
($id) = $f[8]=~ /\bID="([^"]+)"/;
|
||||
# most ver 3's stick unquoted names in the name field
|
||||
($id) = $f[8]=~ /\bName=([^";]+)/ if !$id && $gff == 3;
|
||||
} else {
|
||||
($id) = $f[8]=~ /transcript_id "([^"]+)"/;
|
||||
}
|
||||
|
||||
next unless $id && $f[0];
|
||||
|
||||
if ($f[2] eq 'exon') {
|
||||
die "no position at exon on line $." if ! $f[3];
|
||||
# gff3 puts :\d in exons sometimes
|
||||
$id =~ s/:\d+$// if $gff == 3;
|
||||
push @{$exons{$id}}, \@f;
|
||||
# save lowest start
|
||||
$trans{$id} = \@f if !$trans{$id};
|
||||
} elsif ($f[2] eq 'start_codon') {
|
||||
#optional, output codon start/stop as "thick" region in bed
|
||||
$sc{$id}->[0] = $f[3];
|
||||
} elsif ($f[2] eq 'stop_codon') {
|
||||
$sc{$id}->[1] = $f[4];
|
||||
} elsif ($f[2] eq 'miRNA' ) {
|
||||
$trans{$id} = \@f if !$trans{$id};
|
||||
push @{$exons{$id}}, \@f;
|
||||
}
|
||||
}
|
||||
|
||||
for $id (
|
||||
# sort by chr then pos
|
||||
sort {
|
||||
$trans{$a}->[0] eq $trans{$b}->[0] ?
|
||||
$trans{$a}->[3] <=> $trans{$b}->[3] :
|
||||
$trans{$a}->[0] cmp $trans{$b}->[0]
|
||||
} (keys(%trans)) ) {
|
||||
my ($chr, undef, undef, undef, undef, undef, $dir, undef, $attr, undef, $cds, $cde) = @{$trans{$id}};
|
||||
my ($cds, $cde);
|
||||
($cds, $cde) = @{$sc{$id}} if $sc{$id};
|
||||
|
||||
# sort by pos
|
||||
my @ex = sort {
|
||||
$a->[3] <=> $b->[3]
|
||||
} @{$exons{$id}};
|
||||
|
||||
my $beg = $ex[0][3];
|
||||
my $end = $ex[-1][4];
|
||||
|
||||
if ($dir eq '-') {
|
||||
# swap
|
||||
$tmp=$cds;
|
||||
$cds=$cde;
|
||||
$cde=$tmp;
|
||||
$cds -= 2 if $cds;
|
||||
$cde += 2 if $cde;
|
||||
}
|
||||
|
||||
# not specified, just use exons
|
||||
$cds = $beg if !$cds;
|
||||
$cde = $end if !$cde;
|
||||
|
||||
# adjust start for bed
|
||||
--$beg; --$cds;
|
||||
|
||||
my $exn = @ex; # exon count
|
||||
my $exst = join ",", map {$_->[3]-$beg-1} @ex; # exon start
|
||||
my $exsz = join ",", map {$_->[4]-$_->[3]+1} @ex; # exon size
|
||||
|
||||
my $gene_id;
|
||||
my $extend = "";
|
||||
if ($extended) {
|
||||
($gene_id) = $attr =~ /gene_name "([^"]+)"/;
|
||||
($gene_id) = $attr =~ /gene_id "([^"]+)"/ unless $gene_id;
|
||||
$extend="\t$gene_id";
|
||||
}
|
||||
# added an extra comma to make it look exactly like ucsc's beds
|
||||
print "$chr\t$beg\t$end\t$id\t0\t$dir\t$cds\t$cde\t0\t$exn\t$exsz,\t$exst,$extend\n";
|
||||
}
|
||||
|
||||
close IN;
|
||||
5
src/gtf2bed/script.sh
Executable file
5
src/gtf2bed/script.sh
Executable file
@@ -0,0 +1,5 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
perl "$meta_resources_dir/gtf2bed.pl" $par_gtf > $par_bed_output
|
||||
15
src/gtf2bed/test.sh
Normal file
15
src/gtf2bed/test.sh
Normal file
@@ -0,0 +1,15 @@
|
||||
#!/bin/bash
|
||||
|
||||
gunzip "$meta_resources_dir/genes.gtf.gz"
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
"$meta_executable" \
|
||||
--gtf "$meta_resources_dir/genes.gtf" \
|
||||
--bed_output genes.bed
|
||||
|
||||
echo ">>> Check whether output exists"
|
||||
[ ! -f "genes.bed" ] && echo "BED output file does not exist!" && exit 1
|
||||
[ ! -s "genes.bed" ] && echo "BED output file is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
45
src/gtf_filter/config.vsh.yaml
Normal file
45
src/gtf_filter/config.vsh.yaml
Normal file
@@ -0,0 +1,45 @@
|
||||
name: "gtf_filter"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/gtf_filter.nf]
|
||||
last_sha: 1c6012ecbb087014ea4b8f0f3d39b874850277a8
|
||||
description: |
|
||||
Filters a GTF file based on sequence names in a FASTA file.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--fasta"
|
||||
type: file
|
||||
description: Genome fasta file
|
||||
- name: "--gtf"
|
||||
type: file
|
||||
description: GTF file
|
||||
- name: "--skip_transcript_id_check"
|
||||
type: boolean_true
|
||||
description: Skip checking for transcript IDs in the GTF file.
|
||||
|
||||
- name: " Output"
|
||||
arguments:
|
||||
- name: "--filtered_gtf"
|
||||
type: file
|
||||
direction: output
|
||||
description: Filtered GTF file containing only sequences in the FASTA file
|
||||
|
||||
resources:
|
||||
- type: python_script
|
||||
path: script.py
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/reference/genome.fasta
|
||||
- path: /testData/minimal_test/reference/genes.gtf.gz
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: python
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
47
src/gtf_filter/script.py
Normal file
47
src/gtf_filter/script.py
Normal file
@@ -0,0 +1,47 @@
|
||||
# Adapted from https://github.com/nf-core/rnaseq/blob/3.14.0/bin/filter_gtf.py
|
||||
|
||||
import os
|
||||
import sys
|
||||
import re
|
||||
import statistics
|
||||
from typing import Set
|
||||
|
||||
def extract_fasta_seq_names(fasta_name: str) -> Set[str]:
|
||||
"""Extracts the sequence names from a FASTA file."""
|
||||
with open(fasta_name) as fasta:
|
||||
return {line[1:].split(None, 1)[0] for line in fasta if line.startswith(">")}
|
||||
|
||||
def tab_delimited(file: str) -> float:
|
||||
"""Check if file is tab-delimited and return median number of tabs."""
|
||||
with open(file, "r") as f:
|
||||
data = f.read(102400)
|
||||
return statistics.median(line.count("\t") for line in data.split("\n"))
|
||||
|
||||
def filter_gtf(fasta: str, gtf_in: str, filtered_gtf_out: str, skip_transcript_id_check: bool) -> None:
|
||||
"""Filter GTF file based on FASTA sequence names."""
|
||||
if tab_delimited(gtf_in) != 8:
|
||||
raise ValueError("Invalid GTF file: Expected 9 tab-separated columns.")
|
||||
seq_names_in_genome = extract_fasta_seq_names(fasta)
|
||||
print(f"Extracted chromosome sequence names from {fasta}")
|
||||
print("All sequence IDs from FASTA: " + ", ".join(sorted(seq_names_in_genome)))
|
||||
seq_names_in_gtf = set()
|
||||
try:
|
||||
with open(gtf_in) as gtf, open(filtered_gtf_out, "w") as out:
|
||||
line_count = 0
|
||||
for line in gtf:
|
||||
seq_name = line.split("\t")[0]
|
||||
seq_names_in_gtf.add(seq_name) # Add sequence name to the set
|
||||
if seq_name in seq_names_in_genome:
|
||||
if skip_transcript_id_check or re.search(r'transcript_id "([^"]+)"', line):
|
||||
out.write(line)
|
||||
line_count += 1
|
||||
if line_count == 0:
|
||||
raise ValueError("All GTF lines removed by filters")
|
||||
except IOError as e:
|
||||
print(f"File operation failed: {e}")
|
||||
return
|
||||
|
||||
print("All sequence IDs from GTF: " + ", ".join(sorted(seq_names_in_gtf)))
|
||||
print(f"Extracted {line_count} matching sequences from {gtf_in} into {filtered_gtf_out}")
|
||||
|
||||
filter_gtf(par["fasta"], par["gtf"], par["filtered_gtf"], par["skip_transcript_id_check"])
|
||||
16
src/gtf_filter/test.sh
Normal file
16
src/gtf_filter/test.sh
Normal file
@@ -0,0 +1,16 @@
|
||||
#!/bin/bash
|
||||
|
||||
gunzip "$meta_resources_dir/genes.gtf.gz"
|
||||
|
||||
echo ">>>Testing $metat_functionality_name"
|
||||
"$meta_executable" \
|
||||
--fasta "$meta_resources_dir/genome.fasta" \
|
||||
--gtf "$meta_resources_dir/genes.gtf" \
|
||||
--filtered_gtf filtered_genes.gtf
|
||||
|
||||
echo ">>> Check whether output exists"
|
||||
[ ! -f "filtered_genes.gtf" ] && echo "Filtered GTF file does not exist!" && exit 1
|
||||
[ ! -s "filtered_genes.gtf" ] && echo "Filtered GTF file is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
42
src/gunzip/config.vsh.yaml
Normal file
42
src/gunzip/config.vsh.yaml
Normal file
@@ -0,0 +1,42 @@
|
||||
name: "gunzip"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/gunzip/main.nf, modules/nf-core/gunzip/meta.yml]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: |
|
||||
Compress or uncompress a file or list of files.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
required: true
|
||||
description: Path of file to be uncompressed
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
required: true
|
||||
description: Decompressed file.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/reference/genes.gff.gz
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ gzip ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
11
src/gunzip/script.sh
Executable file
11
src/gunzip/script.sh
Executable file
@@ -0,0 +1,11 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
filename="$(basename -- "$par_input")"
|
||||
|
||||
if [ ${filename##*.} == "gz" ]; then
|
||||
gunzip -c $par_input > $par_output
|
||||
else
|
||||
cat $par_input > $par_output
|
||||
fi
|
||||
22
src/gunzip/test.sh
Normal file
22
src/gunzip/test.sh
Normal file
@@ -0,0 +1,22 @@
|
||||
#!/bin/bash
|
||||
|
||||
# define input and output for script
|
||||
input="$meta_resources_dir/genes.gff.gz"
|
||||
output="genes.gff"
|
||||
|
||||
# run executable and tests
|
||||
echo "> Running $meta_functionality_name."
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$input" \
|
||||
--output "$output"
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> Checking whether output can be found and has content"
|
||||
|
||||
[ ! -f "$output" ] && echo "$output file missing" && exit 1
|
||||
[ ! -s "$output" ] && echo "$output file is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
11
src/multiqc_custom_biotype/biotypes_header.txt
Normal file
11
src/multiqc_custom_biotype/biotypes_header.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
# id: 'biotype_counts'
|
||||
# section_name: 'Biotype Counts'
|
||||
# description: "shows reads overlapping genomic features of different biotypes,
|
||||
# counted by <a href='http://bioinf.wehi.edu.au/featureCounts'>featureCounts</a>."
|
||||
# plot_type: 'bargraph'
|
||||
# anchor: 'featurecounts_biotype'
|
||||
# pconfig:
|
||||
# id: "featurecounts_biotype_plot"
|
||||
# title: "featureCounts: Biotypes"
|
||||
# xlab: "# Reads"
|
||||
# cpswitch_counts_label: "Number of Reads"
|
||||
48
src/multiqc_custom_biotype/config.vsh.yaml
Normal file
48
src/multiqc_custom_biotype/config.vsh.yaml
Normal file
@@ -0,0 +1,48 @@
|
||||
name: "multiqc_custom_biotype"
|
||||
info:
|
||||
migration_info:
|
||||
description: Calculate features percentage for biotype counts
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--biocounts"
|
||||
type: file
|
||||
description: File with all biocounts
|
||||
- name: "--id"
|
||||
type: string
|
||||
description: Sample name
|
||||
default: $id
|
||||
- name: "--features"
|
||||
type: string
|
||||
description: Features to count
|
||||
default: rRNA
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: '--featurecounts_multiqc'
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.biotype_counts_mqc.tsv
|
||||
- name: '--featurecounts_rrna_multiqc'
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.biotype_counts_rrna_mqc.tsv
|
||||
|
||||
resources:
|
||||
# Adapted from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/mqc_features_stat.py
|
||||
- type: python_script
|
||||
path: script.py
|
||||
# Include the header for the biotypes in the component
|
||||
- path: biotypes_header.txt
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: python:latest
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
91
src/multiqc_custom_biotype/script.py
Executable file
91
src/multiqc_custom_biotype/script.py
Executable file
@@ -0,0 +1,91 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
|
||||
# Create a logger
|
||||
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
|
||||
logger = logging.getLogger(__file__)
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
mqc_main = """#id: 'biotype-gs'
|
||||
#plot_type: 'generalstats'
|
||||
#pconfig:"""
|
||||
|
||||
mqc_pconf = """# percent_{ft}:
|
||||
# title: '% {ft}'
|
||||
# namespace: 'Biotype Counts'
|
||||
# description: '% reads overlapping {ft} features'
|
||||
# max: 100
|
||||
# min: 0
|
||||
# scale: 'RdYlGn-rev'
|
||||
# format: '{{:.2f}}%'"""
|
||||
|
||||
|
||||
def mqc_feature_stat(bfile, features, outfile, sname=None):
|
||||
# If sample name not given use file name
|
||||
if not sname:
|
||||
sname = os.path.splitext(os.path.basename(bfile))[0]
|
||||
|
||||
# Try to parse and read biocount file
|
||||
fcounts = {}
|
||||
try:
|
||||
with open(bfile, "r") as bfl:
|
||||
for ln in bfl:
|
||||
if ln.startswith("#"):
|
||||
continue
|
||||
ft, cn = ln.strip().split("\t")
|
||||
fcounts[ft] = float(cn)
|
||||
except:
|
||||
logger.error("Trouble reading the biocount file {}".format(bfile))
|
||||
return
|
||||
|
||||
total_count = sum(fcounts.values())
|
||||
if total_count == 0:
|
||||
logger.error("No biocounts found, exiting")
|
||||
return
|
||||
|
||||
# Calculate percentage for each requested feature
|
||||
fpercent = {f: (fcounts[f] / total_count) * 100 if f in fcounts else 0 for f in features}
|
||||
if len(fpercent) == 0:
|
||||
logger.error("Any of given features '{}' not found in the biocount file".format(", ".join(features), bfile))
|
||||
return
|
||||
|
||||
# Prepare the output strings
|
||||
out_head, out_value, out_mqc = ("Sample", "'{}'".format(sname), mqc_main)
|
||||
for ft, pt in fpercent.items():
|
||||
out_head = "{}\tpercent_{}".format(out_head, ft)
|
||||
out_value = "{}\t{}".format(out_value, pt)
|
||||
out_mqc = "{}\n{}".format(out_mqc, mqc_pconf.format(ft=ft))
|
||||
|
||||
# Write the output to a file
|
||||
with open(outfile, "w") as ofl:
|
||||
out_final = "\n".join([out_mqc, out_head, out_value]).strip()
|
||||
ofl.write(out_final + "\n")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
# Read the biotypes_header.txt file
|
||||
biotypes_header_path = os.path.join(meta["resources_dir"], 'biotypes_header.txt')
|
||||
with open(biotypes_header_path, 'r') as header_file:
|
||||
biotypes_header = header_file.read()
|
||||
|
||||
# Extract specific columns (1 and 7) and skip the first two lines
|
||||
filtered_lines = []
|
||||
with open(par["biocounts"], 'r') as biocounts_file:
|
||||
for i, line in enumerate(biocounts_file):
|
||||
if i >= 2: # Skipping first two lines
|
||||
columns = line.strip().split('\t')
|
||||
filtered_line = f"{columns[0]}\t{columns[6]}" # Columns 1 and 7 (0-indexed)
|
||||
filtered_lines.append(filtered_line)
|
||||
|
||||
# Concatenate the header and the processed lines
|
||||
result = biotypes_header + '\n'.join(filtered_lines) + '\n'
|
||||
|
||||
# Write the result to par_featurecounts_multiqc
|
||||
with open(par["featurecounts_multiqc"], 'w') as output_file:
|
||||
output_file.write(result)
|
||||
|
||||
mqc_feature_stat(par["featurecounts_multiqc"], par["features"], par["featurecounts_rrna_multiqc"], par["id"])
|
||||
36
src/multiqc_custom_biotype/test.sh
Normal file
36
src/multiqc_custom_biotype/test.sh
Normal file
@@ -0,0 +1,36 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "> Prepare test data"
|
||||
|
||||
cat > "test.featurecounts.txt" << HERE
|
||||
# Program:featureCounts v2.0.1; Command:"featureCounts" "-t" "CDS" "-T" "2" "-a" "genome.gtf" "-s" "1" "-o" "test.featureCounts.txt" "test.single_end.bam"
|
||||
Geneid Chr Start End Strand Length test.single_end.bam
|
||||
orf1ab MT192765.1;MT192765.1 259;13461 13461;21545 +;+ 21287 38
|
||||
S MT192765.1 21556 25374 + 3819 4
|
||||
ORF3a MT192765.1 25386 26210 + 825 0
|
||||
E MT192765.1 26238 26462 + 225 1
|
||||
M MT192765.1 26516 27181 + 666 1
|
||||
ORF6 MT192765.1 27195 27377 + 183 0
|
||||
ORF7a MT192765.1 27387 27749 + 363 0
|
||||
ORF7b MT192765.1 27749 27877 + 129 0
|
||||
ORF8 MT192765.1 27887 28249 + 363 0
|
||||
N MT192765.1 28267 29523 + 1257 2
|
||||
ORF10 MT192765.1 29551 29664 + 114 0
|
||||
HERE
|
||||
|
||||
echo "> Run test"
|
||||
"$meta_executable" \
|
||||
--biocounts "test.featurecounts.txt" \
|
||||
--id "test" \
|
||||
--featurecounts_multiqc "test.biotype_counts_mqc.tsv" \
|
||||
--featurecounts_rrna_multiqc "test.biotype_counts_rrna_mqc.tsv"
|
||||
|
||||
echo "> Check results"
|
||||
[ ! -f "test.biotype_counts_mqc.tsv" ] && echo "test.biotype_counts_mqc.tsv was not created" && exit 1
|
||||
[ ! -s "test.biotype_counts_mqc.tsv" ] && echo "test.biotype_counts_mqc.tsv is empty" && exit 1
|
||||
[ ! -f "test.biotype_counts_rrna_mqc.tsv" ] && echo "test.biotype_counts_rrna_mqc.tsv was not created" && exit 1
|
||||
[ ! -s "test.biotype_counts_rrna_mqc.tsv" ] && echo "test.biotype_counts_rrna_mqc.tsv is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
69
src/picard_markduplicates/config.vsh.yaml
Normal file
69
src/picard_markduplicates/config.vsh.yaml
Normal file
@@ -0,0 +1,69 @@
|
||||
name: "picard_markduplicates"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/picard/markduplicates/main.nf, modules/nf-core/picard/markduplicates/meta.yml]
|
||||
last_sha: 55398de6ab7577acfe9b1180016a93d7af7eb859
|
||||
description: |
|
||||
Locate and tag duplicate reads in a BAM file
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--bam"
|
||||
type: file
|
||||
description: Input BAM file
|
||||
- name: "--fasta"
|
||||
type: file
|
||||
description: Reference genome FASTA file
|
||||
- name: "--fai"
|
||||
type: file
|
||||
description: Reference genome FASTA index
|
||||
- name: "--extra_picard_args"
|
||||
type: string
|
||||
description: Additional argument to be passed to Picard MarkDuplicates
|
||||
default: '--ASSUME_SORTED true --REMOVE_DUPLICATES false --VALIDATION_STRINGENCY LENIENT --TMP_DIR tmp'
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_bam"
|
||||
type: file
|
||||
direction: output
|
||||
description: BAM file with duplicate reads marked/removed
|
||||
default: $id.MarkDuplicates.bam
|
||||
- name: "--bai"
|
||||
type: file
|
||||
direction: output
|
||||
description: An optional BAM index file. If desired, --CREATE_INDEX must be passed as a flag
|
||||
default: $id.MarkDuplicates.bam.bai
|
||||
must_exist: false
|
||||
- name: "--metrics"
|
||||
type: file
|
||||
direction: output
|
||||
description: Duplicate metrics file generated by picard
|
||||
default: $id.MarkDuplicates.metrics.txt
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
|
||||
- path: /testData/unit_test_resources/sarscov2/genome.fasta
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
apt-get update && \
|
||||
apt-get install -y build-essential openjdk-17-jdk wget && \
|
||||
wget --no-check-certificate https://github.com/broadinstitute/picard/releases/download/3.1.1/picard.jar && \
|
||||
mv picard.jar /usr/local/bin
|
||||
env: [ PICARD=/usr/local/bin/picard.jar ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
17
src/picard_markduplicates/script.sh
Executable file
17
src/picard_markduplicates/script.sh
Executable file
@@ -0,0 +1,17 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
avail_mem=3072
|
||||
if [ ! $meta_memory_mb ]; then
|
||||
echo '[Picard MarkDuplicates] Available memory not known - defaulting to 3GB. Specify process memory requirements to change this.'
|
||||
else
|
||||
avail_mem=$(( $meta_memory_mb*0.8 ))
|
||||
fi
|
||||
|
||||
java -Xmx${avail_mem}M -jar $PICARD MarkDuplicates \
|
||||
$par_extra_picard_args \
|
||||
--INPUT $par_bam \
|
||||
--OUTPUT $par_output_bam \
|
||||
--REFERENCE_SEQUENCE $par_fasta \
|
||||
--METRICS_FILE $par_metrics
|
||||
19
src/picard_markduplicates/test.sh
Normal file
19
src/picard_markduplicates/test.sh
Normal file
@@ -0,0 +1,19 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--bam "$meta_resources_dir/test.paired_end.sorted.bam" \
|
||||
--fasta "$meta_resources_dir/genome.fasta" \
|
||||
--extra_picard_args "--REMOVE_DUPLICATES false" \
|
||||
--output_bam "test.MarkDuplicates.genome.bam" \
|
||||
--metrics "test.MarkDuplicates.metrics.txt"
|
||||
|
||||
echo ">>> Check whether output exists"
|
||||
[ ! -f "test.MarkDuplicates.genome.bam" ] && echo "MarkDuplicates output BAM file does not exist!" && exit 1
|
||||
[ ! -s "test.MarkDuplicates.genome.bam" ] && echo "MarkDuplicates output BAM file is empty!" && exit 1
|
||||
[ ! -f "test.MarkDuplicates.metrics.txt" ] && echo "MarkDuplicates output metrics file does not exist!" && exit 1
|
||||
[ ! -s "test.MarkDuplicates.metrics.txt" ] && echo "MarkDuplicates output metrics file is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
146
src/prepare_multiqc_input/config.vsh.yaml
Normal file
146
src/prepare_multiqc_input/config.vsh.yaml
Normal file
@@ -0,0 +1,146 @@
|
||||
name: "prepare_multiqc_input"
|
||||
description: |
|
||||
Prepare directory with all the input files for MultiQC.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--fail_trimming_multiqc"
|
||||
type: string
|
||||
- name: "--fail_mapping_multiqc"
|
||||
type: string
|
||||
- name: "--fail_strand_multiqc"
|
||||
type: string
|
||||
- name: "--fastqc_raw_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--fastqc_trim_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--trim_log_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--sortmerna_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--star_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
# - name: "--hisat2_multiqc"
|
||||
# type: file
|
||||
# - name: "--rsem_multiqc"
|
||||
# type: file
|
||||
- name: "--salmon_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--samtools_stats"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--samtools_flagstat"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--samtools_idxstats"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--markduplicates_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--pseudo_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--featurecounts_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--featurecounts_rrna_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--aligner_pca_multiqc"
|
||||
type: file
|
||||
- name: "--aligner_clustering_multiqc"
|
||||
type: file
|
||||
- name: "--pseudo_aligner_pca_multiqc"
|
||||
type: file
|
||||
- name: "--pseudo_aligner_clustering_multiqc"
|
||||
type: file
|
||||
- name: "--preseq_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--qualimap_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--dupradar_output_dup_intercept_mqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--dupradar_output_duprate_exp_denscurve_mqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--bamstat_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--inferexperiment_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--innerdistance_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--junctionannotation_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--junctionsaturation_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--readdistribution_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--readduplication_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--tin_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--multiqc_config"
|
||||
type: file
|
||||
|
||||
|
||||
- name: "Ouput"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
default: multiqc_input
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
73
src/prepare_multiqc_input/script.sh
Normal file
73
src/prepare_multiqc_input/script.sh
Normal file
@@ -0,0 +1,73 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
mkdir -p $par_output
|
||||
|
||||
echo $par_fail_trimming_multiqc > $par_output/fail_trimming_mqc.tsv
|
||||
echo $par_fail_mapping_multiqc > $par_output/fail_mapping_mqc.tsv
|
||||
echo $par_fail_strand_multiqc > $par_output/fail_strand_mqc.tsv
|
||||
|
||||
IFS="," read -ra fastqc_raw_multiqc <<< $par_fastqc_raw_multiqc && for file in "${fastqc_raw_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra fastqc_trim_multiqc <<< $par_fastqc_trim_multiqc && for file in "${fastqc_trim_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra trim_log_multiqc <<< $par_trim_log_multiqc && for file in "${trim_log_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra sortmerna_multiqc <<< $par_sortmerna_multiqc && for file in "${sortmerna_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra star_multiqc <<< $par_star_multiqc && for file in "${star_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
# IFS="," read -ra hisat2_multiqc <<< $par_hisat2_multiqc && for file in "${hisat2_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra rsem_multiqc <<< $par_rsem_multiqc && for file in "${rsem_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra salmon_multiqc <<< $par_salmon_multiqc && for file in "${salmon_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra pseudo_multiqc <<< $par_pseudo_multiqc && for file in "${pseudo_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra samtools_stats <<< $par_samtools_stats && for file in "${samtools_stats[@]}"; do [ -e "$file" ] && cp -r "$file" $par_output/; done
|
||||
|
||||
IFS="," read -ra samtools_flagstat <<< $par_samtools_flagstat && for file in "${samtools_flagstat[@]}"; do [ -e "$file" ] && cp -r "$file" $par_output/; done
|
||||
|
||||
IFS="," read -ra samtools_idxstats <<< $par_samtools_idxstats && for file in "${samtools_idxstats[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra markduplicates_multiqc <<< $par_markduplicates_multiqc && for file in "${markduplicates_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra featurecounts_multiqc <<< $par_featurecounts_multiqc && for file in "${featurecounts_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra featurecounts_rrna_multiqc <<< $par_featurecounts_rrna_multiqc && for file in "${featurecounts_rrna_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
[ -e "$par_aligner_pca_multiqc" ] && cp -r "$par_aligner_pca_multiqc" "$par_output/"
|
||||
|
||||
[ -e "$par_aligner_clustering_multiqc" ] && cp -r $par_aligner_clustering_multiqc "$par_output/"
|
||||
|
||||
[ -e "$par_pseudo_aligner_pca_multiqc" ] && cp -r $par_pseudo_aligner_pca_multiqc "$par_output/"
|
||||
|
||||
[ -e "$par_pseudo_aligner_clustering_multiqc" ] && cp -r $par_pseudo_aligner_clustering_multiqc "$par_output/"
|
||||
|
||||
IFS="," read -ra preseq_multiqc <<< $par_preseq_multiqc && for file in "${preseq_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra qualimap_multiqc <<< $par_qualimap_multiqc && for file in "${qualimap_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra dupradar_output_dup_intercept_mqc <<< $par_dupradar_output_dup_intercept_mqc && for file in "${dupradar_output_dup_intercept_mqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra dupradar_output_duprate_exp_denscurve_mqc <<< $par_dupradar_output_duprate_exp_denscurve_mqc && for file in "${dupradar_output_duprate_exp_denscurve_mqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra bamstat_multiqc <<< $par_bamstat_multiqc && for file in "${bamstat_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra inferexperiment_multiqc <<< $par_inferexperiment_multiqc && for file in "${inferexperiment_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra innerdistance_multiqc <<< $par_innerdistance_multiqc && for file in "${innerdistance_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra junctionannotation_multiqc <<< $par_junctionannotation_multiqc && for file in "${junctionannotation_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra junctionsaturation_multiqc <<< $par_junctionsaturation_multiqc && for file in "${junctionsaturation_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra readdistribution_multiqc <<< $par_readdistribution_multiqc && for file in "${readdistribution_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra readduplication_multiqc <<< $par_readduplication_multiqc && for file in "${readduplication_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra tin_multiqc <<< $par_tin_multiqc && for file in "${tin_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
[ -e "$par_multiqc_config" ] && cp -r $par_multiqc_config "$par_output/"
|
||||
40
src/preprocess_transcripts_fasta/config.vsh.yaml
Normal file
40
src/preprocess_transcripts_fasta/config.vsh.yaml
Normal file
@@ -0,0 +1,40 @@
|
||||
name: "preprocess_transcripts_fasta"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/preprocess_transcripts_fasta_gencode.nf]
|
||||
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
|
||||
description: |
|
||||
Process transcripts FASTA if GTF file is GENOCODE format
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--transcript_fasta"
|
||||
type: file
|
||||
required: true
|
||||
description: Path of transcripts FASTA file
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
required: true
|
||||
description: Path of processed output FASTA file.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/reference/transcriptome.fasta
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
11
src/preprocess_transcripts_fasta/script.sh
Normal file
11
src/preprocess_transcripts_fasta/script.sh
Normal file
@@ -0,0 +1,11 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
filename="$(basename -- "$par_transcript_fasta")"
|
||||
|
||||
if [ ${filename##*.} == "gz" ]; then
|
||||
zcat $par_transcript_fasta | cut -d "|" -f1 > $par_output
|
||||
else
|
||||
cat $par_transcript_fasta | cut -d "|" -f1 > $par_output
|
||||
fi
|
||||
14
src/preprocess_transcripts_fasta/test.sh
Normal file
14
src/preprocess_transcripts_fasta/test.sh
Normal file
@@ -0,0 +1,14 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--transcript_fasta "$meta_resources_dir/transcriptome.fasta" \
|
||||
--output "processed_transcriptome.fasta"
|
||||
|
||||
echo ">>> Check whether output exists"
|
||||
[ ! -f "processed_transcriptome.fasta" ] && echo "Processed FASTA file does not exist!" && exit 1
|
||||
[ ! -s "processed_transcriptome.fasta" ] && echo "Processed FASTA file is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
68
src/preseq_lcextrap/config.vsh.yaml
Normal file
68
src/preseq_lcextrap/config.vsh.yaml
Normal file
@@ -0,0 +1,68 @@
|
||||
name: "preseq_lcextrap"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/preseq/lcextrap/main.nf, modules/nf-core/preseq/lcextrap/meta.yml]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: Computing the expected future yield of distinct reads and bounds on the number of total distinct reads in the library and the associated confidence intervals.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
description: Input genome BAM/BED file
|
||||
- name: "--extra_preseq_args"
|
||||
type: string
|
||||
- name: "--paired"
|
||||
type: boolean
|
||||
description: Paired-end reads?
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.lc_extrap.txt
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/a.sorted.bed
|
||||
- path: /testData/unit_test_resources/SRR1106616_5M_subset.bam
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ curl, bzip2, build-essential, wget, gcc, autoconf, automake, make, libz-dev, libbz2-dev, zlib1g-dev, libncurses5-dev, libncursesw5-dev, liblzma-dev, pip ]
|
||||
- type: docker
|
||||
run: |
|
||||
cd /usr/bin && \
|
||||
wget --no-check-certificate https://github.com/smithlabcode/preseq/releases/download/v3.2.0/preseq-3.2.0.tar.gz && \
|
||||
wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2 && \
|
||||
wget --no-check-certificate https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static && \
|
||||
curl -fsSL https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2 -o samtools-1.18.tar.bz2 && \
|
||||
tar -xjf samtools-1.18.tar.bz2 && rm samtools-1.18.tar.bz2 && \
|
||||
tar -xzf preseq-3.2.0.tar.gz && rm preseq-3.2.0.tar.gz && \
|
||||
tar -vxjf htslib-1.9.tar.bz2 && rm htslib-1.9.tar.bz2 && \
|
||||
mv bedtools.static /usr/local/bin/bedtools && \
|
||||
chmod a+x /usr/local/bin/bedtools && \
|
||||
cd samtools-1.18 && \
|
||||
./configure && \
|
||||
make && \
|
||||
make install && \
|
||||
cd /usr/bin && cd htslib-1.9 && \
|
||||
make && \
|
||||
cd /usr/bin && cd preseq-3.2.0 && \
|
||||
mkdir build && cd build && \
|
||||
../configure && \
|
||||
make && make install && make HAVE_HTSLIB=1 all
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
29
src/preseq_lcextrap/script.sh
Normal file
29
src/preseq_lcextrap/script.sh
Normal file
@@ -0,0 +1,29 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
file=$(basename -- "$par_input")
|
||||
filename="${file%.*}"
|
||||
|
||||
if [ "${file##*.}" == "bam" ]; then
|
||||
samtools sort -o sorted_$filename.bam -n $par_input
|
||||
bedtools bamtobed -i sorted_$filename.bam > $filename.bed
|
||||
bedtools sort -i $filename.bed > sorted_$filename.bed
|
||||
elif [ "${file##*.}" == "bed" ]; then
|
||||
bedtools sort -i $par_input > sorted_$filename.bed
|
||||
else
|
||||
echo "Invalid input file format!"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if $par_paired; then
|
||||
paired="-pe"
|
||||
else
|
||||
paired=""
|
||||
fi
|
||||
|
||||
preseq lc_extrap \
|
||||
sorted_$filename.bed \
|
||||
$paired \
|
||||
$par_extra_preseq_args \
|
||||
-o $par_output
|
||||
28
src/preseq_lcextrap/test.sh
Normal file
28
src/preseq_lcextrap/test.sh
Normal file
@@ -0,0 +1,28 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
echo ">>> Testing with BAM input"
|
||||
"$meta_executable" \
|
||||
--paired false \
|
||||
--input "$meta_resources_dir/SRR1106616_5M_subset.bam" \
|
||||
--output lc_extrap.txt
|
||||
|
||||
echo ">>> Check whether output exists"
|
||||
[ ! -f "lc_extrap.txt" ] && echo "Output file does not exist!" && exit 1
|
||||
[ ! -s "lc_extrap.txt" ] && echo "Output file is empty!" && exit 1
|
||||
|
||||
rm lc_extrap.txt
|
||||
|
||||
echo ">>> Testing with BED input"
|
||||
"$meta_executable" \
|
||||
--paired false \
|
||||
--input "$meta_resources_dir/a.sorted.bed" \
|
||||
--output lc_extrap.txt
|
||||
|
||||
echo ">>> Check whether output exists"
|
||||
[ ! -f "lc_extrap.txt" ] && echo "Output file does not exist!" && exit 1
|
||||
[ ! -s "lc_extrap.txt" ] && echo "Output file is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
60
src/rsem_merge_counts/config.vsh.yaml
Normal file
60
src/rsem_merge_counts/config.vsh.yaml
Normal file
@@ -0,0 +1,60 @@
|
||||
name: "rsem_merge_counts"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/rsem_merge_counts/main.nf]
|
||||
last_sha: 311279532694ce7520164ce4d65a388c0cd11f60
|
||||
|
||||
description: |
|
||||
Merge the transcript quantification results obtained from rsem calculate-expression across all samples.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--counts_gene"
|
||||
type: file
|
||||
description: Expression counts on gene level (genes)
|
||||
- name: "--counts_transcripts"
|
||||
type: file
|
||||
description: Expression counts on transcript level (isoforms)
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--merged_gene_counts"
|
||||
type: file
|
||||
description: File containing gene counts across all samples.
|
||||
default: rsem.merged.gene_counts.tsv
|
||||
direction: output
|
||||
- name: "--merged_gene_tpm"
|
||||
type: file
|
||||
description: File containing gene TPM across all samples.
|
||||
default: rsem.merged.gene_tpm.tsv
|
||||
direction: output
|
||||
- name: "--merged_transcript_counts"
|
||||
type: file
|
||||
description: File containing transcript counts across all samples.
|
||||
default: rsem.merged.transcript_counts.tsv
|
||||
direction: output
|
||||
- name: "--merged_transcript_tpm"
|
||||
type: file
|
||||
description: File containing transcript TPM across all samples.
|
||||
default: rsem.merged.transcript_tpm.tsv
|
||||
direction: output
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
# test_resources:
|
||||
# - type: bash_script
|
||||
# path: test.sh
|
||||
# - path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
|
||||
# - path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
28
src/rsem_merge_counts/script.sh
Normal file
28
src/rsem_merge_counts/script.sh
Normal file
@@ -0,0 +1,28 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -ep pipefail
|
||||
|
||||
mkdir -p tmp/genes
|
||||
# cut -f 1,2 `ls $par_count_genes/*` | head -n 1` > gene_ids.txt
|
||||
for file_id in ${par_count_genes[*]}; do
|
||||
samplename=`basename $file_id | sed s/\\.genes.results\$//g`
|
||||
echo $samplename > tmp/genes/${samplename}.counts.txt
|
||||
cut -f 5 ${file_id} | tail -n+2 >> tmp/genes/${samplename}.counts.txt
|
||||
echo $samplename > tmp/genes/${samplename}.tpm.txt
|
||||
cut -f 6 ${file_id} | tail -n+2 >> tmp/genes/${samplename}.tpm.txt
|
||||
done
|
||||
|
||||
mkdir -p tmp/isoforms
|
||||
# cut -f 1,2 `ls $par_counts_transcripts/*` | head -n 1` > transcript_ids.txt
|
||||
for file_id in ${par_counts_transcripts[*]}; do
|
||||
samplename=`basename $file_id | sed s/\\.isoforms.results\$//g`
|
||||
echo $samplename > tmp/isoforms/${samplename}.counts.txt
|
||||
cut -f 5 ${file_id} | tail -n+2 >> tmp/isoforms/${samplename}.counts.txt
|
||||
echo $samplename > tmp/isoforms/${samplename}.tpm.txt
|
||||
cut -f 6 ${file_id} | tail -n+2 >> tmp/isoforms/${samplename}.tpm.txt
|
||||
done
|
||||
|
||||
paste gene_ids.txt tmp/genes/*.counts.txt > $par_merged_gene_counts
|
||||
paste gene_ids.txt tmp/genes/*.tpm.txt > $par_merged_gene_tpm
|
||||
paste transcript_ids.txt tmp/isoforms/*.counts.txt > $par_merged_transcript_counts
|
||||
paste transcript_ids.txt tmp/isoforms/*.tpm.txt > $par_merged_transcript_tpm
|
||||
108
src/rseqc/rseqc_junctionannotation/config.vsh.yaml
Normal file
108
src/rseqc/rseqc_junctionannotation/config.vsh.yaml
Normal file
@@ -0,0 +1,108 @@
|
||||
name: "rseqc_junctionannotation"
|
||||
namespace: "rseqc"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/rseqc/junctionannotation/main.nf]
|
||||
last_sha:
|
||||
description: |
|
||||
Compare detected splice junctions to reference gene model.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
required: true
|
||||
description: input alignment file in BAM or SAM format
|
||||
|
||||
- name: "--refgene"
|
||||
type: file
|
||||
required: true
|
||||
description: Reference gene model in bed format
|
||||
|
||||
- name: "--map_qual"
|
||||
type: integer
|
||||
required: false
|
||||
default: 30
|
||||
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
|
||||
min: 0
|
||||
|
||||
- name: "--min_intron"
|
||||
type: integer
|
||||
required: false
|
||||
default: 50
|
||||
min: 1
|
||||
description: Minimum intron length (bp), default = 50.
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_log"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_annotation.log
|
||||
description: output log of junction annotation script
|
||||
|
||||
- name: "--output_plot_r"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_annotation_plot.r
|
||||
description: r script to generate splice_junction and splice_events plot
|
||||
|
||||
- name: "--output_junction_bed"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_annotation.bed
|
||||
description: junction annotation file (bed format)
|
||||
|
||||
- name: "--output_junction_interact"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_annotation.Interact.bed
|
||||
description: interact file (bed format) of junctions. Can be uploaded to UCSC genome browser or converted to bigInteract (using bedToBigBed program) for visualization.
|
||||
|
||||
- name: "--output_junction_sheet"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_annotation.xls
|
||||
description: junction annotation file (xls format)
|
||||
|
||||
- name: "--output_splice_events_plot"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.splice_events.pdf
|
||||
description: plot of splice events (pdf)
|
||||
|
||||
- name: "--output_splice_junctions_plot"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.splice_junctions_plot.pdf
|
||||
description: plot of junctions (pdf)
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
|
||||
- path: /testData/unit_test_resources/sarscov2/test.bed12
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ python3-pip, r-base]
|
||||
- type: python
|
||||
packages: [ RSeQC ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
20
src/rseqc/rseqc_junctionannotation/script.sh
Normal file
20
src/rseqc/rseqc_junctionannotation/script.sh
Normal file
@@ -0,0 +1,20 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
prefix=$(openssl rand -hex 8)
|
||||
input="testData/unit_test_resources/test.paired_end.sorted.bam"
|
||||
refgene="testData/unit_test_resources/test.bed"
|
||||
junction_annotation.py \
|
||||
-i $par_input \
|
||||
-r $par_refgene \
|
||||
-o $prefix \
|
||||
-m $par_min_intron \
|
||||
-q $par_map_qual > $par_output_log
|
||||
|
||||
[[ -f "$prefix.junction.bed" ]] && mv $prefix.junction.bed $par_output_junction_bed
|
||||
[[ -f "$prefix.junction.Interact.bed" ]] && mv $prefix.junction.Interact.bed $par_output_junction_interact
|
||||
[[ -f "$prefix.junction.xls" ]] && mv $prefix.junction.xls $par_output_junction_sheet
|
||||
[[ -f "$prefix.junction_plot.r" ]] && mv $prefix.junction_plot.r $par_output_plot_r
|
||||
[[ -f "$prefix.splice_events.pdf" ]] && mv $prefix.splice_events.pdf $par_output_splice_events_plot
|
||||
[[ -f "$prefix.splice_junction.pdf" ]] && mv $prefix.splice_junction.pdf $par_output_splice_junctions_plot
|
||||
48
src/rseqc/rseqc_junctionannotation/test.sh
Normal file
48
src/rseqc/rseqc_junctionannotation/test.sh
Normal file
@@ -0,0 +1,48 @@
|
||||
#!/bin/bash
|
||||
|
||||
# define input and output for script
|
||||
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
|
||||
input_bed="$meta_resources_dir/test.bed12"
|
||||
|
||||
output_junction_bed="junction_annotation.bed"
|
||||
output_junction_interact="junction_annotation.Interact.bed"
|
||||
output_junction_sheet="junction_annotation.xls"
|
||||
output_plot_r="junction_annotation_plot.r"
|
||||
output_splice_events_plot="splice_events.pdf"
|
||||
output_splice_junctions_plot="splice_junctions_plot.pdf"
|
||||
output_log="junction_annotation.log"
|
||||
|
||||
# run executable and test
|
||||
echo "> Running $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$input_bam" \
|
||||
--refgene "$input_bed" \
|
||||
--output_log "$output_log" \
|
||||
--output_plot_r "$output_plot_r" \
|
||||
--output_junction_bed "$output_junction_bed" \
|
||||
--output_junction_interact "$output_junction_interact" \
|
||||
--output_junction_sheet "$output_junction_sheet" \
|
||||
--output_splice_events_plot "$output_splice_events_plot" \
|
||||
--output_splice_junctions_plot "$output_splice_junctions_plot"
|
||||
|
||||
# exit_code=$?
|
||||
# [[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> Check if all output files were created"
|
||||
|
||||
[ ! -f "$output_log" ] && echo "$output_log was not created" && exit 1
|
||||
[ ! -f "$output_junction_sheet" ] && echo "$output_junction_sheet was not created" && exit 1
|
||||
[ -s "$output_junction_sheet" ] && echo "$output_junction_sheet is not empty but should be" && exit 1
|
||||
[ ! -f "$output_plot_r" ] && echo "$output_plot_r was not created" && exit 1
|
||||
[ -s "$output_plot_r" ] && echo "$output_plot_r is not empty but should be" && exit 1
|
||||
# [ ! -f "$output_junction_bed" ] && echo "$output_junction_bed was not created" && exit 1
|
||||
# [ ! -s "$output_junction_bed" ] && echo "$output_junction_bed is empty" && exit 1
|
||||
# [ ! -f "$output_junction_interact" ] && echo "$output_junction_interact was not created" && exit 1
|
||||
# [ ! -s "$output_junction_interact" ] && echo "$output_junction_interact is empty" && exit 1
|
||||
# [ ! -f "$output_splice_events_plot" ] && echo "$output_splice_events_plot was not created" && exit 1
|
||||
# [ ! -s "$output_splice_events_plot" ] && echo "$output_splice_events_plot is empty" && exit 1
|
||||
# [ ! -f "$output_splice_junctions_plot" ] && echo "$output_splice_junctions_plot was not created" && exit 1
|
||||
# [ ! -s "$output_splice_junctions_plot" ] && echo "$output_splice_junctions_plot is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
105
src/rseqc/rseqc_junctionsaturation/config.vsh.yaml
Normal file
105
src/rseqc/rseqc_junctionsaturation/config.vsh.yaml
Normal file
@@ -0,0 +1,105 @@
|
||||
name: "rseqc_junctionsaturation"
|
||||
namespace: "rseqc"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/rseqc/junctionsaturation/main.nf]
|
||||
last_sha:
|
||||
description: |
|
||||
Compare detected splice junctions to reference gene model.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
required: true
|
||||
description: input alignment file in BAM or SAM format
|
||||
|
||||
- name: "--refgene"
|
||||
type: file
|
||||
required: true
|
||||
description: Reference gene model in bed format
|
||||
|
||||
- name: "--sampling_percentile_lower_bound"
|
||||
type: integer
|
||||
required: false
|
||||
default: 5
|
||||
description: Sampling starts from this percentile, must be an integer between 0 and 100, default =5.
|
||||
min: 0
|
||||
max: 100
|
||||
|
||||
- name: "--sampling_percentile_upper_bound"
|
||||
type: integer
|
||||
required: false
|
||||
default: 100
|
||||
description: Sampling ends at this percentile, must be an integer between 0 and 100, default =5.
|
||||
min: 0
|
||||
max: 100
|
||||
|
||||
- name: "--sampling_percentile_step"
|
||||
type: integer
|
||||
required: false
|
||||
default: 5
|
||||
description: Sampling frequency in %. Smaller value means more sampling times. Must be an integer between 0 and 100, default = 5.
|
||||
min: 0
|
||||
max: 100
|
||||
|
||||
- name: "--min_intron"
|
||||
type: integer
|
||||
required: false
|
||||
default: 50
|
||||
min: 1
|
||||
description: Minimum intron length (bp), default = 50.
|
||||
|
||||
- name: "--min_splice_read"
|
||||
type: integer
|
||||
required: false
|
||||
default: 1
|
||||
min: 1
|
||||
description: Minimum number of supporting reads to call a junction, default = 1.
|
||||
|
||||
- name: "--map_qual"
|
||||
type: integer
|
||||
required: false
|
||||
default: 30
|
||||
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
|
||||
min: 0
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_plot_r"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_saturation_plot.r
|
||||
description: r script to generate junction_saturation_plot plot
|
||||
|
||||
- name: "--output_plot"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_saturation_plot.pdf
|
||||
description: plot of junction saturation (pdf)
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
|
||||
- path: /testData/unit_test_resources/sarscov2/test.bed
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ python3-pip, r-base]
|
||||
- type: python
|
||||
packages: [ RSeQC ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
19
src/rseqc/rseqc_junctionsaturation/script.sh
Normal file
19
src/rseqc/rseqc_junctionsaturation/script.sh
Normal file
@@ -0,0 +1,19 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
prefix=$(openssl rand -hex 8)
|
||||
|
||||
junction_saturation.py \
|
||||
-i $par_input \
|
||||
-r $par_refgene \
|
||||
-o $prefix \
|
||||
-l $par_sampling_percentile_lower_bound \
|
||||
-u $par_sampling_percentile_upper_bound \
|
||||
-s $par_sampling_percentile_step \
|
||||
-m $par_min_intron \
|
||||
-v $par_min_splice_read \
|
||||
-q $par_map_qual
|
||||
|
||||
[[ -f "$prefix.junctionSaturation_plot.pdf" ]] && mv $prefix.junctionSaturation_plot.pdf $par_output_plot
|
||||
[[ -f "$prefix.junctionSaturation_plot.r" ]] && mv $prefix.junctionSaturation_plot.r $par_output_plot_r
|
||||
30
src/rseqc/rseqc_junctionsaturation/test.sh
Normal file
30
src/rseqc/rseqc_junctionsaturation/test.sh
Normal file
@@ -0,0 +1,30 @@
|
||||
#!/bin/bash
|
||||
|
||||
gunzip "$meta_resources_dir/hg19_RefSeq.bed.gz"
|
||||
|
||||
# define input and output for script
|
||||
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
|
||||
input_bed="$meta_resources_dir/test.bed"
|
||||
|
||||
output_plot="junction_saturation_plot.pdf"
|
||||
output_plot_r="junction_saturation_plot.r"
|
||||
|
||||
# run executable and test
|
||||
echo "> Running $meta_functionality_name"
|
||||
"$meta_executable" \
|
||||
--input "$input_bam" \
|
||||
--refgene "$input_bed" \
|
||||
--output_plot_r "$output_plot_r" \
|
||||
--output_plot "$output_plot"
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> asserting all output files were created"
|
||||
|
||||
[ ! -f "$output_plot_r" ] && echo "$output_plot_r was not created" && exit 1
|
||||
[ ! -s "$output_plot_r" ] && echo "$output_plot_r is empty" && exit 1
|
||||
[ ! -f "$output_plot" ] && echo "$output_plot was not created" && exit 1
|
||||
[ ! -s "$output_plot" ] && echo "$output_plot is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
52
src/rseqc/rseqc_readdistribution/config.vsh.yaml
Normal file
52
src/rseqc/rseqc_readdistribution/config.vsh.yaml
Normal file
@@ -0,0 +1,52 @@
|
||||
name: "rseqc_readdistribution"
|
||||
namespace: "rseqc"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/rseqc/readdistribution/main.nf]
|
||||
last_sha:
|
||||
description: |
|
||||
Calculate how mapped reads are distributed over genomic features.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
required: true
|
||||
description: input alignment file in BAM or SAM format
|
||||
|
||||
- name: "--refgene"
|
||||
type: file
|
||||
required: true
|
||||
description: Reference gene model in bed format
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.read_distribution.txt
|
||||
description: output file (txt) of read distribution analysis.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
|
||||
- path: /testData/unit_test_resources/sarscov2/test.bed12
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ python3-pip ]
|
||||
- type: python
|
||||
packages: [ RSeQC ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
8
src/rseqc/rseqc_readdistribution/script.sh
Normal file
8
src/rseqc/rseqc_readdistribution/script.sh
Normal file
@@ -0,0 +1,8 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
read_distribution.py \
|
||||
-i $par_input \
|
||||
-r $par_refgene \
|
||||
> $par_output
|
||||
24
src/rseqc/rseqc_readdistribution/test.sh
Normal file
24
src/rseqc/rseqc_readdistribution/test.sh
Normal file
@@ -0,0 +1,24 @@
|
||||
#!/bin/bash
|
||||
|
||||
|
||||
# define input and output for script
|
||||
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
|
||||
input_bed="$meta_resources_dir/test.bed12"
|
||||
output="read_distribution.txt"
|
||||
|
||||
# run executable and test
|
||||
echo "> Running $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$input_bam" \
|
||||
--refgene "$input_bed" \
|
||||
--output "$output"
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> Asserting output file was created"
|
||||
[ ! -f "$output" ] && echo "$output was not created" && exit 1
|
||||
[ ! -f "$output" ] && echo "$output is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
82
src/rseqc/rseqc_readduplication/config.vsh.yaml
Normal file
82
src/rseqc/rseqc_readduplication/config.vsh.yaml
Normal file
@@ -0,0 +1,82 @@
|
||||
name: "rseqc_readduplication"
|
||||
namespace: "rseqc"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/rseqc/readduplication/main.nf]
|
||||
last_sha:
|
||||
description: |
|
||||
Calculate read duplication rate.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
required: true
|
||||
description: input alignment file in BAM or SAM format
|
||||
|
||||
- name: "--read_count_upper_limit"
|
||||
type: integer
|
||||
required: false
|
||||
default: 500
|
||||
description: Upper limit of reads' occurence. Only used for plotting, default = 500 (times).
|
||||
min: 1
|
||||
|
||||
- name: "--map_qual"
|
||||
type: integer
|
||||
required: false
|
||||
default: 30
|
||||
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
|
||||
min: 0
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_duplication_rate_plot_r"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.duplication_rate_plot.r
|
||||
description: R script for generating duplication rate plot
|
||||
|
||||
- name: "--output_duplication_rate_plot"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.duplication_rate_plot.pdf
|
||||
description: duplication rate plot (pdf)
|
||||
|
||||
- name: "--output_duplication_rate_mapping"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.duplication_rate_mapping.xls
|
||||
description: Summary of mapping-based read duplication
|
||||
|
||||
- name: "--output_duplication_rate_sequence"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.duplication_rate_sequencing.xls
|
||||
description: Summary of sequencing-based read duplication
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: "apt"
|
||||
packages: [python3-pip, r-base]
|
||||
- type: python
|
||||
packages: [RSeQC]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
16
src/rseqc/rseqc_readduplication/script.sh
Normal file
16
src/rseqc/rseqc_readduplication/script.sh
Normal file
@@ -0,0 +1,16 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
prefix=$(openssl rand -hex 8)
|
||||
|
||||
read_duplication.py \
|
||||
-i $par_input \
|
||||
-o $prefix \
|
||||
-u $par_read_count_upper_limit \
|
||||
-q $par_map_qual
|
||||
|
||||
[[ -f "$prefix.DupRate_plot.pdf" ]] && mv $prefix.DupRate_plot.pdf $par_output_duplication_rate_plot
|
||||
[[ -f "$prefix.DupRate_plot.r" ]] && mv $prefix.DupRate_plot.r $par_output_duplication_rate_plot_r
|
||||
[[ -f "$prefix.pos.DupRate.xls" ]] && mv $prefix.pos.DupRate.xls $par_output_duplication_rate_mapping
|
||||
[[ -f "$prefix.seq.DupRate.xls" ]] && mv $prefix.seq.DupRate.xls $par_output_duplication_rate_sequence
|
||||
25
src/rseqc/rseqc_readduplication/test.sh
Normal file
25
src/rseqc/rseqc_readduplication/test.sh
Normal file
@@ -0,0 +1,25 @@
|
||||
#!/bin/bash
|
||||
|
||||
# define input and output for script
|
||||
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
|
||||
|
||||
output_duplication_rate_plot_r="duplication_rate_plot.r"
|
||||
output_duplication_rate_plot="duplication_rate_plot.pdf"
|
||||
|
||||
# run executable and test
|
||||
echo "> Running $meta_functionality_name"
|
||||
"$meta_executable" \
|
||||
--input "$input_bam" \
|
||||
--output_duplication_rate_plot_r "$output_duplication_rate_plot_r" \
|
||||
--output_duplication_rate_plot "$output_duplication_rate_plot"
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> asserting all output files were created"
|
||||
[ ! -f "$output_duplication_rate_plot_r" ] && echo "$output_duplication_rate_plot_r was not created" && exit 1
|
||||
[ ! -s "$output_duplication_rate_plot_r" ] && echo "$output_duplication_rate_plot_r is empty" && exit 1
|
||||
[ ! -f "$output_duplication_rate_plot" ] && echo "$output_duplication_rate_plot was not created" && exit 1
|
||||
[ ! -s "$output_duplication_rate_plot" ] && echo "$output_duplication_rate_plot is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
85
src/rseqc/tin/config.vsh.yaml
Normal file
85
src/rseqc/tin/config.vsh.yaml
Normal file
@@ -0,0 +1,85 @@
|
||||
name: "rseqc_tin"
|
||||
namespace: "rseqc"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/rseqc/tin/main.nf]
|
||||
last_sha:
|
||||
description: |
|
||||
Calculte TIN (transcript integrity number) from RNA-seq reads
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--bam_input"
|
||||
type: file
|
||||
required: true
|
||||
description: Path to input alignment file in BAM or SAM format.
|
||||
|
||||
- name: "--bai_input"
|
||||
type: file
|
||||
required: true
|
||||
description: Path to bam index file in bai format.
|
||||
|
||||
- name: "--refgene"
|
||||
type: file
|
||||
required: true
|
||||
description: BED file containing the reference gene model
|
||||
|
||||
- name: "--minimum_coverage"
|
||||
type: integer
|
||||
required: false
|
||||
default: 10
|
||||
min: 1
|
||||
description: Minimum number of reads mapped to a transcript, default = 10.
|
||||
|
||||
- name: "--sample_size"
|
||||
type: integer
|
||||
required: false
|
||||
default: 100
|
||||
min: 1
|
||||
description: Number of equal-spaced nucleotide positions picked from mRNA. Note, if this number is larger than the length of mRNA (L), it will be halved until it's smaller than L (default = 100)
|
||||
|
||||
- name: "--subtract_background"
|
||||
type: boolean_true
|
||||
description: Set flag to subtract background noise (estimated from intronic reads). Only use this option if there are substantial intronic reads.
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_tin_summary"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.tin_summary.txt
|
||||
description: summary statistics (txt) of calculated TIN metrics
|
||||
|
||||
- name: "--output_tin"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.tin.xls
|
||||
description: file with TIN metrics (xls)
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam.bai
|
||||
- path: /testData/unit_test_resources/sarscov2/test.bed12
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: "apt"
|
||||
packages: [python3-pip]
|
||||
- type: docker
|
||||
run: |
|
||||
pip3 install RSeQC
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
29
src/rseqc/tin/script.sh
Normal file
29
src/rseqc/tin/script.sh
Normal file
@@ -0,0 +1,29 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
bam_file="$(basename -- $par_bam_input)"
|
||||
bai_file="$(basename -- $par_bai_input)"
|
||||
|
||||
echo "$bam_file"
|
||||
echo "$bai_file"
|
||||
|
||||
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_functionality_name-XXXXXXXX")
|
||||
function clean_up {
|
||||
rm -rf "$tmpdir"
|
||||
}
|
||||
|
||||
cp $par_bam_input $tmpdir/$bam_file
|
||||
cp $par_bai_input $tmpdir/$bai_file
|
||||
|
||||
tin.py \
|
||||
-i $tmpdir/$bam_file \
|
||||
-r $par_refgene \
|
||||
-c $par_minimum_coverage \
|
||||
-n $par_sample_size \
|
||||
-s $par_subtract_background
|
||||
|
||||
[[ -f "${bam_file%.*}.summary.txt" ]] && mv ${bam_file%.*}.summary.txt $par_output_tin_summary
|
||||
[[ -f "${bam_file%.*}.tin.xls" ]] && mv ${bam_file%.*}.tin.xls $par_output_tin
|
||||
|
||||
clean_up
|
||||
33
src/rseqc/tin/test.sh
Normal file
33
src/rseqc/tin/test.sh
Normal file
@@ -0,0 +1,33 @@
|
||||
#!/bin/bash
|
||||
|
||||
gunzip "$meta_resources_dir/hg19_RefSeq.bed.gz"
|
||||
|
||||
# define input and output for script
|
||||
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
|
||||
input_bai="$meta_resources_dir/test.paired_end.sorted.bam.bai"
|
||||
input_bed="$meta_resources_dir/test.bed12"
|
||||
|
||||
output_tin="tin.xls"
|
||||
output_tin_summary="tin_summary.txt"
|
||||
|
||||
# run executable and test
|
||||
echo "> Running $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--bam_input "$input_bam" \
|
||||
--bai_input "$input_bai" \
|
||||
--refgene "$input_bed" \
|
||||
--output_tin "$output_tin" \
|
||||
--output_tin_summary "$output_tin_summary"
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> Check if all output files were created"
|
||||
|
||||
[ ! -f $output_tin ] && echo "$output_tin was not created" && exit 1
|
||||
[ ! -s $output_tin ] && echo "$output_tin is empty" && exit 1
|
||||
[ ! -f $output_tin_summary ] && echo "$output_tin_summary was not created" && exit 1
|
||||
[ ! -s $output_tin_summary ] && echo "$output_tin_summary is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
65
src/sortmerna/config.vsh.yaml
Normal file
65
src/sortmerna/config.vsh.yaml
Normal file
@@ -0,0 +1,65 @@
|
||||
name: "sortmerna"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/sortmerna/main.nf, modules/nf-core/sortmerna/meta.yml]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: |
|
||||
Local sequence alignment tool for filtering, mapping and clustering. The main application of SortMeRNA is filtering rRNA from metatranscriptomic data. SortMeRNA takes as input files of reads (fasta, fastq, fasta.gz, fastq.gz) and one or multiple rRNA database file(s), and sorts apart aligned and rejected reads into two files.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--paired"
|
||||
type: boolean
|
||||
description: Are the reads single-end or paired-end
|
||||
- name: "--input"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
description: Input fastq
|
||||
- name: "--ribo_database_manifest"
|
||||
type: file
|
||||
multiple: true
|
||||
description: Text file containing paths to fasta files (one per line) that will be used to create the database for SortMeRNA.
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--sortmerna_log"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.sortmerna.log
|
||||
required: false
|
||||
must_exist: false
|
||||
description: Sortmerna log file.
|
||||
- name: "--fastq_1"
|
||||
type: file
|
||||
required: true
|
||||
description: Output file for read 1.
|
||||
direction: output
|
||||
default: $id.$key.read_1.fastq.gz
|
||||
- name: "--fastq_2"
|
||||
type: file
|
||||
required: false
|
||||
must_exist: false
|
||||
description: Output file for read 2.
|
||||
direction: output
|
||||
default: $id.$key.read_2.fastq.gz
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
|
||||
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
|
||||
- path: /testData/minimal_test/reference/rRNA
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/sortmerna:4.3.6--h9ee0642_0
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
42
src/sortmerna/script.sh
Executable file
42
src/sortmerna/script.sh
Executable file
@@ -0,0 +1,42 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
IFS="," read -ra input <<< "$par_input"
|
||||
IFS=";" read -ra paths <<< "$par_ribo_database_manifest"
|
||||
refs=""
|
||||
for i in "${paths[@]}"
|
||||
do
|
||||
refs+="-ref $i "
|
||||
done
|
||||
|
||||
if [ "$par_paired" == "false" ]; then
|
||||
sortmerna \
|
||||
$refs \
|
||||
-reads ${input[0]} \
|
||||
--threads ${meta_cpus:-1} \
|
||||
--workdir . \
|
||||
--aligned rRNA_reads \
|
||||
--fastx \
|
||||
-num_alignments 1 \
|
||||
--other non_rRNA_reads
|
||||
mv non_rRNA_reads.f*q.gz "$par_fastq_1"
|
||||
else
|
||||
sortmerna \
|
||||
$refs \
|
||||
-reads ${input[0]} \
|
||||
--reads ${input[1]} \
|
||||
--threads ${meta_cpus:-1} \
|
||||
--workdir . \
|
||||
--aligned rRNA_reads \
|
||||
--fastx \
|
||||
--num_alignments 1 \
|
||||
--other non_rRNA_reads \
|
||||
--paired_in \
|
||||
--out2
|
||||
mv non_rRNA_reads_fwd.f*q.gz $par_fastq_1
|
||||
mv non_rRNA_reads_rev.f*q.gz $par_fastq_2
|
||||
fi
|
||||
|
||||
mv rRNA_reads.log $par_sortmerna_log
|
||||
|
||||
41
src/sortmerna/test.sh
Normal file
41
src/sortmerna/test.sh
Normal file
@@ -0,0 +1,41 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
# find $meta_resources_dir/rRNA -type f > rrna-db-defaults.txt
|
||||
|
||||
echo ">>> Testing for paired-end reads"
|
||||
"$meta_executable" \
|
||||
--paired true \
|
||||
--input $meta_resources_dir/SRR6357070_1.fastq.gz,$meta_resources_dir/SRR6357070_2.fastq.gz \
|
||||
--ribo_database_manifest "$meta_resources_dir/rRNA/silva-arc-16s-id95.fasta;$meta_resources_dir/rRNA/silva-euk-18s-id95.fasta" \
|
||||
--sortmerna_log SRR6357070_sortmerna.log \
|
||||
--fastq_1 SRR6357070_read_1.fastq.gz \
|
||||
--fastq_2 SRR6357070_read_2.fastq.gz
|
||||
|
||||
echo ">> Checking if the correct files are present"
|
||||
[[ ! -f "SRR6357070_read_1.fastq.gz" ]] || [[ ! -f "SRR6357070_read_2.fastq.gz" ]] && echo "Output fastq file is missing!" && exit 1
|
||||
[[ ! -s "SRR6357070_read_1.fastq.gz" ]] || [[ ! -s "SRR6357070_read_2.fastq.gz" ]] && echo "Output fastq file is empty!" && exit 1
|
||||
[ ! -f "SRR6357070_sortmerna.log" ] && echo "Output log file is missing!" && exit 1
|
||||
[ ! -s "SRR6357070_sortmerna.log" ] && echo "Output log file is empty!" && exit 1
|
||||
|
||||
rm SRR6357070_read_1.fastq.gz SRR6357070_read_2.fastq.gz SRR6357070_sortmerna.log
|
||||
rm -rf kvdb/
|
||||
|
||||
echo ">>> Testing for single-end reads"
|
||||
"$meta_executable" \
|
||||
--paired false \
|
||||
--input $meta_resources_dir/SRR6357070_1.fastq.gz \
|
||||
--ribo_database_manifest "$meta_resources_dir/rRNA/silva-arc-16s-id95.fasta;$meta_resources_dir/rRNA/silva-euk-18s-id95.fasta" \
|
||||
--sortmerna_log SRR6357070_sortmerna.log \
|
||||
--fastq_1 SRR6357070_read_1.fastq.gz
|
||||
|
||||
echo ">> Checking if the correct files are present"
|
||||
[ ! -f "SRR6357070_read_1.fastq.gz" ] && echo "Output fastq file is missing!" && exit 1
|
||||
[ ! -s "SRR6357070_read_1.fastq.gz" ] && echo "Output fastq file is empty!" && exit 1
|
||||
[ ! -f "SRR6357070_sortmerna.log" ] && echo "Output log file is missing!" && exit 1
|
||||
[ ! -s "SRR6357070_sortmerna.log" ] && echo "Output log file is empty!" && exit 1
|
||||
|
||||
echo ">>> Test finished successfully"
|
||||
exit 0
|
||||
|
||||
70
src/stringtie/config.vsh.yaml
Normal file
70
src/stringtie/config.vsh.yaml
Normal file
@@ -0,0 +1,70 @@
|
||||
name: stringtie
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/stringtie/stringtie/main.nf, modules/nf-core/stringtie/stringtie/meta.yml]
|
||||
last_sha: 55398de6ab7577acfe9b1180016a93d7af7eb859
|
||||
description: |
|
||||
Transcript assembly and quantification for RNA-Seq
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--strandedness"
|
||||
type: string
|
||||
description: Forward or reverse strand?
|
||||
- name: "--bam"
|
||||
type: file
|
||||
- name: "--annotation_gtf"
|
||||
type: file
|
||||
- name: "--extra_stringtie_args"
|
||||
type: string
|
||||
description: Extra arguments for running StringTie
|
||||
- name: "--stringtie_ignore_gtf"
|
||||
type: boolean
|
||||
description: Perform reference-guided de novo assembly of transcripts using StringTie, i.e. don't restrict to those in GTF file.
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--transcript_gtf"
|
||||
type: file
|
||||
default: $id.$key.transcripts.gtf
|
||||
direction: output
|
||||
- name: "--coverage_gtf"
|
||||
type: file
|
||||
default: $id.$key.coverage.gtf
|
||||
direction: output
|
||||
- name: "--abundance"
|
||||
type: file
|
||||
default: $id.$key.abundance.txt
|
||||
direction: output
|
||||
- name: "--ballgown"
|
||||
type: file
|
||||
description: for running ballgown
|
||||
default: $id.$key.ballgown
|
||||
direction: output
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/genes.gtf
|
||||
- path: /testData/unit_test_resources/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
apt-get update && \
|
||||
apt-get install -y build-essential zlib1g wget && \
|
||||
wget --no-check-certificate https://github.com/gpertea/stringtie/releases/download/v2.2.1/stringtie-2.2.1.Linux_x86_64.tar.gz && \
|
||||
tar -xzf stringtie-2.2.1.Linux_x86_64.tar.gz && \
|
||||
cp stringtie-2.2.1.Linux_x86_64/stringtie /usr/local/bin/
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
25
src/stringtie/script.sh
Normal file
25
src/stringtie/script.sh
Normal file
@@ -0,0 +1,25 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
if [ $par_strandedness=='forward' ]; then
|
||||
strand='--fr'
|
||||
elif [ $par_strandedness=='reverse' ]; then
|
||||
strand='--rf'
|
||||
fi
|
||||
|
||||
stringtie \
|
||||
$par_bam \
|
||||
$strand \
|
||||
${par_annotation_gtf:+-G $par_annotation_gtf} \
|
||||
-o $par_transcript_gtf \
|
||||
-A "abundance.txt" \
|
||||
${par_annotation_gtf:+-C "coverage.gtf"} \
|
||||
${par_annotation_gtf:+-b "ballgown"} \
|
||||
-p ${meta_cpus:-1} \
|
||||
$par_extra_stringtie_args \
|
||||
${par_stringtie_ignore_gtf:+-e}
|
||||
|
||||
mv coverage.gtf $par_coverage_gtf
|
||||
mv ballgown $par_ballgown
|
||||
mv abundance.txt $par_abundance
|
||||
26
src/stringtie/test.sh
Normal file
26
src/stringtie/test.sh
Normal file
@@ -0,0 +1,26 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--strandedness reverse \
|
||||
--bam $meta_resources_dir/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam \
|
||||
--annotation_gtf $meta_resources_dir/genes.gtf \
|
||||
--extra_stringtie_args "-v" \
|
||||
--transcript_gtf test.transcripts.gtf \
|
||||
--coverage_gtf test.coverage.gtf \
|
||||
--abundance test.abundance.txt \
|
||||
--ballgown test.ballgown
|
||||
|
||||
echo ">> Checking if the correct files are present"
|
||||
[ ! -d "test.ballgown" ] && echo "Directory 'test.ballgown' does not exist!" && exit 1
|
||||
[ -z "$(ls -A 'test.ballgown')" ] && echo "Directory 'test.ballgown' is empty!" && exit 1
|
||||
[ ! -f "test.transcripts.gtf" ] && echo "File 'test.transcripts.gtf' does not exist!" && exit 1
|
||||
[ ! -s "test.transcripts.gtf" ] && echo "File 'test.transcripts.gtf' is empty!" && exit 1
|
||||
[ ! -f "test.coverage.gtf" ] && echo "File 'test.coverage.gtf' does not exist!" && exit 1
|
||||
[ ! -s "test.coverage.gtf" ] && echo "File 'test.coverage.gtf' is empty!" && exit 1
|
||||
[ ! -f "test.abundance.txt" ] && echo "File 'test.abundance.txt' does not exist!" && exit 1
|
||||
[ ! -s "test.abundance.txt" ] && echo "File 'test.abundance.txt' is empty!" && exit 1
|
||||
|
||||
echo ">>> Test finished successfully"
|
||||
exit 0
|
||||
50
src/summarizedexperiment/config.vsh.yaml
Normal file
50
src/summarizedexperiment/config.vsh.yaml
Normal file
@@ -0,0 +1,50 @@
|
||||
name: "summarizedexperiment"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/summarizedexperiment/main.nf]
|
||||
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
|
||||
description: Create SummarizedExperiment object from Salmon counts
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--tpm_gene"
|
||||
type: file
|
||||
- name: "--counts_gene"
|
||||
type: file
|
||||
- name: "--counts_gene_length_scaled"
|
||||
type: file
|
||||
- name: "--counts_gene_scaled"
|
||||
type: file
|
||||
- name: "--tpm_transcript"
|
||||
type: file
|
||||
- name: "--counts_transcript"
|
||||
type: file
|
||||
- name: "--tx2gene_tsv"
|
||||
type: file
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
default: merged_summarizedexperiment
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
# copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/salmon_summarizedexperiment.r
|
||||
- path: summarizedexperiment.r
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ r-base, libcurl4-openssl-dev ]
|
||||
- type: r
|
||||
bioc: [ SummarizedExperiment, tximeta ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
18
src/summarizedexperiment/script.sh
Executable file
18
src/summarizedexperiment/script.sh
Executable file
@@ -0,0 +1,18 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
mkdir -p $par_output
|
||||
|
||||
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_gene $par_tpm_gene $par_tx2gene_tsv
|
||||
|
||||
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_gene_length_scaled $par_tpm_gene $par_tx2gene_tsv
|
||||
|
||||
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_gene_scaled $par_tpm_gene $par_tx2gene_tsv
|
||||
|
||||
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_transcript $par_tpm_transcript $par_tx2gene_tsv
|
||||
|
||||
mv ${par_counts_gene%.*}.rds $par_output/
|
||||
mv ${par_counts_gene_length_scaled%.*}.rds $par_output/
|
||||
mv ${par_counts_gene_scaled%.*}.rds $par_output/
|
||||
mv ${par_counts_transcript%.*}.rds $par_output/
|
||||
68
src/summarizedexperiment/summarizedexperiment.r
Executable file
68
src/summarizedexperiment/summarizedexperiment.r
Executable file
@@ -0,0 +1,68 @@
|
||||
#!/usr/bin/env Rscript
|
||||
|
||||
library(SummarizedExperiment)
|
||||
|
||||
## Create SummarizedExperiment (se) object from Salmon counts
|
||||
|
||||
args <- commandArgs(trailingOnly = TRUE)
|
||||
if (length(args) < 2) {
|
||||
stop("Usage: salmon_se.r <coldata> <counts> <tpm>", call. = FALSE)
|
||||
}
|
||||
|
||||
coldata <- args[1]
|
||||
counts_fn <- args[2]
|
||||
tpm_fn <- args[3]
|
||||
tx2gene <- args[4]
|
||||
|
||||
info <- file.info(tx2gene)
|
||||
if (info$size == 0) {
|
||||
tx2gene <- NULL
|
||||
} else {
|
||||
rowdata <- read.csv(tx2gene, sep = "\t", header = FALSE)
|
||||
colnames(rowdata) <- c("tx", "gene_id", "gene_name")
|
||||
tx2gene <- rowdata[, 1:2]
|
||||
}
|
||||
|
||||
counts <- read.csv(counts_fn, row.names = 1, sep = "\t")
|
||||
counts <- counts[, 2:ncol(counts), drop = FALSE] # remove gene_name column
|
||||
tpm <- read.csv(tpm_fn, row.names = 1, sep = "\t")
|
||||
tpm <- tpm[, 2:ncol(tpm), drop = FALSE] # remove gene_name column
|
||||
|
||||
if (length(intersect(rownames(counts), rowdata[["tx"]])) > length(intersect(rownames(counts), rowdata[["gene_id"]]))) {
|
||||
by_what <- "tx"
|
||||
} else {
|
||||
by_what <- "gene_id"
|
||||
rowdata <- unique(rowdata[, 2:3])
|
||||
}
|
||||
|
||||
if (file.exists(coldata)) {
|
||||
coldata <- read.csv(coldata, sep = "\t")
|
||||
coldata <- coldata[match(colnames(counts), coldata[, 1]), ]
|
||||
coldata <- cbind(files = fns, coldata)
|
||||
} else {
|
||||
message("ColData not avaliable ", coldata)
|
||||
coldata <- data.frame(files = colnames(counts), names = colnames(counts))
|
||||
}
|
||||
|
||||
rownames(coldata) <- coldata[["names"]]
|
||||
extra <- setdiff(rownames(counts), as.character(rowdata[[by_what]]))
|
||||
if (length(extra) > 0) {
|
||||
rowdata <- rbind(
|
||||
rowdata,
|
||||
data.frame(
|
||||
tx = extra,
|
||||
gene_id = extra,
|
||||
gene_name = extra
|
||||
)[, colnames(rowdata)]
|
||||
)
|
||||
}
|
||||
|
||||
rowdata <- rowdata[match(rownames(counts), as.character(rowdata[[by_what]])), ]
|
||||
rownames(rowdata) <- rowdata[[by_what]]
|
||||
se <- SummarizedExperiment(
|
||||
assays = list(counts = counts, abundance = tpm),
|
||||
colData = DataFrame(coldata),
|
||||
rowData = rowdata
|
||||
)
|
||||
|
||||
saveRDS(se, file = paste0(tools::file_path_sans_ext(counts_fn), ".rds"))
|
||||
55
src/tx2gene/config.vsh.yaml
Normal file
55
src/tx2gene/config.vsh.yaml
Normal file
@@ -0,0 +1,55 @@
|
||||
name: "tx2gene"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/tx2gene/main.nf]
|
||||
last_sha: 839ac5cab892504514cc96d44e99e70516b239d2
|
||||
description: Get transcript id (tx) to gene names for tximport
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--quant_results"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ";"
|
||||
- name: "--gtf"
|
||||
type: file
|
||||
- name: "--gtf_extra_attributes"
|
||||
type: string
|
||||
default: 'gene_name'
|
||||
- name: "--gtf_group_features"
|
||||
type: string
|
||||
default: 'gene_id'
|
||||
- name: "--quant_type"
|
||||
type: string
|
||||
description: Method used for quantification
|
||||
choices: ["salmon", "kallisto"]
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--tsv"
|
||||
type: file
|
||||
direction: output
|
||||
default: tx2gene.tsv
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
# copied from https://github.com/nf-core/rnaseq/blob/3.14.0/bin/tx2gene.py
|
||||
- path: tx2gene.py
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [pip, unzip]
|
||||
- type: python
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
24
src/tx2gene/script.sh
Executable file
24
src/tx2gene/script.sh
Executable file
@@ -0,0 +1,24 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
function clean_up {
|
||||
rm -rf "$tmpdir"
|
||||
}
|
||||
trap clean_up EXIT
|
||||
|
||||
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_name-XXXXXXXX")
|
||||
|
||||
IFS=";" read -ra results <<< $par_quant_results
|
||||
for result in ${results[*]}
|
||||
do
|
||||
cp -r $result $tmpdir
|
||||
done
|
||||
|
||||
python3 "$meta_resources_dir/tx2gene.py" \
|
||||
--quant_type $par_quant_type \
|
||||
--gtf $par_gtf \
|
||||
--quants $tmpdir \
|
||||
--id $par_gtf_group_features \
|
||||
--extra $par_gtf_extra_attributes \
|
||||
-o $par_tsv
|
||||
52
src/tx2gene/test.sh
Normal file
52
src/tx2gene/test.sh
Normal file
@@ -0,0 +1,52 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "> Prepare test data"
|
||||
|
||||
cat > "sample1_quant_results.sf" << HERE
|
||||
Name Length EffectiveLength TPM NumReads
|
||||
ENSSASG00005000004 3822 3572 15216.8 753
|
||||
ENSSASG00005000003 13218 12968 1502.34 269.9
|
||||
ENSSASG00005000002 21290 21040 23916.3 6971.1
|
||||
HERE
|
||||
|
||||
cat > "sample2_quant_results.sf" << HERE
|
||||
Name Length EffectiveLength TPM NumReads
|
||||
ENSSASG00005000004 3822 3572 23713.5 703
|
||||
ENSSASG00005000003 13218 12968 14280 1536.92
|
||||
ENSSASG00005000002 21290 21040 37447.4 6539.08
|
||||
HERE
|
||||
|
||||
cat > "genes.gtf" << HERE
|
||||
MN908947.3 ensembl transcript 266 21555 . + . gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl exon 266 21555 . + . gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSASE00005000002"; exon_version "1";
|
||||
MN908947.3 ensembl CDS 266 21552 . + 0 gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSASP00005000002"; protein_version "1";
|
||||
MN908947.3 ensembl start_codon 266 268 . + 0 gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl stop_codon 21553 21555 . + 0 gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl transcript 266 13483 . + . gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl exon 266 13483 . + . gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSASE00005000003"; exon_version "1";
|
||||
MN908947.3 ensembl CDS 266 13480 . + 0 gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSASP00005000003"; protein_version "1";
|
||||
MN908947.3 ensembl start_codon 266 268 . + 0 gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl stop_codon 13481 13483 . + 0 gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl transcript 21563 25384 . + . gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl exon 21563 25384 . + . gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSASE00005000004"; exon_version "1";
|
||||
MN908947.3 ensembl CDS 21563 25381 . + 0 gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSASP00005000004"; protein_version "1";
|
||||
MN908947.3 ensembl start_codon 21563 21565 . + 0 gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl stop_codon 25382 25384 . + 0 gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
HERE
|
||||
|
||||
echo "> Run test"
|
||||
"$meta_executable" \
|
||||
--quant_results "sample1_quant_results.sf;sample2_quant_results.sf" \
|
||||
--gtf "genes.gtf" \
|
||||
--quant_type "salmon" \
|
||||
--gtf_extra_attributes "gene_name" \
|
||||
--gtf_group_features "gene_id" \
|
||||
--tsv "tx2gene.tsv"
|
||||
|
||||
echo "> Check results"
|
||||
[ ! -f "tx2gene.tsv" ] && echo "tx2gene.tsv was not created" && exit 1
|
||||
[ ! -s "tx2gene.tsv" ] && echo "tx2gene.tsv is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
169
src/tx2gene/tx2gene.py
Executable file
169
src/tx2gene/tx2gene.py
Executable file
@@ -0,0 +1,169 @@
|
||||
|
||||
#!/usr/bin/env python
|
||||
|
||||
# Written by Lorena Pantano with subsequent reworking by Jonathan Manning. Released under the MIT license.
|
||||
|
||||
import logging
|
||||
import argparse
|
||||
import glob
|
||||
import os
|
||||
import re
|
||||
from collections import Counter, defaultdict, OrderedDict
|
||||
from collections.abc import Set
|
||||
from typing import Dict
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
|
||||
logger = logging.getLogger(__name__)
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
|
||||
def read_top_transcripts(quant_dir: str, file_pattern: str) -> Set[str]:
|
||||
"""
|
||||
Read the top 100 transcripts from the quantification file.
|
||||
|
||||
Parameters:
|
||||
quant_dir (str): Directory where quantification files are located.
|
||||
file_pattern (str): Pattern to match quantification files.
|
||||
|
||||
Returns:
|
||||
set: A set containing the top 100 transcripts.
|
||||
"""
|
||||
try:
|
||||
# Find the quantification file within the directory
|
||||
quant_file_path = glob.glob(os.path.join(quant_dir, file_pattern))[0]
|
||||
with open(quant_file_path, "r") as file_handle:
|
||||
# Read the file and extract the top 100 transcripts
|
||||
return {line.split()[0] for i, line in enumerate(file_handle) if i > 0 and i <= 100}
|
||||
except IndexError:
|
||||
# Log an error and raise a FileNotFoundError if the quant file does not exist
|
||||
logger.error("No quantification files found.")
|
||||
raise FileNotFoundError("Quantification file not found.")
|
||||
|
||||
|
||||
def discover_transcript_attribute(gtf_file: str, transcripts: Set[str]) -> str:
|
||||
"""
|
||||
Discover the attribute in the GTF that corresponds to transcripts, prioritizing 'transcript_id'.
|
||||
|
||||
Parameters:
|
||||
gtf_file (str): Path to the GTF file.
|
||||
transcripts (Set[str]): A set of transcripts to match in the GTF file.
|
||||
|
||||
Returns:
|
||||
str: The attribute name that corresponds to transcripts in the GTF file.
|
||||
"""
|
||||
|
||||
votes = Counter()
|
||||
with open(gtf_file) as inh:
|
||||
# Read GTF file, skipping header lines
|
||||
for line in filter(lambda x: not x.startswith("#"), inh):
|
||||
cols = line.split("\t")
|
||||
|
||||
# Use regular expression to correctly split the attributes string
|
||||
attributes_str = cols[8]
|
||||
attributes = dict(re.findall(r'(\S+) "(.*?)(?<!\\)";', attributes_str))
|
||||
|
||||
votes.update(key for key, value in attributes.items() if value in transcripts)
|
||||
|
||||
if not votes:
|
||||
# Log a warning if no matching attribute is found
|
||||
logger.warning("No attribute in GTF matching transcripts")
|
||||
return ""
|
||||
|
||||
# Check if 'transcript_id' is among the attributes with the highest votes
|
||||
if "transcript_id" in votes and votes["transcript_id"] == max(votes.values()):
|
||||
logger.info("Attribute 'transcript_id' corresponds to transcripts.")
|
||||
return "transcript_id"
|
||||
|
||||
# If 'transcript_id' isn't the highest, determine the most common attribute that matches the transcripts
|
||||
attribute, _ = votes.most_common(1)[0]
|
||||
logger.info(f"Attribute '{attribute}' corresponds to transcripts.")
|
||||
return attribute
|
||||
|
||||
|
||||
def parse_attributes(attributes_text: str) -> Dict[str, str]:
|
||||
"""
|
||||
Parse the attributes column of a GTF file.
|
||||
|
||||
:param attributes_text: The attributes column as a string.
|
||||
:return: A dictionary of the attributes.
|
||||
"""
|
||||
# Split the attributes string by semicolon and strip whitespace
|
||||
attributes = attributes_text.strip().split(";")
|
||||
attr_dict = OrderedDict()
|
||||
|
||||
# Iterate over each attribute pair
|
||||
for attribute in attributes:
|
||||
# Split the attribute into key and value, ensuring there are two parts
|
||||
parts = attribute.strip().split(" ", 1)
|
||||
if len(parts) == 2:
|
||||
key, value = parts
|
||||
# Remove any double quotes from the value
|
||||
value = value.replace('"', "")
|
||||
attr_dict[key] = value
|
||||
|
||||
return attr_dict
|
||||
|
||||
|
||||
def map_transcripts_to_gene(
|
||||
quant_type: str, gtf_file: str, quant_dir: str, gene_id: str, extra_id_field: str, output_file: str
|
||||
) -> bool:
|
||||
"""
|
||||
Map transcripts to gene names and write the output to a file.
|
||||
|
||||
Parameters:
|
||||
quant_type (str): The quantification method used (e.g., 'salmon').
|
||||
gtf_file (str): Path to the GTF file.
|
||||
quant_dir (str): Directory where quantification files are located.
|
||||
gene_id (str): The gene ID attribute in the GTF file.
|
||||
extra_id_field (str): Additional ID field in the GTF file.
|
||||
output_file (str): The output file path.
|
||||
|
||||
Returns:
|
||||
bool: True if the operation was successful, False otherwise.
|
||||
"""
|
||||
# Read the top transcripts based on quantification type
|
||||
transcripts = read_top_transcripts(quant_dir, "*quant_results.sf" if quant_type == "salmon" else "*abundance.tsv")
|
||||
# Discover the attribute that corresponds to transcripts in the GTF
|
||||
transcript_attribute = discover_transcript_attribute(gtf_file, transcripts)
|
||||
|
||||
if not transcript_attribute:
|
||||
# If no attribute is found, return False
|
||||
return False
|
||||
|
||||
# Open GTF and output file to write the mappings
|
||||
# Initialize the set to track seen combinations
|
||||
seen = set()
|
||||
|
||||
with open(gtf_file) as inh, open(output_file, "w") as output_handle:
|
||||
# Parse each line of the GTF, mapping transcripts to genes
|
||||
for line in filter(lambda x: not x.startswith("#"), inh):
|
||||
cols = line.split("\t")
|
||||
attr_dict = parse_attributes(cols[8])
|
||||
if gene_id in attr_dict and transcript_attribute in attr_dict:
|
||||
# Create a unique identifier for the transcript-gene combination
|
||||
transcript_gene_pair = (attr_dict[transcript_attribute], attr_dict[gene_id])
|
||||
|
||||
# Check if the combination has already been seen
|
||||
if transcript_gene_pair not in seen:
|
||||
# If it's a new combination, write it to the output and add to the seen set
|
||||
extra_id = attr_dict.get(extra_id_field, attr_dict[gene_id])
|
||||
output_handle.write(f"{attr_dict[transcript_attribute]}\t{attr_dict[gene_id]}\t{extra_id}\n")
|
||||
seen.add(transcript_gene_pair)
|
||||
|
||||
return True
|
||||
|
||||
|
||||
# Main function to parse arguments and call the mapping function
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Map transcripts to gene names for tximport.")
|
||||
parser.add_argument("--quant_type", type=str, help="Quantification type", default="salmon")
|
||||
parser.add_argument("--gtf", type=str, help="GTF file", required=True)
|
||||
parser.add_argument("--quants", type=str, help="Output of quantification", required=True)
|
||||
parser.add_argument("--id", type=str, help="Gene ID in the GTF file", required=True)
|
||||
parser.add_argument("--extra", type=str, help="Extra ID in the GTF file")
|
||||
parser.add_argument("-o", "--output", dest="output", default="tx2gene.tsv", type=str, help="File with output")
|
||||
|
||||
args = parser.parse_args()
|
||||
if not map_transcripts_to_gene(args.quant_type, args.gtf, args.quants, args.id, args.extra, args.output):
|
||||
logger.error("Failed to map transcripts to genes.")
|
||||
79
src/tximport/config.vsh.yaml
Normal file
79
src/tximport/config.vsh.yaml
Normal file
@@ -0,0 +1,79 @@
|
||||
name: "tximport"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/tximport/main.nf]
|
||||
last_sha: 489bcb4efdc7bd58839b22b0360d26b4d80b87a8
|
||||
description: Get dataframe linking transcript ID, gene ID, and gene name
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--quant_results"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ";"
|
||||
- name: "--tx2gene_tsv"
|
||||
type: file
|
||||
- name: "--quant_type"
|
||||
type: string
|
||||
description: Method used for quantification
|
||||
choices: ["salmon", "kallisto"]
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--tpm_gene"
|
||||
type: file
|
||||
direction: output
|
||||
default: merged.gene_tpm.tsv
|
||||
- name: "--counts_gene"
|
||||
type: file
|
||||
direction: output
|
||||
default: merged.gene_counts.tsv
|
||||
- name: "--counts_gene_length_scaled"
|
||||
type: file
|
||||
direction: output
|
||||
default: merged.gene_counts_length_scaled.tsv
|
||||
- name: "--counts_gene_scaled"
|
||||
type: file
|
||||
direction: output
|
||||
default: merged.gene_counts_scaled.tsv
|
||||
- name: "--lengths_gene"
|
||||
type: file
|
||||
direction: output
|
||||
default: merged.gene_lengths.tsv
|
||||
- name: "--tpm_transcript"
|
||||
type: file
|
||||
direction: output
|
||||
default: merged.transcript_tpm.tsv
|
||||
- name: "--counts_transcript"
|
||||
type: file
|
||||
direction: output
|
||||
default: merged.transcript_counts.tsv
|
||||
- name: "--lengths_transcript"
|
||||
type: file
|
||||
direction: output
|
||||
default: merged.transcript_lengths.tsv
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
# copied from https://github.com/nf-core/rnaseq/blob/3.14.0/bin/tximport.r
|
||||
- path: tximport.r
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ r-base, libcurl4-openssl-dev, libssl-dev, libxml2-dev ]
|
||||
- type: r
|
||||
bioc: [ SummarizedExperiment, tximport, tximeta ]
|
||||
cran: [ jsonlite ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
23
src/tximport/script.sh
Executable file
23
src/tximport/script.sh
Executable file
@@ -0,0 +1,23 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
function clean_up {
|
||||
rm -rf "$tmpdir"
|
||||
}
|
||||
trap clean_up EXIT
|
||||
|
||||
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_name-XXXXXXXX")
|
||||
|
||||
IFS=";" read -ra results <<< $par_quant_results
|
||||
for result in ${results[*]}
|
||||
do
|
||||
cp -r $result $tmpdir
|
||||
done
|
||||
|
||||
Rscript "$meta_resources_dir/tximport.r" \
|
||||
NULL \
|
||||
$tmpdir \
|
||||
merged \
|
||||
$par_quant_type \
|
||||
$par_tx2gene_tsv
|
||||
59
src/tximport/test.sh
Normal file
59
src/tximport/test.sh
Normal file
@@ -0,0 +1,59 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "> Prepare test data"
|
||||
|
||||
cat > "sample1_quant_results.sf" << HERE
|
||||
Name Length EffectiveLength TPM NumReads
|
||||
ENSSASG00005000004 3822 3572 15216.8 753
|
||||
ENSSASG00005000003 13218 12968 1502.34 269.9
|
||||
ENSSASG00005000002 21290 21040 23916.3 6971.1
|
||||
HERE
|
||||
|
||||
cat > "sample2_quant_results.sf" << HERE
|
||||
Name Length EffectiveLength TPM NumReads
|
||||
ENSSASG00005000004 3822 3572 23713.5 703
|
||||
ENSSASG00005000003 13218 12968 14280 1536.92
|
||||
ENSSASG00005000002 21290 21040 37447.4 6539.08
|
||||
HERE
|
||||
|
||||
cat > tx2gene.tsv << HERE
|
||||
ENSSASG00005000002 ENSSASG00005000002 ORF1ab
|
||||
ENSSASG00005000003 ENSSASG00005000003 ORF1ab
|
||||
ENSSASG00005000004 ENSSASG00005000004 S
|
||||
HERE
|
||||
|
||||
echo "> Run test"
|
||||
"$meta_executable" \
|
||||
--quant_results "sample1_quant_results.sf;sample2_quant_results.sf" \
|
||||
--tx2gene_tsv "tx2gene.tsv" \
|
||||
--quant_type "salmon" \
|
||||
# --tpm_gene gene_tpm.tsv \
|
||||
# --counts_gene gene_counts.tsv \
|
||||
# --counts_gene_length_scaled gene_counts_length_scaled.tsv \
|
||||
# --counts_gene_scaled gene_counts_scaled.tsv \
|
||||
# --lengths_gene gene_length.tsv \
|
||||
# --tpm_transcript transcript_tpm.tsv \
|
||||
# --counts_transcript transcript_counts.tsv \
|
||||
# --lengths_transcript transcript_length.tsv
|
||||
|
||||
echo "> Check results"
|
||||
[ ! -f "merged.gene_tpm.tsv" ] && echo "merged.gene_tpm.tsv was not created" && exit 1
|
||||
[ ! -s "merged.gene_tpm.tsv" ] && echo "merged.gene_tpm.tsv is empty" && exit 1
|
||||
[ ! -f "merged.gene_counts.tsv" ] && echo "merged.gene_counts.tsv was not created" && exit 1
|
||||
[ ! -s "merged.gene_counts.tsv" ] && echo "merged.gene_counts.tsv is empty" && exit 1
|
||||
[ ! -f "merged.gene_counts_length_scaled.tsv" ] && echo "merged.gene_counts_length_scaled.tsv was not created" && exit 1
|
||||
[ ! -s "merged.gene_counts_length_scaled.tsv" ] && echo "merged.gene_counts_length_scaled.tsv is empty" && exit 1
|
||||
[ ! -f "merged.gene_counts_scaled.tsv" ] && echo "merged.gene_counts_scaled.tsv was not created" && exit 1
|
||||
[ ! -s "merged.gene_counts_scaled.tsv" ] && echo "merged.gene_counts_scaled.tsv is empty" && exit 1
|
||||
[ ! -f "merged.gene_lengths.tsv" ] && echo "merged.gene_lengths.tsv was not created" && exit 1
|
||||
[ ! -s "merged.gene_lengths.tsv" ] && echo "merged.gene_lengths.tsv is empty" && exit 1
|
||||
[ ! -f "merged.transcript_tpm.tsv" ] && echo "merged.transcript_tpm.tsv was not created" && exit 1
|
||||
[ ! -s "merged.transcript_tpm.tsv" ] && echo "merged.transcript_tpm.tsv is empty" && exit 1
|
||||
[ ! -f "merged.transcript_counts.tsv" ] && echo "merged.transcript_counts.tsv was not created" && exit 1
|
||||
[ ! -s "merged.transcript_counts.tsv" ] && echo "merged.transcript_counts.tsv is empty" && exit 1
|
||||
[ ! -f "merged.transcript_lengths.tsv" ] && echo "merged.transcript_lengths.tsv was not created" && exit 1
|
||||
[ ! -s "merged.transcript_lengths.tsv" ] && echo "merged.transcript_lengths.tsv is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
141
src/tximport/tximport.r
Executable file
141
src/tximport/tximport.r
Executable file
@@ -0,0 +1,141 @@
|
||||
#!/usr/bin/env Rscript
|
||||
|
||||
# Script for importing and processing transcript-level quantifications.
|
||||
# Written by Lorena Pantano, later modified by Jonathan Manning, and released under the MIT license.
|
||||
|
||||
# Loading required libraries
|
||||
library(SummarizedExperiment)
|
||||
library(tximport)
|
||||
|
||||
# Parsing command line arguments
|
||||
args <- commandArgs(trailingOnly=TRUE)
|
||||
if (length(args) < 4) {
|
||||
stop("Usage: tximport.r <coldata_path> <path> <prefix> <quant_type> <tx2gene_path>",
|
||||
call.=FALSE)
|
||||
}
|
||||
|
||||
# Assigning command line arguments to variables
|
||||
coldata_path <- args[1]
|
||||
path <- args[2]
|
||||
prefix <- args[3]
|
||||
quant_type <- args[4]
|
||||
tx2gene_path <- args[5]
|
||||
|
||||
## Functions
|
||||
|
||||
# Build a table from a SummarizedExperiment object
|
||||
build_table <- function(se.obj, slot) {
|
||||
cbind(rowData(se.obj)[,1:2], assays(se.obj)[[slot]])
|
||||
}
|
||||
|
||||
# Write a table to a file with given parameters
|
||||
write_se_table <- function(params) {
|
||||
file_name <- paste0(prefix, ".", params$suffix)
|
||||
write.table(build_table(params$obj, params$slot), file_name,
|
||||
sep="\t", quote=FALSE, row.names = FALSE)
|
||||
}
|
||||
|
||||
# Read transcript metadata from a given path
|
||||
read_transcript_info <- function(tinfo_path){
|
||||
info <- file.info(tinfo_path)
|
||||
if (info$size == 0) {
|
||||
stop("tx2gene file is empty")
|
||||
}
|
||||
|
||||
transcript_info <- read.csv(tinfo_path, sep="\t", header = FALSE,
|
||||
col.names = c("tx", "gene_id", "gene_name"))
|
||||
|
||||
extra <- setdiff(rownames(txi[[1]]), as.character(transcript_info[["tx"]]))
|
||||
transcript_info <- rbind(transcript_info, data.frame(tx=extra, gene_id=extra, gene_name=extra))
|
||||
transcript_info <- transcript_info[match(rownames(txi[[1]]), transcript_info[["tx"]]), ]
|
||||
rownames(transcript_info) <- transcript_info[["tx"]]
|
||||
|
||||
list(transcript = transcript_info,
|
||||
gene = unique(transcript_info[,2:3]),
|
||||
tx2gene = transcript_info[,1:2])
|
||||
}
|
||||
|
||||
# Read and process sample/column data from a given path
|
||||
read_coldata <- function(coldata_path){
|
||||
if (file.exists(coldata_path)) {
|
||||
coldata <- read.csv(coldata_path, sep="\t")
|
||||
coldata <- coldata[match(names, coldata[,1]),]
|
||||
coldata <- cbind(files = fns, coldata)
|
||||
} else {
|
||||
message("ColData not available: ", coldata_path)
|
||||
coldata <- data.frame(files = fns, names = names)
|
||||
}
|
||||
rownames(coldata) <- coldata[["names"]]
|
||||
}
|
||||
|
||||
# Create a SummarizedExperiment object with given data
|
||||
create_summarized_experiment <- function(counts, abundance, length, col_data, row_data) {
|
||||
SummarizedExperiment(assays = list(counts = counts, abundance = abundance, length = length),
|
||||
colData = col_data,
|
||||
rowData = row_data)
|
||||
}
|
||||
|
||||
# Main script starts here
|
||||
|
||||
# Define pattern for file names based on quantification type
|
||||
pattern <- ifelse(quant_type == "kallisto", "abundance.tsv", ".*quant_results\\.sf")
|
||||
fns <- list.files(path, pattern = pattern, recursive = T, full.names = T)
|
||||
names <- basename(fns)
|
||||
names(fns) <- names
|
||||
dropInfReps <- quant_type == "kallisto"
|
||||
|
||||
# Import transcript-level quantifications
|
||||
txi <- tximport(fns, type = quant_type, txOut = TRUE, dropInfReps = dropInfReps)
|
||||
|
||||
# Read transcript and sample data
|
||||
transcript_info <- read_transcript_info(tx2gene_path)
|
||||
coldata <- read_coldata(coldata_path)
|
||||
|
||||
# Create initial SummarizedExperiment object
|
||||
se <- create_summarized_experiment(txi[["counts"]], txi[["abundance"]], txi[["length"]],
|
||||
DataFrame(coldata), transcript_info$transcript)
|
||||
|
||||
# Setting parameters for writing tables
|
||||
params <- list(
|
||||
list(obj = se, slot = "abundance", suffix = "transcript_tpm.tsv"),
|
||||
list(obj = se, slot = "counts", suffix = "transcript_counts.tsv"),
|
||||
list(obj = se, slot = "length", suffix = "transcript_lengths.tsv")
|
||||
)
|
||||
|
||||
# Process gene-level data if tx2gene mapping is available
|
||||
if ("tx2gene" %in% names(transcript_info) && !is.null(transcript_info$tx2gene)) {
|
||||
tx2gene <- transcript_info$tx2gene
|
||||
gi <- summarizeToGene(txi, tx2gene = tx2gene)
|
||||
gi.ls <- summarizeToGene(txi, tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
|
||||
gi.s <- summarizeToGene(txi, tx2gene = tx2gene, countsFromAbundance = "scaledTPM")
|
||||
|
||||
gene_info <- transcript_info$gene[match(rownames(gi[[1]]), transcript_info$gene[["gene_id"]]),]
|
||||
rownames(gene_info) <- gene_info[["tx"]]
|
||||
|
||||
col_data_frame <- DataFrame(coldata)
|
||||
|
||||
# Create gene-level SummarizedExperiment objects
|
||||
gse <- create_summarized_experiment(gi[["counts"]], gi[["abundance"]], gi[["length"]],
|
||||
col_data_frame, gene_info)
|
||||
gse.ls <- create_summarized_experiment(gi.ls[["counts"]], gi.ls[["abundance"]], gi.ls[["length"]],
|
||||
col_data_frame, gene_info)
|
||||
gse.s <- create_summarized_experiment(gi.s[["counts"]], gi.s[["abundance"]], gi.s[["length"]],
|
||||
col_data_frame, gene_info)
|
||||
|
||||
params <- c(params, list(
|
||||
list(obj = gse, slot = "length", suffix = "gene_lengths.tsv"),
|
||||
list(obj = gse, slot = "abundance", suffix = "gene_tpm.tsv"),
|
||||
list(obj = gse, slot = "counts", suffix = "gene_counts.tsv"),
|
||||
list(obj = gse.ls, slot = "abundance", suffix = "gene_tpm_length_scaled.tsv"),
|
||||
list(obj = gse.ls, slot = "counts", suffix = "gene_counts_length_scaled.tsv"),
|
||||
list(obj = gse.s, slot = "abundance", suffix = "gene_tpm_scaled.tsv"),
|
||||
list(obj = gse.s, slot = "counts", suffix = "gene_counts_scaled.tsv")
|
||||
))
|
||||
}
|
||||
|
||||
# Writing tables for each set of parameters
|
||||
done <- lapply(params, write_se_table)
|
||||
|
||||
# Output session information and citations
|
||||
citation("tximeta")
|
||||
sessionInfo()
|
||||
51
src/ucsc/bedclip/config.vsh.yaml
Normal file
51
src/ucsc/bedclip/config.vsh.yaml
Normal file
@@ -0,0 +1,51 @@
|
||||
name: "bedclip"
|
||||
namespace: "ucsc"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/ucsc/bedclip/main.nf, modules/nf-core/ucsc/bedclip/meta.yml]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: Remove lines from bed file that refer to off-chromosome locations
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input_bedgraph"
|
||||
type: file
|
||||
description: bedGraph file which should be converted
|
||||
- name: "--sizes"
|
||||
type: file
|
||||
description: File with chromosome sizes
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_bedgraph"
|
||||
type: file
|
||||
direction: output
|
||||
description: bedGraph file after clipping
|
||||
default: $id.$key.bedgraph
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.bedgraph
|
||||
- path: /testData/unit_test_resources/sarscov2/genome.sizes
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages:
|
||||
- rsync
|
||||
- libcurl4
|
||||
- type: docker
|
||||
run: |
|
||||
rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bedClip /usr/local/bin/
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
5
src/ucsc/bedclip/script.sh
Normal file
5
src/ucsc/bedclip/script.sh
Normal file
@@ -0,0 +1,5 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
bedClip $par_input_bedgraph $par_sizes $par_output_bedgraph
|
||||
15
src/ucsc/bedclip/test.sh
Normal file
15
src/ucsc/bedclip/test.sh
Normal file
@@ -0,0 +1,15 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--input_bedgraph $meta_resources_dir/test.bedgraph \
|
||||
--sizes $meta_resources_dir/genome.sizes \
|
||||
--output_bedgraph test.clipped.bedgraph
|
||||
|
||||
echo ">> Checking if the correct files are present"
|
||||
[ ! -f "test.clipped.bedgraph" ] && echo "File 'test.clipped.bedgraph' does not exist!" && exit 1
|
||||
[ ! -s "test.clipped.bedgraph" ] && echo "File 'test.clipped.bedgraph' is empty!" && exit 1
|
||||
|
||||
echo ">>> Test finished successfully"
|
||||
exit 0
|
||||
51
src/ucsc/bedgraphtobigwig/config.vsh.yaml
Normal file
51
src/ucsc/bedgraphtobigwig/config.vsh.yaml
Normal file
@@ -0,0 +1,51 @@
|
||||
name: "bedgraphtobigwig"
|
||||
namespace: "ucsc"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/ucsc/bedgraphtobigwig/main.nf, modules/nf-core/ucsc/bedgraphtobigwig/meta.yml]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: Convert a bedGraph file to bigWig format
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--bedgraph"
|
||||
type: file
|
||||
description: bedGraph file which should be converted
|
||||
- name: "--sizes"
|
||||
type: file
|
||||
description: File with chromosome sizes
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--bigwig"
|
||||
type: file
|
||||
direction: output
|
||||
description: bigWig coverage file relative to genes on the input file
|
||||
default: $id.$key.bigwig
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.bedgraph
|
||||
- path: /testData/unit_test_resources/sarscov2/genome.sizes
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages:
|
||||
- rsync
|
||||
- libcurl4
|
||||
- type: docker
|
||||
run: |
|
||||
rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bedGraphToBigWig /usr/local/bin/
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
5
src/ucsc/bedgraphtobigwig/script.sh
Normal file
5
src/ucsc/bedgraphtobigwig/script.sh
Normal file
@@ -0,0 +1,5 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
bedGraphToBigWig $par_bedgraph $par_sizes $par_bigwig
|
||||
15
src/ucsc/bedgraphtobigwig/test.sh
Normal file
15
src/ucsc/bedgraphtobigwig/test.sh
Normal file
@@ -0,0 +1,15 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--bedgraph $meta_resources_dir/test.bedgraph \
|
||||
--sizes $meta_resources_dir/genome.sizes \
|
||||
--bigwig test.bigwig
|
||||
|
||||
echo ">> Checking if the correct files are present"
|
||||
[ ! -f "test.bigwig" ] && echo "File 'test.bigwig' does not exist!" && exit 1
|
||||
[ ! -s "test.bigwig" ] && echo "File 'test.bigwig' is empty!" && exit 1
|
||||
|
||||
echo ">>> Test finished successfully"
|
||||
exit 0
|
||||
197
src/workflows/genome_alignment_and_quant/config.vsh.yaml
Normal file
197
src/workflows/genome_alignment_and_quant/config.vsh.yaml
Normal file
@@ -0,0 +1,197 @@
|
||||
name: genome_alignment_and_quant
|
||||
namespace: workflows
|
||||
description: |
|
||||
A viash sub-workflow for genome alignment and quantification stage of nf-core/rnaseq pipeline.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--id"
|
||||
required: true
|
||||
type: string
|
||||
description: ID of the sample.
|
||||
example: foo
|
||||
- name: "--fastq_1"
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: Path to the sample (or read 1 of paired end sample).
|
||||
required: true
|
||||
example: input.fastq.gz
|
||||
- name: "--fastq_2"
|
||||
type: file
|
||||
required: false
|
||||
description: Path to read 2 of the sample.
|
||||
- name: "--strandedness"
|
||||
type: string
|
||||
required: false
|
||||
description: Sample strand-specificity. Must be one of unstranded, forward, or reverse
|
||||
choices: [forward, reverse, unstranded]
|
||||
- name: "--gtf"
|
||||
type: file
|
||||
description: GTF file
|
||||
- name: "--transcript_fasta"
|
||||
type: file
|
||||
description: Fasta file of the reference transcriptome.
|
||||
- name: "--star_index"
|
||||
type: file
|
||||
description: STAR index directory.
|
||||
- name: "--star_ignore_sjdbgtf"
|
||||
type: boolean
|
||||
default: false
|
||||
description: When using pre-built STAR indices do not re-extract and use splice junctions from the GTF file
|
||||
- name: --star_sjdb_gtf_feature_exon
|
||||
type: string
|
||||
description: Feature type in GTF file to be used as exons for building transcripts
|
||||
- name: "--bam_csi_index"
|
||||
type: boolean
|
||||
default: false
|
||||
description: Create a CSI index for BAM files instead of the traditional BAI index. This will be required for genomes with larger chromosome sizes.
|
||||
- name: "--umi_dedup_stats"
|
||||
type: boolean
|
||||
description: Generate output stats when running "umi_tools dedup".
|
||||
default: false
|
||||
- name: "--with_umi"
|
||||
type: boolean
|
||||
description: Enable UMI-based read deduplication.
|
||||
default: false
|
||||
- name: "--salmon_quant_libtype"
|
||||
type: string
|
||||
description: Override Salmon library type inferred based on strandedness defined in meta object.
|
||||
- name: "--gtf_group_features"
|
||||
type: string
|
||||
default: 'gene_id'
|
||||
description: Define the attribute type used to group features in the GTF file when running Salmon.
|
||||
- name: "--gtf_extra_attributes"
|
||||
type: string
|
||||
default: 'gene_name'
|
||||
description: By default, the pipeline uses the gene_name field to obtain additional gene identifiers from the input GTF file when running Salmon.
|
||||
- name: "--aligner"
|
||||
type: string
|
||||
description: Specifies the alignment algorithm to use - available options are 'star_salmon', 'star_rsem' and 'hisat2'.
|
||||
choices: [star_salmon, star_rsem, hisat2]
|
||||
default: "star_salmon"
|
||||
- name: "--rsem_index"
|
||||
type: file
|
||||
description: Path to directory for pre-built RSEM index.
|
||||
- name: "--salmon_index"
|
||||
type: file
|
||||
description: Path to directory for pre-built Salmon index.
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--star_multiqc"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id_star.log
|
||||
- name: "--genome_bam_sorted"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.genome.bam
|
||||
- name: "--genome_bam_index"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.genome.bam.bai
|
||||
- name: "--genome_bam_stats"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.genome.stats
|
||||
- name: "--genome_bam_flagstat"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.genome.flagstat
|
||||
- name: "--genome_bam_idxstats"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.genome.idxstats
|
||||
- name: "--transcriptome_bam"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.transcriptome.bam
|
||||
- name: "--transcriptome_bam_index"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.transcriptome.bam.bai
|
||||
- name: "--transcriptome_bam_stats"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.transcriptome.stats
|
||||
- name: "--transcriptome_bam_flagstat"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.transcriptome.flagstat
|
||||
- name: "--transcriptome_bam_idxstats"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.transcriptome.idxstats
|
||||
- name: "--quant_out_dir"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.salmon_quant
|
||||
- name: "--quant_results_file"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.quant.sf
|
||||
- name: "--salmon_multiqc"
|
||||
type: file
|
||||
direction: output
|
||||
- name: "--rsem_counts_gene"
|
||||
type: file
|
||||
description: Expression counts on gene level
|
||||
default: $id.genes.tsv
|
||||
direction: output
|
||||
- name: "--rsem_counts_transcripts"
|
||||
type: file
|
||||
description: Expression counts on transcript level
|
||||
default: $id.isoforms.tsv
|
||||
direction: output
|
||||
- name: "--rsem_multiqc"
|
||||
type: file
|
||||
description: RSEM statistics
|
||||
default: $id.rsem_stats
|
||||
direction: output
|
||||
- name: "--bam_star_rsem"
|
||||
type: file
|
||||
description: BAM file generated by STAR (optional)
|
||||
default: $id.STAR_sligned_genome.bam
|
||||
direction: output
|
||||
- name: "--bam_genome_rsem"
|
||||
type: file
|
||||
description: Genome BAM file (optional)
|
||||
default: $id.genome_rsem.bam
|
||||
direction: output
|
||||
- name: "--bam_transcript_rsem"
|
||||
type: file
|
||||
description: Transcript BAM file (optional)
|
||||
default: $id.transcript_rsem.bam
|
||||
direction: output
|
||||
|
||||
resources:
|
||||
- type: nextflow_script
|
||||
path: main.nf
|
||||
entrypoint: run_wf
|
||||
|
||||
dependencies:
|
||||
- name: star/star_align_reads
|
||||
repository: biobox
|
||||
- name: samtools/samtools_sort
|
||||
repository: biobox
|
||||
- name: samtools/samtools_index
|
||||
repository: biobox
|
||||
- name: samtools/samtools_stats
|
||||
repository: biobox
|
||||
- name: samtools/samtools_flagstat
|
||||
repository: biobox
|
||||
- name: samtools/samtools_idxstats
|
||||
repository: biobox
|
||||
- name: umi_tools/umi_tools_dedup
|
||||
repository: biobox
|
||||
- name: umi_tools/umi_tools_prepareforrsem
|
||||
repository: biobox
|
||||
- name: salmon/salmon_quant
|
||||
repository: biobox
|
||||
- name: rsem/rsem_calculate_expression
|
||||
repository: biobox
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
401
src/workflows/genome_alignment_and_quant/main.nf
Normal file
401
src/workflows/genome_alignment_and_quant/main.nf
Normal file
@@ -0,0 +1,401 @@
|
||||
workflow run_wf {
|
||||
take:
|
||||
input_ch
|
||||
|
||||
main:
|
||||
output_ch = input_ch
|
||||
| map { id, state ->
|
||||
def input = state.fastq_2 ? [ state.fastq_1, state.fastq_2 ] : [ state.fastq_1 ]
|
||||
def paired = input.size() == 2
|
||||
[ id, state + [ paired: paired, input: input ] ]
|
||||
}
|
||||
|
||||
| star_align_reads.run (
|
||||
runIf: { id, state -> state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"input": "fastq_1",
|
||||
"input_r2": "fastq_2",
|
||||
"genome_dir": "star_index",
|
||||
"sjdb_gtf_file": "gtf",
|
||||
"out_sam_attr_rg_line": "star_sam_attr_rg_line",
|
||||
"sjdb_gtf_feature_exon": "star_sjdb_gtf_feature_exon"
|
||||
],
|
||||
toState: [
|
||||
"genome_bam": "aligned_reads",
|
||||
"transcriptome_bam": "reads_aligned_to_transcriptome",
|
||||
"star_multiqc": "log"
|
||||
],
|
||||
args: [
|
||||
quant_mode: "TranscriptomeSAM",
|
||||
twopass_mode: "Basic",
|
||||
out_sam_type: "BAM;Unsorted",
|
||||
run_rng_seed: 0,
|
||||
out_filter_multimap_nmax: 20,
|
||||
align_sjdb_overhang_min: 1,
|
||||
out_sam_attributes: "NH;HI;AS;NM;MD",
|
||||
quant_transcriptome_sam_output: "BanSingleEnd"
|
||||
]
|
||||
)
|
||||
|
||||
// GENOME BAM
|
||||
| samtools_sort.run (
|
||||
runIf: { id, state -> state.aligner == 'star_salmon' },
|
||||
fromState: ["input": "genome_bam"],
|
||||
toState: ["genome_bam_sorted": "output"],
|
||||
key: "genome_sorted"
|
||||
)
|
||||
| samtools_index.run (
|
||||
runIf: { id, state -> state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"input": "genome_bam_sorted",
|
||||
"csi": "bam_csi_index"
|
||||
],
|
||||
toState: [ "genome_bam_index": "output" ],
|
||||
key: "genome_sorted"
|
||||
)
|
||||
| samtools_stats.run (
|
||||
runIf: { id, state -> state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"input": "genome_bam_sorted",
|
||||
"bai": "genome_bam_index",
|
||||
"fasta": "fasta"
|
||||
],
|
||||
toState: [ "genome_bam_stats": "output" ],
|
||||
key: "genome_stats"
|
||||
)
|
||||
| samtools_flagstat.run (
|
||||
runIf: { id, state -> state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"bam": "genome_bam_sorted",
|
||||
"bai": "genome_bam_index",
|
||||
"fasta": "fasta"
|
||||
],
|
||||
toState: [ "genome_bam_flagstat": "output" ],
|
||||
key: "genome_flagstat"
|
||||
)
|
||||
| samtools_idxstats.run(
|
||||
runIf: { id, state -> state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"bam": "genome_bam_sorted",
|
||||
"bai": "genome_bam_index",
|
||||
"fasta": "fasta"
|
||||
],
|
||||
toState: [ "genome_bam_idxstats": "output" ],
|
||||
key: "genome_idxstats"
|
||||
)
|
||||
|
||||
//
|
||||
// Remove duplicate reads from BAM file based on UMIs
|
||||
//
|
||||
|
||||
// Deduplicate genome BAM file
|
||||
| umi_tools_dedup.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: { id, state ->
|
||||
def output_stats = state.umi_dedup_stats ? state.id :
|
||||
[ paired: state.paired,
|
||||
input: state.genome_bam,
|
||||
bai: state.genome_bam_index,
|
||||
output_stats: output_stats]
|
||||
},
|
||||
toState: [ "genome_bam_sorted": "output" ],
|
||||
key: "genome_deduped"
|
||||
)
|
||||
| samtools_index.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"input": "genome_bam_sorted",
|
||||
"csi": "bam_csi_index"
|
||||
],
|
||||
toState: [ "genome_bam_index": "output" ],
|
||||
key: "genome_deduped"
|
||||
)
|
||||
| samtools_stats.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"input": "genome_bam_sorted",
|
||||
"bai": "genome_bam_index",
|
||||
"fasta": "fasta"
|
||||
],
|
||||
toState: [ "genome_bam_stats": "output" ],
|
||||
key: "genome_deduped_stats"
|
||||
)
|
||||
| samtools_flagstat.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"bam": "genome_bam_sorted",
|
||||
"bai": "genome_bam_index",
|
||||
"fasta": "fasta"
|
||||
],
|
||||
toState: [ "genome_bam_flagstat": "output" ],
|
||||
key: "genome_deduped_flagstat"
|
||||
)
|
||||
| samtools_idxstats.run(
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"bam": "genome_bam_sorted",
|
||||
"bai": "genome_bam_index",
|
||||
"fasta": "fasta",
|
||||
],
|
||||
toState: [ "genome_bam_idxstats": "output" ],
|
||||
key: "genome_deduped_idxstats"
|
||||
)
|
||||
|
||||
// Deduplicate transcriptome BAM file
|
||||
|
||||
| samtools_sort.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [ "input": "transcriptome_bam" ],
|
||||
toState: [ "transcriptome_bam": "output" ],
|
||||
key: "transcriptome_sorted"
|
||||
)
|
||||
| samtools_index.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"input": "transcriptome_bam",
|
||||
"csi": "bam_csi_index"
|
||||
],
|
||||
toState: [ "transcriptome_bam_index": "output" ],
|
||||
key: "transcriptome_sorted"
|
||||
)
|
||||
| samtools_stats.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"input": "transcriptome_bam",
|
||||
"bai": "transcriptome_bam_index",
|
||||
],
|
||||
toState: [ "transcriptome_bam_stats": "output" ],
|
||||
key: "transcriptome_stats"
|
||||
)
|
||||
| samtools_flagstat.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"bam": "transcriptome_bam",
|
||||
"bai": "transcriptome_bam_index"
|
||||
],
|
||||
toState: [ "transcriptome_bam_flagstat": "output" ],
|
||||
key: "transcriptome_flagstat"
|
||||
)
|
||||
| samtools_idxstats.run(
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"bam": "transcriptome_bam",
|
||||
"bai": "transcriptome_bam_index"
|
||||
],
|
||||
toState: [ "transcriptome_bam_idxstats": "output" ],
|
||||
key: "transcriptome_idxstats"
|
||||
)
|
||||
|
||||
| umi_tools_dedup.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: { id, state ->
|
||||
def output_stats = state.umi_dedup_stats ? state.id :
|
||||
[ paired: state.paired,
|
||||
input: state.transcriptome_bam,
|
||||
bai: state.transcriptome_bam_index,
|
||||
output_stats: output_stats]
|
||||
},
|
||||
toState: [ "transcriptome_bam_deduped": "output" ],
|
||||
key: "transcriptome_deduped"
|
||||
)
|
||||
| samtools_sort.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [ "input": "transcriptome_bam_deduped" ],
|
||||
toState: [ "transcriptome_bam": "output" ],
|
||||
key: "transcriptome_deduped_sorted"
|
||||
)
|
||||
| samtools_index.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"input": "transcriptome_bam",
|
||||
"csi": "bam_csi_index"
|
||||
],
|
||||
toState: [ "transcriptome_bam_index": "output" ],
|
||||
key: "transcriptome_deduped_sorted"
|
||||
)
|
||||
| samtools_stats.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"input": "transcriptome_bam",
|
||||
"bai": "transcriptome_bam_index"
|
||||
],
|
||||
toState: [ "transcriptome_bam_stats": "output" ],
|
||||
key: "transcriptome_deduped_stats"
|
||||
)
|
||||
| samtools_flagstat.run (
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"bam": "transcriptome_bam",
|
||||
"bai": "transcriptome_bam_index"
|
||||
],
|
||||
toState: [ "transcriptome_bam_flagstat": "output" ],
|
||||
key: "transcriptome_deduped_flagstat"
|
||||
)
|
||||
| samtools_idxstats.run(
|
||||
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"bam": "transcriptome_bam",
|
||||
"bai": "transcriptome_bam_index"
|
||||
],
|
||||
toState: [ "transcriptome_bam_idxstats": "output" ],
|
||||
key: "transcriptome_deduped_idxstats"
|
||||
)
|
||||
|
||||
// Fix paired-end reads in name sorted BAM file
|
||||
| umi_tools_prepareforrsem.run (
|
||||
runIf: { id, state -> state.with_umi && state.paired && state.aligner == 'star_salmon' },
|
||||
fromState: [ "input": "transcriptome_bam" ],
|
||||
toState: [ "transcriptome_bam": "output" ]
|
||||
)
|
||||
|
||||
// Infer lib-type for salmon quant
|
||||
| map { id, state ->
|
||||
def lib_type = (state.paired) ?
|
||||
(
|
||||
(state.strandedness == "forward") ?
|
||||
"ISF" :
|
||||
(
|
||||
(state.strandedness == "reverse") ? "ISR" : "IU"
|
||||
)
|
||||
)
|
||||
: (
|
||||
(state.strandedness == "forward") ?
|
||||
"SF" :
|
||||
(
|
||||
(state.strandedness == "reverse") ? "SR" : "U"
|
||||
)
|
||||
)
|
||||
[ id, state + [lib_type: lib_type] ]
|
||||
}
|
||||
|
||||
// Count reads from BAM alignments using Salmon
|
||||
| salmon_quant.run (
|
||||
runIf: { id, state -> state.aligner == 'star_salmon' },
|
||||
fromState: [
|
||||
"lib_type": "lib_type",
|
||||
"alignments": "transcriptome_bam",
|
||||
"targets": "transcript_fasta",
|
||||
"gene_map": "gtf"
|
||||
],
|
||||
toState: [
|
||||
"quant_out_dir": "output",
|
||||
"quant_results_file": "quant_results"
|
||||
]
|
||||
)
|
||||
|
||||
| map { id, state ->
|
||||
def mod_state = (state.aligner == 'star_salmon') ? state + [salmon_multiqc: state.quant_out_dir] : state
|
||||
[ id, mod_state ]
|
||||
}
|
||||
|
||||
| rsem_calculate_expression.run (
|
||||
runIf: { id, state -> state.aligner == 'star_rsem' },
|
||||
fromState: [
|
||||
"id": "id",
|
||||
"strandedness": "strandedness",
|
||||
"paired": "paired",
|
||||
"input": "input",
|
||||
"index": "rsem_index",
|
||||
"counts_gene": "rsem_counts_gene",
|
||||
"counts_transcripts": "rsem_counts_transcripts",
|
||||
"stat": "rsem_multiqc",
|
||||
"logs": "star_multiqc",
|
||||
"bam_star": "bam_star_rsem",
|
||||
"bam_genome": "bam_genome_rsem",
|
||||
"bam_transcript": "bam_transcript_rsem"
|
||||
],
|
||||
args: [
|
||||
star: true,
|
||||
star_output_genome_bam: true,
|
||||
star_gzipped_read_file: true,
|
||||
estimate_rspd: true,
|
||||
seed: 1
|
||||
],
|
||||
toState: [
|
||||
"rsem_counts_gene": "counts_gene",
|
||||
"rsem_counts_transcripts": "counts_transcripts",
|
||||
"rsem_multiqc": "stat",
|
||||
"star_multiqc": "logs",
|
||||
"bam_star_rsem": "bam_star",
|
||||
"bam_genome_rsem": "bam_genome",
|
||||
"bam_transcript_rsem": "bam_transcript"
|
||||
]
|
||||
)
|
||||
|
||||
// RSEM_Star BAM
|
||||
| samtools_sort.run (
|
||||
runIf: { id, state -> state.aligner == 'star_rsem' },
|
||||
fromState: ["input": "bam_star_rsem"],
|
||||
toState: ["genome_bam_sorted": "output"],
|
||||
key: "genome_sorted"
|
||||
)
|
||||
| samtools_index.run (
|
||||
runIf: { id, state -> state.aligner == 'star_rsem' },
|
||||
fromState: [
|
||||
"input": "genome_bam_sorted",
|
||||
"csi": "bam_csi_index"
|
||||
],
|
||||
toState: [ "genome_bam_index": "output" ],
|
||||
key: "genome_sorted"
|
||||
)
|
||||
| samtools_stats.run (
|
||||
runIf: { id, state -> state.aligner == 'star_rsem' },
|
||||
fromState: [
|
||||
"input": "genome_bam_sorted",
|
||||
"bai": "genome_bam_index",
|
||||
"fasta": "fasta"
|
||||
],
|
||||
toState: [ "genome_bam_stats": "output" ],
|
||||
key: "genome_stats"
|
||||
)
|
||||
| samtools_flagstat.run (
|
||||
runIf: { id, state -> state.aligner == 'star_rsem' },
|
||||
fromState: [
|
||||
"bam": "genome_bam_sorted",
|
||||
"bai": "genome_bam_index",
|
||||
"fasta": "fasta"
|
||||
],
|
||||
toState: [ "genome_bam_flagstat": "output" ],
|
||||
key: "genome_flagstat"
|
||||
)
|
||||
| samtools_idxstats.run(
|
||||
runIf: { id, state -> state.aligner == 'star_rsem' },
|
||||
fromState: [
|
||||
"bam": "genome_bam_sorted",
|
||||
"bai": "genome_bam_index",
|
||||
"fasta": "fasta"
|
||||
],
|
||||
toState: [ "genome_bam_idxstats": "output" ],
|
||||
key: "genome_idxstats"
|
||||
)
|
||||
|
||||
| map { id, state ->
|
||||
def mod_state = state.findAll { key, value -> value instanceof java.nio.file.Path && value.exists() }
|
||||
[ id, mod_state ]
|
||||
}
|
||||
|
||||
| setState (
|
||||
[ "star_multiqc": "star_multiqc",
|
||||
"rsem_multiqc": "rsem_multiqc",
|
||||
"salmon_multiqc": "salmon_multiqc",
|
||||
"genome_bam_sorted": "genome_bam_sorted",
|
||||
"genome_bam_index": "genome_bam_index",
|
||||
"genome_bam_stats": "genome_bam_stats",
|
||||
"genome_bam_flagstat": "genome_bam_flagstat",
|
||||
"genome_bam_idxstats": "genome_bam_idxstats",
|
||||
"transcriptome_bam": "transcriptome_bam",
|
||||
"transcriptome_bam_index": "transcriptome_bam_index",
|
||||
"transcriptome_bam_stats": "transcriptome_bam_stats",
|
||||
"transcriptome_bam_flagstat": "transcriptome_bam_flagstat",
|
||||
"transcriptome_bam_idxstats": "transcriptome_bam_idxstats",
|
||||
"quant_out_dir": "quant_out_dir",
|
||||
"quant_results_file": "quant_results_file",
|
||||
"rsem_counts_gene": "rsem_counts_gene",
|
||||
"rsem_counts_transcripts": "rsem_counts_transcripts",
|
||||
"bam_genome_rsem": "bam_genome_rsem",
|
||||
"bam_transcript_rsem": "bam_transcript_rsem" ]
|
||||
)
|
||||
|
||||
emit:
|
||||
output_ch
|
||||
}
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user