Build branch htrnaseq/v0.11 with version v0.11.0 to htrnaseq on branch v0.11 (213fd5f)

Build pipeline: viash-hub.htrnaseq.v0.11-tl5df

Source commit: 213fd5f7d0

Source message: Merge remote-tracking branch 'origin/main' into v0.11
This commit is contained in:
CI
2025-09-04 15:12:13 +00:00
commit 61eb413b0a
300 changed files with 137512 additions and 0 deletions

19
.gitignore vendored Normal file
View File

@@ -0,0 +1,19 @@
target
testData
resources_test
# Nextflow related files
.nextflow
.nextflow.log*
work
# Python related files
*__pycache__*
.venv
# R related files
.Rproj.user
htrnaseq.Rproj
# vscode
.vscode

310
CHANGELOG.md Normal file
View File

@@ -0,0 +1,310 @@
# htrnaseq v0.11.0
## Breaking changes
* `runner`: feature annotation data (fData) is now output to a subfolder `fData` (PR #68).
## New features
* `runner`: add output results to state in order for the workflow to be used as subworkflow (PR #68, PR 70, PR #71).
## Bug fixes
* `runner`: disable `publishFilesProc` because this workflow handles publishing itself (PR #68).
## Minor changes
* Bump craftbox to v0.3.0 (PR #69).
# htrnaseq v0.10.0
## Breaking changes
* `runner`: Replaced `ignore` parameter with `pools` paremeter. When set, only the selected pools are included for analysis.
By default all pools are selected (PR #66)
## Bug fixes
* Fix an error where processing FASTQ files from multiple lanes would cause an assertion error requesting the well demultiplexing
output to reside in one directory (PR #67).
# Minor changes
* `generate_well_statistics`: update base image to `python:3.13-trixie` (PR #67).
# htrnaseq v0.9.1
## Bug fixes
* Reverted functionality to set `fastq_publish_dir` and `results_publish_dir` using fromState (PR #64).
* `runner`: fix detection of FASTQ files with non-numerical characters in the sample name (PR #65).
# htrnaseq v0.9.0
## Breaking changes
* `runner`: removed `plain_output` argument (PR #63).
## Minor changes
* `runner`: the `fastq_publish_dir` and `results_publish_dir` can now be set using `fromState` when using the workflow as subworkflow (PR #63).
# htrnaseq v0.8.3
## Minor changes
* Bump craftbox to v0.2.0 (PR #62).
# htrnaseq v0.8.2
## Under the hood
* Add the package config (`_viash.yaml`) to every component's target dir. This makes introspection from, e.g. a `runner` workflow much more robust (PR #61)
# htrnaseq v0.8.1
## Bug fixes
* Fix an issue where the FASTQ files from different samples on the same sequencing run would overwrite each other (PR #56).
## Under the hood
* Moved the test resources to their new location (PR #47).
## Minor changes
* Bump `biobox` and `craftbox` dependencies to versions `0.3.1` and `0.2.0`, respectively (PR #60).
# htrnaseq v0.8.0
## New functionality
* `save_params`: added a component to save workflow input parameters as yaml (PR #48).
* Added `run_params` parameter to `htrnaseq` and `runner` workflows in order to save the input parameters
used for the workflow run (PR #48).
# htrnaseq v0.7.2
## Documentation
* Update README (PR #54)
# htrnaseq v0.7.1
## Bug fixes
* Bump viash version to `0.9.4`. This adds support for nextflow versions starting major version 25.01 and
fixes an issue where an integer being passed to a argument with `type: double` resulted in an error (PR #51).
* `reporting`: updated default colour mapping (PR #50).
## Minor changes
* `create_report`: bump bioconductor version to 3.21 in order to accommodate R version 4.5 (PR #52).
# htrnaseq v0.7.0
## Breaking changes
The `runner` and `htrnaseq` workflow now output FASTQ files corresponding to the barcodes per input ID (per sequencing run).
Previously, when multiple input folders or multiple input FASTQ files were provided
(for the `runner` and `htrnaseq` workflows respectively), the demultiplexed FASTQ files for these inputs were concatenated
and provided as output. For the `htrnaseq` workflow, reads can still be combined by using a newly added `sampleID` argument.
This means that two lists of FASTQ files can be provided for a single sample, and by assigning the same `sampleID`,
these reads will be joined. For example, with other arguments are left out for brevity:
```yaml
- id: sample1_run1
input_r1: [sample_1_L001_1_R1.fastq, sample_1_L002_1_R1.fastq]
input_r2: [sample_1_L001_1_R2.fastq, sample_1_L002_1_R2.fastq]
sampleID: "sample_1"
- id: sample1_run2
input_r1: [sample_1_L001_1_R1.fastq, sample_1_L002_1_R1.fastq]
input_r2: [sample_1_L001_1_R2.fastq, sample_1_L002_1_R2.fastq]
sampleID: "sample_1"
- id: sample_2
input_r1: [sample_2_L001_1_R1.fastq, sample_2_L002_1_R1.fastq]
input_r2: [sample_2_L001_1_R2.fastq, sample_2_L002_1_R2.fastq]
```
For the runner, concatenation of data across samples is automatically inferred. Previously, multiple IDs (events) could be
provided which were processed in parallel. This is no longer possible, as providing multiple will cause the matching
samples for these runs to be concatenated.
For example, the following old parameter yaml
```yaml
- id: run1
input: ["run_folder_1/", run_folder_2/]
```
should now be provided as:
```yaml
- id: run1
input: "run_folder_1/"
- id: run2
input: run_folder_2/
```
## Minor changes
* Updated viash to `0.9.2` (PR #49)
# htrnaseq v0.6.0
## Breaking changes
* `runner`: a subdirectory `data_processed` is now added to the output structure, in between
the experiment ID and the directory with the workflow date and version (PR #45).
# htrnaseq v0.5.5
## New functionality
* Add `umi_length` parameter to the `runner` workflow (PR #46)
# htrnaseq v0.5.4
* Fix missing barcodes in the output from `generate_pool_statistics`, which caused an assertion error in `create_pdata`.
In order to resolve the issue `generate_well_statistics` now outputs results for all chromosomes/scaffolds presented by
the genome annotation, even when no reads were mapped to the regions in question. `generate_pool_statistics` will now
remove regions from the output that have not at least one counts across all barcodes (PR #44).
# htrnaseq v0.5.3
## Bug fixes
* Fix `create_eset` component failing to create when one of the input samples has no counts (PR #43).
# htrnaseq v0.5.2
## Bug fixes
* `create_fdata`: remove duplicate entries from feature data (PR #41).
# htrnaseq v0.5.1
## Bug fixes
* `generate_well_statistics`: fix `ValueError` when an empty .bam file is provided as input (PR #40).
* `create_pdata`: avoid false positive `ValueError` for non-overlapping barcodes when input
data contains empty (`NA`) values (PR #40).
# htrnaseq v0.5.0
## New functionality
* Added `ignore` parameter was added to the runner workflow in order to pass over certain input files
from the input directory (PR #39).
# htrnaseq v0.4.0
## Breaking changes
An effort has been made to align the inputs for the `htrnaseq` and the mapping and demultiplexing of the wells, in order
simplify running these steps as seperate steps (PR #37).
* Changes to the `parallel_map` component:
- The `barcode` argument has been renamed to `barcodesFasta` and the provided
value for this argument must now be single FASTA file instead of a list of barcodes.
- The filenames for the provided FASTQ files must now conform to the format `{name}_R(1|2).fasta`,
where `{name}` is the well identifiers. The well identifiers correspond to the headers
of the FASTA file containing the barcodes (up untill the first whitespace).
Forward and reverse FASTQ files must still be provided in pairs, meaning that the order of
files provided to `input_r1` and `input_r2` remains important.
- The requirement for equal number of barcodes and FASTQ pairs to be provided has been dropped.
Instead, the barcodes provided with `barcodesFasta` are matched to the input FASTQ files by comparing
the header of the FASTA records to the file names of the provided FASTQ input files. Each barcode must
match exactly one FASTQ input pair (forward and reverse reads), but FASTQ files that were not matched to any
barcode are not processed. Basically, the barcodes fasta can now act as a filter for the FASTQ files to be mapped.
* The `utils/groupWells` workflow has been removed.
* `parallel_map_wf` has been removed as its functionality is now incomporated into the `parallel_map` component.
* The `pool`, `well_id`, `barcode`, `lane`, `pair_end` and `n_wells` output arguments have been dropped from the
`well_demultiplexing` workflow. This workflow now only outputs a list of demultiplexed FASTQ files.
* A `well_metadata` workflow has been implemented that extracts the metadata that is no longer output by the `well_demultiplexing`
workflow from the demultiplexed files and the barcodes FASTA.
## New functionality
* Multiple input directories can not be provided. The input reads from these from these directories
will be joined per barcode before mapping. This is useful when data has been generated using
multiple sequencing runs in order to increase sequencing depth (PR #38).
# htrnaseq v0.3.0
## New functionality
* Added `umi_length` argument (PR #27).
* Added `runner` workflow (PR #26, see below)
## `runner` workflow
* Removed `wellBarcodesLength` from `parallel_map` workflow (PR #27).
## Major changes
A runner workflows has been added, providing two additional features:
1. Start from an input directory containing fastq files rather than a list of input fastq pairs.
2. Improve the output of the workflow
### Input directory
It is now possible to specify a single `--input <basedir>` directory as input and the runner will extract the fastq file pairs. An error will be raised if the filename processing leads to errors.
### Output
The runner provides a complete different approach to output. A couple of things are important here:
- Output is split up in 2 parts:
1. The well-demultiplexed fastq files (`--fastq_publish_dir`)
2. All the other results of the workflow (`--results_publish_dir`)
- The well-demultiplexed fastq file are stored under `--fastq_publish_dir` according to the following format:
```
$fastq_publish_dir/$id/<date-time>_htrnaseq_<version>/$plate_$lane/<well_id>_R1/2_001.fastq
```
- The other results are stored under `--results_publish_dir` according to the following format:
```
$results_publish_dir/$project_id/$experiment_id/<date-time>_htrnaseq_<version>/
```
This is an example listing of this directory:
```
esets
fData
nrReadsNrGenesPerChrom
pData
report.html
star_output
starLogs
```
This output structure can be circumvented by using the `--output_dir` option, which will store all output in a single directory.
1. Using the `htrnaseq` workflow directory rather than the `runner` interface
2. Using the argument `--plain_output` with the `runner`. fastq files and other results will still be published in their respective directories, but not in a directory hierarchy as described above.
## Minor changes
* Use `v0.2.0` version of cutadapt instead of `main` (PR #23).
* Use `v0.3.0` version of cutadapt
* Bump viash to 0.9.1 (PR #31).
* `create_eset`: Update base container image, `R` version and all dependencies
to newer versions (PR #28).
# htrnaseq v0.2.0
# New functionality
* Make sure that the Well ID matches the required format (PR #22 and PR #21).
# htrnaseq v0.1.0
Initial release

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2021 OpenPipelines
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

207
README.md Normal file
View File

@@ -0,0 +1,207 @@
# HT-RNAseq
[![ViashHub](https://img.shields.io/badge/ViashHub-htrnaseq-7a4baa.svg)](https://www.viash-hub.com/packages/htrnaseq)
[![GitHub](https://img.shields.io/badge/GitHub-viash--hub%2Fhtrnaseq-blue.svg)](https://github.com/viash-hub/htrnaseq)
[![GitHub
License](https://img.shields.io/github/license/viash-hub/htrnaseq.svg)](https://github.com/viash-hub/htrnaseq/blob/main/LICENSE)
[![GitHub
Issues](https://img.shields.io/github/issues/viash-hub/htrnaseq.svg)](https://github.com/viash-hub/htrnaseq/issues)
[![Viash
version](https://img.shields.io/badge/Viash-v0.9.4-blue.svg)](https://viash.io)
## Introduction
This workflow is designed to process high-throughput RNA-seq data, where
every well of a microarray plate is a sample. A fasta file provided as
input defines the mapping between sample barcodes and wells.
The workflow is built in a modular fashion, where most of the base
functionality is provided by components from
[`biobox`](https://www.viash-hub.com/packages/biobox/latest)
supplemented by custom base components and workflow components in this
package.
The full workflow is split in two major subworkflows that can be run
independently:
- **Well-demultiplexing:** Split the input (plate/pool level) fastq
files per well.
- **Mapping, counting and QC:** Run per-well mapping, counting and
generate QC reports.
Each of those can be started individually, or the full workflow can be
run in two ways:
1. Run the [main
workflow](https://www.viash-hub.com/packages/htrnaseq/v0.3.0/components/workflows/htrnaseq)
containing the main functionality.
2. Run the [(opinionated)
`runner`](https://www.viash-hub.com/packages/htrnaseq/v0.3.0/components/workflows/runner)
where a number of choices (input/output structure and location) have
been made.
Input for the workflow has to be `fastq` files (zipped or not). For bcl
or other formats, please consider running
[demultiplex](https://www.viash-hub.com/packages/demultiplex) first.
``` mermaid lang="mermaid"
flowchart TB
subgraph runner [runner]
direction TB
subgraph htrnaseq [HT-RNAseq]
direction LR
demultiplex[Well demultiplexing]
map
report
eset
end
end
demultiplex --> map --> report --> eset
class runner container
class htrnaseq container
class demultiplex container-inner
class map container-inner
class report container-inner
class eset container-inner
class demultiplex node
class map node
class report node
class eset node
```
## Example usage
### Test and example data
If you want to explore this workflow, its possible to the use data we
use as test data: [a DRUGseq
dataset](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE176150)
from the [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra).
For the unit and integration tests, this data has been (partly)
subsampled to reduce the test runtime. We used
[seqtk](https://github.com/lh3/seqtk) for this with a seed of 1, e.g.:
``` bash
seqtk sample -s1 orig/SRR14730302/VH02001614_S8_R1_001.fastq.gz 10000 > 10k/SRR14730302/VH02001614_S8_R1_001.fastq.gz
```
This data is available at: `gs://viash-hub-test-data/htrnaseq/v1/`.
### Run from Viash Hub
Open [Viash Hub](https://www.viash-hub.com) and browse to the [htrnaseq
component](https://www.viash-hub.com/packages/htrnaseq/v0.3.0/components/workflows/htrnaseq).
Press the Launch button and follow the instructions.
![](assets/htrnaseq-launch-small.png)
We will start an example run loading just one input and using a barcodes
fasta file containing only 2 wells.
In the first step, we add the `local` profile to the list of profiles in
order to limit the cpu and memory requirements of the workflow steps:
![](assets/launch-parameters-1-small.png)
In the next step, we provide the paramters as follows:
- `input_r1`:
`gs://viash-hub-test-data/htrnaseq/v1/100k/SRR14730301/VH02001612_S9_R1_001.fastq`
- `input_r2`:
`gs://viash-hub-test-data/htrnaseq/v1/100k/SRR14730301/VH02001612_S9_R2_001.fastq`
- `genomeDir`:
`gs://viash-hub-test-data/htrnaseq/v1/genomeDir/subset/Homo_sapiens/v0.0.3/`
- `barcodesFasta`:
`gs://viash-hub-test-data/htrnaseq/v1/2-wells-with-ids.fasta`
- `annotation`:
`gs://viash-hub-test-data/htrnaseq/v1/genomeDir/gencode.v41.annotation.gtf.gz`
Please note that both `input_r1` and `input_r2` can take multiple
values. This means that one has to press ENTER after pasting the input
path.
![](assets/launch-parameters-2-small.png)
Press the Launch button at the end to get the instructions on how to
run the workflow from the CLI.
### Run using NF-Tower / Seqera Cloud
Its possible to run the workflow directly from [Seqera
Cloud](https://cloud.seqera.io). The necessary [Nextflow schema
file](https://nextflow-io.github.io/nf-schema/latest/nextflow_schema/nextflow_schema_specification/)
has been built and provided with the workflows in order to use the
form-based input. However, Seqera Cloud can not deal with multiple-value
parameters when using the form-based input. Therefore, its better to
use Viash Hub also here:
First, select the option to run the workflow using Seqera Cloud. You
will need to create an API token for your account. Once this token is
filled in in the corresponding field, you will get the option to select
a Workspace and a Compute environment.
![](assets/launch-parameters-3-small.png)
Next, we need to fill in the parameters for the run. This is similar to
before:
![](assets/launch-parameters-4-small.png)
In the next screen, pressing the Launch button will actually start the
workflow on Seqera Cloud. A message is shown when the submit was
successful.
![](assets/launch-parameters-5-small.png)
### Run from the CLI
Running from the CLI directly without using Viash hub is possible. The
easiest is to just use the integrated help functionality, for instance
using the following:
``` bash
nextflow run https://packages.viash-hub.com/vsh/htrnaseq.git \
-revision v0.8.1 \
-main-script target/nextflow/workflows/runner/main.nf \
--help
```
### (Optional) Resource usage tuning
Nextflows labels can be used to specify the amount of resources a
process can use. This workflow uses the following labels for CPU and
memory:
- `verylowmem`, `lowmem`, `midmem`, `highmem`
- `verylowcpu`, `lowcpu`, `midcpu`, `highcpu`
The defaults for these labels can be found at
`src/config/labels.config`. Nextflow checks that the specified resources
for a process do not exceed what is available on the machine and will
not start if it does. Create your own config file to tune the labels to
your needs, for example:
// Resource labels
withLabel: verylowcpu { cpus = 2 }
withLabel: lowcpu { cpus = 8 }
withLabel: midcpu { cpus = 16 }
withLabel: highcpu { cpus = 32 }
withLabel: verylowmem { memory = { get_memory( 4.GB * task.attempt ) } }
withLabel: lowmem { memory = { get_memory( 8.GB * task.attempt ) } }
withLabel: midmem { memory = { get_memory( 16.GB * task.attempt ) } }
withLabel: highmem { memory = { get_memory( 64.GB * task.attempt ) } }
When starting nextflow using the CLI, you can use `-c` to provide the
file to nextflow and overwrite the defaults.
## Contributions
Developed in collaboration with Data Intuitive and Open Analytics.
Other contributions are welcome.

149
README.qmd Normal file
View File

@@ -0,0 +1,149 @@
---
format: gfm
---
```{r setup, include=FALSE}
project <- yaml::read_yaml("_viash.yaml")
license <- paste0(project$links$repository, "/blob/main/LICENSE")
contributing <- paste0(project$links$repository, "/blob/main/CONTRIBUTING.md")
```
# HT-RNAseq
[![ViashHub](https://img.shields.io/badge/ViashHub-`r project$name`-7a4baa.svg)](https://www.viash-hub.com/packages/`r project$name`)
[![GitHub](https://img.shields.io/badge/GitHub-viash--hub%2F`r project$name`-blue.svg)](`r project$links$repository`)
[![GitHub License](https://img.shields.io/github/license/viash-hub/`r project$name`.svg)](`r license`)
[![GitHub Issues](https://img.shields.io/github/issues/viash-hub/`r project$name`.svg)](`r project$links$issue_tracker`)
[![Viash version](https://img.shields.io/badge/Viash-v`r gsub("-", "--", project$viash_version)`-blue.svg)](https://viash.io)
## Introduction
`r project$description`
```{mermaid lang='mermaid'}
flowchart TB
subgraph runner [runner]
direction TB
subgraph htrnaseq [HT-RNAseq]
direction LR
demultiplex[Well demultiplexing]
map
report
eset
end
end
demultiplex --> map --> report --> eset
class runner container
class htrnaseq container
class demultiplex container-inner
class map container-inner
class report container-inner
class eset container-inner
class demultiplex node
class map node
class report node
class eset node
```
## Example usage
### Test and example data
If you want to explore this workflow, it's possible to the use data we use as test data: [a DRUGseq dataset](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE176150) from the [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra). For the unit and integration tests, this data has been (partly) subsampled to reduce the test runtime. We used [seqtk](https://github.com/lh3/seqtk) for this with a seed of 1, e.g.:
```bash
seqtk sample -s1 orig/SRR14730302/VH02001614_S8_R1_001.fastq.gz 10000 > 10k/SRR14730302/VH02001614_S8_R1_001.fastq.gz
```
This data is available at: `gs://viash-hub-test-data/htrnaseq/v1/`.
### Run from Viash Hub
Open [Viash Hub](https://www.viash-hub.com) and browse to the [htrnaseq component](https://www.viash-hub.com/packages/htrnaseq/v0.3.0/components/workflows/htrnaseq). Press the 'Launch' button and follow the instructions.
![](assets/htrnaseq-launch-small.png)
We will start an example run loading just one input and using a barcodes fasta file containing only 2 wells.
In the first step, we add the `local` profile to the list of profiles in order to limit the cpu and memory requirements of the workflow steps:
![](assets/launch-parameters-1-small.png)
In the next step, we provide the paramters as follows:
- `input_r1`: `gs://viash-hub-test-data/htrnaseq/v1/100k/SRR14730301/VH02001612_S9_R1_001.fastq`
- `input_r2`: `gs://viash-hub-test-data/htrnaseq/v1/100k/SRR14730301/VH02001612_S9_R2_001.fastq`
- `genomeDir`: `gs://viash-hub-test-data/htrnaseq/v1/genomeDir/subset/Homo_sapiens/v0.0.3/`
- `barcodesFasta`: `gs://viash-hub-test-data/htrnaseq/v1/2-wells-with-ids.fasta`
- `annotation`: `gs://viash-hub-test-data/htrnaseq/v1/genomeDir/gencode.v41.annotation.gtf.gz`
Please note that both `input_r1` and `input_r2` can take multiple values. This means that one has to press ENTER after pasting the input path.
![](assets/launch-parameters-2-small.png)
Press the 'Launch' button at the end to get the instructions on how to run the workflow from the CLI.
### Run using NF-Tower / Seqera Cloud
It's possible to run the workflow directly from [Seqera Cloud](https://cloud.seqera.io). The necessary [Nextflow schema file](https://nextflow-io.github.io/nf-schema/latest/nextflow_schema/nextflow_schema_specification/) has been built and provided with the workflows in order to use the form-based input. However, Seqera Cloud can not deal with multiple-value parameters when using the form-based input. Therefore, it's better to use Viash Hub also here:
First, select the option to run the workflow using Seqera Cloud. You will need to create an API token for your account. Once this token is filled in in the corresponding field, you will get the option to select a 'Workspace' and a 'Compute environment'.
![](assets/launch-parameters-3-small.png)
Next, we need to fill in the parameters for the run. This is similar to before:
![](assets/launch-parameters-4-small.png)
In the next screen, pressing the 'Launch' button will actually start the workflow on Seqera Cloud. A message is shown when the submit was successful.
![](assets/launch-parameters-5-small.png)
### Run from the CLI
Running from the CLI directly without using Viash hub is possible. The easiest is to just use the integrated help functionality, for instance using the following:
```bash
nextflow run https://packages.viash-hub.com/vsh/htrnaseq.git \
-revision v0.8.1 \
-main-script target/nextflow/workflows/runner/main.nf \
--help
```
### (Optional) Resource usage tuning
Nextflow's labels can be used to specify the amount of resources a process can use. This workflow uses the following labels for CPU and memory:
* `verylowmem`, `lowmem`, `midmem`, `highmem`
* `verylowcpu`, `lowcpu`, `midcpu`, `highcpu`
The defaults for these labels can be found at `src/config/labels.config`. Nextflow checks that the specified resources for a process do not exceed what is available on the machine and will not start if it does. Create your own config file to tune the labels to your needs, for example:
```
// Resource labels
withLabel: verylowcpu { cpus = 2 }
withLabel: lowcpu { cpus = 8 }
withLabel: midcpu { cpus = 16 }
withLabel: highcpu { cpus = 32 }
withLabel: verylowmem { memory = { get_memory( 4.GB * task.attempt ) } }
withLabel: lowmem { memory = { get_memory( 8.GB * task.attempt ) } }
withLabel: midmem { memory = { get_memory( 16.GB * task.attempt ) } }
withLabel: highmem { memory = { get_memory( 64.GB * task.attempt ) } }
```
When starting nextflow using the CLI, you can use `-c` to provide the file to nextflow and overwrite the defaults.
## Contributions
Developed in collaboration with Data Intuitive and Open Analytics.
Other contributions are welcome.

21
_viash.yaml Normal file
View File

@@ -0,0 +1,21 @@
name: htrnaseq
version: v0.11.0
summary: |
A workflow for high-throughput RNA-seq data analyses.
description: "This workflow is designed to process high-throughput RNA-seq data, where every\nwell of a microarray plate is a sample. A fasta file provided as input\ndefines the mapping between sample barcodes and wells.\n\nThe workflow is built in a modular fashion, where most of the base functionality\nis provided by components from [`biobox`](https://www.viash-hub.com/packages/biobox/latest)\nsupplemented by custom base components and workflow components in this package.\n\nThe full workflow is split in two major subworkflows that can be run independently:\n\n* **Well-demultiplexing:** Split the input (plate/pool level) fastq files per well.\n* **Mapping, counting and QC:** Run per-well mapping, counting and generate QC reports.\n\nEach of those can be started individually, or the full workflow can be run in two ways:\n\n1. Run the [main workflow](https://www.viash-hub.com/packages/htrnaseq/v0.3.0/components/workflows/htrnaseq) \ncontaining the main functionality.\n2. Run the [(opinionated) `runner`](https://www.viash-hub.com/packages/htrnaseq/v0.3.0/components/workflows/runner) where a\nnumber of choices (input/output structure and location) have been made.\n\nInput for the workflow has to be `fastq` files (zipped or not). For bcl or other formats, please consider running\n[demultiplex](https://www.viash-hub.com/packages/demultiplex) first.\n"
license: MIT
keywords: [bioinformatics, sequencing, high-throughput, RNAseq, mapping, counting, pipeline, workflow]
links:
issue_tracker: https://github.com/viash-hub/htrnaseq/issues
repository: https://github.com/viash-hub/htrnaseq
viash_version: 0.9.4
info:
test_resources:
- path: gs://viash-hub-resources/htrnaseq/v2
dest: resources_test
config_mods: |
.requirements.commands := ['ps']
.runners[.type == 'nextflow'].config.script += 'includeConfig("nextflow_labels.config")'
.resources += {path: '/src/config/labels.config', dest: 'nextflow_labels.config'}
.resources += {path: '/_viash.yaml', dest: '_viash.yaml'}
organization: vsh

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

BIN
assets/htrnaseq-launch.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 131 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 123 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 183 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 280 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 114 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 183 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 187 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 287 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

3
main.nf Normal file
View File

@@ -0,0 +1,3 @@
workflow {
print("This is a dummy placeholder for pipeline execution. Please use the corresponding nf files for running pipelines.")
}

12
nextflow.config Normal file
View File

@@ -0,0 +1,12 @@
manifest {
homePage = 'https://github.com/viash-hub/htrnaseq'
description = 'HT-RNAseq pipeline'
mainScript = 'target/nextflow/workflows/htrnaseq/main.nf'
}
process {
withName: publishStatesProc {
publishDir = [ enabled: false ]
}
}

View File

@@ -0,0 +1,11 @@
name: Dries Schaumont
info:
links:
email: dries@data-intuitive.com
github: DriesSchaumont
orcid: "0000-0002-4389-0440"
linkedin: dries-schaumont
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Scientist

View File

@@ -0,0 +1,10 @@
name: Marijke Van Moerbeke
info:
links:
github: mvanmoerbeke
orcid: 0000-0002-3097-5621
linkedin: marijke-van-moerbeke-84303a34
organizations:
- name: OpenAnalytics
href: https://www.openanalytics.eu
role: Statistical Consultant

View File

@@ -0,0 +1,10 @@
name: Toni Verbeiren
info:
role: Core Team Member
links:
github: tverbeiren
linkedin: verbeiren
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Scientist and CEO

108
src/config/labels.config Normal file
View File

@@ -0,0 +1,108 @@
executor {
$k8s {
submitRateLimit = '10sec'
pollInterval = '1 sec'
}
}
process {
container = 'nextflow/bash:latest'
// default resources
memory = { 8.Gb * task.attempt }
cpus = 8
maxForks = 36
// Retry for exit codes that have something to do with memory issues
errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 3
maxMemory = 192.GB
// Resource labels
withLabel: verylowcpu { cpus = 2 }
withLabel: lowcpu { cpus = 8 }
withLabel: midcpu { cpus = 16 }
withLabel: highcpu { cpus = 32 }
withLabel: verylowmem { memory = { get_memory( 4.GB * task.attempt ) } }
withLabel: lowmem { memory = { get_memory( 8.GB * task.attempt ) } }
withLabel: midmem { memory = { get_memory( 16.GB * task.attempt ) } }
withLabel: highmem { memory = { get_memory( 64.GB * task.attempt ) } }
}
profiles {
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
docker {
docker.fixOwnership = true
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
local {
// This config is for local processing.
process {
withName: ".*parallel_map_process" {
maxForks = 1
}
maxMemory = 25.GB
withLabel: verylowcpu { cpus = 2 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 6 }
withLabel: highcpu { cpus = 8 }
withLabel: lowmem { memory = { get_memory( 8.GB * task.attempt ) } }
withLabel: midmem { memory = { get_memory( 12.GB * task.attempt ) } }
withLabel: highmem { memory = { get_memory( 20.GB * task.attempt ) } }
}
}
}
def get_memory(to_compare) {
if (!process.containsKey("maxMemory") || !process.maxMemory) {
return to_compare
}
try {
if (process.containsKey("maxRetries") && process.maxRetries && task.attempt == (process.maxRetries as int)) {
return process.maxMemory
}
else if (to_compare.compareTo(process.maxMemory as nextflow.util.MemoryUnit) == 1) {
return max_memory as nextflow.util.MemoryUnit
}
else {
return to_compare
}
} catch (all) {
println "Error processing memory resources. Please check that process.maxMemory '${process.maxMemory}' and process.maxRetries '${process.maxRetries}' are valid!"
System.exit(1)
}
}

View File

@@ -0,0 +1,56 @@
name: create_eset
namespace: "eset"
authors:
- __merge__: /src/base/authors/dries_schaumont.yaml
roles: [ maintainer ]
- __merge__: /src/base/authors/marijke_van_moerbeke.yaml
roles: [ author ]
argument_groups:
- name: "Arguments"
arguments:
- type: file
name: "--pDataFile"
required: true
- type: file
name: "--fDataFile"
required: true
- type: file
name: "--mappingDir"
multiple: true
required: true
- type: string
name: --poolName
required: true
- name: "--output"
type: file
required: true
direction: output
default: eset.$id.rds
resources:
- type: r_script
path: script.R
test_resources:
- type: r_script
path: test.R
- path: test_data/pData.tsv
- path: test_data/fData.tsv
- path: test_data/mapping_dir
engines:
- type: docker
image: rocker/r2u:24.04
setup:
- type: r
cran:
- data.table
- nlcv
bioc:
- Seurat
test_setup:
- type: r
cran:
- testthat
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,431 @@
library(Biobase)
library(data.table)
library(nlcv)
library(Matrix)
library(Seurat)
### VIASH START
par <- list(
pDataFile = "src/eset/create_eset/test_data/pData.tsv",
fDataFile = "src/eset/create_eset/test_data/fData.tsv",
studyType = "Standard",
mappingDir = c("src/eset/create_eset/test_data/mapping_dir/AACAAGGTAC",
"src/eset/create_eset/test_data/mapping_dir/ACGCCTTCGT"),
output = "eset.rds",
poolName = "Foo"
)
### VIASH END
Read10X <- function(data_dir = NULL, gene_column = 2, unique_features = TRUE) {
full.data <- list()
for (i in seq_along(along.with = data_dir)) {
run <- data_dir[i]
if (!dir.exists(paths = run)) {
stop("Directory provided does not exist")
}
barcode.loc <- file.path(run, "barcodes.tsv")
gene.loc <- file.path(run, "features.tsv")
features.loc <- file.path(run, "features.tsv.gz")
matrix.loc <- file.path(run, "matrix.mtx")
pre_ver_3 <- file.exists(gene.loc)
if (!pre_ver_3) {
addgz <- function(s) {
return(paste0(s, ".gz"))
}
barcode.loc <- addgz(s = barcode.loc)
matrix.loc <- addgz(s = matrix.loc)
}
if (!file.exists(barcode.loc)) {
stop("Barcode file missing")
}
if (!pre_ver_3 && !file.exists(features.loc)) {
stop("Gene name or features file missing")
}
if (!file.exists(matrix.loc)) {
stop("Expression matrix file missing")
}
data <- readMM(file = matrix.loc)
cell.names <- readLines(barcode.loc)
if (all(grepl(pattern = "\\-1$", x = cell.names))) {
cell.names <- as.vector(x = as.character(x = sapply(X = cell.names,
FUN = ExtractField, field = 1, delim = "-")))
}
if (is.null(x = names(x = data_dir))) {
if (i < 2) {
colnames(x = data) <- cell.names
}
else {
colnames(x = data) <- paste0(i, "_", cell.names)
}
}
else {
colnames(x = data) <- paste0(names(x = data_dir)[i],
"_", cell.names)
}
feature.names <- read.delim(file = ifelse(test = pre_ver_3,
yes = gene.loc, no = features.loc), header = FALSE,
stringsAsFactors = FALSE)
if (any(is.na(x = feature.names[, gene_column]))) {
warning("Some features names are NA. Replacing NA names with ID from the opposite column requested",
call. = FALSE, immediate. = TRUE)
na.features <- which(x = is.na(x = feature.names[,
gene_column]))
replacement.column <- ifelse(test = gene_column ==
2, yes = 1, no = 2)
feature.names[na.features, gene_column] <- feature.names[na.features,
replacement.column]
}
if (unique_features) {
fcols = ncol(x = feature.names)
if (fcols < gene_column) {
stop(paste0("gene_column was set to ", gene_column,
" but feature.tsv.gz (or genes.tsv) only has ",
fcols, " columns.", " Try setting the gene_column ",
"argument to a value <= to ",
fcols, "."))
}
rownames(x = data) <- make.unique(names = feature.names[,
gene_column])
}
if (ncol(x = feature.names) > 2) {
data_types <- factor(x = feature.names$V3)
lvls <- levels(x = data_types)
if (length(x = lvls) > 1 && length(x = full.data) == 0) {
message(paste0("10X data contains more than one type and is ",
"being returned as a list containing matrices ",
"of each type."))
}
expr_name <- "Gene Expression"
if (expr_name %in% lvls) {
lvls <- c(expr_name, lvls[-which(x = lvls ==
expr_name)])
}
data <- lapply(X = lvls, FUN = function(l) {
return(data[data_types == l, , drop = FALSE])
})
names(x = data) <- lvls
} else {
data <- list(data)
}
full.data[[length(x = full.data) + 1]] <- data
}
list_of_data <- list()
for (j in 1:length(x = full.data[[1]])) {
list_of_data[[j]] <- do.call(cbind, lapply(X = full.data,
FUN = `[[`, j))
list_of_data[[j]] <- as(object = list_of_data[[j]], Class = "CsparseMatrix")
}
names(x = list_of_data) <- names(x = full.data[[1]])
if (length(x = list_of_data) == 1) {
return(list_of_data[[1]])
} else {
return(list_of_data)
}
}
match_features <- function(exprs_matrix, fdata) {
identical_features <- all(rownames(exprs_matrix) == rownames(fdata))
if (nrow(exprs_matrix) != nrow(fdata) || !identical_features) {
message(paste0("Features in 'fData' and expression matrix differ. ",
"Only matching features are returned."))
}
features <- intersect(rownames(exprs_matrix), rownames(fdata))
exprs_matrix <- exprs_matrix[which(rownames(exprs_matrix) %in% features), ]
fdata <- fdata[which(rownames(fdata) %in% features), ]
fdata[, seq_len(ncol(fdata))] <- lapply(fdata[, seq_len(ncol(fdata)), drop = FALSE], as.character)
# order features in exprs mat according to fdata
exprs_matrix <- exprs_matrix[match(rownames(fdata), rownames(exprs_matrix)), ]
list(exprs_matrix = exprs_matrix, fdata = fdata)
}
create_pdata <- function(sample_file, pool_name, barcodes) {
cols_to_remove <- c("SampleFileName", "Output", "Measure", "Strandedness")
pData <- sample_file[, !colnames(sample_file) %in% cols_to_remove,
drop = FALSE]
rownames(pData) <- lapply(sample_file$WellBC,
\(x) paste(pool_name, x, sep = "_"))
# pData[, ] <- lapply(pData, as.factor)
pData$PoolName <- pool_name
pData <- pData[match(barcodes, pData$WellBC), ]
return(pData)
}
check_sample_file <- function(mapping_dir, sample_file){
message("Checking sample annotation:")
requireNamespace("tools")
mapping_dir <- unlist(lapply(mapping_dir, function(x) {
if (!dir.exists(x)) {
stop(sprintf(paste0("Could not find directory ",
"provided in 'mappingDir' argument (%s)."), x))
}
tools::file_path_as_absolute(x)
}))
# additional check for STARsolo
check_STARsolo_output <- function(x) {
files <- c("barcodes.tsv", "features.tsv", "matrix.mtx")
test <- list.files(x) %in% c(files, paste0(files, ".gz"))
length(test) != 0 && all(test)
}
if (!"WellBC" %in% colnames(sample_file)) {
stop(paste0("STARsolo output is used. The sample annotation must ",
"contain 'WellBC' column providing cell barcodes."))
}
mapping_dir <- unique(mapping_dir)
all_STARsolo_files_present <- all(
unlist(
lapply(mapping_dir, function(x) {
check_STARsolo_output(x)
})
)
)
if (!all_STARsolo_files_present) {
stop(paste0("Could not find files: 'barcodes', 'features' and 'matrix'",
" for STARsolo output. Please check 'mappingDir' argument."))
}
message("- 'SampleFileName' column - OK")
list(sample_expression_files = mapping_dir)
}
create_exprs_matrix <- function(exprs_matrix_path, exprs_file_paths,
output, measure, col_names, cell_barcodes) {
read_matrix <- Read10X(data_dir = exprs_file_paths, gene_column = 1)
# keep index of feature names containing "_" because Seurat
#changes them to "-" and they no longer match with fdata[, "gene_id"]
idx <- grep("_", rownames(read_matrix))
requireNamespace("Seurat")
seurat_object <- Seurat::CreateSeuratObject(counts = read_matrix)
exprs_matrix <- as.matrix(seurat_object[['RNA']]$counts)
# replace "-" with "_" for features with "_"
# before converting to Seurat object
rownames(exprs_matrix)[idx] <- gsub("-", "_", rownames(exprs_matrix)[idx])
requireNamespace("stringr")
exprs_matrix <- exprs_matrix[, stringr::str_detect(colnames(exprs_matrix),
paste(cell_barcodes, collapse = "|"))]
# check if rownames are ENSEMBL and remove version suffix
isENSEMBL <- all(grepl("ENS", rownames(exprs_matrix)))
if (isENSEMBL) {
# do not use gsub("(.+)[.]\\d+", "\\1", rownames(exprs_matrix)),
# so that ENS000000.1_PAR_Y can be kept
rownames(exprs_matrix) <- gsub("\\.\\d+$", "", rownames(exprs_matrix))
}
colnames(exprs_matrix) <- col_names
exprs_matrix
}
create_eset <- function(feature_annotation_path,
sample_annotation_path,
mapping_dir,
barcodes,
output_path,
pool_name,
exprs_matrix_path = NULL,
path = NULL,
add_eset_annotation = NULL) {
if (!file.exists(feature_annotation_path)) {
stop("Could not find feature annotation at '", feature_annotation_path, "'")
}
if (!file.exists(sample_annotation_path)) {
stop("Could not find sample annotation at '", sample_annotation_path, "'")
}
if(!is.null(exprs_matrix_path)) {
if(!file.exists(exprs_matrix_path)) {
stop("Could not find expression matrix at '", exprs_matrix_path, "'")
}
}
if(!is.null(path)) {
if(!dir.exists(path)) {
stop("Provided 'path': '", path, "' does not exist.")
}
}
##### Import annotation files #####
message("Importing feature annotation")
fdata_file <- read.table(feature_annotation_path, header = TRUE,
sep = "\t", quote = "\"",
comment.char = "", stringsAsFactors = FALSE)
# for backwards compatibility
if("ENSEMBL" %in% colnames(fdata_file) && !all(grepl("ENS", fdata_file[, "ENSEMBL"])) & !"gene_id" %in% colnames(fdata_file)) {
colnames(fdata_file)[which(colnames(fdata_file) == "ENSEMBL")] <- "gene_id"
}
# Check gene annotation
if(!"gene_id" %in% colnames(fdata_file))
stop("'gene_id' column with unique feature identifiers must be present in 'feature_annotation_path'.")
# check if duplicated ids are present
if(any(duplicated(fdata_file$gene_id)))
stop("Duplicated features ids are not allowed. Please check the 'gene_id' column in 'feature_annotation_path'.")
message("Importing sample annotation")
sample_file <- read.table(sample_annotation_path, header = TRUE,
sep = "\t", quote = "\"",
comment.char = "", stringsAsFactors = FALSE)
# Check sample annotation
check_sample_file_list <- check_sample_file(mapping_dir = mapping_dir,
sample_file = sample_file)
output <- "STARsolo"
measure <- "counts"
sample_expression_files <- check_sample_file_list$sample_expression_files
##### Create phenodata #####
pdata_eset <- create_pdata(sample_file = sample_file, pool_name = pool_name,
barcodes = barcodes)
##### Create expression matrix #####
message("Creating expression matrix")
exprs_matrix_eset <- create_exprs_matrix(
exprs_matrix_path = exprs_matrix_path,
exprs_file_paths = sample_expression_files,
output = output,
measure = measure,
col_names = rownames(pdata_eset),
cell_barcodes = barcodes
)
##### Create featuredata #####
message("Creating feature data")
fdata_eset <- fdata_file
rownames(fdata_eset) <- fdata_eset[, "gene_id"]
# intersect features between exprs matrix and fdata
feature_files <- match_features(exprs_matrix = exprs_matrix_eset,
fdata = fdata_eset)
fdata_eset <- feature_files$fdata
exprs_matrix_eset <- feature_files$exprs_matrix
##### Create eSet #####
message("Creating eset")
if (nrow(pdata_eset) != ncol(exprs_matrix_eset)) {
stop("nrow(pData) and ncol(exprsMatrix) differ")
}
if (nrow(fdata_eset) != nrow(exprs_matrix_eset)) {
stop("nrow(fData) and nrow(exprsMatrix) differ")
}
if (!all(rownames(pdata_eset) == colnames(exprs_matrix_eset))) {
stop("rownames(pData) and colnames(exprsMatrix) differ")
}
if (!all(rownames(fdata_eset) == rownames(exprs_matrix_eset))) {
stop("rownames(fData) and rownames(exprsMatrix) differ")
}
if (!inherits(exprs_matrix_eset, "matrix")) {
stop("exprsMatrix must be of class 'matrix'")
}
additional_info <- paste0("Additional information about eSet \n",
" Expression matrix created from ",
output, " output. \n",
" Expression matrix contains non-transformed ",
ifelse(output %in% c("STAR", "STARsolo"),
"counts",
ifelse(measure == "expected_count",
"counts", measure)), ".")
if (isTRUE(!is.null(add_eset_annotation) &
is.character(add_eset_annotation))) {
additional_info <- paste0(additional_info, "\n", " ", add_eset_annotation)
}
fdata_eset <- new("AnnotatedDataFrame", data = fdata_eset)
pdata_eset <- new("AnnotatedDataFrame", data = pdata_eset)
requireNamespace("Biobase")
eset <- Biobase::ExpressionSet(assayData = exprs_matrix_eset,
phenoData = pdata_eset,
featureData = fdata_eset,
annotation = additional_info)
eset <- eset[, colSums(exprs(eset)) != 0]
saveRDS(eset, file = output_path)
message(paste0("eset created succesfully for ", ncol(eset),
" samples and ", nrow(eset),
" genes and saved at ", output_path, "."))
eset
}
p_data_file <- par$pDataFile
f_data_file <- par$fDataFile
pool_name <- par$poolName
mapping_dir <- lapply(par$mappingDir,
\(x) file.path(x, "Solo.out", "Gene", "raw"))
get_barcode_from_mapping_dir <- function(raw_dir) {
barcodes_file <- file.path(raw_dir, "barcodes.tsv")
if (!file.exists(barcodes_file)) {
stop(paste0("Expected the 'Solo.out/Gene/raw' directory at ",
raw_dir, " to contain a 'barcodes.tsv' file."))
}
barcodes <- readLines(barcodes_file)
if (length(barcodes) != 1) {
stop(paste0("A single STAR Solo folder should only have ",
"mapped one (1) barcode, but found '",
length(barcodes), "'for mapping directory ", raw_dir))
}
return(barcodes)
}
barcodes <- lapply(mapping_dir, get_barcode_from_mapping_dir)
print(paste0("mappingDir: ", mapping_dir))
print(paste0("pDataFile: ", p_data_file))
print(paste0("fDataFile: ", f_data_file))
print(paste0("poolName: ", pool_name))
print(paste0("barcodes: ", barcodes))
# CREATE ESET WITH RAW UMI COUNTS
eset <- create_eset(feature_annotation_path = f_data_file,
sample_annotation_path = p_data_file,
mapping_dir = mapping_dir,
barcodes = barcodes,
output_path = par$output,
pool_name = pool_name,
path = NULL,
exprs_matrix_path = NULL)

132
src/eset/create_eset/test.R Normal file
View File

@@ -0,0 +1,132 @@
library(testthat)
library(Biobase)
### VIASH START
meta <- list(
resources_dir = "src/eset/create_eset/test_data",
executable = "target/executable/eset/create_eset/create_eset"
)
### VIASH END
output <- tempfile()
out <- processx::run(meta$executable, c(
"--pDataFile", file.path(meta$resources_dir, "pData.tsv"),
"--fDataFile", file.path(meta$resources_dir, "fData.tsv"),
"--mappingDir", file.path(meta$resources_dir, "mapping_dir", "AACAAGGTAC"),
"--mappingDir", file.path(meta$resources_dir, "mapping_dir", "ACGCCTTCGT"),
"--poolName", "foo",
"--output", output
))
expect_equal(out$status, 0)
expect_true(file.exists(output))
result <- readRDS(output)
stopifnot(length(sampleNames(result)) == 2)
stopifnot(all(sampleNames(result) == c("foo_AACAAGGTAC", "foo_ACGCCTTCGT")))
expected_feature_names <- c(
"ENS0001058", "ENS0000221", "ENS0001387", "ENS0000508", "ENS0001199",
"ENS0000477", "ENS0001457", "ENS0001040", "ENS0000114", "ENS0000821",
"ENS0001429", "ENS0001396", "ENS0000355", "ENS0000122", "ENS0000441",
"ENS0001223", "ENS0001431", "ENS0000042", "ENS0000443", "ENS0000389",
"ENS0001208", "ENS0001140", "ENS0000071", "ENS0001369"
)
stopifnot(length(featureNames(result)) == 24)
stopifnot(all(featureNames(result) == expected_feature_names))
expected_expressions <- matrix(
c(0, 0,
0, 40,
0, 0,
0, 0,
1, 2,
0, 0,
0, 0,
0, 0,
2, 2,
0, 0,
0, 0,
8, 2,
0, 0,
1, 0,
2, 3,
0, 0,
0, 0,
0, 0,
1, 0,
0, 0,
16, 13,
0, 0,
12, 13,
5, 2),
ncol = 2,
nrow = 24,
byrow = TRUE,
)
rownames(expected_expressions) <- expected_feature_names
colnames(expected_expressions) <- c("foo_AACAAGGTAC", "foo_ACGCCTTCGT")
stopifnot(identical(exprs(result), expected_expressions))
input_f_data <- read.table(file.path(meta$resources_dir, "fData.tsv"),
sep = "\t", quote = "\"", comment.char = "",
header = TRUE)
input_f_data <- input_f_data[input_f_data$gene_id %in% expected_feature_names, ]
row.names(input_f_data) <- input_f_data$gene_id
input_f_data[] <- lapply(input_f_data, as.character)
stopifnot(identical(input_f_data, fData(result)))
# Check results filtering of barcodes with no reads
out <- processx::run(meta$executable, c(
"--pDataFile", file.path(meta$resources_dir, "pData.tsv"),
"--fDataFile", file.path(meta$resources_dir, "fData.tsv"),
"--mappingDir", file.path(meta$resources_dir, "mapping_dir", "AACAAGGTAC"),
"--mappingDir", file.path(meta$resources_dir, "mapping_dir", "EMPTY"),
"--poolName", "bar",
"--output", output
))
expect_equal(out$status, 0)
expect_true(file.exists(output))
result <- readRDS(output)
stopifnot(length(sampleNames(result)) == 1)
stopifnot(all(sampleNames(result) == c("bar_AACAAGGTAC")))
expected_feature_names <- c(
"ENS0001058", "ENS0000221", "ENS0001387", "ENS0000508", "ENS0001199",
"ENS0000477", "ENS0001457", "ENS0001040", "ENS0000114", "ENS0000821",
"ENS0001429", "ENS0001396", "ENS0000355", "ENS0000122", "ENS0000441",
"ENS0001223", "ENS0001431", "ENS0000042", "ENS0000443", "ENS0000389",
"ENS0001208", "ENS0001140", "ENS0000071", "ENS0001369"
)
stopifnot(length(featureNames(result)) == 24)
stopifnot(all(featureNames(result) == expected_feature_names))
expected_expressions <- matrix(
c(0,
0,
0,
0,
1,
0,
0,
0,
2,
0,
0,
8,
0,
1,
2,
0,
0,
0,
1,
0,
16,
0,
12,
5),
ncol = 1,
nrow = 24,
byrow = TRUE,
)
rownames(expected_expressions) <- expected_feature_names
colnames(expected_expressions) <- c("bar_AACAAGGTAC")
stopifnot(identical(exprs(result), expected_expressions))

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1 @@
AACAAGGTAC
1 AACAAGGTAC

View File

@@ -0,0 +1,25 @@
ENS0001140 209E3 Gene Expression
ENS0001058 A2B9A Gene Expression
ENS0000508 CF168 Gene Expression
ENS0001457 3BA5A Gene Expression
ENS0001431 1C968 Gene Expression
ENS0000821 E5192 Gene Expression
ENS0001040 1821B Gene Expression
ENS0000443 5AD11 Gene Expression
ENS0000441 3F0FF Gene Expression
ENS0001387 265F2 Gene Expression
ENS0001223 28A43 Gene Expression
ENS0001208 58E28 Gene Expression
ENS0001396 6E614 Gene Expression
ENS0001199 EA941 Gene Expression
ENS0001369 99DDC Gene Expression
ENS0000770 AFCC0 Gene Expression
ENS0000389 B58E5 Gene Expression
ENS0000071 7A6C3 Gene Expression
ENS0000114 65424 Gene Expression
ENS0000355 077A2 Gene Expression
ENS0001429 22A4F Gene Expression
ENS0000477 981E6 Gene Expression
ENS0000042 E2D99 Gene Expression
ENS0000122 D90E9 Gene Expression
ENS0000221 97B0F Gene Expression
1 ENS0001140 209E3 Gene Expression
2 ENS0001058 A2B9A Gene Expression
3 ENS0000508 CF168 Gene Expression
4 ENS0001457 3BA5A Gene Expression
5 ENS0001431 1C968 Gene Expression
6 ENS0000821 E5192 Gene Expression
7 ENS0001040 1821B Gene Expression
8 ENS0000443 5AD11 Gene Expression
9 ENS0000441 3F0FF Gene Expression
10 ENS0001387 265F2 Gene Expression
11 ENS0001223 28A43 Gene Expression
12 ENS0001208 58E28 Gene Expression
13 ENS0001396 6E614 Gene Expression
14 ENS0001199 EA941 Gene Expression
15 ENS0001369 99DDC Gene Expression
16 ENS0000770 AFCC0 Gene Expression
17 ENS0000389 B58E5 Gene Expression
18 ENS0000071 7A6C3 Gene Expression
19 ENS0000114 65424 Gene Expression
20 ENS0000355 077A2 Gene Expression
21 ENS0001429 22A4F Gene Expression
22 ENS0000477 981E6 Gene Expression
23 ENS0000042 E2D99 Gene Expression
24 ENS0000122 D90E9 Gene Expression
25 ENS0000221 97B0F Gene Expression

View File

@@ -0,0 +1,13 @@
%%MatrixMarket matrix coordinate integer general
%
25 1 10
8 1 1
9 1 2
12 1 16
13 1 8
14 1 1
15 1 5
16 1 5
18 1 12
19 1 2
24 1 1

View File

@@ -0,0 +1 @@
ACGCCTTCGT
1 ACGCCTTCGT

View File

@@ -0,0 +1,25 @@
ENS0001140 209E3 Gene Expression
ENS0001058 A2B9A Gene Expression
ENS0000508 CF168 Gene Expression
ENS0001457 3BA5A Gene Expression
ENS0001431 1C968 Gene Expression
ENS0000821 E5192 Gene Expression
ENS0001040 1821B Gene Expression
ENS0000443 5AD11 Gene Expression
ENS0000441 3F0FF Gene Expression
ENS0001387 265F2 Gene Expression
ENS0001223 28A43 Gene Expression
ENS0001208 58E28 Gene Expression
ENS0001396 6E614 Gene Expression
ENS0001199 EA941 Gene Expression
ENS0001369 99DDC Gene Expression
ENS0000770 AFCC0 Gene Expression
ENS0000389 B58E5 Gene Expression
ENS0000071 7A6C3 Gene Expression
ENS0000114 65424 Gene Expression
ENS0000355 077A2 Gene Expression
ENS0001429 22A4F Gene Expression
ENS0000477 981E6 Gene Expression
ENS0000042 E2D99 Gene Expression
ENS0000122 D90E9 Gene Expression
ENS0000221 97B0F Gene Expression
1 ENS0001140 209E3 Gene Expression
2 ENS0001058 A2B9A Gene Expression
3 ENS0000508 CF168 Gene Expression
4 ENS0001457 3BA5A Gene Expression
5 ENS0001431 1C968 Gene Expression
6 ENS0000821 E5192 Gene Expression
7 ENS0001040 1821B Gene Expression
8 ENS0000443 5AD11 Gene Expression
9 ENS0000441 3F0FF Gene Expression
10 ENS0001387 265F2 Gene Expression
11 ENS0001223 28A43 Gene Expression
12 ENS0001208 58E28 Gene Expression
13 ENS0001396 6E614 Gene Expression
14 ENS0001199 EA941 Gene Expression
15 ENS0001369 99DDC Gene Expression
16 ENS0000770 AFCC0 Gene Expression
17 ENS0000389 B58E5 Gene Expression
18 ENS0000071 7A6C3 Gene Expression
19 ENS0000114 65424 Gene Expression
20 ENS0000355 077A2 Gene Expression
21 ENS0001429 22A4F Gene Expression
22 ENS0000477 981E6 Gene Expression
23 ENS0000042 E2D99 Gene Expression
24 ENS0000122 D90E9 Gene Expression
25 ENS0000221 97B0F Gene Expression

View File

@@ -0,0 +1,12 @@
%%MatrixMarket matrix coordinate integer general
%
25 1 9
9 1 3
12 1 13
13 1 2
14 1 2
15 1 2
16 1 3
18 1 13
19 1 2
25 1 40

View File

@@ -0,0 +1 @@
CCCCCCCCCC
1 CCCCCCCCCC

View File

@@ -0,0 +1,25 @@
ENS0001140 209E3 Gene Expression
ENS0001058 A2B9A Gene Expression
ENS0000508 CF168 Gene Expression
ENS0001457 3BA5A Gene Expression
ENS0001431 1C968 Gene Expression
ENS0000821 E5192 Gene Expression
ENS0001040 1821B Gene Expression
ENS0000443 5AD11 Gene Expression
ENS0000441 3F0FF Gene Expression
ENS0001387 265F2 Gene Expression
ENS0001223 28A43 Gene Expression
ENS0001208 58E28 Gene Expression
ENS0001396 6E614 Gene Expression
ENS0001199 EA941 Gene Expression
ENS0001369 99DDC Gene Expression
ENS0000770 AFCC0 Gene Expression
ENS0000389 B58E5 Gene Expression
ENS0000071 7A6C3 Gene Expression
ENS0000114 65424 Gene Expression
ENS0000355 077A2 Gene Expression
ENS0001429 22A4F Gene Expression
ENS0000477 981E6 Gene Expression
ENS0000042 E2D99 Gene Expression
ENS0000122 D90E9 Gene Expression
ENS0000221 97B0F Gene Expression
1 ENS0001140 209E3 Gene Expression
2 ENS0001058 A2B9A Gene Expression
3 ENS0000508 CF168 Gene Expression
4 ENS0001457 3BA5A Gene Expression
5 ENS0001431 1C968 Gene Expression
6 ENS0000821 E5192 Gene Expression
7 ENS0001040 1821B Gene Expression
8 ENS0000443 5AD11 Gene Expression
9 ENS0000441 3F0FF Gene Expression
10 ENS0001387 265F2 Gene Expression
11 ENS0001223 28A43 Gene Expression
12 ENS0001208 58E28 Gene Expression
13 ENS0001396 6E614 Gene Expression
14 ENS0001199 EA941 Gene Expression
15 ENS0001369 99DDC Gene Expression
16 ENS0000770 AFCC0 Gene Expression
17 ENS0000389 B58E5 Gene Expression
18 ENS0000071 7A6C3 Gene Expression
19 ENS0000114 65424 Gene Expression
20 ENS0000355 077A2 Gene Expression
21 ENS0001429 22A4F Gene Expression
22 ENS0000477 981E6 Gene Expression
23 ENS0000042 E2D99 Gene Expression
24 ENS0000122 D90E9 Gene Expression
25 ENS0000221 97B0F Gene Expression

View File

@@ -0,0 +1,3 @@
%%MatrixMarket matrix coordinate integer general
%
25 1 0

View File

@@ -0,0 +1,3 @@
WellBC WellID NumberOfMTReads pctMT NumberOfERCCReads pctERCC NumberOfChromReads pctChrom NumberOfInputReads NumberOfMappedReads PctMappedReads NumberOfReadsMappedToMultipleLoci PectOfReadsMappedToMultipleLoci NumberOfReadsMappedToTooManyLoci PectOfReadsMappedToTooManyLoci NumberOfReadsUnmappedTooManyMismatches PectOfReadsUnmappedTooManyMismatches NumberOfReadsUnmappedTooShort PectOfReadsUnmappedTooShort NumberOfReadsUnmappedOther PectOfReadsUnmappedOther ReadsWithValidBarcodes SequencingSaturation Q30BasesInCB+UMI ReadsMappedToTranscriptome:Unique+MultipeGenes EstimatedNumberOfCells FractionOfReadsInCells MeanReadsPerCell NumberOfUMIs NumberOfGenes NumberOfCountedReads
AACAAGGTAC A1 0 0 0 0 8542 100 141303 23749 16.81 0 0 8458 5.99 0 0 109035 77.16 61 0.04 0.999816 0.0698056 0.979965 0.0618175 1 1 8538 7942 408 9535
ACGCCTTCGT B2 0 0 0 0 5863 100 96430 16869 17.49 0 0 6124 6.35 0 0 73375 76.09 62 0.06 0.999782 0.0665302 0.980077 0.0620969 1 1 5862 5472 377 6463
1 WellBC WellID NumberOfMTReads pctMT NumberOfERCCReads pctERCC NumberOfChromReads pctChrom NumberOfInputReads NumberOfMappedReads PctMappedReads NumberOfReadsMappedToMultipleLoci PectOfReadsMappedToMultipleLoci NumberOfReadsMappedToTooManyLoci PectOfReadsMappedToTooManyLoci NumberOfReadsUnmappedTooManyMismatches PectOfReadsUnmappedTooManyMismatches NumberOfReadsUnmappedTooShort PectOfReadsUnmappedTooShort NumberOfReadsUnmappedOther PectOfReadsUnmappedOther ReadsWithValidBarcodes SequencingSaturation Q30BasesInCB+UMI ReadsMappedToTranscriptome:Unique+MultipeGenes EstimatedNumberOfCells FractionOfReadsInCells MeanReadsPerCell NumberOfUMIs NumberOfGenes NumberOfCountedReads
2 AACAAGGTAC A1 0 0 0 0 8542 100 141303 23749 16.81 0 0 8458 5.99 0 0 109035 77.16 61 0.04 0.999816 0.0698056 0.979965 0.0618175 1 1 8538 7942 408 9535
3 ACGCCTTCGT B2 0 0 0 0 5863 100 96430 16869 17.49 0 0 6124 6.35 0 0 73375 76.09 62 0.06 0.999782 0.0665302 0.980077 0.0620969 1 1 5862 5472 377 6463

View File

@@ -0,0 +1,46 @@
name: create_fdata
namespace: eset
description: |
Create a fdata file
authors:
- __merge__: /src/base/authors/dries_schaumont.yaml
roles: [ maintainer ]
- __merge__: /src/base/authors/marijke_van_moerbeke.yaml
roles: [ contributor ]
arguments:
- name: "--gtf"
type: file
description: "Genome annotation file in GTF format."
required: true
- name: "--output"
description: |
Tab-delimited text file containing information about the 'gene' or 'transcript'
entries from the input GTF file. The 'transcript' entries are used in case the source
of the GTF was 'refGene' or 'ncbiRefSeq'.
type: file
direction: output
default: fData.$id.txt
resources:
- type: python_script
path: create_fdata.py
test_resources:
- type: python_script
path: test.py
- path: test_annotation.gtf
engines:
- type: docker
image: python:3.12-slim
setup:
- type: apt
packages:
- procps
- type: python
packages:
- pandas
test_setup:
- type: python
packages:
- viashpy
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,137 @@
import logging
import pandas as pd
import numpy as np
from textwrap import fill
### VIASH START
meta = {
"name": "create_fdata",
}
par = {
"gtf": "src/eset/create_fdata/test_annotation.gtf",
"output": "fData.tsv"
}
### VIASH END
logger = logging.getLogger()
console_handler = logging.StreamHandler()
logger.addHandler(console_handler)
logger.setLevel(logging.DEBUG)
def read_gtf(gtf_path: str) -> pd.DataFrame:
logger.info("Reading %s", gtf_path)
result = pd.read_csv(gtf_path, sep="\t",
header=None, names=("seqname", "source",
"feature", "start", "end",
"score", "strand", "frame",
"attribute"),
dtype={
"seqname": pd.StringDtype(),
"source": pd.StringDtype(),
"feature": pd.StringDtype(),
"start": pd.Int64Dtype(),
"end": pd.Int64Dtype(),
"score": pd.StringDtype(),
"strand": pd.CategoricalDtype(categories=["+", "-"],
ordered=False),
"frame": pd.StringDtype(),
"attribute": pd.StringDtype(),
},
comment='#'
)
logger.info("Done reading %s. Found %d GTF entries ", par["gtf"], result.shape[0])
logger.info("GTF file is providing information for the following chromosomes: \n%s",
fill(", ".join(result['seqname'].unique()), width=100))
logger.info("The following sources were specified in the GTF file:\n%s",
", ".join(result["source"].unique()))
return result
def parse_attributes(attributes_series: pd.Series):
attribute_dict = dict()
attributes_list = [attr.strip().split(" ")
for attr in attributes_series["attribute"].strip(";").split(";")]
for (attr_name, attr_value) in attributes_list:
attribute_dict.setdefault(attr_name, []).append(attr_value.strip('"'))
attribute_dict = {attr_name: "|".join(attr_value)
for attr_name, attr_value in attribute_dict.items()}
return pd.Series(attribute_dict)
def main(par):
logger.info(f"{meta['name']} started.")
parameters_str = [f'\t{param}: {param_val}\n' for param, param_val in par.items()]
logger.info("Parameters:\n%s", "".join(parameters_str).rstrip())
gtf_file = read_gtf(par["gtf"])
sources = set(source for source in gtf_file["source"].unique() if source != "ERCC")
specific_gtf = False
feature = "gene"
if len(sources) == 1 and (source := sources[0]) \
and (source == "refGene" or source == "ncbiRefSeq"):
feature = "transcript"
specific_gtf = True
logger.info("Found specific GTF from %s, forcing filtering on feature type %s", source, feature)
logger.info("Filtering GTF entries for feature type '%s'.", feature)
gtf_file = gtf_file[gtf_file["feature"] == feature]
logger.info("After filtering %d entries are left.", gtf_file.shape[0])
logger.info("Parsing the GTF attributes")
annotation = gtf_file[["attribute"]].apply(parse_attributes, result_type="expand", axis=1)
logger.info("Found the following attributes in the GTF:\n%s", ", ".join(annotation.columns))
annotation = pd.concat([gtf_file.drop(["attribute"], axis=1), annotation], axis=1)
if specific_gtf:
logger.info("Because the source of the GTF is either 'ncbiRefSeq' or 'refGene', which"
"caused forced filtering based on %s, the duplicate genes still need to be dropped.",
feature)
annotation = annotation.drop_duplicates(subset=("gene_id", "gene_name"), keep=False)
logger.info("After dropping duplicates, %d entries are left", annotation.shape[0])
# detect ensembl ids
# some GTF files contain version in ENSEMBL, e.g. ENS00000000046319.1
# we remove the version, because the annotation packages don't contain the version
if "gene_id" in annotation.columns:
logger.info("'gene_id' column was detected in attributes. Performing extra parsing of ENSEMBL ids.")
annotation["ENSEMBL_with_version"] = annotation["gene_id"].where(annotation["gene_id"].str.startswith("ENS"))
annotation["ENSEMBL"] = annotation["ENSEMBL_with_version"].str.replace(r"\.\d+$", "", regex=True)
annotation["gene_id"] = annotation["gene_id"].str.replace(r"\.\d+$", "", regex=True)
possible_name_columns = ("Name", "name", "gene_name")
found_columns = list(filter(lambda col_name: col_name in annotation, possible_name_columns))
# The following code allows to select a value for the SYMBOL column based on the first non-na column
if found_columns:
logger.info("Found one the following columns: %s; which can be used to populate the SYMBOL column",
", ".join(possible_name_columns))
# For each row (gtf entry), get the name of the first column that actually holds a value.
column_to_get = annotation.loc[:,found_columns].apply(pd.Series.first_valid_index, axis=1)
counts_per_column = column_to_get.value_counts(dropna=False).to_dict()
counts_per_column_str = [f'\t{col}: {counts}\n' for col, counts in counts_per_column.items()]
logger.info("Frequencies of the origin for the entries in the SYMBOL column:\n%s",
"".join(counts_per_column_str).rstrip())
# If all columns hold NA for a certain row, first_valid_index will return None.
# Just use the name of the first column.
column_to_get = column_to_get.fillna(found_columns[0])
# We now have a list one column name per row, use it so select the values
# Loc cannot be used here because 1 value per row is required,
# and loc will select for each row all the columns in columns_to_get
idx, cols = pd.factorize(column_to_get)
symbol_values = annotation.reindex(cols, axis=1).to_numpy()[np.arange(len(annotation)), idx]
annotation["SYMBOL"] = symbol_values
logger.info("Dropping unused columns")
annotation = annotation.drop(["score", "source", "frame", "feature"], axis=1)
logger.info("Looking for duplicate rows and removing them. Starting with %i entries", annotation.shape[0])
annotation = annotation.drop_duplicates(keep="first", ignore_index=True)
logger.info("After removing duplicates: %i entries", annotation.shape[0])
logger.info("Writing to %s", par["output"])
annotation.to_csv(par["output"], sep="\t", header=True, index=False, na_rep="NA")
# Do these checks *after* writing the csv in order to be able to check the data
logger.info("Checking for unique gene IDs")
if not annotation["gene_id"].is_unique:
raise ValueError("Values from the 'gene_id' column are not unique after processing!")
logger.info("%s finished", meta['name'])
if __name__ == "__main__":
main(par)

View File

@@ -0,0 +1,102 @@
import pytest
import sys
import pandas as pd
from pathlib import Path
from uuid import uuid4
from shutil import copyfile
### VIASH START
meta = {
"resources_dir": "./src/eset/create_fdata/",
"executable": "target/executable/eset/create_fdata/create_fdata",
"config": "src/eset/create_fdata/config.vsh.yaml"
}
### VIASH END
@pytest.fixture
def test_annotation_path():
return Path(meta["resources_dir"]) / "test_annotation.gtf"
@pytest.fixture
def random_path(tmp_path):
def wrapper(extension=None):
extension = "" if not extension else f".{extension}"
return tmp_path / f"{uuid4()}{extension}"
return wrapper
def test_create_fdata(run_component, test_annotation_path, random_path):
output_path = random_path("tsv")
run_component([
"--gtf", test_annotation_path,
"--output", output_path
])
assert output_path.is_file()
result = pd.read_csv(output_path, sep="\t", dtype=pd.StringDtype())
expected_dict = {
"seqname": ["20", "20", "20", "21"],
"start": ["87250", "142590", "157454", "297570"],
"end": ["97094", "145751", "159163", "300321"],
"strand": ["+", "+", "+", "+"],
"gene_id": ["ENSG00000178591", "ENSG00000125788",
"ENSG00000088782", "ENSG00000247315"],
"gene_version": ["7", "6", "5", "4"],
"gene_name": ["DEFB125", "DEFB126", "DEFB127", pd.NA],
"gene_source": ["ensembl_havana", "ensembl_havana",
"ensembl_havana", "havana"],
"gene_biotype": ["protein_coding", "protein_coding",
"protein_coding", "protein_coding"],
"ENSEMBL_with_version": ["ENSG00000178591.7", "ENSG00000125788",
"ENSG00000088782", "ENSG00000247315"],
"ENSEMBL": ["ENSG00000178591", "ENSG00000125788",
"ENSG00000088782", "ENSG00000247315"],
"SYMBOL": ["DEFB125", "DEFB126", "DEFB127", pd.NA]
}
expected = pd.DataFrame.from_dict(expected_dict, dtype=pd.StringDtype())
pd.testing.assert_frame_equal(expected, result, check_like=True)
def test_make_unique(run_component, test_annotation_path, random_path):
gtf_with_duplicate_entry_path = random_path("gtf")
output_path = random_path("tsv")
entry_to_add = (
"\n20 ensembl_havana gene 87250 97094 . + . gene_id " +
"\"ENSG00000178591.7\"; gene_version \"7\"; gene_name \"DEFB125\"; " +
"gene_source \"ensembl_havana\"; gene_biotype \"protein_coding\";\n"
)
copyfile(test_annotation_path, gtf_with_duplicate_entry_path)
with gtf_with_duplicate_entry_path.open("a") as open_gtf:
open_gtf.write(entry_to_add)
run_component([
"--gtf", gtf_with_duplicate_entry_path,
"--output", output_path
])
assert output_path.is_file()
result = pd.read_csv(output_path, sep="\t", dtype=pd.StringDtype())
expected_dict = {
"seqname": ["20", "20", "20", "21"],
"start": ["87250", "142590", "157454", "297570"],
"end": ["97094", "145751", "159163", "300321"],
"strand": ["+", "+", "+", "+"],
"gene_id": ["ENSG00000178591", "ENSG00000125788",
"ENSG00000088782", "ENSG00000247315"],
"gene_version": ["7", "6", "5", "4"],
"gene_name": ["DEFB125", "DEFB126", "DEFB127", pd.NA],
"gene_source": ["ensembl_havana", "ensembl_havana",
"ensembl_havana", "havana"],
"gene_biotype": ["protein_coding", "protein_coding",
"protein_coding", "protein_coding"],
"ENSEMBL_with_version": ["ENSG00000178591.7", "ENSG00000125788",
"ENSG00000088782", "ENSG00000247315"],
"ENSEMBL": ["ENSG00000178591", "ENSG00000125788",
"ENSG00000088782", "ENSG00000247315"],
"SYMBOL": ["DEFB125", "DEFB126", "DEFB127", pd.NA]
}
expected = pd.DataFrame.from_dict(expected_dict, dtype=pd.StringDtype())
pd.testing.assert_frame_equal(expected, result, check_like=True)
if __name__ == '__main__':
sys.exit(pytest.main([__file__]))

View File

@@ -0,0 +1,45 @@
20 ensembl_havana gene 87250 97094 . + . gene_id "ENSG00000178591.7"; gene_version "7"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
20 havana transcript 87250 97094 . + . gene_id "ENSG00000178591"; gene_version "7"; transcript_id "ENST00000608838"; transcript_version "1"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB125-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; transcript_support_level "2";
20 havana exon 87250 87359 . + . gene_id "ENSG00000178591"; gene_version "7"; transcript_id "ENST00000608838"; transcript_version "1"; exon_number "1"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB125-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003702629"; exon_version "1"; transcript_support_level "2";
20 havana exon 96005 97094 . + . gene_id "ENSG00000178591"; gene_version "7"; transcript_id "ENST00000608838"; transcript_version "1"; exon_number "2"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB125-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003705060"; exon_version "1"; transcript_support_level "2";
20 ensembl_havana transcript 87672 97094 . + . gene_id "ENSG00000178591"; gene_version "7"; transcript_id "ENST00000382410"; transcript_version "3"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB125-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12989"; tag "basic"; transcript_support_level "1 (assigned to previous version 2)";
20 ensembl_havana exon 87672 87767 . + . gene_id "ENSG00000178591"; gene_version "7"; transcript_id "ENST00000382410"; transcript_version "3"; exon_number "1"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB125-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12989"; exon_id "ENSE00001491993"; exon_version "2"; tag "basic"; transcript_support_level "1 (assigned to previous version 2)";
20 ensembl_havana CDS 87710 87767 . + 0 gene_id "ENSG00000178591"; gene_version "7"; transcript_id "ENST00000382410"; transcript_version "3"; exon_number "1"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB125-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12989"; protein_id "ENSP00000371847"; protein_version "2"; tag "basic"; transcript_support_level "1 (assigned to previous version 2)";
20 ensembl_havana start_codon 87710 87712 . + 0 gene_id "ENSG00000178591"; gene_version "7"; transcript_id "ENST00000382410"; transcript_version "3"; exon_number "1"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB125-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12989"; tag "basic"; transcript_support_level "1 (assigned to previous version 2)";
20 ensembl_havana exon 96005 97094 . + . gene_id "ENSG00000178591"; gene_version "7"; transcript_id "ENST00000382410"; transcript_version "3"; exon_number "2"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB125-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12989"; exon_id "ENSE00001491984"; exon_version "3"; tag "basic"; transcript_support_level "1 (assigned to previous version 2)";
20 ensembl_havana CDS 96005 96414 . + 2 gene_id "ENSG00000178591"; gene_version "7"; transcript_id "ENST00000382410"; transcript_version "3"; exon_number "2"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB125-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12989"; protein_id "ENSP00000371847"; protein_version "2"; tag "basic"; transcript_support_level "1 (assigned to previous version 2)";
20 ensembl_havana stop_codon 96415 96417 . + 0 gene_id "ENSG00000178591"; gene_version "7"; transcript_id "ENST00000382410"; transcript_version "3"; exon_number "2"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB125-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12989"; tag "basic"; transcript_support_level "1 (assigned to previous version 2)";
20 ensembl_havana five_prime_utr 87672 87709 . + . gene_id "ENSG00000178591"; gene_version "7"; transcript_id "ENST00000382410"; transcript_version "3"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB125-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12989"; tag "basic"; transcript_support_level "1 (assigned to previous version 2)";
20 ensembl_havana three_prime_utr 96418 97094 . + . gene_id "ENSG00000178591"; gene_version "7"; transcript_id "ENST00000382410"; transcript_version "3"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB125-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12989"; tag "basic"; transcript_support_level "1 (assigned to previous version 2)";
20 ensembl_havana gene 142590 145751 . + . gene_id "ENSG00000125788"; gene_version "6"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
20 ensembl_havana transcript 142590 145751 . + . gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000382398"; transcript_version "4"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12990"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana exon 142590 142686 . + . gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000382398"; transcript_version "4"; exon_number "1"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12990"; exon_id "ENSE00001491976"; exon_version "4"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana CDS 142629 142686 . + 0 gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000382398"; transcript_version "4"; exon_number "1"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12990"; protein_id "ENSP00000371835"; protein_version "3"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana start_codon 142629 142631 . + 0 gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000382398"; transcript_version "4"; exon_number "1"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12990"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana exon 145415 145751 . + . gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000382398"; transcript_version "4"; exon_number "2"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12990"; exon_id "ENSE00000858522"; exon_version "4"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana CDS 145415 145689 . + 2 gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000382398"; transcript_version "4"; exon_number "2"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12990"; protein_id "ENSP00000371835"; protein_version "3"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana stop_codon 145690 145692 . + 0 gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000382398"; transcript_version "4"; exon_number "2"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12990"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana five_prime_utr 142590 142628 . + . gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000382398"; transcript_version "4"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12990"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana three_prime_utr 145693 145751 . + . gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000382398"; transcript_version "4"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12990"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 havana transcript 142634 145749 . + . gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000542572"; transcript_version "1"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "mRNA_start_NF"; transcript_support_level "3";
20 havana exon 142634 142686 . + . gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000542572"; transcript_version "1"; exon_number "1"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002285856"; exon_version "1"; tag "mRNA_start_NF"; transcript_support_level "3";
20 havana exon 145415 145488 . + . gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000542572"; transcript_version "1"; exon_number "2"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002303512"; exon_version "1"; tag "mRNA_start_NF"; transcript_support_level "3";
20 havana exon 145579 145749 . + . gene_id "ENSG00000125788"; gene_version "6"; transcript_id "ENST00000542572"; transcript_version "1"; exon_number "3"; gene_name "DEFB126"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB126-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002217818"; exon_version "1"; tag "mRNA_start_NF"; transcript_support_level "3";
20 ensembl_havana gene 157454 159163 . + . gene_id "ENSG00000088782"; gene_version "5"; gene_name "DEFB127"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
20 ensembl_havana transcript 157454 159163 . + . gene_id "ENSG00000088782"; gene_version "5"; transcript_id "ENST00000382388"; transcript_version "4"; gene_name "DEFB127"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB127-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12991"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana exon 157454 157593 . + . gene_id "ENSG00000088782"; gene_version "5"; transcript_id "ENST00000382388"; transcript_version "4"; exon_number "1"; gene_name "DEFB127"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB127-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12991"; exon_id "ENSE00001491947"; exon_version "4"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana CDS 157545 157593 . + 0 gene_id "ENSG00000088782"; gene_version "5"; transcript_id "ENST00000382388"; transcript_version "4"; exon_number "1"; gene_name "DEFB127"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB127-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12991"; protein_id "ENSP00000371825"; protein_version "3"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana start_codon 157545 157547 . + 0 gene_id "ENSG00000088782"; gene_version "5"; transcript_id "ENST00000382388"; transcript_version "4"; exon_number "1"; gene_name "DEFB127"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB127-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12991"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana exon 158774 159163 . + . gene_id "ENSG00000088782"; gene_version "5"; transcript_id "ENST00000382388"; transcript_version "4"; exon_number "2"; gene_name "DEFB127"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB127-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12991"; exon_id "ENSE00001166560"; exon_version "3"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana CDS 158774 159021 . + 2 gene_id "ENSG00000088782"; gene_version "5"; transcript_id "ENST00000382388"; transcript_version "4"; exon_number "2"; gene_name "DEFB127"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB127-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12991"; protein_id "ENSP00000371825"; protein_version "3"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana stop_codon 159022 159024 . + 0 gene_id "ENSG00000088782"; gene_version "5"; transcript_id "ENST00000382388"; transcript_version "4"; exon_number "2"; gene_name "DEFB127"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB127-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12991"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana five_prime_utr 157454 157544 . + . gene_id "ENSG00000088782"; gene_version "5"; transcript_id "ENST00000382388"; transcript_version "4"; gene_name "DEFB127"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB127-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12991"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
20 ensembl_havana three_prime_utr 159025 159163 . + . gene_id "ENSG00000088782"; gene_version "5"; transcript_id "ENST00000382388"; transcript_version "4"; gene_name "DEFB127"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "DEFB127-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS12991"; tag "basic"; transcript_support_level "1 (assigned to previous version 3)";
21 havana gene 297570 300321 . + . gene_id "ENSG00000247315"; gene_version "4"; gene_source "havana"; gene_biotype "protein_coding";
21 havana transcript 297570 300321 . + . gene_id "ENSG00000247315"; gene_version "4"; transcript_id "ENST00000500893"; transcript_version "4"; gene_source "havana"; gene_biotype "protein_coding"; transcript_name "ZCCHC3-201"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS42844"; tag "basic"; transcript_support_level "NA (assigned to previous version 3)";
21 havana exon 297570 300321 . + . gene_id "ENSG00000247315"; gene_version "4"; transcript_id "ENST00000500893"; transcript_version "4"; exon_number "1"; gene_source "havana"; gene_biotype "protein_coding"; transcript_name "ZCCHC3-201"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS42844"; exon_id "ENSE00001977652"; exon_version "4"; tag "basic"; transcript_support_level "NA (assigned to previous version 3)";
21 havana CDS 297587 298795 . + 0 gene_id "ENSG00000247315"; gene_version "4"; transcript_id "ENST00000500893"; transcript_version "4"; exon_number "1"; gene_source "havana"; gene_biotype "protein_coding"; transcript_name "ZCCHC3-201"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS42844"; protein_id "ENSP00000484056"; protein_version "1"; tag "basic"; transcript_support_level "NA (assigned to previous version 3)";
21 havana start_codon 297587 297589 . + 0 gene_id "ENSG00000247315"; gene_version "4"; transcript_id "ENST00000500893"; transcript_version "4"; exon_number "1"; gene_source "havana"; gene_biotype "protein_coding"; transcript_name "ZCCHC3-201"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS42844"; tag "basic"; transcript_support_level "NA (assigned to previous version 3)";
21 havana stop_codon 298796 298798 . + 0 gene_id "ENSG00000247315"; gene_version "4"; transcript_id "ENST00000500893"; transcript_version "4"; exon_number "1"; gene_source "havana"; gene_biotype "protein_coding"; transcript_name "ZCCHC3-201"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS42844"; tag "basic"; transcript_support_level "NA (assigned to previous version 3)";
21 havana five_prime_utr 297570 297586 . + . gene_id "ENSG00000247315"; gene_version "4"; transcript_id "ENST00000500893"; transcript_version "4"; gene_source "havana"; gene_biotype "protein_coding"; transcript_name "ZCCHC3-201"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS42844"; tag "basic"; transcript_support_level "NA (assigned to previous version 3)";
21 havana three_prime_utr 298799 300321 . + . gene_id "ENSG00000247315"; gene_version "4"; transcript_id "ENST00000500893"; transcript_version "4"; gene_source "havana"; gene_biotype "protein_coding"; transcript_name "ZCCHC3-201"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS42844"; tag "basic"; transcript_support_level "NA (assigned to previous version 3)";

View File

@@ -0,0 +1,55 @@
name: create_pdata
namespace: eset
description: |
Create a pdata file by combining the mapping statistics
authors:
- __merge__: /src/base/authors/dries_schaumont.yaml
roles: [ maintainer ]
- __merge__: /src/base/authors/marijke_van_moerbeke.yaml
roles: [ contributor ]
arguments:
- name: "--star_stats_file"
type: file
description: |
Tab-delimited text file containing statistics (per column) that were generated
from the STAR log files (Log.final.out, Summary.csv, ReadsPerGene.out.tab).
Each entry (row) in the file describes the values for one well (barcode).
required: true
- name: "--nrReadsNrGenesPerChromPool"
type: file
description: |
Pivot table in tsv format of the combined nrReadsNrGenesPerChrom files from STAR.
Describes per chromosome (as columns) the number of reads, as well as the total number
of reads per cell barcode and the percentage of nuclear, ERCC and mitochondrial
reads.
required: true
- name: "--output"
type: file
direction: output
default: pData.$id.txt
resources:
- type: python_script
path: create_pdata.py
test_resources:
- type: python_script
path: test.py
- path: nrReadsNrGenesPerChromPool.txt
- path: starLogs.txt
engines:
- type: docker
image: python:3.12-slim
setup:
- type: apt
packages:
- procps
- type: python
packages:
- pandas
test_setup:
- type: python
packages:
- viashpy
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,63 @@
from itertools import batched
import pandas as pd
import logging
### VIASH START
meta = {
"name": "create_pdata",
}
par = {
"star_stats_file": "src/eset/create_pdata/starLogs.txt",
"nrReadsNrGenesPerChromPool": "src/eset/create_pdata/nrReadsNrGenesPerChromPool.txt",
"output": "pData.tsv"
}
### VIASH END
logger = logging.getLogger()
console_handler = logging.StreamHandler()
logger.addHandler(console_handler)
logger.setLevel(logging.DEBUG)
def main(par):
logger.info(f"{meta['name']} started.")
parameters_str = [f'\t{param}: {param_val}\n' for param, param_val in par.items()]
logger.info("Parameters:\n%s", "".join(parameters_str).rstrip())
logger.info("Reading %s", par["star_stats_file"])
star_log_stats = pd.read_csv(par["star_stats_file"], sep="\t", index_col=0)
logger.info("STAR log statics file contains information for the following barcodes: %s",
", ".join(star_log_stats.index))
logger.info("Reading %s", par["nrReadsNrGenesPerChromPool"])
reads_and_genes_per_chr_stats = pd.read_csv(par["nrReadsNrGenesPerChromPool"], sep="\t", index_col=0)
logger.info("Reads per gene and chromosome table contains information for the following barcodes: %s",
", ".join(reads_and_genes_per_chr_stats.index))
logger.info("Filtering mapping statistics file columns.")
cols_to_keep = ("WellID", "NumberOfMTReads", "pctMT", "NumberOfERCCReads",
"pctERCC", "NumberOfChromReads", "pctChrom")
try:
reads_and_genes_per_chr_stats = reads_and_genes_per_chr_stats.loc[:,cols_to_keep]
except KeyError as e:
raise KeyError("When trying to subset the reads per genes and chromosomes file, "
"a column was missing. Available columns in the file: "
f"{', '.join(reads_and_genes_per_chr_stats.columns)}.") from e
# Each barcode should be present. An alternative approach could be to just
# do the concatenation and check for NA values that are filled for non-overlapping
# index values, but there are already NA values present in the dataframes
if not star_log_stats.index.sort_values().equals(reads_and_genes_per_chr_stats.index.sort_values()):
raise ValueError("Error while combining two log files. It seems that the entries (barcodes) "
f"do not fully overlap. Barcodes in '{par['star_stats_file']}: "
f"{', '.join(reads_and_genes_per_chr_stats.index)}. Barcodes in "
f"'{par['nrReadsNrGenesPerChromPool']}': "
f"{', '.join(star_log_stats.index)}")
combined_stats = pd.concat([reads_and_genes_per_chr_stats, star_log_stats], axis=1)
logger.info("Summary of final output:\n%s\n",
"\n".join(repr(combined_stats.loc[:,columns].describe())
for columns in batched(combined_stats.columns, 3)))
logger.info("Writing to %s", par["output"])
combined_stats.reset_index("WellBC").to_csv(par["output"], sep="\t", header=True, index=False)
logger.info("Finished %s.", meta["name"])
if __name__ == "__main__":
main(par)

View File

@@ -0,0 +1,8 @@
WellBC WellID 20 pctChrom pctMT pctERCC SumReads NumberOfGenes NumberOfERCCReads NumberOfChromReads NumberOfMTReads
AACAAGGTAC A1 8542 100 0 0 8542 408 0 8542 0
ACGCCTTCGT A2 5863 100 0 0 5863 377 0 5863 0
CCATACTGAC A3 7396 100 0 0 7396 391 0 7396 0
GCAAGCGAAT B1 10092 100 0 0 10092 420 0 10092 0
GTCTCGAGTG C5 470 100 0 0 470 150 0 470 0
TGCGCTCATT D6 7650 100 0 0 7650 407 0 7650 0
TTGTGTTCGA E19 9422 100 0 0 9422 420 0 9422 0

View File

@@ -0,0 +1,8 @@
WellBC NumberOfInputReads NumberOfMappedReads PctMappedReads NumberOfReadsMappedToMultipleLoci PectOfReadsMappedToMultipleLoci NumberOfReadsMappedToTooManyLoci PectOfReadsMappedToTooManyLoci NumberOfReadsUnmappedTooManyMismatches PectOfReadsUnmappedTooManyMismatches NumberOfReadsUnmappedTooShort PectOfReadsUnmappedTooShort NumberOfReadsUnmappedOther PectOfReadsUnmappedOther ReadsWithValidBarcodes SequencingSaturation Q30BasesInCB+UMI ReadsMappedToTranscriptome:Unique+MultipeGenes EstimatedNumberOfCells FractionOfReadsInCells MeanReadsPerCell NumberOfUMIs NumberOfGenes NumberOfCountedReads
ACGCCTTCGT 96430 16869 17.49 0 0 6124 6.35 0 0 73375 76.09 62 0.06 0.999782 0.0665302 0.980077 0.0620969 1 1 5862 5472 377 6463
GTCTCGAGTG 10158 1902 18.72 0 0 967 9.52 0 0 7280 71.67 9 0.09 0.999803 0.0553191 0.984451 0.0476472 1 1 470 444 150 533
GCAAGCGAAT 156134 24005 15.37 0 0 7961 5.1 0 0 124096 79.48 72 0.05 0.999744 0.0680872 0.982779 0.0658665 1 1 10090 9403 420 11273
CCATACTGAC 113577 17319 15.25 0 0 5905 5.2 0 0 90292 79.5 61 0.05 0.999859 0.0717282 0.982313 0.066554 1 1 7389 6859 391 8299
TGCGCTCATT 126989 19272 15.18 0 0 7141 5.62 0 0 100515 79.15 61 0.05 0.999843 0.0667974 0.986581 0.0616668 1 1 7650 7139 407 8444
TTGTGTTCGA 142560 22129 15.52 0 0 7045 4.94 0 0 113324 79.49 62 0.04 0.999783 0.060828 0.986622 0.0676838 1 1 9420 8847 420 10383
AACAAGGTAC 141303 23749 16.81 0 0 8458 5.99 0 0 109035 77.16 61 0.04 0.999816 0.0698056 0.979965 0.0618175 1 1 8538 7942 408 9535

View File

@@ -0,0 +1,173 @@
import pytest
import sys
import pandas as pd
from pathlib import Path
from uuid import uuid4
### VIASH START
meta = {
"resources_dir": "./src/eset/create_pdata/",
"executable": "target/executable/eset/create_pdata/create_pdata",
"config": "src/eset/create_pdata/config.vsh.yaml"
}
### VIASH END
@pytest.fixture
def test_reads_and_genes_per_chr_path():
return Path(meta["resources_dir"]) / "nrReadsNrGenesPerChromPool.txt"
@pytest.fixture
def test_star_logs_summary_path():
return Path(meta["resources_dir"]) / "starLogs.txt"
@pytest.fixture
def random_path(tmp_path):
def wrapper(extension=None):
extension = "" if not extension else f".{extension}"
return tmp_path / f"{uuid4()}{extension}"
return wrapper
def test_create_pdata(run_component, test_reads_and_genes_per_chr_path,
test_star_logs_summary_path, random_path):
output_path = random_path("tsv")
run_component([
"--star_stats_file", test_star_logs_summary_path,
"--nrReadsNrGenesPerChromPool", test_reads_and_genes_per_chr_path,
"--output", output_path
])
assert output_path.is_file()
result = pd.read_csv(output_path, sep="\t", dtype=pd.StringDtype())
expected_dict = {
'WellBC': ['AACAAGGTAC', 'ACGCCTTCGT', 'CCATACTGAC', 'GCAAGCGAAT',
'GTCTCGAGTG', 'TGCGCTCATT', 'TTGTGTTCGA'],
'WellID': ['A1', 'A2', 'A3', 'B1', 'C5', 'D6', 'E19'],
'NumberOfMTReads': ['0', '0', '0', '0', '0', '0', '0'],
'pctMT': ['0', '0', '0', '0', '0', '0', '0'],
'NumberOfERCCReads': ['0', '0', '0', '0', '0', '0', '0'],
'pctERCC': ['0', '0', '0', '0', '0', '0', '0'],
'NumberOfChromReads': ['8542', '5863', '7396', '10092', '470',
'7650', '9422'],
'pctChrom': ['100', '100', '100', '100', '100', '100', '100'],
'NumberOfInputReads': ['141303', '96430', '113577', '156134', '10158',
'126989', '142560'],
'NumberOfMappedReads': ['23749', '16869', '17319', '24005', '1902',
'19272', '22129'],
'PctMappedReads': ['16.81', '17.49', '15.25', '15.37', '18.72',
'15.18', '15.52'],
'NumberOfReadsMappedToMultipleLoci': ['0', '0', '0', '0', '0', '0', '0'],
'PectOfReadsMappedToMultipleLoci': ['0', '0', '0', '0', '0', '0', '0'],
'NumberOfReadsMappedToTooManyLoci': ['8458', '6124', '5905', '7961', '967',
'7141', '7045'],
'PectOfReadsMappedToTooManyLoci': ['5.99', '6.35', '5.2', '5.1', '9.52',
'5.62', '4.94'],
'NumberOfReadsUnmappedTooManyMismatches': ['0', '0', '0', '0', '0', '0', '0'],
'PectOfReadsUnmappedTooManyMismatches': ['0', '0', '0', '0', '0', '0', '0'],
'NumberOfReadsUnmappedTooShort': ['109035', '73375', '90292', '124096',
'7280', '100515', '113324'],
'PectOfReadsUnmappedTooShort': ['77.16', '76.09', '79.5', '79.48',
'71.67', '79.15', '79.49'],
'NumberOfReadsUnmappedOther': ['61', '62', '61', '72', '9', '61', '62'],
'PectOfReadsUnmappedOther': ['0.04', '0.06', '0.05', '0.05',
'0.09', '0.05', '0.04'],
'ReadsWithValidBarcodes': ['0.999816', '0.999782', '0.999859', '0.999744',
'0.999803', '0.999843', '0.999783'],
'SequencingSaturation': ['0.0698056', '0.0665302', '0.0717282', '0.0680872',
'0.0553191', '0.0667974', '0.060828'],
'Q30BasesInCB+UMI': ['0.979965', '0.980077', '0.982313', '0.982779',
'0.984451', '0.986581', '0.986622'],
'ReadsMappedToTranscriptome:Unique+MultipeGenes': ['0.0618175', '0.0620969',
'0.066554', '0.0658665',
'0.0476472', '0.0616668',
'0.0676838'],
'EstimatedNumberOfCells': ['1', '1', '1', '1', '1', '1', '1'],
'FractionOfReadsInCells': ['1', '1', '1', '1', '1', '1', '1'],
'MeanReadsPerCell': ['8538', '5862', '7389',
'10090', '470', '7650', '9420'],
'NumberOfUMIs': ['7942', '5472', '6859', '9403',
'444', '7139', '8847'],
'NumberOfGenes': ['408', '377', '391', '420', '150', '407', '420'],
'NumberOfCountedReads': ['9535', '6463', '8299', '11273',
'533', '8444', '10383']
}
expected = pd.DataFrame.from_dict(expected_dict, dtype=pd.StringDtype())
pd.testing.assert_frame_equal(result, expected, check_like=True)
def test_na(run_component, test_reads_and_genes_per_chr_path,
test_star_logs_summary_path, random_path):
"""
The star log summary can contain NA values.
"""
output_path = random_path("tsv")
summary_with_na_path = random_path("txt")
original_summary = pd.read_csv(test_star_logs_summary_path,
sep="\t", index_col=0)
original_summary.loc["GTCTCGAGTG", "FractionOfReadsInCells"] = pd.NA
original_summary.reset_index("WellBC").to_csv(summary_with_na_path, sep="\t",
header=True, index=False)
run_component([
"--star_stats_file", summary_with_na_path,
"--nrReadsNrGenesPerChromPool", test_reads_and_genes_per_chr_path,
"--output", output_path
])
expected_dict = {
'WellBC': ['AACAAGGTAC', 'ACGCCTTCGT', 'CCATACTGAC', 'GCAAGCGAAT',
'GTCTCGAGTG', 'TGCGCTCATT', 'TTGTGTTCGA'],
'WellID': ['A1', 'A2', 'A3', 'B1', 'C5', 'D6', 'E19'],
'NumberOfMTReads': ['0', '0', '0', '0', '0', '0', '0'],
'pctMT': ['0', '0', '0', '0', '0', '0', '0'],
'NumberOfERCCReads': ['0', '0', '0', '0', '0', '0', '0'],
'pctERCC': ['0', '0', '0', '0', '0', '0', '0'],
'NumberOfChromReads': ['8542', '5863', '7396', '10092', '470',
'7650', '9422'],
'pctChrom': ['100', '100', '100', '100', '100', '100', '100'],
'NumberOfInputReads': ['141303', '96430', '113577', '156134', '10158',
'126989', '142560'],
'NumberOfMappedReads': ['23749', '16869', '17319', '24005', '1902',
'19272', '22129'],
'PctMappedReads': ['16.81', '17.49', '15.25', '15.37', '18.72',
'15.18', '15.52'],
'NumberOfReadsMappedToMultipleLoci': ['0', '0', '0', '0', '0', '0', '0'],
'PectOfReadsMappedToMultipleLoci': ['0', '0', '0', '0', '0', '0', '0'],
'NumberOfReadsMappedToTooManyLoci': ['8458', '6124', '5905', '7961', '967',
'7141', '7045'],
'PectOfReadsMappedToTooManyLoci': ['5.99', '6.35', '5.2', '5.1', '9.52',
'5.62', '4.94'],
'NumberOfReadsUnmappedTooManyMismatches': ['0', '0', '0', '0', '0', '0', '0'],
'PectOfReadsUnmappedTooManyMismatches': ['0', '0', '0', '0', '0', '0', '0'],
'NumberOfReadsUnmappedTooShort': ['109035', '73375', '90292', '124096',
'7280', '100515', '113324'],
'PectOfReadsUnmappedTooShort': ['77.16', '76.09', '79.5', '79.48',
'71.67', '79.15', '79.49'],
'NumberOfReadsUnmappedOther': ['61', '62', '61', '72', '9', '61', '62'],
'PectOfReadsUnmappedOther': ['0.04', '0.06', '0.05', '0.05',
'0.09', '0.05', '0.04'],
'ReadsWithValidBarcodes': ['0.999816', '0.999782', '0.999859', '0.999744',
'0.999803', '0.999843', '0.999783'],
'SequencingSaturation': ['0.0698056', '0.0665302', '0.0717282', '0.0680872',
'0.0553191', '0.0667974', '0.060828'],
'Q30BasesInCB+UMI': ['0.979965', '0.980077', '0.982313', '0.982779',
'0.984451', '0.986581', '0.986622'],
'ReadsMappedToTranscriptome:Unique+MultipeGenes': ['0.0618175', '0.0620969',
'0.066554', '0.0658665',
'0.0476472', '0.0616668',
'0.0676838'],
'EstimatedNumberOfCells': ['1', '1', '1', '1', '1', '1', '1'],
'FractionOfReadsInCells': ['1.0', '1.0', '1.0', '1.0', pd.NA, '1.0', '1.0'],
'MeanReadsPerCell': ['8538', '5862', '7389',
'10090', '470', '7650', '9420'],
'NumberOfUMIs': ['7942', '5472', '6859', '9403',
'444', '7139', '8847'],
'NumberOfGenes': ['408', '377', '391', '420', '150', '407', '420'],
'NumberOfCountedReads': ['9535', '6463', '8299', '11273',
'533', '8444', '10383']
}
result = pd.read_csv(output_path, sep="\t", dtype=pd.StringDtype())
expected = pd.DataFrame.from_dict(expected_dict, dtype=pd.StringDtype())
pd.testing.assert_frame_equal(result, expected, check_like=True)
if __name__ == '__main__':
sys.exit(pytest.main([__file__]))

View File

@@ -0,0 +1,34 @@
name: "check_eset"
namespace: "integration_test_components/htrnaseq"
description: "This component test the ExpressionSet object as output by the main pipeline."
authors:
- __merge__: /src/base/authors/dries_schaumont.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: "--eset"
type: file
required: true
description: Path to an ExpressionSet object.
example: eset.rds
- name: "--star_output"
type: file
required: true
multiple: true
resources:
- type: r_script
path: script.R
engines:
- type: docker
image: bioconductor/bioconductor_docker:3.19
setup:
- type: r
cran:
- bit64
bioc:
- Biobase
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,198 @@
library(Biobase)
library(testthat)
library(Matrix)
sample_1_result <- readRDS(par$eset)
expected_sample_names <- c(
"sample_one_AACAAGGTAC", "sample_one_AACAATCAGG", "sample_one_AACACCTAGT",
"sample_one_AACAGGCAAT", "sample_one_AACATGGAGA", "sample_one_AACATTACCG",
"sample_one_AACCAGCCAG", "sample_one_AACCAGTTGA", "sample_one_AACCGCGACT",
"sample_one_AACCGGAAGG", "sample_one_AACCGGCGTA", "sample_one_AACCTAGTCC",
"sample_one_AACCTCATAG", "sample_one_AACGTAAGCT", "sample_one_AACTCTACAC",
"sample_one_AACTGTGTCA", "sample_one_AAGACGGATT", "sample_one_AAGATCGGCG",
"sample_one_AAGATGTCCA", "sample_one_AAGCATATGG", "sample_one_AAGCGATGTT",
"sample_one_AAGCGTTCAG", "sample_one_AAGCTCACCT", "sample_one_AAGGCATGCG",
"sample_one_AAGGTCTGGA", "sample_one_AAGTTAGCGC", "sample_one_AAGTTCCTTG",
"sample_one_AATACCGGTA", "sample_one_AATAGCCACA", "sample_one_AATCACGCGA",
"sample_one_AATCCATCTG", "sample_one_AATCCGCTCC", "sample_one_AATCCTACCA",
"sample_one_AATCGTCCGC", "sample_one_AATGAACACG", "sample_one_AATGACCTTC",
"sample_one_AATGAGAGCA", "sample_one_AATGTCAGTG", "sample_one_AATTAGGCCG",
"sample_one_AATTGCGATG", "sample_one_ACAACAGTCG", "sample_one_ACAACCATAC",
"sample_one_ACAACGGAGC", "sample_one_ACAAGCGCGA", "sample_one_ACACAATCTC",
"sample_one_ACACAGTGAA", "sample_one_ACACCGAATT", "sample_one_ACACGCAGTA",
"sample_one_ACACGGTCCT", "sample_one_ACACTTGCTG", "sample_one_ACAGTGCCAA",
"sample_one_ACATGTGTGC", "sample_one_ACCAGGACCA", "sample_one_ACCATAACAC",
"sample_one_ACCGAACCGT", "sample_one_ACCGAGAGTC", "sample_one_ACCGGTACAG",
"sample_one_ACCGTACTTC", "sample_one_ACCTCCGACA", "sample_one_ACCTCTCTCC",
"sample_one_ACCTGTCCGA", "sample_one_ACCTTATGTG", "sample_one_ACGAATGACA",
"sample_one_ACGCCTCAAC", "sample_one_ACGCCTTCGT", "sample_one_ACGCTGGATA",
"sample_one_ACGGTCCGTT", "sample_one_ACGTAGGCAC", "sample_one_ACGTGCTGAT",
"sample_one_ACTCCAAGCC", "sample_one_ACTGGCGCAT", "sample_one_ACTGGCTTCC",
"sample_one_ACTTAACTGC", "sample_one_ACTTCATCAC", "sample_one_ACTTCGTTGA",
"sample_one_ACTTCTCCTG", "sample_one_ACTTGAGGAA", "sample_one_ACTTGTAAGG",
"sample_one_AGAACCACGG", "sample_one_AGAAGCAATC", "sample_one_AGACCGTTAT",
"sample_one_AGACTAGCAT", "sample_one_AGAGATGCAG", "sample_one_AGAGCTTACA",
"sample_one_AGAGTGTAAC", "sample_one_AGAGTTCTGC", "sample_one_AGATAGTGCT",
"sample_one_AGCAATGCGC", "sample_one_AGCATGTCAT", "sample_one_AGCCACTAGC",
"sample_one_AGCCAGAATA", "sample_one_AGCCAGCTCT", "sample_one_AGCGATAACG",
"sample_one_AGCGTACAAT", "sample_one_AGCTATTCCA", "sample_one_AGCTCCTCAG",
"sample_one_AGGAGGCATA", "sample_one_AGGCGTCTGT", "sample_one_AGTAACTCAC",
"sample_one_AGTAAGCGTT", "sample_one_AGTCTGTACG", "sample_one_AGTGCAATGT",
"sample_one_ATAAGGTGCA", "sample_one_ATACACGACA", "sample_one_ATAGGCCATT",
"sample_one_ATATCCGCAT", "sample_one_ATCAGCACTT", "sample_one_ATCAGCGAGG",
"sample_one_ATCCAATACG", "sample_one_ATCCGCTGTG", "sample_one_ATCCGTCCAT",
"sample_one_ATCGACGGCT", "sample_one_ATCGCGATTA", "sample_one_ATCGGTAGGC",
"sample_one_ATCTAAGGAG", "sample_one_ATGACGGTAA", "sample_one_ATGACTCAGT",
"sample_one_ATGCACCGGA", "sample_one_ATGCGGACTG", "sample_one_ATGCTTCCTA",
"sample_one_ATGGACCAAC", "sample_one_ATGGTCTTAG", "sample_one_ATGGTGAGCG",
"sample_one_ATGTGGAAGC", "sample_one_ATTATCGGAC", "sample_one_ATTCGGAACA",
"sample_one_CAACAATCCA", "sample_one_CAAGAAGCAT", "sample_one_CAAGATGAGG",
"sample_one_CAAGCCAACG", "sample_one_CAAGTGGATC", "sample_one_CACAGTTCAT",
"sample_one_CACGAGTCTG", "sample_one_CACGCTCCAA", "sample_one_CACTGAGCAC",
"sample_one_CAGATCAATG", "sample_one_CAGTGCTCTT", "sample_one_CAGTTAAGCA",
"sample_one_CATAGCTATC", "sample_one_CATCACCACC", "sample_one_CATGTACGCC",
"sample_one_CATTACACTG", "sample_one_CATTCGACGA", "sample_one_CCAACTATGG",
"sample_one_CCAAGGAGTT", "sample_one_CCAATTGTTC", "sample_one_CCACAAGTGC",
"sample_one_CCAGCTTAGT", "sample_one_CCATAACTTG", "sample_one_CCATACTGAC",
"sample_one_CCATAGATCA", "sample_one_CCATGTGCTT", "sample_one_CCATTCAGCG",
"sample_one_CCGAACAAGC", "sample_one_CCGAACCTAA", "sample_one_CCGAAGACCT",
"sample_one_CCGAATAGTG", "sample_one_CCGACTTCTC", "sample_one_CCGATCCACT",
"sample_one_CCGATGATAC", "sample_one_CCGCGTTATG", "sample_one_CCGCTAGCTT",
"sample_one_CCGGAGTATC", "sample_one_CCGGCCAATT", "sample_one_CCGGTCTCTA",
"sample_one_CCGTACGATG", "sample_one_CCGTCAGAAC", "sample_one_CCTAGACACG",
"sample_one_CCTAGTTGAG", "sample_one_CCTATTCTGT", "sample_one_CCTCAACCGA",
"sample_one_CCTCCATAAG", "sample_one_CCTGATGCCA", "sample_one_CCTGCAATAC",
"sample_one_CCTTGTATTC", "sample_one_CGAGATCTCT", "sample_one_CGAGGAACAA",
"sample_one_CGATAACCGC", "sample_one_CGATCCTGTG", "sample_one_CGCCAACCAT",
"sample_one_CGCCAGTGTT", "sample_one_CGCCTTGTAC", "sample_one_CGCGGATTCA",
"sample_one_CGCTTAAGGC", "sample_one_CGCTTACTAA", "sample_one_CGCTTCTTGG",
"sample_one_CGGAAGCTGT", "sample_one_CGGAATACAC", "sample_one_CGGAGATTGG",
"sample_one_CGGAGCTCAA", "sample_one_CGGATCGGTA", "sample_one_CGGATTCTAG",
"sample_one_CGGCAACTTA", "sample_one_CGGCTCATCA", "sample_one_CGGTCGTATT",
"sample_one_CGGTGACATC", "sample_one_CGTAACGGAT", "sample_one_CGTAAGATTC",
"sample_one_CGTACTGTAA", "sample_one_CGTAGAAGAC", "sample_one_CGTCCTAGGA",
"sample_one_CGTCGGCAAT", "sample_one_CGTGAGTTAT", "sample_one_CGTGTCAAGC",
"sample_one_CTAACTTCAG", "sample_one_CTAATAGCGT", "sample_one_CTACACCAGG",
"sample_one_CTAGCACAAT", "sample_one_CTATGAACGG", "sample_one_CTCAAGGACC",
"sample_one_CTCACCTGTC", "sample_one_CTCCTATTGT", "sample_one_CTCGCAACGT",
"sample_one_CTCGTGCCTA", "sample_one_CTGGATTGAC", "sample_one_CTGTAGTCAG",
"sample_one_CTGTCGCTTC", "sample_one_CTGTCTGTGT", "sample_one_CTTCATATCG",
"sample_one_CTTGCTGACG", "sample_one_GAAGGATTAG", "sample_one_GAATCGAGCC",
"sample_one_GACCATCTAA", "sample_one_GACGACCACA", "sample_one_GAGACATCTT",
"sample_one_GAGCGAGTCA", "sample_one_GAGTAGACCA", "sample_one_GATACGCTTA",
"sample_one_GATAGACTGT", "sample_one_GATAGAGGCG", "sample_one_GATAGGTCAA",
"sample_one_GATATCAGGA", "sample_one_GATCTCATTC", "sample_one_GATCTGGTCG",
"sample_one_GATGAGTGAC", "sample_one_GATGGATACA", "sample_one_GATGTGACAG",
"sample_one_GATTAAGTCC", "sample_one_GATTGCACGC", "sample_one_GCAAGCGAAT",
"sample_one_GCAATGTAAG", "sample_one_GCACACTATA", "sample_one_GCACTCGGAA",
"sample_one_GCACTGCGTT", "sample_one_GCACTTAATC", "sample_one_GCAGGAGATG",
"sample_one_GCAGTACTGG", "sample_one_GCATATGAGT", "sample_one_GCATCCGATC",
"sample_one_GCCAAGTACA", "sample_one_GCCACGATTC", "sample_one_GCCATAGGTT",
"sample_one_GCCATATCGA", "sample_one_GCCGTCAATA", "sample_one_GCCTGGACAT",
"sample_one_GCGTAATTAC", "sample_one_GCTATTATCC", "sample_one_GCTCAGTAAT",
"sample_one_GCTGCTTATA", "sample_one_GGAATAAGCA", "sample_one_GGACGATGCT",
"sample_one_GGCATCGTGA", "sample_one_GGCATTATTG", "sample_one_GGCCGAGATT",
"sample_one_GGCGCTATAA", "sample_one_GGCGTTAAGT", "sample_one_GGCTATTGAT",
"sample_one_GGCTGCTACT", "sample_one_GGTAATGTGT", "sample_one_GGTGGTTGGA",
"sample_one_GGTGTTCACC", "sample_one_GGTTAGATCT", "sample_one_GGTTATGGCG",
"sample_one_GGTTCACTGG", "sample_one_GGTTGTGCAA", "sample_one_GTAACCAGTA",
"sample_one_GTAACCTTGG", "sample_one_GTAAGAACCT", "sample_one_GTAAGGCTCC",
"sample_one_GTAATCCACG", "sample_one_GTATTGTGGA", "sample_one_GTCCGCATCA",
"sample_one_GTCCTTCGGT", "sample_one_GTCGCTCTCT", "sample_one_GTCGGTGACA",
"sample_one_GTCTCGAGTG", "sample_one_GTCTCTTAAG", "sample_one_GTCTTCCGAG",
"sample_one_GTGACTATAC", "sample_one_GTGGTTAATG", "sample_one_GTGTGCCTGT",
"sample_one_GTGTGTGTCC", "sample_one_GTTCATTGCC", "sample_one_GTTCCGGTGA",
"sample_one_GTTCGTCGAA", "sample_one_GTTGAATTGG", "sample_one_GTTGATCCGC",
"sample_one_GTTGTATGCT", "sample_one_TAACCGTAGC", "sample_one_TAACGTCGAT",
"sample_one_TAAGGTACGG", "sample_one_TACGGACATA", "sample_one_TACTACCGCC",
"sample_one_TACTGTCAAG", "sample_one_TAGCGAACGC", "sample_one_TAGCGCCAAC",
"sample_one_TAGGACGCCT", "sample_one_TAGGTTGCAA", "sample_one_TAGTAGTCTC",
"sample_one_TAGTCCGCTG", "sample_one_TAGTGGAACT", "sample_one_TATCATGCAG",
"sample_one_TATCGTTACG", "sample_one_TCAAGTGCAG", "sample_one_TCACAGATAC",
"sample_one_TCACCGCCTA", "sample_one_TCACGCCACT", "sample_one_TCACGTTGGC",
"sample_one_TCATTGTCCA", "sample_one_TCCACACTAG", "sample_one_TCCACGGTCA",
"sample_one_TCCACTCGCT", "sample_one_TCCGACTAAC", "sample_one_TCCGTTATCT",
"sample_one_TCCTAAGAGA", "sample_one_TCCTCTAGTA", "sample_one_TCGAAGCATT",
"sample_one_TCGAGAGAGC", "sample_one_TCGCACTTGA", "sample_one_TCGCCTACTG",
"sample_one_TCGCGTAGCA", "sample_one_TCGGCGTTAA", "sample_one_TCTACATCCG",
"sample_one_TCTCCACATT", "sample_one_TCTCTCCTAT", "sample_one_TCTTGCTCGG",
"sample_one_TGAACTAACC", "sample_one_TGAAGAAGGT", "sample_one_TGAGCGTTCC",
"sample_one_TGAGTACGTA", "sample_one_TGGAATGGAG", "sample_one_TGTCATTCGC",
"sample_one_TGTGCTTCAG", "sample_one_TGTTCAGGAT", "sample_one_TTACACACGT",
"sample_one_TTACTGTGAC", "sample_one_TTATAGGAGG", "sample_one_TTATCGCGTT",
"sample_one_TTATGCCGCG", "sample_one_TTCACGGAAG", "sample_one_TTCAGGAGTA",
"sample_one_TTCCATCGAG", "sample_one_TTCGAGTGAT", "sample_one_TTCTGTACCT",
"sample_one_TTGGCAATTC", "sample_one_TTGGCTCCAC", "sample_one_TTGGTAACAG",
"sample_one_TTGGTCAGTA", "sample_one_TTGTCGGCCA", "sample_one_TTGTGTTCGA"
)
stopifnot(identical(sampleNames(sample_1_result), expected_sample_names))
expected_var_labels <- c(
"WellBC",
"WellID",
"NumberOfMTReads",
"pctMT",
"NumberOfERCCReads",
"pctERCC",
"NumberOfChromReads",
"pctChrom",
"NumberOfInputReads",
"NumberOfMappedReads",
"PctMappedReads",
"NumberOfReadsMappedToMultipleLoci",
"PectOfReadsMappedToMultipleLoci",
"NumberOfReadsMappedToTooManyLoci",
"PectOfReadsMappedToTooManyLoci",
"NumberOfReadsUnmappedTooManyMismatches",
"PectOfReadsUnmappedTooManyMismatches",
"NumberOfReadsUnmappedTooShort",
"PectOfReadsUnmappedTooShort",
"NumberOfReadsUnmappedOther",
"PectOfReadsUnmappedOther",
"ReadsWithValidBarcodes",
"SequencingSaturation",
"Q30BasesInCB.UMI",
"ReadsMappedToTranscriptome.Unique.MultipeGenes",
"EstimatedNumberOfCells",
"FractionOfReadsInCells",
"MeanReadsPerCell",
"NumberOfUMIs",
"NumberOfGenes",
"NumberOfCountedReads",
"PoolName"
)
stopifnot(identical(varLabels(sample_1_result), expected_var_labels))
read_mm <- function(mapping_dir) {
market_matrix_file <- file.path(mapping_dir, "Solo.out",
"Gene", "raw", "matrix.mtx")
result <- readMM(market_matrix_file)
feature_file <- file.path(mapping_dir, "Solo.out",
"Gene", "raw", "features.tsv")
features <- read.table(feature_file, sep = "\t", header = FALSE,
col.names = c("ID", "Name", "Type"))$ID
rownames(result) <- gsub("\\.\\d+$", "", features)
barcodes_file <- file.path(mapping_dir,
"Solo.out", "Gene", "raw", "barcodes.tsv")
if (!file.exists(barcodes_file)) {
stop(paste0("Expected the 'Solo.out/Gene/raw' directory at ",
mapping_dir, " to contain a 'barcodes.tsv' file."))
}
barcodes <- readLines(barcodes_file)
if (length(barcodes) != 1) {
stop(paste0("A single STAR Solo folder should only have ",
"mapped one (1) barcode, but found '",
length(barcodes), "'for mapping directory ", mapping_dir))
}
colnames(result) <- paste0("sample_one_", barcodes)
return(result)
}
expected_matrices <- lapply(par$star_output, read_mm)
expected_matrix <- as.matrix(do.call(cbind, expected_matrices))
result_counts <- exprs(sample_1_result)
stopifnot(length(setdiff(colnames(expected_matrix),
colnames(exprs(sample_1_result)))) == 0)
stopifnot(length(setdiff(rownames(expected_matrix),
rownames(exprs(sample_1_result)))) == 0)
expected_matrix_sorted <- expected_matrix[, colnames(exprs(sample_1_result))]
stopifnot(identical(exprs(sample_1_result), expected_matrix_sorted))

View File

@@ -0,0 +1,41 @@
name: "check_cutadapt_output"
namespace: "integration_test_components/well_demultiplexing"
description: "This component test the cutadapt output from the well_demultiplex subworkflow."
authors:
- __merge__: /src/base/authors/dries_schaumont.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: "--fastq_r1"
type: file
required: true
multiple: true
description: Path to the forward reads to test.
- name: "--fastq_r2"
type: file
required: true
multiple: true
description: Path to the reverse reads to test.
- name: "--ids"
type: string
description: "Well IDs for the corresponding fastq input"
required: true
multiple: true
resources:
- type: python_script
path: script.py
engines:
- type: docker
image: python:3.12-slim
setup:
- type: apt
packages:
- procps
- type: python
packages:
- dnaio
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,78 @@
import dnaio
from operator import itemgetter
## VIASH START
par = {
}
## VIASH END
def assert_number_of_reads(reads):
expected_number_of_reads = {
"SRR14730301__A1": 165,
"SRR14730301__B1": 194,
"SRR14730302__A1": 141,
"SRR14730302__B1": 213,
"SRR14730302__unknown": 99646,
"SRR14730301__unknown": 99641,
}
for input_id, expected_reads in expected_number_of_reads.items():
num_reads = len(reads[input_id])
assert num_reads == expected_reads, \
f"Expected number of ouput reads for {input_id} to be {expected_reads}, was {num_reads}."
def string_difference(string1, string2):
result = 0
for char1, char2 in zip(string1, string2, strict=True):
if char1.lower() != char2.lower():
result += 1
return result
def assert_barcodes_not_removed(reads):
barcodes = {
"SRR14730301__A1": "ACACCGAATT",
"SRR14730302__A1": "ACACCGAATT",
"SRR14730301__B1": "GGCTATTGAT",
"SRR14730302__B1": "GGCTATTGAT"
}
for sample_id, barcode in barcodes.items():
sample_reads = reads[sample_id]
forward_reads = map(itemgetter(0), sample_reads)
for i, forward_read in enumerate(forward_reads):
read_sequence = forward_read.sequence
read_barcode_start = read_sequence[: len(barcode)]
# A 10% difference is allowed.
assert string_difference(read_barcode_start, barcode) <= (0.1 * len(barcode)), \
(f"Expected barcode {barcode} to be present for sample {sample_id} "
f"in read {i}. Found {read_barcode_start}")
def create_input_mapping(sample_ids, inputs_r1, inputs_r2):
return {sample_id: [input_r1, input_r2]
for sample_id, input_r1, input_r2
in zip(sample_ids, inputs_r1, inputs_r2, strict=True)}
def read_input_files(input_mapping):
expected_keys = {"SRR14730301__A1", "SRR14730301__B1",
"SRR14730302__A1", "SRR14730302__B1",
"SRR14730301__unknown", "SRR14730302__unknown"}
difference = set(input_mapping.keys()) - expected_keys
assert not difference, f"Found unexpected output id(s): {difference}"
result = {}
for input_id, input_files in input_mapping.items():
input_r1, input_r2 = input_files
# This reads the files into memory,
# but they are reasonably small
with dnaio.open(input_r1) as r1_reads, dnaio.open(input_r2) as r2_reads:
for r1_read, r2_read in zip(r1_reads, r2_reads, strict=True):
result.setdefault(input_id, []).append((r1_read, r2_read))
return result
def main(par):
inputs = create_input_mapping(par["ids"], par["fastq_r1"], par["fastq_r2"])
reads = read_input_files(inputs)
assert_number_of_reads(reads)
assert_barcodes_not_removed(reads)
if __name__ == "__main__":
main(par)

15
src/io/publish_fastqs/code.sh Executable file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
echo "Publishing $par_input -> $par_output"
echo
echo "Creating directory if it does not exist:"
mkdir -p "$par_output" && echo "$par_output created"
echo
echo "Copying files..."
IFS=";" read -ra input <<<$par_input
for i in "${input[@]}"; do
cp -rL "$i" "$par_output/"
done

View File

@@ -0,0 +1,34 @@
name: "publish_fastqs"
namespace: "io"
description: "Publish the fastq files per well"
argument_groups:
- name: Input arguments
arguments:
- name: --input
description: Directory to write fastq data to
type: file
multiple: true
required: true
- name: Output arguments
arguments:
- name: --output
type: file
direction: output
# ID is the well barcode
default: "$id/"
resources:
- type: bash_script
path: ./code.sh
engines:
- type: docker
image: debian:stable-slim
setup:
- type: apt
packages:
- procps
runners:
- type: executable
- type: nextflow

93
src/io/publish_results/code.sh Executable file
View File

@@ -0,0 +1,93 @@
#!/bin/bash
set -eo pipefail
echo "Publishing results to multiple output directories"
# Create output directories for multiple files
echo "Creating output directories..."
path_pars=(
par_star_output_dir
par_nrReadsNrGenesPerChrom_dir
par_star_qc_metrics_dir
par_eset_dir
par_f_data_dir
par_p_data_dir
par_html_report_output
par_run_params_output
)
for par in ${path_pars[@]}; do
curr_val="${!par}"
new_value=$(realpath --canonicalize-missing "$curr_val")
declare -g "$par=$new_value"
done
mkdir -p "$par_star_output_dir" && echo "$par_star_output_dir created"
mkdir -p "$par_nrReadsNrGenesPerChrom_dir" && echo "$par_nrReadsNrGenesPerChrom_dir created"
mkdir -p "$par_star_qc_metrics_dir" && echo "$par_star_qc_metrics_dir created"
mkdir -p "$par_eset_dir" && echo "$par_eset_dir created"
mkdir -p "$par_f_data_dir" && echo "$par_f_data_dir created"
mkdir -p "$par_p_data_dir" && echo "$par_p_data_dir created"
echo
echo "Copying STAR output files..."
IFS=";" read -ra star_output <<<$par_star_output
for i in "${star_output[@]}"; do
echo "Copying $i to $par_star_output_dir/"
cp -rL "$i" "$par_star_output_dir/"
done
echo
echo "Copying nrReadsNrGenesPerChrom files..."
IFS=";" read -ra nrReadsNrGenesPerChrom <<<$par_nrReadsNrGenesPerChrom
for i in "${nrReadsNrGenesPerChrom[@]}"; do
echo "Copying $i to $par_nrReadsNrGenesPerChrom_dir/"
cp -rL "$i" "$par_nrReadsNrGenesPerChrom_dir/"
done
echo
echo "Copying STAR QC metrics files..."
IFS=";" read -ra star_qc_metrics <<<$par_star_qc_metrics
for i in "${star_qc_metrics[@]}"; do
echo "Copying $i to $par_star_qc_metrics_dir/"
cp -rL "$i" "$par_star_qc_metrics_dir/"
done
echo
echo "Copying eset files..."
IFS=";" read -ra eset <<<$par_eset
for i in "${eset[@]}"; do
echo "Copying $i to $par_eset_dir/"
cp -rL "$i" "$par_eset_dir/"
done
echo
echo "Copying f_data files..."
IFS=";" read -ra f_data <<<$par_f_data
for i in "${f_data[@]}"; do
echo "Copying $i to $par_f_data_dir/"
cp -rL "$i" "$par_f_data_dir/"
done
echo
echo "Copying p_data files..."
IFS=";" read -ra p_data <<<$par_p_data
for i in "${p_data[@]}"; do
echo "Copying $i to $par_p_data_dir/"
cp -rL "$i" "$par_p_data_dir/"
done
echo
echo "Copying single files directly..."
mkdir -p $(dirname "$par_html_report_output")
echo "Copying $par_html_report to $par_html_report_output"
cp -L "$par_html_report" "$par_html_report_output"
echo "Copying $par_run_params to $par_run_params_output"
mkdir -p $(dirname "$par_run_params_output")
cp -L "$par_run_params" "$par_run_params_output"
echo
echo "Publishing completed successfully!"

View File

@@ -0,0 +1,91 @@
name: "publish_results"
namespace: "io"
description: "Publish the results"
argument_groups:
- name: Input arguments
arguments:
- name: --star_output
description: Output from mapping with STAR
type: file
multiple: true
required: true
- name: "--nrReadsNrGenesPerChrom"
type: file
multiple: true
required: true
- name: "--star_qc_metrics"
type: file
multiple: true
required: true
- name: "--eset"
type: file
multiple: true
required: true
- name: "--f_data"
type: file
multiple: true
required: true
- name: "--p_data"
type: file
multiple: true
required: true
- name: "--html_report"
type: file
required: true
- name: "--run_params"
type: file
required: true
- name: Output directory
description: |
Determines the name of output directories
arguments:
- name: --star_output_dir
type: file
direction: output
default: "star_output"
- name: --nrReadsNrGenesPerChrom_dir
type: file
direction: output
default: "nrReadsNrGenesPerChrom"
- name: --star_qc_metrics_dir
type: file
direction: output
default: "starLogs"
- name: --eset_dir
type: file
direction: output
default: "esets"
- name: --f_data_dir
type: file
direction: output
default: "fData"
- name: --p_data_dir
type: file
direction: output
default: "pData"
- name: "Output file arguments"
description: Determines the name of output files
arguments:
- name: "--run_params_output"
type: file
direction: output
- name: "--html_report_output"
type: file
direction: output
resources:
- type: bash_script
path: ./code.sh
engines:
- type: docker
image: debian:stable-slim
setup:
- type: apt
packages:
- procps
runners:
- type: executable
- type: nextflow

BIN
src/parallel_map/STAR Executable file

Binary file not shown.

View File

@@ -0,0 +1,124 @@
name: parallel_map
description: |
Map wells in batch, using STAR
Spliced Transcripts Alignment to a Reference (C) Alexander Dobin
https://github.com/alexdobin/STAR
authors:
- __merge__: /src/base/authors/dries_schaumont.yaml
roles: [ maintainer ]
- __merge__: /src/base/authors/toni_verbeiren.yaml
roles: [ author, maintainer ]
requirements:
commands:
- STAR
- file
- parallel
argument_groups:
- name: Input arguments
arguments:
- name: "--input_r1"
description: |
Input FASTQ files for the forward reads. All FASTQ file names must start with the prefix '{well_id}_R1', where
'well_id' can be found as the sequence identifier in the barcodes FASTA file (see 'barcodesFasta' argument).
For each FASTQ file, a matching FASTQ file for the reverse reads must be provided to the 'input_r2' argument,
meaning that their 'well_id' prefix must match. The number of items provided for 'input_r1' must be equal
to the number of items for 'input_r2'.
type: file
required: true
multiple: true
- name: "--input_r2"
description: |
Input FASTQ files for the reverse reads. All FASTQ file names must start with the prefix '{well_id}_R2', where
'well_id' can be found as the sequence identifier in the barcodes FASTA file (see 'barcodesFasta' argument).
For each FASTQ file, a matching FASTQ file for the reverse reads must be provided to the 'input_r1' argument,
meaning that their 'well_id' prefix must match. The number of items provided for 'input_r1' must be equal
to the number of items for 'input_r2'.
type: file
required: true
multiple: true
- name: "--genomeDir"
description: |
Reference genome to match to. Can be generated from genomic FASTA sequences and a genome annotation
by using STAR with '--runMode genomeGenerate'.
type: file
required: true
- name: "--barcodesFasta"
type: file
required: true
description: |
FASTA file where each entry specifies a unique barcode sequence present at the start of the forward input reads
(input_r1). The IDs of each barcode (the start of the FASTA headers up until the first whitespace character) must
match with the start of one input FASTQ pair.
- name: Barcode arguments
arguments:
- name: "--umiLength"
type: integer
required: true
description: |
Length of the Unique Molecular Identifiers (UMI). The UMI are expected to be located after the barcodes in the
forwards reads.
- name: "--limitBAMsortRAM"
type: string
default: "10000000000"
- name: Runtime arguments
arguments:
- name: "--runThreadN"
description: "Number of threads to use for a single STAR execution."
type: integer
default: 1
- name: Output arguments
arguments:
- name: "--output"
type: file
description: |
A list of output folders which are the result of using STAR to map each input FASTQ pair STAR to the reference genome.
The order of the items DO NOT match with the order of the entries in the barcodes FASTA file or the input FASTQ pairs.
required: true
multiple: true
direction: output
default: './*'
- name: "--joblog"
type: file
description: Where to store the log file listing all the jobs.
required: false
direction: output
default: "execution_log.txt"
resources:
- type: bash_script
path: script.sh
- path: STAR
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: debian:stable-slim
setup:
- type: apt
packages:
- procps
- wget
- automake
- make
- gcc
- g++
- zlib1g-dev
- parallel
- file
- seqkit
- type: docker
build_args:
- STAR_V=2.7.6a
env:
- STAR_SOURCE="https://github.com/alexdobin/STAR/archive/refs/tags/$STAR_V.tar.gz"
- STAR_TARGET="/app/star-$STAR_V.tar.gz"
- STAR_INSTALL_DIR="/app/STAR-$STAR_V"
- STAR_BINARY=STAR
copy:
- STAR /usr/local/bin/$STAR_BINARY
runners:
- type: executable
- type: nextflow

342
src/parallel_map/script.sh Executable file
View File

@@ -0,0 +1,342 @@
#!/bin/bash
## VIASH START
par_input_r1="work/2c/5b8b3a2dd4a988b8838e3f72d38a37/_viash_par/input_r1_1/two__ACACCGAATT.concat_text_r1.output.txt"
par_input_r2="work/2c/5b8b3a2dd4a988b8838e3f72d38a37/_viash_par/input_r2_1/two__ACACCGAATT.concat_text_r2.output.txt"
par_barcodes="ACACCGAATT;GGCTATTGAT"
par_output="./*"
par_genomeDir="star"
par_umiLength=10
par_limitBAMsortRAM="10000000000"
meta_cpus=2
par_runThreadN=1
## VIASH END
set -eo pipefail
# Check if wildcard character is present in output folder template
printf "Checking if output folder template ($par_output) contains a single wildcard character '*'. "
output_glob_character="${par_output//[^\*]}"
if [[ "${#output_glob_character}" -ne "1" ]]; then
echo "The value for --output must contain exactly one '*' character. Exiting..."
exit 1
else
echo "Done, wildcard character found!"
fi
# Split the delimited strings into arrays
IFS=';' read -r -a input_r1 <<< "$par_input_r1"
IFS=';' read -r -a input_r2 <<< "$par_input_r2"
# Read barcodes FASTQ
# seqkit will make sure to take the leading non-whitespace as sequence identifier (ID)
# Luckily, this is the same as how cutadapt determines an adapter name from the FASTA header.
readarray -t well_ids < <(seqkit seq --name "$par_barcodesFasta" )
readarray -t barcodes < <(seqkit seq --seq --upper-case --remove-gaps --gap-letters '^' --validate-seq "$par_barcodesFasta")
# Function to test for unique values in array
function arrayContainsUniqueValues {
# Pass the argument by reference
local -n arr=$1
# Create a temporary associative array
# in order to use its uniqueness of keys
# 'declare' in a function is automatically local
declare -A uniq_tmp
for item in "${arr[@]}"; do
uniq_tmp[$item]=0 # assigning a placeholder
done
local unique_array_values=(${!uniq_tmp[@]})
if [ "${#unique_array_values[@]}" -eq "${#arr[@]}" ]; then
return
fi
false
}
arrayContainsUniqueValues barcodes
is_array_unique_exit_code=$?
if ! (exit $is_array_unique_exit_code); then
echo "The provided barcodes should be unique!"
echo "Values: $par_barcodes"
exit 1
fi
# Check that the number of values provided for the fastq files are the same.
num_r1_inputs="${#input_r1[@]}"
num_r2_inputs="${#input_r2[@]}"
if [ ! "$num_r1_inputs" -eq "$num_r2_inputs" ]; then
echo "The number of values for arguments "\
"'input_r1' ($num_r1_inputs) and 'input_r2' ($num_r2_inputs) "\
"should be the same."
exit 1
else
echo "Checked if the same as the number of R1 FASTQ ($num_r1_inputs) and R2 FASTQ files "\
"($num_r2_inputs) were provided. Seems OK!"
fi
# Loop over the well IDs and match them to the input FASTQ files
# The FASTQ file names should have the format {well_id}_R(1|2).fastq,
# which is the output format that the cutadapt component uses for demultiplexing.
# sorted_input_r1 and sorted_input_r2 are the input FASTQ files sorted by the order
# of the barcodes in the barcodes array (i.e. the order in the barcodes FASTA file).
declare -a sorted_input_r1=()
declare -a sorted_input_r2=()
for barcode_index in "${!barcodes[@]}"; do
barcode="${barcodes[$barcode_index]}"
well_id="${well_ids[$barcode_index]}"
echo "Finding FASTQ files for barcode ${barcode}, well ID '${well_id}'."
# The FASTQ files for a particular barcode must match the following regex:
input_file_regex="^${well_id}_R[1-2]"
for r1_index in "${!input_r1[@]}"; do
r1_file_path=${input_r1[$r1_index]}
r2_file_path=${input_r2[$r1_index]}
# Get the file names from the full path
r1_file_name=$(basename -- "$r1_file_path")
r2_file_name=$(basename -- "$r2_file_path")
# Check if the file names match the regex
if [[ $r1_file_name =~ $input_file_regex ]]; then
echo "Matched with $r1_file_name and $r2_file_name."
# If the R1 FASTQ file matched the regex,
# the R2 file must have also been matched
if ! [[ $r2_file_name =~ $input_file_regex ]]; then
echo "File ${r1_file_name} matched with regex ${input_file_regex} "\
"but ${r2_file_name} did not! Make sure that the order of "\
"the R1 and R2 input files match."
exit 1
fi
# Add the
sorted_input_r1+=("$r1_file_path")
sorted_input_r2+=("$r2_file_path")
# Do not continue looking for more files for this barcode
# '2' to affect the *outer* loop (which indeed loops barcodes)!
continue 2
fi
done
echo "Did not find FASTQ files files for well ${well_id}! "\
"Make sure that the input files have the correct file name format."\
"Input files: ${input_r1[@]}"
exit 1
done
# Define the function that will be used to run a single job
function _run() {
local par_UMIlength="$1"
local par_output="$2"
local par_genomeDir="$3"
local par_limitBAMsortRAM="$4"
local par_runThreadN="$5"
local barcode="$6"
local input_R1="$7"
local input_R2="$8"
local barcode_length="${#barcode}"
local umi_start="$(($barcode_length + 1))"
set -eo pipefail
echo <<-EOF
Processing $barcode
For the following inputs (lanes):
"$star_readFilesIn
EOF
echo "Writing barcode '$barcode' to $barcode.txt and using it as input".
# Note that there is no possible conflict between jobs here
# because the barcodes are unique (and the barcode is part of the name
# of the file).
echo "$barcode" > "$barcode.txt"
local dir="${par_output//\*/$barcode}/"
echo "Setting output for barcode '$barcode' to '$dir'."
mkdir -p "$dir"
# check if files are compressed
local TMPDIR=$(mktemp -d "$meta_temp_dir/parallel_map-$barcode-XXXXXX")
function clean_up {
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
}
trap clean_up RETURN
# Decompress the input files when needed
# NOTE: for some reason, using STAR's --readFilesCommand does not always work
# This might be because STAR creates fifo files (see https://man7.org/linux/man-pages/man7/fifo.7.html)
# and this requires a filesystem that supports this. Another cause might be that the input files
# are symlinks. When testing this, using '--readFilesCommand "zcat"'
# always produced empty BAM files, but also a succesfull exit code (0) so the problem is not reported.
# However, the logs showed the following error: "gzip -: unexpected end of file".
function is_gzipped {
printf "Checking if input '$1' (barcode '$barcode') is gzipped... "
if file "$1" | grep -q 'gzip'; then
echo "Done, detected compressed file."
return
fi
echo "Done, file does not need decompression."
false
}
# Resolve symbolic links to actual file paths
input_R1=$(realpath $input_R1)
input_R2=$(realpath $input_R2)
if is_gzipped $input_R1; then
local compressed_file_name_r1="$(basename -- $input_R1)"
local uncompressed_file_r1="$TMPDIR/${compressed_file_name_r1%.gz}"
printf "Unpacking input to $uncompressed_file_r1... "
zcat "$input_R1" > "$uncompressed_file_r1"
echo "Decompression done."
else
local uncompressed_file_r1="$input_R1"
fi
if is_gzipped $input_R2; then
local compressed_file_name_r2="$(basename -- $input_R2)"
local uncompressed_file_r2="$TMPDIR/${compressed_file_name_r2%.gz}"
printf "Unpacking input to $uncompressed_file_r2... "
zcat "$input_R2" > "$uncompressed_file_r2"
echo "Decompression done."
else
local uncompressed_file_r2="$input_R2"
fi
local n_input_lines_r1=$(wc -l < "$uncompressed_file_r1")
local n_input_lines_r2=$(wc -l < "$uncompressed_file_r2")
printf "Checking if length of input file mates match. "
if (( $n_input_lines_r1 != n_input_lines_r2 )); then
echo "The length of file $input_R1 ($n_input_lines_r1) does not match with $input_R2 ($n_input_lines_r2)"
return 1
else
echo "Seems OK, $n_input_lines_r1 input lines."
fi
echo "Starting STAR for barcode '$barcode'"
# soloType 'Droplet' is the same as 'CB_UMI_Simple': one UMI and one cell barcode of fixed length.
# By default in this mode, STAR will look for the cell barcode and the UMI int the last files specified with --readFilesIn
# So we need to specify R2 first and R1 second, because R1 contains the barcode and UMI.
# Also, you might be tempted to use '--soloBarcodeMate 1' to alter this behavior, but this requires the clipping
# the barcode from this mate by specifying --clip5pNbases and/or --clip3pNbases, which we do not want to do.
STAR \
--readFilesIn "$uncompressed_file_r2" "$uncompressed_file_r1" \
--soloType Droplet \
--quantMode GeneCounts \
--genomeLoad LoadAndKeep \
--limitBAMsortRAM "$par_limitBAMsortRAM" \
--runThreadN "$par_runThreadN" \
--outFilterMultimapNmax 1 \
--outSAMtype BAM SortedByCoordinate \
--soloCBstart 1 \
--readFilesType "Fastx" \
--soloCBlen "$barcode_length" \
--soloUMIstart "$umi_start" \
--soloUMIlen "$par_UMIlength" \
--soloBarcodeReadLength 0 \
--soloStrand Unstranded \
--soloFeatures Gene \
--genomeDir "$par_genomeDir" \
--outReadsUnmapped Fastx \
--outSAMunmapped Within \
--outSAMattributes NH HI nM AS CR UR CB UB GX GN \
--soloCBwhitelist "$barcode.txt" \
--outFileNamePrefix "$dir" \
--outTmpDir "$TMPDIR/STARtemp/"
printf "Done running STAR. "
# Check if the number of processed reads is equal to the number of input reads
local n_input_reads=$(($n_input_lines_r1 / 4))
local nr_output_reads=$(grep -Po "Number\ of\ input\ reads \\|\W*\K\d+" "$dir/Log.final.out")
if (( $nr_output_reads != $n_input_reads )); then
echo "Not all input reads were processed for barcode $barcode."
return 1
else
echo "Processed $nr_output_reads reads for barcode $barcode".
fi
printf "Making sure that the output has the proper permissions."
find "$dir" -type d -exec chmod o+x {} \;
chmod -R o+r "$dir"
echo "Done"
}
# Export the function - requires bash
export -f _run
# Load reference genome
echo "Loading reference genome"
STAR --genomeLoad LoadAndExit --genomeDir "$par_genomeDir"
# Run the concurrent jobs using GNU parallel
# Make sure that parallel uses the correct shell
export PARALLEL_SHELL="/bin/bash"
# Some notes:
# --halt now,fail=1: instruct parallel to exit when a job has failed and kill remaining running jobs.
#
# ::: is a special syntax for GNU parallel to delineate inputs
# If multiple ::: are given, each group will be treated as an input source, and all combinations of input
# sources will be generated. E.g. ::: 1 2 ::: a b c will result in the combinations (1,a) (1,b) (1,c) (2,a) (2,b) (2,c)
# The delimiter :::+ (note the extra '+') links the argument to the previous argument, and one argument from each of the input
# sources will be read.
parallel_cmd=("parallel" "--jobs" "80%" "--verbose" "--memfree" "2G"
"--tmpdir" "$meta_temp_dir"
"--retry-failed" "--retries" "4" "--halt" "soon,fail=1"
"--joblog" "$par_joblog" "_run" "{}")
# Arguments for which there is one value, so these will not create extra jobs
parallel_cmd+=(":::" "$par_umiLength" ":::" "$par_output" ":::" "$par_genomeDir" ":::" "$par_limitBAMsortRAM" ":::" "$par_runThreadN")
# Argument which in fact will cause extra jobs to be spawned, per job one item from each argument will be selected
# Thus, these argument lists should have the same length.
parallel_cmd+=(":::" "${barcodes[@]}" ":::+" "${sorted_input_r1[@]}" ":::+" "${sorted_input_r2[@]}")
set +eo pipefail
"${parallel_cmd[@]}"
exit_code=$?
set -eo pipefail
echo "GNU parallel finished!"
# Unload reference
printf "Unloading reference genome. "
STAR --genomeLoad Remove --genomeDir "$par_genomeDir"
echo "Done!"
# Exit code from GNU parallel:
# If fail=1 is used, the exit status will be the exit status of the failing job.
echo "Checking exit code"
if ((exit_code>0)); then
# Note that the ending HERE must be indented with TAB characters (not spaces)
# in order to remove leading indentation
MESSAGE=$(
cat <<-HERE
==================================================================
!!! An error occurred for one of the jobs.
Exit code of the failing job: $exit_code.
%s
==================================================================
HERE
)
printf "$MESSAGE" "$(<$par_joblog)"
exit 1
else
cat <<-HERE
==================================================================
Mapping went fine (exit code '$exit_code'), zero errors occurred
==================================================================
HERE
fi

466
src/parallel_map/test.sh Executable file
View File

@@ -0,0 +1,466 @@
set -eo pipefail
## VIASH START
meta_executable=$(realpath "target/executable/parallel_map/parallel_map")
## VIASH END
# Some helper functions
assert_directory_exists() {
[ -d "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_contains_regex() {
grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
echo "> Prepare test data in $meta_temp_dir"
TMPDIR=$(mktemp -d --tmpdir="$meta_temp_dir")
function clean_up {
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
}
trap clean_up EXIT
# Sample 1, barcode ACAGTCACAG, UMI CTACGGATGA
cat > "$TMPDIR/sample1_R1.fastq" <<'EOF'
@SAMPLE_1_SEQ_ID1
ACAGTCACAGCTACGGATGAGCCTCATAAGCCTCACACATCCGCGCCTATGTTGTGACTCTCTGTGAG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SAMPLE_1_SEQ_ID2
ACAGTCACAGCTACGGATGAGCCTCATAAGCCTCACACATCCGCGCCTATGTTGTGACTCTCTGTGAG
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
EOF
cat > "$TMPDIR/sample1_R2.fastq" <<'EOF'
@SAMPLE_1_SEQ_ID1
CTCACAGAGAGTCACAACATAGGCGCGGATGTGTGAGGCTTATGAGGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SAMPLE_1_SEQ_ID2
CTCACAGAGAGTCACAACATAGGCGCGGATGTGTGAGGCTTATGAGGC
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
EOF
# Sample 2, barcode CGGGTTTACC, UMI GCTAGCTAGC
cat > "$TMPDIR/sample2_R1.fastq" << 'EOF'
@SAMPLE_2_SEQ_ID1
CGGGTTTACCGCTAGCTAGCCACCACTATGGTTGGCCGGTTAGTAGTGT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SAMPLE_2_SEQ_ID2
CGGGTTTACCGCTAGCTAGCCACCACTATGGTTGGCCGGTTAGTAGTGT
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
EOF
cat > "$TMPDIR/sample2_R2.fastq" <<'EOF'
@SAMPLE_2_SEQ_ID1
ACACTACTAACCGGCCAACCATAGTGGTG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SAMPLE_2_SEQ_ID2
ACACTACTAACCGGCCAACCATAGTGGTG
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
EOF
cat > "$TMPDIR/barcodes.fasta" <<'EOF'
>sample1
ACAGTCACAG
>sample2
CGGGTTTACC
EOF
# Note that there is a sjdbGTFchrPrefix argument for STAR:
# prefix for chromosome names in a GTF file (default: '-')
cat > "$TMPDIR/genome.fasta" <<'EOF'
>1
TGGCATGAGCCAACGAACGCTGCCTCATAAGCCTCACACATCCGCGCCTATGTTGTGACTCTCTGTGAGCGTTCGTGGG
GCTCGTCACCACTATGGTTGGCCGGTTAGTAGTGTGACTCCTGGTTTTCTGGAGCTTCTTTAAACCGTAGTCCAGTCAA
TGCGAATGGCACTTCACGACGGACTGTCCTTAGCTCAGGGGA
EOF
cat > "$TMPDIR/genes.gtf" <<'EOF'
1 example_source gene 0 72 . + . gene_id "gene1"; gene_name: "GENE1;
1 example_source exon 20 71 . + . gene_id "gene1"; gene_name: "GENE1"; exon_id: gene1_exon1;
1 example_source gene 80 160 . + . gene_id "gene2"; gene_name: "GENE2;
1 example_source exon 80 159 . + . gene_id "gene2"; gene_name: "GENE2"; exon_id: gene2_exon1;
EOF
echo "> Generate index"
STAR \
${meta_cpus:+--runThreadN $meta_cpus} \
--runMode genomeGenerate \
--genomeDir "$TMPDIR/index/" \
--genomeFastaFiles "$TMPDIR/genome.fasta" \
--sjdbGTFfile "$TMPDIR/genes.gtf" \
--genomeSAindexNbases 2 > /dev/null 2>&1
echo "> Run test 1"
run_1_dir="$TMPDIR/run_1"
mkdir -p "$run_1_dir"
pushd "$run_1_dir" > /dev/null
"$meta_executable" \
--input_r1 "$TMPDIR/sample1_R1.fastq;$TMPDIR/sample2_R1.fastq" \
--input_r2 "$TMPDIR/sample1_R2.fastq;$TMPDIR/sample2_R2.fastq" \
--genomeDir "$TMPDIR/index/" \
--barcodesFasta "$TMPDIR/barcodes.fasta" \
--umiLength 10 \
--runThreadN 2 \
--output "$TMPDIR/output_*"
popd
echo ">> Check if output directories exists"
sample1_out="$TMPDIR/output_ACAGTCACAG"
sample2_out="$TMPDIR/output_CGGGTTTACC"
assert_directory_exists "$sample1_out"
assert_directory_exists "$sample2_out"
echo ">> Check if output files have been created"
for sample in "$sample1_out" "$sample2_out"; do
assert_file_exists "$sample/Aligned.sortedByCoord.out.bam"
assert_file_exists "$sample/Unmapped.out.mate1"
assert_file_exists "$sample/Unmapped.out.mate2"
assert_file_exists "$sample/Log.out"
assert_file_exists "$sample/Log.final.out"
assert_file_exists "$sample/ReadsPerGene.out.tab"
done
echo ">> Check if Solo output is present"
for sample in "$sample1_out" "$sample2_out"; do
assert_directory_exists "$sample1_out/Solo.out"
assert_directory_exists "$sample1_out/Solo.out/Gene"
assert_file_exists "$sample1_out/Solo.out/Barcodes.stats"
assert_file_exists "$sample1_out/Solo.out/Gene/raw/barcodes.tsv"
assert_file_exists "$sample1_out/Solo.out/Gene/raw/features.tsv"
assert_file_exists "$sample1_out/Solo.out/Gene/raw/matrix.mtx"
assert_file_exists "$sample1_out/Solo.out/Gene/filtered/barcodes.tsv"
assert_file_exists "$sample1_out/Solo.out/Gene/filtered/features.tsv"
assert_file_exists "$sample1_out/Solo.out/Gene/filtered/matrix.mtx"
done
echo ">> Check contents of output"
echo ">>> Sample 1"
assert_file_contains "$sample1_out/Solo.out/Barcodes.stats" "nExactMatch 2"
assert_file_contains "$sample1_out/Log.final.out" "Uniquely mapped reads number | 2"
assert_file_contains "$sample1_out/Log.final.out" "Number of input reads | 2"
cat << EOF | cmp -s "$sample1_out/Solo.out/Gene/filtered/barcodes.tsv" || { echo "Barcodes file is different"; exit 1; }
ACAGTCACAG
EOF
cat << EOF | cmp -s "$sample1_out/Solo.out/Gene/filtered/features.tsv" || { echo "Features file is different"; exit 1; }
gene1 gene1 Gene Expression
gene2 gene2 Gene Expression
EOF
cat << EOF | cmp -s "$sample1_out/Solo.out/Gene/filtered/matrix.mtx" || { echo "Matrix file is different"; exit 1; }
%%MatrixMarket matrix coordinate integer general
%
2 1 1
1 1 1
EOF
echo ">>> Sample 2"
assert_file_contains "$sample2_out/Solo.out/Barcodes.stats" "nExactMatch 2"
assert_file_contains "$sample2_out/Log.final.out" "Uniquely mapped reads number | 2"
assert_file_contains "$sample2_out/Log.final.out" "Number of input reads | 2"
cat << EOF | cmp -s "$sample2_out/Solo.out/Gene/filtered/barcodes.tsv" || { echo "Barcodes file is different"; exit 1; }
CGGGTTTACC
EOF
cat << EOF | cmp -s "$sample2_out/Solo.out/Gene/filtered/features.tsv" || { echo "Features file is different"; exit 1; }
gene1 gene1 Gene Expression
gene2 gene2 Gene Expression
EOF
cat << EOF | cmp -s "$sample2_out/Solo.out/Gene/filtered/matrix.mtx" || { echo "Matrix file is different"; exit 1; }
%%MatrixMarket matrix coordinate integer general
%
2 1 1
2 1 1
EOF
echo "> Run test 2 (compressed input)"
gzip -c "$TMPDIR/sample1_R1.fastq" > "$TMPDIR/sample1_R1.fastq.gz"
gzip -c "$TMPDIR/sample2_R1.fastq" > "$TMPDIR/sample2_R1.fastq.gz"
gzip -c "$TMPDIR/sample1_R2.fastq" > "$TMPDIR/sample1_R2.fastq.gz"
gzip -c "$TMPDIR/sample2_R2.fastq" > "$TMPDIR/sample2_R2.fastq.gz"
run_2_dir="$TMPDIR/run_2"
mkdir -p "$run_2_dir"
pushd "$run_2_dir" > /dev/null
"$meta_executable" \
--input_r1 "$TMPDIR/sample1_R1.fastq.gz;$TMPDIR/sample2_R1.fastq.gz" \
--input_r2 "$TMPDIR/sample1_R2.fastq.gz;$TMPDIR/sample2_R2.fastq.gz" \
--genomeDir "$TMPDIR/index/" \
--barcodesFasta "$TMPDIR/barcodes.fasta" \
--umiLength 10 \
--runThreadN 2 \
--output "$TMPDIR/output_gz_*" > /dev/null 2>&1
popd > /dev/null
echo ">> Check if output directories exists"
sample1_out="$TMPDIR/output_gz_ACAGTCACAG"
sample2_out="$TMPDIR/output_gz_CGGGTTTACC"
assert_directory_exists "$sample1_out"
assert_directory_exists "$sample2_out"
echo ">> Check if output files have been created"
for sample in "$sample1_out" "$sample2_out"; do
assert_file_exists "$sample/Aligned.sortedByCoord.out.bam"
assert_file_exists "$sample/Unmapped.out.mate1"
assert_file_exists "$sample/Unmapped.out.mate2"
assert_file_exists "$sample/Log.out"
assert_file_exists "$sample/Log.final.out"
assert_file_exists "$sample/ReadsPerGene.out.tab"
done
echo ">> Check if Solo output is present"
for sample in "$sample1_out" "$sample2_out"; do
assert_directory_exists "$sample1_out/Solo.out"
assert_directory_exists "$sample1_out/Solo.out/Gene"
assert_file_exists "$sample1_out/Solo.out/Barcodes.stats"
assert_file_exists "$sample1_out/Solo.out/Gene/raw/barcodes.tsv"
assert_file_exists "$sample1_out/Solo.out/Gene/raw/features.tsv"
assert_file_exists "$sample1_out/Solo.out/Gene/raw/matrix.mtx"
assert_file_exists "$sample1_out/Solo.out/Gene/filtered/barcodes.tsv"
assert_file_exists "$sample1_out/Solo.out/Gene/filtered/features.tsv"
assert_file_exists "$sample1_out/Solo.out/Gene/filtered/matrix.mtx"
done
echo ">> Check contents of output"
echo ">>> Sample 1"
assert_file_contains "$sample1_out/Solo.out/Barcodes.stats" "nExactMatch 2"
assert_file_contains "$sample1_out/Log.final.out" "Uniquely mapped reads number | 2"
assert_file_contains "$sample1_out/Log.final.out" "Number of input reads | 2"
cat << EOF | cmp -s "$sample1_out/Solo.out/Gene/filtered/barcodes.tsv" || { echo "Barcodes file is different"; exit 1; }
ACAGTCACAG
EOF
cat << EOF | cmp -s "$sample1_out/Solo.out/Gene/filtered/features.tsv" || { echo "Features file is different"; exit 1; }
gene1 gene1 Gene Expression
gene2 gene2 Gene Expression
EOF
cat << EOF | cmp -s "$sample1_out/Solo.out/Gene/filtered/matrix.mtx" || { echo "Matrix file is different"; exit 1; }
%%MatrixMarket matrix coordinate integer general
%
2 1 1
1 1 1
EOF
echo ">>> Sample 2"
assert_file_contains "$sample2_out/Solo.out/Barcodes.stats" "nExactMatch 2"
assert_file_contains "$sample2_out/Log.final.out" "Uniquely mapped reads number | 2"
assert_file_contains "$sample2_out/Log.final.out" "Number of input reads | 2"
cat << EOF | cmp -s "$sample2_out/Solo.out/Gene/filtered/barcodes.tsv" || { echo "Barcodes file is different"; exit 1; }
CGGGTTTACC
EOF
cat << EOF | cmp -s "$sample2_out/Solo.out/Gene/filtered/features.tsv" || { echo "Features file is different"; exit 1; }
gene1 gene1 Gene Expression
gene2 gene2 Gene Expression
EOF
cat << EOF | cmp -s "$sample2_out/Solo.out/Gene/filtered/matrix.mtx" || { echo "Matrix file is different"; exit 1; }
%%MatrixMarket matrix coordinate integer general
%
2 1 1
2 1 1
EOF
cat > "$TMPDIR/wrong_number_of_barcodes.fasta" <<'EOF'
>A1
ACAGTCACAG
EOF
echo "> Check that wrong number of barcodes are detected."
run_3_dir="$TMPDIR/run_3"
mkdir -p "$run_3_dir"
pushd "$run_3_dir" > /dev/null
set +eo pipefail
"$meta_executable" \
--input_r1 "$TMPDIR/sample1_R1.fastq.gz;$TMPDIR/sample2_R1.fastq.gz" \
--input_r2 "$TMPDIR/sample1_R2.fastq.gz;$TMPDIR/sample2_R2.fastq.gz" \
--genomeDir "$TMPDIR/index/" \
--barcodesFasta "$TMPDIR/wrong_number_of_barcodes.fasta" \
--umiLength 10 \
--runThreadN 2 \
--output "$TMPDIR/output_gz_*" > /dev/null 2>&1 && echo "Expected non-zero exit code " && exit 1
set -eo pipefail
popd > /dev/null
echo "> Check that missing wildcard character is detected."
run_4_dir="$TMPDIR/run_4"
mkdir -p "$run_4_dir"
pushd "$run_4_dir" > /dev/null
set +eo pipefail
"$meta_executable" \
--input_r1 "$TMPDIR/sample1_R1.fastq.gz;$TMPDIR/sample2_R1.fastq.gz" \
--input_r2 "$TMPDIR/sample1_R2.fastq.gz;$TMPDIR/sample2_R2.fastq.gz" \
--genomeDir "$TMPDIR/index/" \
--barcodesFasta "$TMPDIR/barcodes.fasta" \
--umiLength 10 \
--runThreadN 2 \
--output "$TMPDIR/output_run4" > /dev/null 2>&1 && echo "Expected non-zero exit code." && exit 1
set -eo pipefail
popd > /dev/null
echo "> Check that a mismatch in the length of the input mates is detected."
empty_input_file="$TMPDIR/empty.fastq"
touch "$empty_input_file"
run_5_dir="$TMPDIR/run_5"
mkdir -p "$run_5_dir"
pushd "$run_5_dir" > /dev/null
set +eo pipefail
"$meta_executable" \
--input_r1 "$TMPDIR/sample1_R1.fastq;$empty_input_file" \
--input_r2 "$TMPDIR/sample1_R2.fastq;$TMPDIR/sample2_R2.fastq" \
--genomeDir "$TMPDIR/index/" \
--barcodesFasta "$TMPDIR/barcodes.fasta" \
--umiLength 10 \
--runThreadN 2 \
--output "$TMPDIR/output_run5_*" > /dev/null 2>&1 && echo "Expected non-zero exit code " && exit 1
set -eo pipefail
popd > /dev/null
echo "> Check that wrong number of input files is detected."
run_6_dir="$TMPDIR/run_6"
mkdir -p "$run_6_dir"
pushd "$run_6_dir" > /dev/null
set +eo pipefail
"$meta_executable" \
--input_r1 "$TMPDIR/sample1_R1.fastq" \
--input_r2 "$TMPDIR/sample1_R2.fastq;$TMPDIR/sample2_R2.fastq" \
--genomeDir "$TMPDIR/index/" \
--barcodesFasta "$TMPDIR/barcodes.fasta" \
--umiLength 10 \
--runThreadN 2 \
--output "$TMPDIR/output_run_6_*" > /dev/null 2>&1 && echo "Expected non-zero exit code " && exit 1
set -eo pipefail
popd > /dev/null
echo "> Check that wrong FASTQ order is detected."
run_6_dir="$TMPDIR/run_7"
mkdir -p "$run_6_dir"
pushd "$run_6_dir" > /dev/null
set +eo pipefail
"$meta_executable" \
--input_r1 "$TMPDIR/sample2_R1.fastq.gz;$TMPDIR/sample1_R1.fastq.gz" \
--input_r2 "$TMPDIR/sample1_R2.fastq;$TMPDIR/sample2_R2.fastq" \
--genomeDir "$TMPDIR/index/" \
--barcodesFasta "$TMPDIR/barcodes.fasta" \
--umiLength 10 \
--runThreadN 2 \
--output "$TMPDIR/output_run_6_*" > /dev/null 2>&1 && echo "Expected non-zero exit code " && exit 1
set -eo pipefail
popd > /dev/null
echo "> Check that order of input FASTQ files must not match with the order of barcodes"
run_8_dir="$TMPDIR/run_8"
mkdir -p "$run_8_dir"
pushd "$run_8_dir" > /dev/null
"$meta_executable" \
--input_r1 "$TMPDIR/sample2_R1.fastq;$TMPDIR/sample1_R1.fastq" \
--input_r2 "$TMPDIR/sample2_R2.fastq;$TMPDIR/sample1_R2.fastq" \
--genomeDir "$TMPDIR/index/" \
--barcodesFasta "$TMPDIR/barcodes.fasta" \
--umiLength 10 \
--runThreadN 2 \
--output "$TMPDIR/output_*" > /dev/null 2>&1
popd
echo ">> Check if output directories exists"
sample1_out="$TMPDIR/output_ACAGTCACAG"
sample2_out="$TMPDIR/output_CGGGTTTACC"
assert_directory_exists "$sample1_out"
assert_directory_exists "$sample2_out"
echo ">> Check if output files have been created"
for sample in "$sample1_out" "$sample2_out"; do
assert_file_exists "$sample/Aligned.sortedByCoord.out.bam"
assert_file_exists "$sample/Unmapped.out.mate1"
assert_file_exists "$sample/Unmapped.out.mate2"
assert_file_exists "$sample/Log.out"
assert_file_exists "$sample/Log.final.out"
assert_file_exists "$sample/ReadsPerGene.out.tab"
done
echo ">> Check if Solo output is present"
for sample in "$sample1_out" "$sample2_out"; do
assert_directory_exists "$sample1_out/Solo.out"
assert_directory_exists "$sample1_out/Solo.out/Gene"
assert_file_exists "$sample1_out/Solo.out/Barcodes.stats"
assert_file_exists "$sample1_out/Solo.out/Gene/raw/barcodes.tsv"
assert_file_exists "$sample1_out/Solo.out/Gene/raw/features.tsv"
assert_file_exists "$sample1_out/Solo.out/Gene/raw/matrix.mtx"
assert_file_exists "$sample1_out/Solo.out/Gene/filtered/barcodes.tsv"
assert_file_exists "$sample1_out/Solo.out/Gene/filtered/features.tsv"
assert_file_exists "$sample1_out/Solo.out/Gene/filtered/matrix.mtx"
done
echo ">> Check contents of output"
echo ">>> Sample 1"
assert_file_contains "$sample1_out/Solo.out/Barcodes.stats" "nExactMatch 2"
assert_file_contains "$sample1_out/Log.final.out" "Uniquely mapped reads number | 2"
assert_file_contains "$sample1_out/Log.final.out" "Number of input reads | 2"
cat << EOF | cmp -s "$sample1_out/Solo.out/Gene/filtered/barcodes.tsv" || { echo "Barcodes file is different"; exit 1; }
ACAGTCACAG
EOF
cat << EOF | cmp -s "$sample1_out/Solo.out/Gene/filtered/features.tsv" || { echo "Features file is different"; exit 1; }
gene1 gene1 Gene Expression
gene2 gene2 Gene Expression
EOF
cat << EOF | cmp -s "$sample1_out/Solo.out/Gene/filtered/matrix.mtx" || { echo "Matrix file is different"; exit 1; }
%%MatrixMarket matrix coordinate integer general
%
2 1 1
1 1 1
EOF
echo ">>> Sample 2"
assert_file_contains "$sample2_out/Solo.out/Barcodes.stats" "nExactMatch 2"
assert_file_contains "$sample2_out/Log.final.out" "Uniquely mapped reads number | 2"
assert_file_contains "$sample2_out/Log.final.out" "Number of input reads | 2"
cat << EOF | cmp -s "$sample2_out/Solo.out/Gene/filtered/barcodes.tsv" || { echo "Barcodes file is different"; exit 1; }
CGGGTTTACC
EOF
cat << EOF | cmp -s "$sample2_out/Solo.out/Gene/filtered/features.tsv" || { echo "Features file is different"; exit 1; }
gene1 gene1 Gene Expression
gene2 gene2 Gene Expression
EOF
cat << EOF | cmp -s "$sample2_out/Solo.out/Gene/filtered/matrix.mtx" || { echo "Matrix file is different"; exit 1; }
%%MatrixMarket matrix coordinate integer general
%
2 1 1
2 1 1
EOF

Binary file not shown.

After

Width:  |  Height:  |  Size: 77 KiB

View File

@@ -0,0 +1,77 @@
name: create_report
namespace: "report"
description: |
Create a basic QC report in HTML format based on a number of esets.
authors:
- __merge__: /src/base/authors/dries_schaumont.yaml
roles: [ maintainer ]
- __merge__: /src/base/authors/marijke_van_moerbeke.yaml
roles: [ author, maintainer ]
argument_groups:
- name: "Arguments"
arguments:
- type: file
name: "--eset"
required: true
multiple: true
- type: file
name: "--output_report"
required: true
direction: output
example: report.html
resources:
- type: r_script
path: script.R
- type: r_script
path: template.Rmd
- type: r_script
path: plateLayouts.R
- path: OutputSTARsolo.png
type: file
test_resources:
- type: r_script
path: test.R
- path: ./test_data
engines:
- type: docker
image: rocker/r2u:24.04
setup:
- type: apt
packages:
- procps
- pandoc
- type: r
script:
- install.packages("BiocManager")
- BiocManager::install(version = "3.21", type = "source", checkBuilt = TRUE)
- type: r
bioc:
- Biobase
- ComplexHeatmap
cran:
- ggplot2
- knitr
- gridExtra
- RColorBrewer
- processx
- whisker
- rmarkdown
- bookdown
- data.table
- platetools
- htmltools
- DT
- logger
- bit64
script:
- install.packages("oaStyle", repos = c(rdepot = "https://repos.openanalytics.eu/repo/public", getOption("repos")))
test_setup:
- type: r
packages:
- testthat
- R.utils
runners:
- type: executable
- type: nextflow

447
src/report/plateLayouts.R Normal file
View File

@@ -0,0 +1,447 @@
#' Displays the annotation of the wells in a plateLayout
#' @param plateData a data.table object containing the information
#' of the plate. This must contain a "WellID".
#' @param plateName The plate name
#' @param valueVariable The name of the variable in 'plateData' to
#' be visualized in a plate layout.
#' @param textVariable The name of the variable in 'plateData' to be
#' shown in the wells of the plate layout. If NULL, the valueVariable
#' is shown.
#' @param colours A named character vector containing the colours
#' for the different levels of the valuevariable. The names should
#' correspond to the dose levels. if not specified, a scheme of blues
#' will be provided.
#' @param breaks Numeric vector indicating breaks for plot coloring.
#' @param colourWellText Colour to display the text in the wells.
#' @param layout Integer vector of length two with number of rows and
#' colums in a plate, e.g. \code{c(16,24)}
#' @param legend.title A title for the legend
#' @param plot.title A title for the plot, will be contracted
#' with the plate name
#' @param ... additional arguments for \code{plateLayout.default} function
#' @import data.table
#' @importFrom platetools fill_plate
#' @export
plateLayout.annotation <- function(
plateData,
plateName = character(),
valueVariable = "Dose",
textVariable = NULL,
breaks = NULL, colours = NULL,
colourWellText = "black",
layout = c(16, 24),
legend.title = "Dose",
plot.title = "Plate Annotation - ",
textFontSize = 9, ...
) {
WellID <- Label <- NULL
if (!(all(c("WellID", "SampleName") %in% colnames(plateData)))) {
stop(" 'WellID' and 'SampleName' column required in plateData object")
}
#Check WellID Format
checkWellID <- grepl("^[[:upper:]]{1,2}[[:digit:]]{1,2}$", plateData$WellID)
if(!all(checkWellID)){
stop("WellID does not have the correct format")
}
plateData[, WellID := paste0(
sub(".*([[:alpha:]]).+", "\\1", plateData$WellID),
sprintf(
"%02d", as.numeric(sub(".*[[:alpha:]](.+)", "\\1", plateData$WellID))
)
)]
plateData <- platetools::fill_plate(plateData, "WellID", plate = layout[1]*layout[2])
plateData$column <- factor(
sprintf(
"%02d",
as.numeric(sub(".*[[:alpha:]](.+)", "\\1", plateData$WellID))
),
levels = sprintf("%02d", seq(1, layout[2]))
)
plateData$row <- factor(sub(".*([[:alpha:]]).+", "\\1", plateData$WellID),
levels = LETTERS[seq(1, layout[1])])
if (!is.null(valueVariable)){
plateData[, values := as.character(plateData[, ..valueVariable][[1]])]
valueVar <- "values"
}else{
plateData[, values := "grey"]
valueVar <- "values"
colours <- setNames("grey", "grey")
}
if (is.null(colours)) {
blues <- colorRampPalette(c("#d6e0ff", "#2171B5"))
greens <- colorRampPalette(c("light green", "dark green"))
numLevels <- sort(as.numeric(as.character(unique(plateData[, values])[
grepl(
"^[[:digit:]]+([.][[:digit:]]+)?$",
trimws(unique(plateData[, values]))
)
])))
otherLevels <- sort(as.character(unique(plateData[, values])[
!grepl(
"^[[:digit:]]+([.][[:digit:]]+)?$",
trimws(unique(plateData[,values]))
)
]))
colours <- c(blues(length(numLevels)), greens(length(otherLevels)), "red")
names(colours) <- c(numLevels, otherLevels, "failed")
}
if (!is.null(textVariable)) {
plateData[,
Label := do.call(paste, c(.SD, sep = "\n ")),
.SDcols = textVariable
]
plateData[, Label := gsub("-", "-\n", Label)]
plateData[, Label := gsub("_", "_\n", Label)]
textVar <- "Label"
} else {
textVar <- NULL
}
if (is.null(breaks)){
breaks <- seq_len(length(colours))
}
plateLayout(
plateData = plateData, valueVariable = valueVar,
textVariable = textVar, plateName = plateName,
breaks = breaks, colourWellText = colourWellText,
legend.title = legend.title, layout = layout,
colours = colours, plot.title = plot.title,
textFontSize = textFontSize, ...
)
}
#' Create a heatmap of values in a plateLayout view. The values can be
#' library sizes, number of genes, qcScore (0/1) or a factor.
#' @param plateData A data.table of the values to be visualized with
#' at least the column of interest (specified in 'varOfInterest')
#' and a 'WellID' column indicating the wells in the plate. The WellID
#' is a combination of a letter (row in the plate) and an integer
#' (column in the plate).
#' @param valueVariable The name of the variable in 'plateData'
#' to be visualized in a plate layout
#' @param textVariable The name of the variable in 'plateData'
#' to be shown in the wells of the plate layout. Defaults to the
#' valueVariable and if NULL, no text will be displayed.
#' @param breaks Numeric vector indicating breaks for plot coloring.
#' @param colours Colours to be used for levels specified by
#' the breaks. If NULL, a colour scheme of purples is shown.
#' @param colourWellText Colour to display the text in the wells.
#' @param layout Integer vector of length two with number of rows
#' and colums in a plate, e.g. \code{c(16,24)}
#' @param makeContourColours Logical, whether or not the plate
#' layout will contain a contour colours for the wells based on the
#' parameters in 'contourColours' and 'categories'
#' @param contourVariable The variable used for the contour colouring
#' @param contourColours Character vector specifying a colour for
#' each range in 'categories'
#' @param labelsCategories Character vector specifying the names
#' (labels) for each range in 'categories'
#' @param categories if contour Variable is not a factor, a numeric
#' vector specifying the categories to divide the 'varOfInterest',
#' including the lower and upper limits.
#' @param plateName The plate name
#' @param plot.title A title for the plot, will be contracted with
#' the plate name
#' @param legend.title A title for the legend
#' @param displayHeatmap Logical, whether to display the plateLayout heatmap
#' @param saveHeatmap Logical, whether to save the plateLayout heatmap
#' @param outputDir The directory where the plateLayout heatmap should be saved
#' @param prefix The prefix to the file name of the saved plateLayout heatmap
#' @param ... additional arguments for \code{ComplexHeatmap::Heatmap} function
#' @importFrom platetools fill_plate
#' @importFrom RColorBrewer brewer.pal
#' @importFrom ComplexHeatmap Heatmap
#' @importFrom circlize colorRamp2
#' @importFrom grid grid.text grid.rect gpar legendGrob gpar
#' @importFrom grDevices dev.off png
#' @importFrom graphics title
#' @export
plateLayout <- function(
plateData, valueVariable, textVariable = valueVariable,
breaks = NULL, colours = NULL, colourWellText = "white", textFontSize = 6,
layout = c(16, 24), makeContourColours = FALSE, contourVariable = character(),
contourColours = c("red", "orange", "seagreen3"),lwdContours = c(1, 1, 1),
labelsCategories = c('1', '2', '3'), categories = NULL, plateName = character(),
plot.title = character(), legend.title = NULL, legendFontSize = 15,
row_split = rep("A", 16), col_split = rep("A", 24), legendFontSizeTitle = 15,
displayHeatmap = TRUE, saveHeatmap = FALSE, outputDir = ".", prefix = ""
) {
WellID <- NULL
if (!(all(c("WellID", "SampleName") %in% colnames(plateData)))) {
stop(" 'WellID' and 'SampleName' column required in plateData object")
}
plateData[, WellID := paste0(
sub(".*([[:alpha:]]).+", "\\1", plateData$WellID),
sprintf(
"%02d",
as.numeric(sub(".*[[:alpha:]](.+)", "\\1", plateData$WellID))
)
)]
plateData <- platetools::fill_plate(plateData, "WellID", plate = 384)
plateData$column <- factor(
sprintf("%02d", as.numeric(
sub(".*[[:alpha:]](.+)", "\\1", plateData$WellID)
)),
levels = sprintf("%02d", seq(1, layout[2]))
)
plateData$row <- factor(sub(".*([[:alpha:]]).+", "\\1", plateData$WellID),
levels = LETTERS[seq(1, layout[1])])
plateValues <- plateLayoutFormat(
plateData,
varOfInterest = valueVariable,
rows = layout[1],
cols = layout[2]
)
if (!is.null(textVariable)) {
plateText <- plateLayoutFormat(
plateData, varOfInterest = textVariable,
rows = layout[1],
cols = layout[2]
)
}
plot.title <- gsub(
"^([a-z])", "\\U\\1",
gsub("([A-Z])", " \\1",
plot.title, perl = TRUE), perl = TRUE
)
mainTitle <- paste0(plot.title, plateName)
plateContourColours <- matrix("", nrow = layout[1], ncol = layout[2])
if (makeContourColours) {
contourData <- plateData[WellType %in% c("nonEmpty", "Treated Wells"), ]
if (is.numeric(contourData[, ..contourVariable][[1]])) {
contourData$contours <- cut(
contourData[, ..contourVariable][[1]],
categories, left = TRUE,
right = TRUE,
labels = labelsCategories)
}
else {
contourData$contours <- contourData[, ..contourVariable][[1]]
}
names(contourColours) <- labelsCategories
names(lwdContours) <- labelsCategories
for (i in seq_len(layout[1])) {
for (j in seq_len(layout[2])) {
tryCatch({
sampleHit <- which(
as.character(contourData$WellID) == paste0(
LETTERS[i], sprintf("%02d", j)
)
)
if (length(sampleHit) == 1) {
plateContourColours[i, j] <- as.character(
contourData[sampleHit,'contours'][[1]]
)
}
},
error = function(e) {
print(paste0(LETTERS[i], sprintf("%02d", j), " is missing."))
}
)
}
}
}
plateValues$contours <- plateContourColours
colnames(plateValues$values) <- seq_len(ncol(plateValues$values))
if (is.null(breaks)) {
breakValues <- plateValues$values
breakValues[which(is.na(breakValues))] <- 0
if (all(breakValues >= 0)) {
breaks <- computeBreaks(7, max(plateValues$values, na.rm = TRUE))
} else {
breaks <- quantile(plateValues$values, probs = seq(0, 1, 0.125))
}
}
if (is.null(colours)) {
colours <- tryCatch({
circlize::colorRamp2(
breaks = breaks,
colors = brewer.pal(length(breaks), "Purples")
)
},
error = function(cond){
message("Recomputed breaks for proper colour mapping")
breakValues <- plateValues$values
breakValues[which(is.na(breakValues))] <- 0
if (all(breakValues >= 0)) {
breaks <- computeBreaks(7, max(plateValues$values, na.rm = TRUE))
} else {
breaks <- quantile(plateValues$values, probs = seq(0, 1, 0.125))
}
circlize::colorRamp2(
breaks = breaks,
colors = brewer.pal(length(breaks), "Purples")
)
})
}
ht <- Heatmap(
plateValues$values,
column_title = mainTitle, column_title_side = "top",
rect_gp = gpar(lwd = 0.4),
cluster_rows = FALSE, cluster_columns = FALSE,
col = colours, row_title = NULL,
row_split = row_split, column_split = col_split,
row_names_side = "left",
cluster_row_slices = FALSE,
cluster_column_slices = FALSE,
show_heatmap_legend = TRUE,
heatmap_legend_param = list(
title = ifelse(
is.null(legend.title),
paste0(valueVariable, "\n"),
paste0(legend.title, "\n")
),
grid_height = unit(9, "mm"), border = "black",
labels_gp = gpar(fontsize = legendFontSize),
title_gp = gpar(fontsize = legendFontSizeTitle)
),
cell_fun = function(j, i, x, y, width, height, fill) {
if (is.na(plateValues$values[i, j])) {
grid.rect(
x, y, width, height,
gp = gpar(fill = "white", alpha = 0.7, lwd = 0.7, col = "white")
)
}
else if (!is.null(textVariable)) {
grid.text(
plateText$values[i, j], x, y,
just = "centre",
gp = gpar(fontsize = textFontSize, col = colourWellText)
)
}
if (makeContourColours) {
if (!is.na(plateValues$contours[i, j])) {
grid.rect(
x, y, width, height,
gp = gpar(
col = contourColours[as.character(plateValues$contours[i, j])],
fill = NA,
lwd = lwdContours[as.character(plateValues$contours[i, j])]
)
)
}
}
}
)
if (displayHeatmap) {
print(ht)
}
if (saveHeatmap) {
png(
file.path(
outputDir,
paste0(prefix,gsub(" |-", "",plot.title), "_", plateName, ".png")
),
width = 30, height = 10, units = "cm", res = 1200
)
print(ht)
dev.off()
}
return(ht)
}
#' Return numerical matrix with number of reads that corresponds to the
#' plate layout
#' @param data A data.frame of the values to be visualized with at least
#' the columnof interest (specified in 'varOfInterest') and a 'WellID' column
#' indicating the wells in the plate. The WellID is a combination of a
#' letter (row in the plate) and an integer (column in the plate).
#' @param varOfInterest The name of the variable in 'data' to be visualized
#' in a plate layout
#' @param rows number of rows in a plate layout
#' @param cols number of columns in a plate layout
#' @param verbose if \code{TRUE}, samples missing from the plate
#' will be reported
#' @export
plateLayoutFormat <- function(
data, varOfInterest,
rows = 16, cols = 24,
verbose = FALSE
) {
plateValues <- matrix(NA, nrow = rows, ncol = cols)
for (i in seq_len(rows)) {
for (j in seq_len(cols)) {
tryCatch({
sampleHit <- which(
as.character(data$WellID) == paste0(LETTERS[i], sprintf("%02d", j))
)
if(length(sampleHit) == 1){
plateValues[i, j] <- data[sampleHit, ..varOfInterest][[1]]
}
},
error = function(e) {
if (verbose == TRUE) {
print(paste0(LETTERS[i], sprintf("%02d", j), " is missing."))
}
}
)
}
}
row.names(plateValues) <- LETTERS[1:rows]
return(list("values" = plateValues))
}
#' Helper function to automate break selection for raw count data
#'
#' This function creates an exponentially increasing vector for given number
#' breaks between zero and some element of choice. It is particularly useful for
#' raw counts or raw counts per million.
#'
#' @param nBreaks Number of breaks to be generated
#' @param maxElement Maximum value of data entries
#' @export
computeBreaks <- function(nBreaks, variable) {
maxElement <- max(variable, na.rm = TRUE)
if (length(unique(variable)) == 1) {
breaks <- c(0, 0.5, ifelse(maxElement < 1, 1, maxElement))
} else {
coefSystem <- solve(
rbind(c(1, 1), c(1, (nBreaks - 1)))) %*% c(0, log(maxElement)
)
coefExp <- c(exp(coefSystem[1]), coefSystem[2])
breaks <- coefExp[1] * exp((1:(nBreaks - 1)) * coefExp[2])
breaks <- unique(c(0, breaks))
}
return(breaks)
}

33
src/report/script.R Normal file
View File

@@ -0,0 +1,33 @@
library(whisker)
library(logger)
log_info("Setting temporary directory to: {meta$temp_dir}")
Sys.setenv(TMP = meta$temp_dir)
temp_folder <- tempdir(check = TRUE)
log_info("Created temporary directory {temp_folder}")
template <- file.path(meta$resources_dir, "template.Rmd")
esets_normalized <- lapply(par$eset, function(eset_path) {
return(file.path(normalizePath(dirname(eset_path)), basename(eset_path)))
})
log_info(paste0(
"Rendering markdown {template} to HTML ",
"{par$output_report} with esets {paste(esets_normalized, collapse = ', ')}"
))
rmarkdown::render(
normalizePath(template),
output_file = basename(par$output_report),
output_dir = dirname(par$output_report),
runtime = "static",
intermediates_dir = par$report_dir,
clean = TRUE,
params = list(
esets = esets_normalized,
outputDir = par$report_dir
)
)
log_info("Done")

977
src/report/template.Rmd Normal file
View File

@@ -0,0 +1,977 @@
---
title: "Exploratory Data Report"
date: "`r format(Sys.time(), '%d %B, %Y')`"
editor_options:
chunk_output_type: console
output:
oaStyle::html_report
# parameters which are overwritten by the script
params:
outputDir: 'output/'
esets:
- sample1.rds
- sample2.rds
---
<!---
Copy this template in your working directory (where you want to run the report).
This template can be used as a starting document to run a preliminary DRUGseq report
-->
<!---
Use full page width
-->
<style type="text/css">
div.main-container {
max-width: 1600px !important;
margin-left: auto;
margin-right: auto;
}
</style>
```{r params, eval = TRUE, include = FALSE}
outputDir <- params$outputDir
esets <- params$esets
```
```{r outputDir, echo = FALSE}
## Required: ABSOLUTE outputDir
outputDir <- file.path(outputDir)
# When working on a windows computer it should be
# "/Users/..." instead of "C:/Users/..."
if (.Platform$OS.type == "windows") {
outputDir <- paste0(
"/",
paste(
unlist(strsplit(outputDir, split = "/"))[-1], collapse = "/"
),
"/"
)
}
```
```{r optionsChunkDoNotModify, echo = FALSE, message = FALSE, warning=FALSE}
## Chunk with options for knitr. This chunk should not be modified.
knitr::opts_chunk$set(
eval = TRUE,
echo = FALSE,
message = FALSE,
cache = FALSE,
warning = FALSE,
error = FALSE,
comment = NA, #"#",
tidy = FALSE,
collapse = TRUE,
out.width = "100%",
fig.width = 20,
fig.height = 10,
results = "asis")
knitr::opts_knit$set(root.dir = getwd())
options(warn = 1, width = 200)
```
```{r libraries_and_functions}
source("plateLayouts.R")
library(ComplexHeatmap)
library(data.table)
library(ggplot2)
library(knitr)
library(Biobase)
library(gridExtra)
library(RColorBrewer)
```
```{r dataImport}
# Create esetList
esetList <- sapply(
esets, simplify = FALSE,
USE.NAMES = TRUE,
function(eset_raw) {
if (!file.exists(eset_raw)) {
stop(paste0("Provided path '", eset_raw, "' is not a file."))
}
eset <- readRDS(eset_raw)
}
)
pools <- sapply(esetList, function(eset) {
unique(eset$PoolName)
})
names(esetList) <- unlist(pools)
# Create qcData
pDataList <- lapply(esetList, function(eset) data.table(pData(eset)))
qcData <- rbindlist(pDataList, fill = TRUE)
textVars <- "SampleName"
annotationVar <- "PoolName"
if (!"SampleName" %in% names(qcData)) {
qcData[, SampleName := paste0(PoolName, "_", WellBC)]
}
qcData[, log10LibSize := round(log10(NumberOfInputReads))]
qcData[, (annotationVar) := lapply(.SD, as.factor), .SDcols = annotationVar]
colourList <- list()
Design_levels <- sort(
as.character(unique(qcData[, ..annotationVar][[1]])),
decreasing = TRUE
)
if (length(Design_levels) == 1) {
colours <- c("#d6e0ff", "lightgrey")
names(colours) <- c(Design_levels, "Empty")
colourList[[annotationVar]] <- list(
"colours" = colours,
"annotVar" = annotationVar,
"text" = textVars
)
}else if (length(Design_levels) == 2) {
colours <- c("#d6e0ff", "#FF9999")
names(colours) <- c(Design_levels)
colourList[[annotationVar]] <- list(
"colours" = colours,
"annotVar" = annotationVar,
"text" = textVars
)
} else if (length(Design_levels) <= 20) {
if (length(Design_levels) > 12) {
colours <- c(
brewer.pal(12, "Set3"),
brewer.pal((length(Design_levels) - 12),
"Pastel2")
)
} else {
colours <- c(brewer.pal(length(Design_levels), "Set3"))
}
names(colours) <- c(Design_levels)
colourList[[annotationVar]] <- list(
"colours" = colours,
"annotVar" = annotationVar,
"text" = textVars
)
} else {
colours <- c("#d6e0ff")
names(colours) <- c("nonEmpty")
colourList[[annotVar]] <- list(
"colours" = colours,
"annotVar" = annotVar,
"text" = annotVar
)
}
```
# Pool Description
Per pool within this study, there are several pool layout plots shown, based on the
* number of STAR input reads (= library size)
* log10 transformed number of STAR input reads
* number of detected UMIs
* number of detected genes
* number of chromosomal reads
* percentage of ERCC
* percentage of mitochondria
> The values for the different samples within each pool is expected to be comparable if the content of the different pools is equally diverse.
```{r plateAnnotation, out.width = "100%",fig.width = 20, fig.height= 10}
plateVars <- c("NumberOfInputReads", "log10LibSize", "NumberOfMappedReads",
"NumberOfChromReads", "NumberOfUMIs", "NumberOfGenes",
"pctMT", "pctERCC")
breaksVars <- lapply(
plateVars,
function(var) {
computeBreaks(7, qcData[, ..var])
}
)
names(breaksVars) <- plateVars
for (pool in pools){
cat("\n\n")
cat(paste0("## ", pool, " {.tabset} \n\n"))
poolData <- qcData[PoolName == pool]
lapply(plateVars, function(plateVar) {
cat("\n\n")
cat(sprintf("### %s {.unnumbered}", plateVar))
cat("\n\n")
plateLayout(
poolData, valueVariable = plateVar,
textFontSize = 10, legendFontSize = 12,
plateName = pool, plot.title = "libSize - ",
legend.title = "libSize", breaks = breaksVars[[plateVar]]
)
cat("\n\n")
})
cat("\n\n")
}
```
<br>
# Data Distributions
## Reads Distributions {.tabset}
The 4 box plots below represent the distributions per pool of the different samples based on:
* the number of STAR input reads
* the number of STAR mapped reads
* the percentage of STAR mapped reads
* the number of detected genes
> The distributions contribute to the QC metrics mentioned in Par 3. The higher these values, the better.
> The data range for the different plates is expected to be comparable if the content of the different plates is equally diverse.
### Number of Input Reads {.tabset .unnumbered}
```{r settings_1}
nColPlots = 1
figHeight = 7
```
#### Distribution {.tabset .unnumbered}
```{r boxplots_input_plate, fig.height = figHeight}
ggplot(
qcData,
aes(
x = PoolName,
y = NumberOfInputReads, colour = PoolName
)
) + geom_boxplot() + ylab("Number of Input Reads") +
ggtitle("Number of Input Reads") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text.y = element_text(angle = 90, size = 14),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title = element_text(size = 15),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
```
### Number of Mapped Reads {.tabset .unnumbered}
#### Distribution {.unnumbered}
```{r boxplots_mapped_plate, fig.height = figHeight}
ggplot(
qcData,
aes(x = PoolName, y = NumberOfMappedReads, colour = PoolName)
) + geom_boxplot() + ylab("Number of Mapped Reads") +
ggtitle("Number of Mapped Reads") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text.y = element_text(angle = 90, size = 14),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title.y = element_text(size = 15),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
```
#### pct Mapped Reads {.unnumbered}
```{r boxplots_pctMapped_plate, fig.height = figHeight}
ggplot(
qcData,
aes(x = PoolName, y = PctMappedReads, colour = PoolName)
) +
geom_boxplot() +
ylab("pct Mapped Reads") +
ggtitle("pct Mapped Reads") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text.y = element_text(angle = 90, size = 14),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title.y = element_text(size = 15),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
```
### Number of Chromosomal Reads {.tabset .unnumbered}
#### Distribution {.unnumbered}
```{r boxplots_chrom_plate, fig.height = figHeight}
ggplot(
qcData,
aes(x = PoolName, y = NumberOfChromReads, colour = PoolName)
) + geom_boxplot() + ylab("Number of Chromosomal Reads") +
ggtitle("Number of Chromosomal Reads") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text.y = element_text(angle = 90, size = 14),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title.y = element_text(size = 15),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
```
#### pct Chromosomal Reads {.unnumbered}
```{r boxplots_pctChrom_plate, fig.height = figHeight}
ggplot(
qcData,
aes(x = PoolName, y = pctChrom, colour = PoolName)
) + geom_boxplot() + ylab("pct Chromosomal Reads") +
ggtitle("pct Chromosomal Reads") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text.y = element_text(angle = 90, size = 14),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title.y = element_text(size = 15),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
```
### Number of UMIs {.tabset .unnumbered}
#### Distribution {.tabset .unnumbered}
```{r boxplots_umi_plate, fig.height = figHeight}
ggplot(
qcData,
aes(x = PoolName, y = NumberOfUMIs, colour = PoolName)
) + geom_boxplot() + ylab("Number of UMIs") +
ggtitle('Number of UMIs') +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text.y = element_text(angle = 90, size = 14),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title = element_text(size = 15),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
```
#### Density distribution {.unnumbered}
```{r density_numberOfUMIs}
## Pre-filtering data exploration
dt_plot <- melt(
qcData,
id.vars = c("SampleName", "PoolName", "WellID"),
measure.vars = c("NumberOfInputReads", "NumberOfMappedReads", "NumberOfUMIs")
)
readsDensity_plot <- ggplot(dt_plot, aes(value))
readsDensity_plot <- readsDensity_plot +
geom_density(aes(fill = variable), alpha=0.8) +
facet_grid(~ PoolName, scales = "free_x", space = "fixed", drop = TRUE) +
geom_vline(
xintercept = 5e5,
linetype = "dashed",
color = "steelblue3", size = 2
) +
annotate(
"text",
x = 3.5e5, y = 2e-6, label = "500k",
angle = 90, color = "steelblue3", size = 10
) +
geom_vline(
xintercept = 1.5e6, linetype = "dashed",
color = "forestgreen", size = 2
) +
annotate(
"text", x = 1.35e6, y = 2e-6, label = "1.5M",
angle = 90, color = "forestgreen", size = 10
) +
labs(
title = "Density plot",
subtitle = paste0(
"# Samples with NumberOfMappedReads > 1.5M: ",
length(which(qcData$NumberOfMappedReads > 1.5e6)),
"\n# Samples with NumberOfUMIs > 500k: ",
length(which(qcData$NumberOfUMIs > 5e5))
),
caption = paste0("# Total samples (after removing empty): ", nrow(qcData)),
x = "Count",
fill = "Variable"
) +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 5),
axis.text.x = element_text(angle = 90, size = 14),
plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 17),
plot.caption = element_text(size = 15),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title = element_text(size = 15),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank()
)
readsDensity_plot
```
### Number of Genes {.tabset .unnumbered}
#### Distribution {.unnumbered}
```{r boxplots_genes_plate, fig.height = figHeight}
ggplot(
qcData,
aes(x = PoolName, y = NumberOfGenes, colour = PoolName)
) +
geom_boxplot() + ylab("Number of Genes") +
ggtitle("Number of Genes") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text.y = element_text(angle = 90, size = 14),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title.y = element_text(size = 15),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
```
## {.tabset .toc-ignore .unnumbered}
In addition, several plots are shown visualizing the efficiency of the reads-to-genes translation:
* the number of input reads vs the number of mapped reads
* the number of chromosomal reads vs the number of mapped reads
* the number of mapped reads per UMI vs the number of mapped reads
* the number of UNI vs the number of mapped reads
* the number of mapped reads vs the number of genes
* the number of chromosomal reads vs the number of genes
* the number of mapped reads per UMI vs the number of genes
### Mapping Efficiency {.tabset .unnumbered}
#### Number of Input Reads {.unnumbered}
```{r mapping_efficiency_1_plate, fig.height = 7}
ggplot(
qcData,
aes(x = NumberOfInputReads, y = NumberOfMappedReads, colour = PoolName)
) +
geom_point() +
xlab("Number of Input Reads") +
ylab("Number of Mapped Reads") +
ggtitle("Number of Mapped Reads vs Number of Input Reads") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text = element_text(angle = 90, size = 15),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title = element_text(size = 15)
)
```
#### Number of Chromosomal Reads {.unnumbered}
```{r mapping_efficiency_2_plate, fig.height = 7}
ggplot(
qcData,
aes(x = NumberOfChromReads, y = NumberOfMappedReads, colour = PoolName)
) + geom_point() +
xlab("Number of Chromosomal Reads") + ylab("Number of Mapped Reads") +
ggtitle("Number of Chromosomal Reads vs Number of Mapped Reads") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text = element_text(angle = 90, size = 15),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title = element_text(size = 15)
)
```
#### Number of UMI {.unnumbered}
```{r mapping_efficiency_4_plate, fig.height = 7}
ggplot(
qcData,
aes(x =NumberOfUMIs, y = NumberOfMappedReads, colour = PoolName)
) + geom_point() +
ylab("Number of Mapped Reads") + xlab("Number of UMIs ") +
ggtitle("Number of UMIs vs Number of Mapped Reads") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text = element_text(angle = 90, size = 15),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title = element_text(size = 15)
)
```
### Counting Efficiency {.tabset .unnumbered}
#### Number of Mapped Reads {.unnumbered}
```{r gene_efficiency_1_plate, fig.height = 7}
ggplot(
qcData,
aes(x = NumberOfMappedReads, y = NumberOfGenes, colour = PoolName)
) + geom_point() +
ylab("Number of Genes") + xlab("Number of Mapped Reads") +
ggtitle("Number of Genes vs Number of Mapped Reads") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text = element_text(angle = 90, size = 15),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title = element_text(size = 15)
)
```
#### Number of Chromosomal Reads {.unnumbered}
```{r gene_efficiency_2_plate, fig.height = 7}
ggplot(
qcData,
aes(x = NumberOfChromReads, y = NumberOfGenes, colour = PoolName)
) + geom_point() +
ylab("Number of Genes") + xlab("Number of Chromosomal Reads") +
ggtitle("Number of Genes vs Number of Chromosomal Reads") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text = element_text(angle = 90, size = 15),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title = element_text(size = 15)
)
```
## Sequencing Saturation {.tabset}
The barplots below represent the sequencing saturation per sample as determined by STAR, split per pool.
The HT-RNAseq platform aims for shallow sequencing resulting in relatively low sequencing saturations of 10-20%.
In addition, the sequencing saturation vs the number of input reads is shown.
### Sequencing Saturation {.unnumbered}
```{r sequencingSaturation, fig.height = figHeight}
ggplot(
qcData,
aes(x = WellID, y = SequencingSaturation, fill = PoolName)
) + geom_bar(stat = "identity", position = "dodge") +
xlab("Samples") + ggtitle("Sequencing Saturation per Sample") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(1, "lines"),
text = element_text(size = 10),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title = element_text(size = 15),
axis.text.x = element_blank(),
axis.text.y = element_text(size = 15),
axis.ticks.x = element_blank()
)
```
### Sequencing Saturation - Input Reads {.unnumbered}
```{r sequencingSaturation_inputReads, fig.height = figHeight}
ggplot(
qcData,
aes(x = NumberOfInputReads, y = SequencingSaturation, colour = PoolName)
) + geom_point() +
ggtitle("Sequencing Saturation vs Number of Input Reads") +
theme(strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text = element_text(angle = 90, size = 15),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title = element_text(size = 15)
)
```
### Sequencing Saturation - Mapped Reads {.unnumbered}
```{r sequencingSaturation_mappedReads, fig.height = figHeight}
ggplot(
qcData,
aes(x = NumberOfChromReads, y = SequencingSaturation, colour = PoolName)
) + geom_point() +
ggtitle("Sequencing Saturation vs Number of Chromosomal Reads") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size=10),
axis.text = element_text(angle = 90, size = 15),
plot.title = element_text(size=18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title = element_text(size = 15)
)
```
<br>
## Genomic Origin {.tabset}
The 3 boxplots below represent, per pool, the distributions of the percentage of reads mapping to:
* chromosomal regions
* mitochondrial regions
* ERCC spike-ins
The 4th plot summarises the above results across samples per pool.
The 5th plot shows the percentage of reads mapped to the transcriptome (as counted by STAR). This measurement serves as a proxy for the percentage of reads mapped to exons.
> The percentage ERCC contributes to the QC metrics mentioned in Par 3. This value is ideally as low as possible (but non-zero to ensure the they have been spiked in) and comparable for the different pools.
### pctChrom {.tabset .unnumbered}
```{r genomicOrigin_chrom_plate, fig.height = figHeight}
ggplot(
qcData, aes(x = PoolName, y = pctChrom, colour = PoolName)
) +
geom_boxplot() +
ggtitle("pctChrom") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text.y = element_text(angle = 90, size = 14),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title.y = element_text(size = 15),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
```
### pctMT {.tabset .unnumbered}
```{r genomicOrigin_mt_plate, fig.height = figHeight}
ggplot(
qcData,
aes(x = PoolName, y = pctMT, colour = PoolName)
) +
geom_boxplot() + ggtitle("pctMT") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text.y = element_text(angle = 90, size = 14),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title.y = element_text(size = 15),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
```
### pctERCC {.tabset .unnumbered}
```{r genomicOrigin_ercc_plate, fig.height = figHeight}
ggplot(qcData, aes(x = PoolName, y = pctERCC, colour = PoolName)) +
geom_boxplot() +
ggtitle("pctERCC") +
theme(
strip.text.x = element_text(size = 20),
panel.spacing = unit(2, "lines"),
text = element_text(size = 10),
axis.text.y = element_text(angle = 90, size = 14),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title.y = element_text(size = 15),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
```
### Genomic Summary {.tabset .unnumbered}
```{r genomicOrigin_summary_plate}
meanPctChromMTData <- qcData[, .(
"pctChrom" = median(pctChrom),
"pctMT" = median(pctMT),
"pctERCC" = median(pctERCC)
), by = PoolName]
meanPctChromMTDataLong <- melt(
meanPctChromMTData,
id.vars = "PoolName",
measure.vars = c("pctChrom", "pctMT", "pctERCC"),
variable.name = "Origin", value.name = "pct"
)
ggplot(
meanPctChromMTDataLong,
aes(fill = Origin, y = pct, x = PoolName)) +
geom_bar(position = "stack", stat = "identity") +
ggtitle("Genomic Origin") +
theme(
text = element_text(size = 10),
axis.text = element_text(angle = 90, size = 15),
plot.title = element_text(size = 18),
legend.text = element_text(size = 15),
legend.title = element_text(size = 17),
axis.title = element_text(size = 15)
)
```
# Depletion {.tabset}
<div align="center">
```{r depletion}
for (eset_name in pools) {
cat("\n\n")
cat(paste0("## ", eset_name, " {.unnumbered}"))
cat("\n\n")
eset <- esetList[[eset_name]]
average_reads <- sort(apply(exprs(eset), 1, mean), decreasing = TRUE)
plotData <- data.table(
ENSGID = names(average_reads),
av_count = average_reads
)
gen_descript <- data.table(
ENSGID = eset@featureData@data$gene_id,
Description = eset@featureData@data$GENENAME
)
order_gen_descript <- gen_descript[
match(plotData$ENSGID, gen_descript$ENSGID),
]
g <- ggplot(
plotData[c(1:100)],
aes(x = reorder(ENSGID, -av_count), y = av_count)
) + geom_bar(stat = "identity") +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 12),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 15),
legend.title = element_text(size = 15),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20)
) + ylab("Average Counts") + xlab("Genes")
print(g)
cat("\n\n")
cat("<br>")
cat("<br>")
print(htmltools::tagList((DT::datatable(order_gen_descript[1:100, ]))))
}
```
</div>
<br>
<br>
<br>
<br>
# Glossary {.unnumbered}
## Read {.unlisted .unnumbered}
A read is a oligonucleotide (a short RNA fragment) that has been sequenced. It consists of a fixed number of base pairs (bp) and therefore has a specific read length.
## Input Read {.unlisted .unnumbered}
Each read of the fastq file used as input to the STAR aligner is considered an input read.
## Read With Valid Barcode {.unlisted .unnumbered}
A read with a valid barcode is a read for which the barcode matches the white list of barcodes under the given restriction of the number of allowed mismatches. The number of reads with a valid barcode is lower or equal to the number of input reads.
## Mapped Read {.unlisted .unnumbered}
A read that has been aligned against the reference genome and for which one or more suitable matching locations have been found is a mapped read. Depending on the number of allowed mismatches this might or might not be be an exact match. The number of mapped reads is lower or equal to the number of reads with a valid barcode.
## Uniquely Mapped Read {.unlisted .unnumbered}
A read for which one and only one suitable matching location in the reference genome was found is an uniquely mapped read. The number of uniquely mapped reads is lower or equal to the number of mapped reads.
## Counted Read {.unlisted .unnumbered}
A mapped read will only be counted if it overlaps (1 nucleotide or more) with one and only one gene. The number of counted reads is lower or equal to the number of (uniquely) mapped reads.
## UMIs {.unlisted .unnumbered}
Unique molecular identifiers (UMI) are short sequences in order to uniquely tag each molecule in a sample library. Sequencing with UMIs allows bioinformatics software to filter out duplicate reads and PCR errors with a high level of accuracy and report unique reads.
The reported UMIs is the number of UMIs among the set of reads that map to an unique gene, i.e the number of reads is deduplicated.
## pctERCC {.unlisted .unnumbered}
The percentage of reads mapping to the ERCC genes among the total number of **mapped** reads.
## pctMT {.unlisted .unnumbered}
The percentage of reads mapping to the MT genes among the total number of **mapped** reads.
## Sequencing Saturation {.unlisted .unnumbered}
The sequencing saturation is a measure of the fraction of library complexity. The inverse of one minus the sequencing saturation can be interpreted as the number of additional reads it would take to detect a new transcript. Consequently, a low sequencing saturation indicates a shallow sequencing in which a new transcript could be discovered with a few reads.
<br>
<br>
<br>
<br>
<center>
![](OutputSTARsolo.png)
</center>
<br>
<br>

41
src/report/test.R Normal file
View File

@@ -0,0 +1,41 @@
library(whisker)
library(testthat)
library(R.utils)
cat(">> Creating temporary directory \n")
Sys.setenv(TMP = meta$temp_dir)
temp_folder <- tempdir(check = TRUE)
cat(">> Running component create_report for test case \n")
input_dir <- file.path(meta$resources_dir, "test_data")
stopifnot(file.exists(input_dir))
out <- processx::run(meta$executable, c(
"--eset", file.path(meta$resources_dir, "test_data", "eset.sample_one.rds"),
"--eset", file.path(meta$resources_dir, "test_data", "eset.sample_two.rds"),
"--output_report", "report.html"
))
expect_equal(out$status, 0)
expect_true(file.exists("report.html"))
cat(">> Test succesful \n")
cat(">> Running component create_report with symbolic links \n")
link_sample_1 <- file.path(temp_folder, "eset.sample_one.rds")
link_sample_2 <- file.path(temp_folder, "eset.sample_two.rds")
createLink(link = link_sample_1,
target = file.path(meta$resources_dir, "test_data", "eset.sample_one.rds"))
createLink(link = link_sample_2,
target = file.path(meta$resources_dir, "test_data", "eset.sample_two.rds"))
out <- processx::run(meta$executable, c(
"--eset", link_sample_1,
"--eset", link_sample_2,
"--output_report", "report2.html"
))
expect_true(file.exists("report2.html"))

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1,72 @@
name: combine_star_logs
namespace: "stats"
authors:
- __merge__: /src/base/authors/dries_schaumont.yaml
roles: [ author, maintainer ]
argument_groups:
- name: "Arguments"
arguments:
- name: "--barcodes"
type: string
multiple: true
required: true
description: |
Barcodes responding to the respective log files.
- name: "--star_logs"
type: file
multiple: true
required: true
description: |
Paths to the STAR log files (most frequently called Log.final.out)
direction: input
example: "Log.final.out"
- name: "--gene_summary_logs"
direction: input
type: file
multiple: true
required: true
description: |
Paths to the Summary.csv files from the STAR Solo output. Can be found in
the 'Solo.out/Gene' folder relative to the root of the STAR output directory.
example: "Summary.txt"
- name: "--reads_per_gene_logs"
direction: input
type: file
multiple: true
required: true
description: |
Paths to the 'ReadsPerGene.out.tab' files as output by STAR.
- name: "--output"
type: file
direction: output
default: "starLogs.txt"
description: |
Tab-delimited file describing for each barcode (as the rows), the metrics (as columns)
gathered from the different input files.
resources:
- type: python_script
path: script.py
test_resources:
- type: python_script
path: test.py
- path: test_data
engines:
- type: docker
image: python:3.12-slim
setup:
- type: apt
packages:
- procps
- type: python
packages:
- pandas
test_setup:
- type: python
packages:
- viashpy
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,228 @@
import logging
import pandas as pd
from itertools import batched, starmap
### VIASH START
meta = {
"name": "combine_star_logs",
}
par = {
"star_logs": ["src/stats/combine_star_logs/test_data/barcode_1/Log.final.out",
"src/stats/combine_star_logs/test_data/barcode_2/Log.final.out"],
"gene_summary_logs": ["src/stats/combine_star_logs/test_data/barcode_1/summary.csv",
"src/stats/combine_star_logs/test_data/barcode_2/summary.csv"],
"reads_per_gene_logs": ["src/stats/combine_star_logs/test_data/barcode_1/ReadsPerGene.out.tab",
"src/stats/combine_star_logs/test_data/barcode_2/ReadsPerGene.out.tab"],
"output": "output.txt",
"barcodes": ["ACGG", "TTTT"],
}
### VIASH END
logger = logging.getLogger()
console_handler = logging.StreamHandler()
logger.addHandler(console_handler)
logger.setLevel(logging.DEBUG)
def handle_percentages(column_value):
# TODO: handle this more gracefully
if column_value:
return column_value.strip('%')
return column_value
def star_log_to_dataframe(barcode: str, log_path) -> pd.DataFrame:
logger.info("Reading STAR log %s for barcode '%s'", log_path, barcode)
result = pd.read_table(log_path, sep=r"\|\t+", converters={"Value": handle_percentages},
engine="python", header=None, skip_blank_lines=True,
skipinitialspace=True, names=["Category", "Value"], index_col=0,
skiprows=[0, 1, 2])
logger.info("Read %d row(s) and %d column(s) from STAR logs at %s",
*result.shape, log_path)
return result
def summary_to_dataframe(barcode: str, summary_path) -> pd.DataFrame:
logger.info("Reading summary log %s for barcode %s", summary_path, barcode)
result = pd.read_table(summary_path, sep=",",
header=None, names=["Category", "Value"],
index_col=0, dtype=pd.StringDtype())
logger.info("Read %d row(s) and %d column(s) from summary file at %s",
*result.shape, summary_path)
return result
def reads_per_gene_to_dataframe(barcode, read_per_gene_path) -> pd.DataFrame:
logger.info("Reading reads per gene file %s for barcode %s", read_per_gene_path, barcode)
result = pd.read_table(read_per_gene_path, skiprows=[0, 1, 2, 3], header=None, sep="\t",
dtype={"geneID": pd.StringDtype(),
"Unstranded": pd.Int64Dtype(),
"posStrand": pd.Int64Dtype(),
"negStrand": pd.Int64Dtype()},
index_col=0, names=["geneID", "Unstranded", "posStrand", "negStrand"])
result = result[["Unstranded"]] # Do not use .loc here because we need a DataFrame, not a Series
df = pd.DataFrame({"Value": result.sum()})
df = df.rename({"Unstranded": "NumberOfCountedReads"}, errors="raise")
df.index.name = "Category"
logger.info("Read %d row(s) and %d column(s) from reads per gene file at %s",
*df.shape, read_per_gene_path)
return df
def star_log_remove_unwanted_entries_and_adjust_format(barcode, df: pd.DataFrame) -> pd.DataFrame:
"""
For a single star log (Log.final.out) in dataframe format, filter out the
entries that are not needed and format the labels for some metrics:
- Replace '%' with 'pect' in the labels.
- Remove labels ending with ':'
(mostly the section separators like 'MULTI-MAPPING READS:' and 'UNMAPPED READS:')
- Remove the metrics we do no need based on the following keywords:
Mapping speed, Average, Number of splices, per base, chimeric reads, average
The dataframe provided as input must have an index with 1 level with the metric names.
"""
# Remove index values ending with ':' (rows like 'MULTI-MAPPING READS:','UNIQUE READS:')
logger.info("Filtering STAR logs for barcode %s. Starting with %d row(s) and %d column(s)", barcode, *df.shape)
to_keep = ~df.index.to_series().str.endswith(":")
# Remove index values where the values contain any of these substrings
regex_columns_to_remove = "Mapping speed|Average|Number of splices|per base|chimeric reads|average"
to_keep = to_keep & ~df.index.to_series().str.contains(regex_columns_to_remove, regex=True)
logger.info("Removed the following log entries for barcode '%s':\n\t%s",
barcode,
"\n\t".join(to_keep[~to_keep].index.to_list()))
result = df.loc[to_keep]
# Replace % by pect, remove columns, use camel case and remove spaces
# You might be tempted to use .title() to make everything uppercase,
# but characters which are already uppercase should stay that way.
# (example: NumberOfUMIs and not NumberOfUmis)
result.index = result.index.str.replace("%", "pect")\
.str.replace(":", "")\
.str.replace(r"(?:^|\s).", lambda m:m.group(0).upper(), regex=True)\
.str.replace(" ", "")
result = result.rename({"UniquelyMappedReadsNumber": "NumberOfMappedReads",
"UniquelyMappedReadsPect": "PctMappedReads"}, errors="raise")
logger.info("Done filtering STAR logs for barcode %s. Result has %d row(s) and %d column(s). "
"Found entries:\n\t%s",
barcode, *result.shape, "\n\t".join(result.index.to_list()))
return result
def summary_remove_unwanted_entries_and_adjust_format(barcode, df: pd.DataFrame) -> pd.DataFrame:
logger.info("Filtering and formatting summary logs for barcode %s. "
"Starting with %d row(s) and %d column(s)", barcode, *df.shape)
columns_to_remove = (
"Number of Reads",
"Q30 Bases in RNA read",
"Reads Mapped to Genome: Unique",
"Reads Mapped to Transcriptome: Unique Genes",
"Reads in Cells Mapped to Unique Genes",
"Median UMI per Cell",
"Median Genes per Cell",
"Reads Mapped to Genome: Unique+Multiple",
"Median Reads per Cell",
"Mean UMI per Cell",
"Mean Genes per Cell",
)
to_keep = ~df.index.isin(columns_to_remove)
logger.info("Removed the following summary entries for barcode '%s':\n\t%s",
barcode,
"\n\t".join(df.loc[~to_keep].index.to_list()))
result = df.loc[to_keep]
result.index = result.index.str.replace(r"(?:^|\s).", lambda m:m.group(0).upper(),
regex=True).str.replace(" ", "")
to_rename = {"UMIsInCells": "NumberOfUMIs",
"TotalGenesDetected": "NumberOfGenes"}
try:
result = result.rename(to_rename, errors="raise")
except KeyError as e:
raise KeyError(f"Tried to rename log entries ({','.join(to_rename)}) in the summary "
f"log for barcode {barcode}, but an entry was not found in the file. "
"Make sure that you are using the correct version of STAR."
f"Available entries: {", ".join(result.index.to_list())}") from e
logger.info("Done filtering summary logs for barcode %s. Result has %d row(s) and %d column(s). "
"Found entries:\n\t%s",
barcode, *result.shape, "\n\t".join(result.index.to_list()))
return result
def join_dfs(df_list, barcodes) -> pd.DataFrame:
# Combine the dataframes together and add the barcodes as a level to the dataframe
# in order to make a 2-level index (first level the barcodes and second level the metrics).
result = pd.concat(dict(zip(barcodes, df_list)), names=["WellBC"])
# Pivot the table by moving the metrics to the columns. Its added as an extra level,
# so we can just frop the 'Values' level that was already there
result = result.unstack(level="Category").droplevel(0, axis="columns")
return result
def main(par):
logger.info("Component started.")
# Provide an overview of the parameters in the logs
parameters_str = [f'\t{param}: {param_val}\n' for param, param_val in par.items()]
logger.info("Parameters:\n%s", "".join(parameters_str).rstrip())
star_logs, gene_summary_logs, reads_per_gene_logs, barcodes = par["star_logs"], \
par["gene_summary_logs"], par["reads_per_gene_logs"], par["barcodes"]
number_of_inputs = tuple(len(i) for i in (star_logs, gene_summary_logs,
reads_per_gene_logs, barcodes))
if len(set(number_of_inputs)) != 1:
raise ValueError("Expected the same number of inputs for 'star_logs' (%d), "
"'gene_summary_logs' (%d), 'reads_per_gene_logs' (%d) "
"and 'barcodes' (%d)." % number_of_inputs)
logs_to_process = [
(star_log_to_dataframe, star_log_remove_unwanted_entries_and_adjust_format, star_logs),
(summary_to_dataframe, summary_remove_unwanted_entries_and_adjust_format, gene_summary_logs),
(reads_per_gene_to_dataframe, None, reads_per_gene_logs),
]
logger.info("Formatting the contents of the log files.")
all_logs_data = []
for df_generator, formatter, data in logs_to_process:
data_as_df = list(starmap(df_generator, zip(barcodes, data)))
data_formatted = data_as_df
if formatter:
data_formatted = list(starmap(formatter, zip(barcodes, data_as_df)))
data_joined = join_dfs(data_formatted, barcodes)
all_logs_data.append(data_joined)
logger.info("Joining entries across the different logs together.")
all_stats = pd.concat(all_logs_data, axis=1)
logger.info("Log statistics were gathered for the following barcodes: %s",
", ".join(all_stats.index.to_list()))
dtypes = {
'NumberOfInputReads': pd.UInt64Dtype(),
'NumberOfMappedReads': pd.UInt64Dtype(),
'PctMappedReads': pd.Float64Dtype(),
'NumberOfReadsMappedToMultipleLoci': pd.UInt64Dtype(),
'PectOfReadsMappedToMultipleLoci': pd.Float64Dtype(),
'NumberOfReadsMappedToTooManyLoci': pd.UInt64Dtype(),
'PectOfReadsMappedToTooManyLoci': pd.Float64Dtype(),
'NumberOfReadsUnmappedTooManyMismatches': pd.UInt64Dtype(),
'PectOfReadsUnmappedTooManyMismatches': pd.Float64Dtype(),
'NumberOfReadsUnmappedTooShort': pd.UInt64Dtype(),
'PectOfReadsUnmappedTooShort': pd.Float64Dtype(),
'NumberOfReadsUnmappedOther': pd.UInt64Dtype(),
'PectOfReadsUnmappedOther': pd.Float64Dtype(),
'ReadsWithValidBarcodes': pd.Float64Dtype(),
'SequencingSaturation': pd.Float64Dtype(),
'Q30BasesInCB+UMI': pd.Float64Dtype(),
'ReadsMappedToTranscriptome:Unique+MultipeGenes': pd.Float64Dtype(),
'EstimatedNumberOfCells': pd.UInt64Dtype(),
'FractionOfReadsInCells': pd.Float64Dtype(),
'MeanReadsPerCell': pd.UInt64Dtype(),
'NumberOfUMIs': pd.UInt64Dtype(),
'NumberOfGenes': pd.UInt64Dtype(),
'NumberOfCountedReads': pd.UInt64Dtype(),
}
all_stats = all_stats.astype(dtypes)
# batched() is used here to print a limited amount of columnns at a time
# to make sure that they are all displayed (pandas might limit the view for readability)
logger.info("Summary of final output:\n%s\n",
"\n".join(repr(all_stats.loc[:,columns].describe())
for columns in batched(all_stats.columns, 3)))
logger.info("Writing output to %s", par["output"])
all_stats.reset_index("WellBC").to_csv(par["output"], sep="\t", header=True,
index=False, float_format='%g')
logger.info("Finished %s.", meta["name"])
if __name__ == "__main__":
main(par)

View File

@@ -0,0 +1,182 @@
import pytest
import sys
import re
import pandas as pd
from pathlib import Path
from uuid import uuid4
from subprocess import CalledProcessError
### VIASH START
meta = {
"resources_dir": "./src/stats/combine_star_logs/",
"executable": "target/executable/stats/combine_star_logs/combine_star_logs",
"config": "src/stats/combine_star_logs/config.vsh.yaml"
}
### VIASH END
@pytest.fixture
def test_resources_path():
return Path(meta["resources_dir"]) / "test_data"
@pytest.fixture
def barcode_1_star_log(test_resources_path):
return test_resources_path / "barcode_1" / "Log.final.out"
@pytest.fixture
def barcode_1_reads_per_gene_file(test_resources_path):
return test_resources_path / "barcode_1" / "ReadsPerGene.out.tab"
@pytest.fixture
def barcode_1_summary(test_resources_path):
return test_resources_path / "barcode_1" / "summary.csv"
@pytest.fixture
def barcode_2_star_log(test_resources_path):
return test_resources_path / "barcode_2" / "Log.final.out"
@pytest.fixture
def barcode_2_reads_per_gene_file(test_resources_path):
return test_resources_path / "barcode_2" / "ReadsPerGene.out.tab"
@pytest.fixture
def barcode_2_summary(test_resources_path):
return test_resources_path / "barcode_2" / "summary.csv"
@pytest.fixture
def no_reads_mapped_star_log(test_resources_path):
return test_resources_path / "empty" / "Log.final.out"
@pytest.fixture
def no_reads_mapped_reads_per_gene_file(test_resources_path):
return test_resources_path / "empty" / "ReadsPerGene.out.tab"
@pytest.fixture
def no_reads_mapped_summary(test_resources_path):
return test_resources_path / "empty" / "summary.csv"
@pytest.fixture
def random_path(tmp_path):
def wrapper(extension=None):
extension = "" if not extension else f".{extension}"
return tmp_path / f"{uuid4()}{extension}"
return wrapper
def test_incorrect_number_of_inputs_raises(run_component,
barcode_1_star_log, barcode_2_star_log,
barcode_1_reads_per_gene_file, barcode_2_reads_per_gene_file,
barcode_1_summary, barcode_2_summary,
random_path):
output_path = random_path("txt")
with pytest.raises(CalledProcessError) as err:
run_component([
"--barcodes", "foo;bar",
"--star_logs", f"{barcode_1_star_log}",
"--reads_per_gene_logs", f"{barcode_1_reads_per_gene_file};{barcode_2_reads_per_gene_file}",
"--gene_summary_logs", f"{barcode_1_summary};{barcode_2_summary}",
"--output", output_path,
])
assert re.search(r"ValueError: Expected the same number of inputs for 'star_logs' \(1\), "
r"'gene_summary_logs' \(2\), 'reads_per_gene_logs' \(2\) and 'barcodes' \(2\)\.",
err.value.stdout.decode('utf-8'))
def test_equal_number_of_argument(run_component,
barcode_1_star_log, barcode_2_star_log,
barcode_1_reads_per_gene_file, barcode_2_reads_per_gene_file,
barcode_1_summary, barcode_2_summary,
random_path):
output_path = random_path("txt")
run_component([
"--barcodes", "foo;bar",
"--star_logs", f"{barcode_1_star_log};{barcode_2_star_log}",
"--reads_per_gene_logs", f"{barcode_1_reads_per_gene_file};{barcode_2_reads_per_gene_file}",
"--gene_summary_logs", f"{barcode_1_summary};{barcode_2_summary}",
"--output", output_path,
])
# We use strings here to make a comparison of the file contents without
# doing any inferences of the numerical data type (i.e. exact file contents).
expected_dict = {
'NumberOfInputReads': ["96398", "10155"],
'NumberOfMappedReads': ["70824", "7179"],
'PctMappedReads': ["73.47", "70.69"],
'NumberOfReadsMappedToMultipleLoci': ["0", "0"],
'PectOfReadsMappedToMultipleLoci': ["0", "0"],
'NumberOfReadsMappedToTooManyLoci': ["22281", "2248"],
'PectOfReadsMappedToTooManyLoci': ["23.11", "22.14"],
'NumberOfReadsUnmappedTooManyMismatches': ["0", "0"],
'PectOfReadsUnmappedTooManyMismatches': ["0", "0"],
'NumberOfReadsUnmappedTooShort': ["2697", "553"],
'PectOfReadsUnmappedTooShort': ["2.8", "5.45"],
'NumberOfReadsUnmappedOther': ["596", "175"],
'PectOfReadsUnmappedOther': ["0.62", "1.72"],
'ReadsWithValidBarcodes': ["0.999782", "0.999803"],
'SequencingSaturation': ["0.0602963", "0.0539344"],
'Q30BasesInCB+UMI': ["0.980096", "0.984461"],
'ReadsMappedToTranscriptome:Unique+MultipeGenes': ["0.60411", "0.530871"],
'EstimatedNumberOfCells': ["1", "1"],
'FractionOfReadsInCells': ["1", "1"],
'MeanReadsPerCell': ["53602", "4969"],
'NumberOfUMIs': ["50370", "4701"],
'NumberOfGenes': ["8767", "2397"],
'NumberOfCountedReads': ["17", "15"],
}
expected = pd.DataFrame.from_dict(expected_dict, dtype=pd.StringDtype())
expected.index = pd.Index(["foo", "bar"], name="WellBC", dtype=pd.StringDtype())
assert output_path.is_file()
contents = pd.read_csv(output_path, sep="\t", index_col=0, dtype=pd.StringDtype())
assert set(("NumberOfInputReads", "SequencingSaturation",
"NumberOfGenes", "NumberOfUMIs", "NumberOfCountedReads",
"PctMappedReads")).issubset(set(contents.columns))
pd.testing.assert_frame_equal(contents, expected)
def test_empty(run_component, no_reads_mapped_star_log,
no_reads_mapped_reads_per_gene_file, no_reads_mapped_summary,
random_path):
"""
Sometimes the summary.csv contains '-nan' values, make sure they
are properly handled.
"""
output_path = random_path("txt")
run_component([
"--barcodes", "foo",
"--star_logs", no_reads_mapped_star_log,
"--reads_per_gene_logs", no_reads_mapped_reads_per_gene_file,
"--gene_summary_logs", no_reads_mapped_summary,
"--output", output_path,
])
expected_dict = {
'NumberOfInputReads': ["1327"],
'NumberOfMappedReads': ["116"],
'PctMappedReads': ["8.74"],
'NumberOfReadsMappedToMultipleLoci': ["0"],
'PectOfReadsMappedToMultipleLoci': ["0"],
'NumberOfReadsMappedToTooManyLoci': ["43"],
'PectOfReadsMappedToTooManyLoci': ["3.24"],
'NumberOfReadsUnmappedTooManyMismatches': ["0"],
'PectOfReadsUnmappedTooManyMismatches': ["0"],
'NumberOfReadsUnmappedTooShort': ["1166"],
'PectOfReadsUnmappedTooShort': ["87.87"],
'NumberOfReadsUnmappedOther': ["2"],
'PectOfReadsUnmappedOther': ["0.15"],
'ReadsWithValidBarcodes': ["0.023361"],
'SequencingSaturation': [pd.NA],
'Q30BasesInCB+UMI': ["0.917408"],
'ReadsMappedToTranscriptome:Unique+MultipeGenes': ["0"],
'EstimatedNumberOfCells': ["0"],
'FractionOfReadsInCells': [pd.NA],
'MeanReadsPerCell': ["0"],
'NumberOfUMIs': ["0"],
'NumberOfGenes': ["0"],
'NumberOfCountedReads': ["0"],
}
expected = pd.DataFrame.from_dict(expected_dict, dtype=pd.StringDtype())
expected.index = pd.Index(["foo"], name="WellBC", dtype=pd.StringDtype())
contents = pd.read_csv(output_path, sep="\t", index_col=0, dtype=pd.StringDtype())
pd.testing.assert_frame_equal(contents, expected)
if __name__ == '__main__':
sys.exit(pytest.main([__file__]))

View File

@@ -0,0 +1,37 @@
Started job on | Jun 26 09:38:11
Started mapping on | Jun 26 09:38:14
Finished on | Jun 26 09:38:23
Mapping speed, Million of reads per hour | 38.56
Number of input reads | 96398
Average input read length | 57
UNIQUE READS:
Uniquely mapped reads number | 70824
Uniquely mapped reads % | 73.47%
Average mapped length | 56.93
Number of splices: Total | 6432
Number of splices: Annotated (sjdb) | 6285
Number of splices: GT/AG | 6331
Number of splices: GC/AG | 33
Number of splices: AT/AC | 2
Number of splices: Non-canonical | 66
Mismatch rate per base, % | 0.61%
Deletion rate per base | 0.01%
Deletion average length | 1.38
Insertion rate per base | 0.00%
Insertion average length | 1.24
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 0
% of reads mapped to multiple loci | 0.00%
Number of reads mapped to too many loci | 22281
% of reads mapped to too many loci | 23.11%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 2697
% of reads unmapped: too short | 2.80%
Number of reads unmapped: other | 596
% of reads unmapped: other | 0.62%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%

View File

@@ -0,0 +1,8 @@
N_unmapped 11111 22222 33333
N_multimapping 0 0 0
N_noFeature 44444 55555 66666
N_ambiguous 77777 88888 99999
gene1 2 0 0
gene2 0 0 0
gene3 6 0 6
gene5 9 6 3

View File

@@ -0,0 +1,20 @@
Number of Reads,96398
Reads With Valid Barcodes,0.999782
Sequencing Saturation,0.0602963
Q30 Bases in CB+UMI,0.980096
Q30 Bases in RNA read,0.799904
Reads Mapped to Genome: Unique+Multiple,0.734704
Reads Mapped to Genome: Unique,0.734704
Reads Mapped to Transcriptome: Unique+Multipe Genes,0.60411
Reads Mapped to Transcriptome: Unique Genes,0.556049
Estimated Number of Cells,1
Reads in Cells Mapped to Unique Genes,53602
Fraction of Reads in Cells,1
Mean Reads per Cell,53602
Median Reads per Cell,53602
UMIs in Cells,50370
Mean UMI per Cell,50370
Median UMI per Cell,50370
Mean Genes per Cell,8767
Median Genes per Cell,8767
Total Genes Detected,8767
1 Number of Reads 96398
2 Reads With Valid Barcodes 0.999782
3 Sequencing Saturation 0.0602963
4 Q30 Bases in CB+UMI 0.980096
5 Q30 Bases in RNA read 0.799904
6 Reads Mapped to Genome: Unique+Multiple 0.734704
7 Reads Mapped to Genome: Unique 0.734704
8 Reads Mapped to Transcriptome: Unique+Multipe Genes 0.60411
9 Reads Mapped to Transcriptome: Unique Genes 0.556049
10 Estimated Number of Cells 1
11 Reads in Cells Mapped to Unique Genes 53602
12 Fraction of Reads in Cells 1
13 Mean Reads per Cell 53602
14 Median Reads per Cell 53602
15 UMIs in Cells 50370
16 Mean UMI per Cell 50370
17 Median UMI per Cell 50370
18 Mean Genes per Cell 8767
19 Median Genes per Cell 8767
20 Total Genes Detected 8767

View File

@@ -0,0 +1,37 @@
Started job on | Jun 26 09:38:56
Started mapping on | Jun 26 09:39:00
Finished on | Jun 26 09:39:02
Mapping speed, Million of reads per hour | 18.28
Number of input reads | 10155
Average input read length | 57
UNIQUE READS:
Uniquely mapped reads number | 7179
Uniquely mapped reads % | 70.69%
Average mapped length | 56.36
Number of splices: Total | 526
Number of splices: Annotated (sjdb) | 495
Number of splices: GT/AG | 502
Number of splices: GC/AG | 4
Number of splices: AT/AC | 1
Number of splices: Non-canonical | 19
Mismatch rate per base, % | 0.85%
Deletion rate per base | 0.00%
Deletion average length | 1.09
Insertion rate per base | 0.00%
Insertion average length | 1.07
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 0
% of reads mapped to multiple loci | 0.00%
Number of reads mapped to too many loci | 2248
% of reads mapped to too many loci | 22.14%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 553
% of reads unmapped: too short | 5.45%
Number of reads unmapped: other | 175
% of reads unmapped: other | 1.72%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%

View File

@@ -0,0 +1,8 @@
N_unmapped 101010 202020 303030
N_multimapping 0 0 0
N_noFeature 404040 505050 606060
N_ambiguous 707070 808080 909090
gene1 0 0 0
gene2 0 0 0
gene6 5 5 0
gene4 10 2 8

View File

@@ -0,0 +1,20 @@
Number of Reads,10155
Reads With Valid Barcodes,0.999803
Sequencing Saturation,0.0539344
Q30 Bases in CB+UMI,0.984461
Q30 Bases in RNA read,0.786064
Reads Mapped to Genome: Unique+Multiple,0.706942
Reads Mapped to Genome: Unique,0.706942
Reads Mapped to Transcriptome: Unique+Multipe Genes,0.530871
Reads Mapped to Transcriptome: Unique Genes,0.489316
Estimated Number of Cells,1
Reads in Cells Mapped to Unique Genes,4969
Fraction of Reads in Cells,1
Mean Reads per Cell,4969
Median Reads per Cell,4969
UMIs in Cells,4701
Mean UMI per Cell,4701
Median UMI per Cell,4701
Mean Genes per Cell,2397
Median Genes per Cell,2397
Total Genes Detected,2397
1 Number of Reads 10155
2 Reads With Valid Barcodes 0.999803
3 Sequencing Saturation 0.0539344
4 Q30 Bases in CB+UMI 0.984461
5 Q30 Bases in RNA read 0.786064
6 Reads Mapped to Genome: Unique+Multiple 0.706942
7 Reads Mapped to Genome: Unique 0.706942
8 Reads Mapped to Transcriptome: Unique+Multipe Genes 0.530871
9 Reads Mapped to Transcriptome: Unique Genes 0.489316
10 Estimated Number of Cells 1
11 Reads in Cells Mapped to Unique Genes 4969
12 Fraction of Reads in Cells 1
13 Mean Reads per Cell 4969
14 Median Reads per Cell 4969
15 UMIs in Cells 4701
16 Mean UMI per Cell 4701
17 Median UMI per Cell 4701
18 Mean Genes per Cell 2397
19 Median Genes per Cell 2397
20 Total Genes Detected 2397

View File

@@ -0,0 +1,37 @@
Started job on | Jun 26 09:38:56
Started mapping on | Jun 26 09:39:00
Finished on | Jun 26 09:39:02
Mapping speed, Million of reads per hour | 18.28
Number of input reads | 1327
Average input read length | 58
UNIQUE READS:
Uniquely mapped reads number | 116
Uniquely mapped reads % | 8.74%
Average mapped length | 54.11
Number of splices: Total | 6
Number of splices: Annotated (sjdb) | 4
Number of splices: GT/AG | 4
Number of splices: GC/AG | 0
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 2
Mismatch rate per base, % | 6.63%
Deletion rate per base | 0.13%
Deletion average length | 2.00
Insertion rate per base | 0.00%
Insertion average length | 0.00
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 0
% of reads mapped to multiple loci | 0.00%
Number of reads mapped to too many loci | 43
% of reads mapped to too many loci | 3.24%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 1166
% of reads unmapped: too short | 87.87%
Number of reads unmapped: other | 2
% of reads unmapped: other | 0.15%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%

View File

@@ -0,0 +1,8 @@
N_unmapped 1211 1211 1211
N_multimapping 0 0 0
N_noFeature 23 26 109
N_ambiguous 6 2 0
gene1 0 0 0
gene2 0 0 0
gene6 0 0 0
gene4 0 0 0

View File

@@ -0,0 +1,20 @@
Number of Reads,1327
Reads With Valid Barcodes,0.023361
Sequencing Saturation,-nan
Q30 Bases in CB+UMI,0.917408
Q30 Bases in RNA read,0.711711
Reads Mapped to Genome: Unique+Multiple,0.0874152
Reads Mapped to Genome: Unique,0.0874152
Reads Mapped to Transcriptome: Unique+Multipe Genes,0
Reads Mapped to Transcriptome: Unique Genes,0
Estimated Number of Cells,0
Reads in Cells Mapped to Unique Genes,0
Fraction of Reads in Cells,-nan
Mean Reads per Cell,0
Median Reads per Cell,0
UMIs in Cells,0
Mean UMI per Cell,0
Median UMI per Cell,0
Mean Genes per Cell,0
Median Genes per Cell,0
Total Genes Detected,0
1 Number of Reads 1327
2 Reads With Valid Barcodes 0.023361
3 Sequencing Saturation -nan
4 Q30 Bases in CB+UMI 0.917408
5 Q30 Bases in RNA read 0.711711
6 Reads Mapped to Genome: Unique+Multiple 0.0874152
7 Reads Mapped to Genome: Unique 0.0874152
8 Reads Mapped to Transcriptome: Unique+Multipe Genes 0
9 Reads Mapped to Transcriptome: Unique Genes 0
10 Estimated Number of Cells 0
11 Reads in Cells Mapped to Unique Genes 0
12 Fraction of Reads in Cells -nan
13 Mean Reads per Cell 0
14 Median Reads per Cell 0
15 UMIs in Cells 0
16 Mean UMI per Cell 0
17 Median UMI per Cell 0
18 Mean Genes per Cell 0
19 Median Genes per Cell 0
20 Total Genes Detected 0

View File

@@ -0,0 +1,56 @@
name: generate_pool_statistics
namespace: "stats"
authors:
- __merge__: /src/base/authors/dries_schaumont.yaml
roles: [ author, maintainer ]
- __merge__: /src/base/authors/marijke_van_moerbeke.yaml
roles: [ contributor ]
argument_groups:
- name: "Arguments"
arguments:
- name: "--nrReadsNrGenesPerChrom"
type: file
multiple: true
description: |
Path to an output file that contains a .tsv formatted table describing
per chromosome the number of reads that were mapped to that chromosome (NumberOfReads
column) and the number of genes on that chromosome that had at least one
read mapped to it (NumberOfGenes).
direction: input
default: [processedBamFile_well1.tsv, processedBamfile_well2.tsv]
- name: "--nrReadsNrGenesPerChromPool"
direction: output
type: file
multiple: false
description: |
Pivot table in tsv format of the combined input nrReadsNrGenesPerChrom files. Describes
per chromosome (as columns) the number of reads, as well as the total number
of reads per cell barcode and the percentage of nuclear, ERCC and mitochondrial
reads.
example: "nrReadsNrGenesPerChrom.txt"
resources:
- type: python_script
path: script.py
test_resources:
- type: python_script
path: test.py
engines:
- type: docker
image: python:3.12-slim
setup:
- type: apt
packages:
- procps
- type: python
packages:
- pandas
test_setup:
- type: python
packages:
- viashpy
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,94 @@
import pandas as pd
from pathlib import Path
import re
### VIASH START
par = {
"nrReadsNrGenesPerChrom": ["src/stats/generate_pool_statistics/test1.tsv", "src/stats/generate_pool_statistics/test2.tsv"],
"nrReadsNrGenesPerChromPool": "nrReadsNrGenesPerChrom_pool.txt"
}
### VIASH END
INDEX_COL = ["WellBC", "WellID"]
if __name__ == "__main__":
#########
# nrReadsNrGenesPerChrom file
#########
nr_reads_nr_genes_wells = []
par["nrReadsNrGenesPerChrom"] = list(map(Path, par["nrReadsNrGenesPerChrom"]))
for nr_reads_nr_genes_file in par["nrReadsNrGenesPerChrom"]:
nr_reads_nr_gene_well = pd.read_csv(nr_reads_nr_genes_file,
header=0, delimiter="\t",
dtype={"WellBC": pd.StringDtype(),
"WellID": pd.StringDtype(),
"Chr": pd.StringDtype(),
"NumberOfReads": pd.UInt64Dtype(),
"NumberOfGenes": pd.UInt64Dtype()})
if nr_reads_nr_gene_well.empty:
raise ValueError(f"{nr_reads_nr_genes_file.name} does not seem to contain any information!")
nr_reads_nr_genes_wells.append(nr_reads_nr_gene_well)
nr_reads_nr_genes_pool = pd.concat(nr_reads_nr_genes_wells, ignore_index=True,)
total_nr_reads_per_chromosome = nr_reads_nr_genes_pool.pivot_table(index=INDEX_COL, columns="Chr",
values=["NumberOfReads"], fill_value=0,
aggfunc="sum").droplevel(0, axis=1)
total_nr_reads_per_chromosome.columns.name = None
# Remove scaffolds/chromosomes with no counts
total_nr_reads_per_chromosome = total_nr_reads_per_chromosome.loc[:, (total_nr_reads_per_chromosome != 0).any(axis=0)]
##### Total number of genes from all chromosomes
total_nr_genes = nr_reads_nr_genes_pool.loc[:, INDEX_COL + ['NumberOfGenes']].groupby(["WellBC", "WellID"]).sum()
##### Total counts across (irrespective of chromosome)
total_sum_of_reads = total_nr_reads_per_chromosome.sum(numeric_only=True, axis=1)
##### Logic to split up chromosome per type
chromosome_names = total_nr_reads_per_chromosome.columns.to_list()
chr_regex = re.compile(r"^(chr)?\d+")
matching_chromosomes = [chr_name for chr_name
in chromosome_names
if chr_regex.match(chr_name)]
sex_chromosome_names = ["X", "Y"]
mitochondrial_chr_name = "MT"
# This is logic from the original HT pipeline,
# only when all of the matched chromosomes start with "chr", the mitochonrial, X and Y
# chromosomes should also start with 'chr'
if all(chr_name.startswith("chr") for chr_name in matching_chromosomes):
sex_chromosome_names += ["chrX", "chrY"]
mitochondrial_chr_name = "chrM"
###### Counts for mitochondrial reads
try:
mitochondrial_reads = total_nr_reads_per_chromosome.loc[:,mitochondrial_chr_name]
except KeyError:
mitochondrial_reads = 0
percentage_mitochondrial_reads = round(mitochondrial_reads / total_sum_of_reads * 100, 2)
###### Counts for ERCC reads
total_ercc_reads = total_nr_reads_per_chromosome.filter(regex=r"^ERCC").sum(axis=1)
percentage_ercc_reads = round(total_ercc_reads / total_sum_of_reads * 100, 2)
###### Counts for nuclear chromosomes
total_chromosomal_reads = total_nr_reads_per_chromosome.loc[:,matching_chromosomes].sum(axis=1)
percentage_chromosomal_reads = round(total_chromosomal_reads / total_sum_of_reads * 100, 2)
cols_to_add = {
"pctChrom": percentage_chromosomal_reads,
"pctMT": percentage_mitochondrial_reads,
"pctERCC": percentage_ercc_reads,
"SumReads": total_sum_of_reads,
"NumberOfGenes": total_nr_genes,
"NumberOfERCCReads": total_ercc_reads,
"NumberOfChromReads": total_chromosomal_reads,
"NumberOfMTReads": mitochondrial_reads,
}
total_nr_reads_per_chromosome = total_nr_reads_per_chromosome.assign(
**cols_to_add
)
total_nr_reads_per_chromosome.reset_index(names=INDEX_COL)\
.to_csv(par["nrReadsNrGenesPerChromPool"], sep="\t",
header=True, index=False, float_format="%g",
columns=tuple(INDEX_COL) + tuple(chromosome_names) + tuple(cols_to_add.keys())
)

View File

@@ -0,0 +1,269 @@
from uuid import uuid4
from textwrap import dedent
from subprocess import CalledProcessError
import pandas as pd
import re
import pytest
import sys
from pathlib import Path
### VIASH START
meta = {
"resources_dir": "./src/stats/generate_pool_statistics/",
"executable": "target/executable/stats/generate_pool_statistics/generate_pool_statistics",
"config": "src/stats/generate_pool_statistics/config.vsh.yaml"
}
### VIASH END
@pytest.fixture
def random_path(tmp_path):
def wrapper(extension=None):
extension = "" if not extension else f".{extension}"
return Path(tmp_path / f"{uuid4()}{extension}")
return wrapper
@pytest.fixture
def random_tsv_path(random_path):
def wrapper():
return random_path(".tsv")
return wrapper
@pytest.fixture
def simple_input_file_one(random_tsv_path, request):
prefix = request.param
mito_name = f"{prefix}M{'T' if not prefix else ''}"
contents = dedent(
f"""\
WellBC WellID Chr NumberOfReads NumberOfGenes
AGG A1 {prefix}1 2 1
AGG A1 {prefix}2 3 2
AGG A1 {prefix}3 4 2
AGG A1 {mito_name} 4 2
AGG A1 {prefix}X 2 3
AGG A1 ERCC-1 1 1
AGG A1 ERCC-2 1 1
""")
output_file = random_tsv_path()
with output_file.open("w") as open_file:
open_file.write(contents)
return output_file
@pytest.fixture
def simple_input_file_two(random_tsv_path, request):
prefix = request.param
contents = dedent(
f"""\
WellBC WellID Chr NumberOfReads NumberOfGenes
CCC B2 {prefix}2 2 1
CCC B2 {prefix}3 3 2
CCC B2 {prefix}5 4 2
CCC B2 {prefix}1 4 2
CCC B2 {prefix}Y 2 3
CCC B2 {prefix}X 2 3
CCC B2 ERCC-3 1 1
CCC B2 ERCC-2 1 1
""")
output_file = random_tsv_path()
with output_file.open("w") as open_file:
open_file.write(contents)
return output_file
@pytest.fixture
def empty_input_file(random_tsv_path):
contents = dedent(
f"""\
WellBC WellID Chr NumberOfReads NumberOfGenes
""")
output_file = random_tsv_path()
with output_file.open("w") as open_file:
open_file.write(contents)
return output_file
@pytest.mark.parametrize("simple_input_file_one,simple_input_file_two,expected", [("chr", "chr", "chr"), ("", "", "")],
indirect=["simple_input_file_one", "simple_input_file_two"])
def test_generate_pool_statistics_simple(run_component, simple_input_file_one,
simple_input_file_two, random_tsv_path, expected):
output_path = random_tsv_path()
run_component([
"--nrReadsNrGenesPerChrom", simple_input_file_one,
"--nrReadsNrGenesPerChrom", simple_input_file_two,
"--nrReadsNrGenesPerChromPool", output_path
])
mito_name = f"{expected}M{'T' if not expected else ''}"
expected_dict = {
"WellBC": ["AGG", "CCC"],
"WellID": ["A1", "B2"],
"ERCC-1": ["1", "0"],
"ERCC-2": ["1", "1"],
"ERCC-3": ["0", "1"],
f"{expected}1": ["2", "4"],
f"{expected}2": ["3", "2"],
f"{expected}3": ["4", "3"],
f"{expected}5": ["0", "4"],
f"{mito_name}": ["4", "0"],
f"{expected}X": ["2", "2"],
f"{expected}Y": ["0", "2"],
"SumReads": ["17", "19"],
"pctMT": ["23.53", "0"],
"pctERCC": ["11.76", "10.53"],
"pctChrom": ["52.94", "68.42"],
"NumberOfGenes": ["12", "15"],
"NumberOfMTReads": ["4", "0"],
"NumberOfChromReads": ["9", "13"],
"NumberOfERCCReads": ["2", "2"],
}
expected_frame = pd.DataFrame.from_dict(expected_dict, dtype=pd.StringDtype())
assert output_path.is_file()
contents = pd.read_csv(output_path, sep="\t", dtype=pd.StringDtype())
pd.testing.assert_frame_equal(contents, expected_frame, check_like=True)
def test_only_numerical_chromosomes(run_component, random_tsv_path):
"""
The chromosome column might be read as an integer instead of a string,
make sure that a numerical column only works.
"""
output_path = random_tsv_path()
contents1 = dedent(
f"""\
WellBC WellID Chr NumberOfReads NumberOfGenes
CCC B2 2 2 1
CCC B2 3 3 2
CCC B2 5 4 2
CCC B2 1 4 2
""")
input_file_1 = random_tsv_path()
with input_file_1.open("w") as open_file:
open_file.write(contents1)
contents2 = dedent(
f"""\
WellBC WellID Chr NumberOfReads NumberOfGenes
AGG A1 2 2 1
AGG A1 3 3 2
AGG A1 5 4 2
AGG A1 1 4 2
""")
input_file_2 = random_tsv_path()
with input_file_2.open("w") as open_file:
open_file.write(contents2)
output_path = random_tsv_path()
run_component([
"--nrReadsNrGenesPerChrom", input_file_1,
"--nrReadsNrGenesPerChrom", input_file_2,
"--nrReadsNrGenesPerChromPool", output_path
])
expected_dict = {
"WellBC": ["AGG", "CCC"],
"WellID": ["A1", "B2"],
"1": ["4", "4"],
"2": ["2", "2"],
"3": ["3", "3"],
"5": ["4", "4"],
"pctChrom": ["100", "100"],
"pctMT": ["0", "0"],
"pctERCC": ["0", "0"],
"SumReads": ["13", "13"],
"NumberOfGenes": ["7", "7"],
"NumberOfERCCReads": ["0", "0"],
"NumberOfChromReads": ["13", "13"],
"NumberOfMTReads": ["0", "0"],
}
expected_frame = pd.DataFrame.from_dict(expected_dict,
dtype=pd.StringDtype())
assert output_path.is_file()
contents = pd.read_csv(output_path, sep="\t", dtype=pd.StringDtype())
pd.testing.assert_frame_equal(contents, expected_frame, check_like=True)
@pytest.mark.parametrize("simple_input_file_one", [("")],
indirect=["simple_input_file_one"])
def test_empty_input_raises(run_component, simple_input_file_one, empty_input_file, random_tsv_path):
"""
When an input file contains no data, raise an error.
"""
output_path = random_tsv_path()
with pytest.raises(CalledProcessError) as err:
run_component([
"--nrReadsNrGenesPerChrom", simple_input_file_one,
"--nrReadsNrGenesPerChrom", empty_input_file,
"--nrReadsNrGenesPerChromPool", output_path
])
assert re.search(
rf"{empty_input_file.name} does not seem to contain any information",
err.value.stdout.decode("utf-8"),
)
def test_remove_chromosomes_with_no_counts(run_component, random_tsv_path):
"""
If a chromosome has no counts across all of the wells, it should
not be included in the output
"""
output_path = random_tsv_path()
contents1 = dedent(
f"""\
WellBC WellID Chr NumberOfReads NumberOfGenes
CCC B2 2 2 1
CCC B2 3 3 2
CCC B2 5 4 2
CCC B2 1 4 2
CCC B2 empty 0 0
""")
input_file_1 = random_tsv_path()
with input_file_1.open("w") as open_file:
open_file.write(contents1)
contents2 = dedent(
f"""\
WellBC WellID Chr NumberOfReads NumberOfGenes
AGG A1 2 2 1
AGG A1 3 3 2
AGG A1 5 4 2
AGG A1 1 4 2
AGG A1 empty 0 0
""")
input_file_2 = random_tsv_path()
with input_file_2.open("w") as open_file:
open_file.write(contents2)
output_path = random_tsv_path()
run_component([
"--nrReadsNrGenesPerChrom", input_file_1,
"--nrReadsNrGenesPerChrom", input_file_2,
"--nrReadsNrGenesPerChromPool", output_path
])
# Here, the chromosome called "empty" should not be included
expected_dict = {
"WellBC": ["AGG", "CCC"],
"WellID": ["A1", "B2"],
"1": ["4", "4"],
"2": ["2", "2"],
"3": ["3", "3"],
"5": ["4", "4"],
"pctChrom": ["100", "100"],
"pctMT": ["0", "0"],
"pctERCC": ["0", "0"],
"SumReads": ["13", "13"],
"NumberOfGenes": ["7", "7"],
"NumberOfERCCReads": ["0", "0"],
"NumberOfChromReads": ["13", "13"],
"NumberOfMTReads": ["0", "0"],
}
expected_frame = pd.DataFrame.from_dict(expected_dict,
dtype=pd.StringDtype())
assert output_path.is_file()
contents = pd.read_csv(output_path, sep="\t", dtype=pd.StringDtype())
pd.testing.assert_frame_equal(contents, expected_frame, check_like=True)
if __name__ == '__main__':
sys.exit(pytest.main([__file__]))

View File

@@ -0,0 +1,93 @@
name: generate_well_statistics
namespace: "stats"
description: Generate summary statistics from BAM files generated by STAR solo.
authors:
- __merge__: /src/base/authors/dries_schaumont.yaml
roles: [ author, maintainer ]
- __merge__: /src/base/authors/marijke_van_moerbeke.yaml
roles: [ contributor ]
argument_groups:
- name: "Arguments"
arguments:
- name: "--input"
type: file
description: "The .bam file as returned by the mapping tool STAR."
direction: input
example: "input.bam"
- name: "--barcode"
type: string
description: |
The barcode for the well that is being processed. Is only used to add a metadata
column to all output files.
required: true
- name: "--well_id"
type: string
description: |
ID of this well. Only used to add a metadata column to the output files.
required: true
- name: "--processedBAMFile"
type: file
description: |
Path to a .tsv file listing, per read in the BAM file,
the value for the "CB", "UX", "GX" and "GN" tag, together with the
chromsome to which the read was mapped to.
direction: output
default: processedBamFile.txt
- name: "--nrReadsNrGenesPerChrom"
type: file
description: |
Path to an output file that contains a .tsv formatted table describing
per chromosome the number of reads that were mapped to that chromosome (NumberOfReads
column) and the number of genes on that chromosome that had at least one
read mapped to it (NumberOfGenes).
default: nrReadsNrGenesPerChrom.txt
direction: output
- name: "--nrReadsNrUMIsPerCB"
type: file
description: |
Path to an output file that contains a .tsv formatted table describing
per barcode the number of UMI's (nrUMIs) and the total number of reads (NumberOfReads).
direction: output
default: nrReadsNrUMIsPerCB.txt
- name: "--umiFreqTop"
type: file
description: |
Path to an output file that contains a .tsv formatted table describing
per UMI (column UB) the frequency at which they occur in the reads (column
N). Only the top 100 UMIs are included.
default: umiFreqTop100.txt
direction: output
- name: "--threads"
type: integer
description: |
Number of threads to use for decompressing BAM files.
min: 1
default: 1
resources:
- type: python_script
path: script.py
test_resources:
- type: python_script
path: test.py
- path: test.sam
- path: empty.sam
engines:
- type: docker
image: python:3.13-trixie
setup:
- type: apt
packages:
- procps
- type: python
packages:
- pysam
- pandas
test_setup:
- type: python
packages:
- viashpy
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,3 @@
@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:200
@SQ SN:2 LN:50

View File

@@ -0,0 +1,83 @@
import pysam
import pandas as pd
import logging
### VIASH START
par = {
"input": "src/stats/generate_well_statistics/test.sam",
"processedBAMFile": "processedBamFile.txt",
"nrReadsNrGenesPerChrom": "nrReadsNrGenesPerChrom.txt",
"nrReadsNrUMIsPerCB": "nrReadsNrUMIsPerCB.txt",
"umiFreqTop": "umiFreqTop.txt",
"threads": 1,
"barcode": "ACGT",
"well_id": "A1",
}
### VIASH END
logger = logging.getLogger()
console_handler = logging.StreamHandler()
logger.addHandler(console_handler)
logger.setLevel(logging.DEBUG)
if __name__ == "__main__":
logger.info("Component started.")
parameters_str = [f'\t{param}: {param_val}\n' for param, param_val in par.items()]
logger.info("Parameters:\n%s", "".join(parameters_str).rstrip())
logger.info("Opening '%s'", par["input"])
samfile = pysam.AlignmentFile(par["input"], "rb", threads=par["threads"])
all_tags = []
index = []
tags_selection = ("CB", "UB", "GX", "GN")
for aligned_segment in samfile:
tags = dict(aligned_segment.get_tags())
all_tags.append(tags)
reference_name = aligned_segment.reference_name
index.append("*" if not reference_name else reference_name)
if not index:
# Workaround for https://github.com/pandas-dev/pandas/issues/58594
tag_dataframe = pd.DataFrame([], index=[], columns=tags_selection)
else:
tag_dataframe = pd.DataFrame.from_records(all_tags, index=index,
columns=tags_selection)
tag_dataframe_to_write = tag_dataframe.copy()
logger.info("Done reading BAM file. Found %i entries", tag_dataframe.shape[0])
tag_dataframe.assign(WellBC=par["barcode"], WellID=par["well_id"])\
.reset_index(names="Chr")\
.to_csv(par["processedBAMFile"], sep="\t", na_rep="",
header=True, index=False,
columns=("WellBC", "WellID", "Chr") + tags_selection)
logger.info("Constructing of dataframe done.")
# Number of genes that had a read mapped to them per chromosome,
# and the number of reads mapped to those genes per chromosome.
nr_reads_nr_genes = tag_dataframe.dropna(subset=["GX"]).groupby(level=0).agg(
NumberOfReads=pd.NamedAgg("GX", aggfunc="size"),
NumberOfGenes=pd.NamedAgg(column="GX", aggfunc="nunique")
)
nr_reads_nr_genes = nr_reads_nr_genes.reindex(samfile.header.references, fill_value=0)
logger.info("Done calculating number of reads per gene and per chromesome. Writing to %s",
par['nrReadsNrGenesPerChrom'])
nr_reads_nr_genes.reset_index(names="Chr").assign(WellBC=par["barcode"], WellID=par["well_id"])\
.to_csv(par["nrReadsNrGenesPerChrom"], sep="\t",
header=True, index=False,
columns=("WellBC", "WellID", "Chr", "NumberOfReads", "NumberOfGenes"))
# Number of reads mapped to the reference, grouped by UMI
nr_read_per_umi = tag_dataframe.groupby('UB').size()\
.drop("", errors="ignore").sort_values(ascending=False).head(100)
nr_read_per_umi_df = nr_read_per_umi.to_frame(name="N")
logger.info("Done calculating number of mapped reads per UMI, writing to %s", par["umiFreqTop"])
nr_read_per_umi_df.assign(WellBC=par["barcode"], WellID=par["well_id"]).reset_index(names="UB")\
.to_csv(par["umiFreqTop"], header=True, sep="\t",
index=False, columns=("WellBC", "WellID", "UB", "N"))
# Total number of mapped reads and total number of UMIs (not grouped per chromosome)
nr_reads_and_umi_per_barcode = tag_dataframe.groupby(by="CB").agg(
NumberOfReads=pd.NamedAgg("CB", "size"),
nrUMIs=pd.NamedAgg("UB", "nunique")
)
logger.info("Done calculating number of mapped reads and number of UMIs per Cell Barcode, writing to %s",
par["nrReadsNrUMIsPerCB"])
nr_reads_and_umi_per_barcode.assign(WellBC=par["barcode"], WellID=par["well_id"]).reset_index(names="CB")\
.to_csv(par["nrReadsNrUMIsPerCB"], sep="\t", header=True,
index=False, columns=("WellBC", "WellID", "CB", "NumberOfReads", "nrUMIs"))
logger.info("Finished!")

View File

@@ -0,0 +1,166 @@
import sys
import pytest
import pysam
from uuid import uuid4
from pathlib import Path
from textwrap import dedent
### VIASH START
meta = {
"resources_dir": "./src/stats/generate_well_statistics/",
"executable": "target/executable/stats/generate_well_statistics/generate_well_statistics",
"config": "src/stats/generate_well_statistics/config.vsh.yaml"
}
### VIASH END
def assert_file_content_equals(file_to_check, expected):
with file_to_check.open('r') as open_file:
contents = open_file.read()
assert contents == expected
@pytest.fixture
def input_sam_path():
return Path(meta["resources_dir"]) / "test.sam"
@pytest.fixture
def random_path(tmp_path):
def wrapper(extension=None):
extension = "" if not extension else f".{extension}"
return tmp_path / f"{uuid4()}{extension}"
return wrapper
@pytest.fixture
def random_bam_path(random_path):
def wrapper():
return random_path(".bam")
return wrapper
@pytest.fixture
def sam_to_bam(random_bam_path):
def wrapper(sam_file):
out_path = random_bam_path()
with pysam.AlignmentFile(sam_file, "r") as infile, \
pysam.AlignmentFile(out_path, "wb", template=infile) as outfile:
for s in infile:
outfile.write(s)
infile.close()
return out_path
return wrapper
@pytest.fixture
def empty_sam_path():
return Path(meta["resources_dir"]) / "empty.sam"
def test_generate_well_statistics_simple_bam(run_component, input_sam_path, sam_to_bam, random_path):
bam_file = sam_to_bam(input_sam_path)
processed_bam = random_path("tsv")
reads_per_chromosome = random_path("tsv")
nr_reads_nr_umis_per_cb = random_path("tsv")
top_onehundred_umis = random_path("tsv")
run_component([
"--input", bam_file,
"--processedBAMFile", processed_bam,
"--nrReadsNrGenesPerChrom", reads_per_chromosome,
"--nrReadsNrUMIsPerCB", nr_reads_nr_umis_per_cb,
"--umiFreqTop", top_onehundred_umis,
"--barcode", "ACGT",
"--well_id", "A1",
])
for file_path in (processed_bam, reads_per_chromosome,
nr_reads_nr_umis_per_cb, top_onehundred_umis):
assert file_path.is_file()
expected_processed_bam = \
dedent("""\
WellBC WellID Chr CB UB GX GN
ACGT A1 1 ACA CGG gene1 gene1
ACGT A1 1 ACA CGG gene1 gene1
ACGT A1 2 GGG GTT gene2 gene2
ACGT A1 2 GGG GTC gene3 gene3
""")
expected_reads_per_chromosome = \
dedent("""\
WellBC WellID Chr NumberOfReads NumberOfGenes
ACGT A1 1 2 1
ACGT A1 2 2 2
""")
expected_nr_reads_nr_umis_per_cb = \
dedent("""\
WellBC WellID CB NumberOfReads nrUMIs
ACGT A1 ACA 2 1
ACGT A1 GGG 2 2
""")
expected_top_onehundred_umis = \
dedent("""\
WellBC WellID UB N
ACGT A1 CGG 2
ACGT A1 GTC 1
ACGT A1 GTT 1
""")
assert_file_content_equals(processed_bam, expected_processed_bam)
assert_file_content_equals(reads_per_chromosome, expected_reads_per_chromosome)
assert_file_content_equals(nr_reads_nr_umis_per_cb, expected_nr_reads_nr_umis_per_cb)
assert_file_content_equals(top_onehundred_umis, expected_top_onehundred_umis)
def test_empty_sam(run_component, empty_sam_path, sam_to_bam, random_path):
"""
Test an empty bam file. Make sure that chromosomes with mapped reads
are still represented. Ran into issue https://github.com/pandas-dev/pandas/pull/59258
"""
bam_file = sam_to_bam(empty_sam_path)
processed_bam = random_path("tsv")
reads_per_chromosome = random_path("tsv")
nr_reads_nr_umis_per_cb = random_path("tsv")
top_onehundred_umis = random_path("tsv")
run_component([
"--input", bam_file,
"--processedBAMFile", processed_bam,
"--nrReadsNrGenesPerChrom", reads_per_chromosome,
"--nrReadsNrUMIsPerCB", nr_reads_nr_umis_per_cb,
"--umiFreqTop", top_onehundred_umis,
"--barcode", "ACGT",
"--well_id", "A1",
])
for file_path in (processed_bam, reads_per_chromosome,
nr_reads_nr_umis_per_cb, top_onehundred_umis):
assert file_path.is_file()
expected_processed_bam = \
dedent("""\
WellBC WellID Chr CB UB GX GN
""")
expected_reads_per_chromosome = \
dedent("""\
WellBC WellID Chr NumberOfReads NumberOfGenes
ACGT A1 1 0 0
ACGT A1 2 0 0
""")
expected_nr_reads_nr_umis_per_cb = \
dedent("""\
WellBC WellID CB NumberOfReads nrUMIs
""")
expected_top_onehundred_umis = \
dedent("""\
WellBC WellID UB N
""")
assert_file_content_equals(processed_bam, expected_processed_bam)
assert_file_content_equals(reads_per_chromosome, expected_reads_per_chromosome)
assert_file_content_equals(nr_reads_nr_umis_per_cb, expected_nr_reads_nr_umis_per_cb)
assert_file_content_equals(top_onehundred_umis, expected_top_onehundred_umis)
if __name__ == '__main__':
sys.exit(pytest.main([__file__]))

View File

@@ -0,0 +1,7 @@
@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:200
@SQ SN:2 LN:50
test_1 16 1 22 255 1M * 0 0 C I NH:i:1 HI:i:1 nM:i:0 AS:i:47 CR:Z:ACA UR:Z:CGG GX:Z:gene1 GN:Z:gene1 CB:Z:ACA UB:Z:CGG
test_2 16 1 22 255 1M * 0 0 G ! NH:i:1 HI:i:1 nM:i:0 AS:i:47 CR:Z:ACA UR:Z:CGG GX:Z:gene1 GN:Z:gene1 CB:Z:ACA UB:Z:CGG
test_3 0 2 40 255 1M * 0 0 T ! NH:i:1 HI:i:1 nM:i:0 AS:i:47 CR:Z:GGG UR:Z:GTT GX:Z:gene2 GN:Z:gene2 CB:Z:GGG UB:Z:GTT
test_4 0 2 60 255 1M * 0 0 C ! NH:i:1 HI:i:1 nM:i:0 AS:i:47 CR:Z:GGG UR:Z:GTC GX:Z:gene3 GN:Z:gene3 CB:Z:GGG UB:Z:GTC

View File

@@ -0,0 +1,43 @@
name: concatRuns
namespace: utils
description: |
Concatenate well FASTQ files from different runs in order to increase sequencing depth.
arguments:
- name: "--input_r1"
type: file
required: true
multiple: true
- name: "--input_r2"
type: file
required: true
multiple: true
- name: "--sample_id"
type: string
required: true
- name: "--output_r1"
type: file
multiple: true
description: Path to read 1 fastq/fasta file
direction: output
- name: "--output_r2"
type: file
multiple: true
description: Path to read 2 fastq/fasta file
direction: output
resources:
- type: nextflow_script
path: main.nf
entrypoint: run_wf
dependencies:
- name: concat_text
repository: cb
repositories:
- name: cb
type: vsh
repo: craftbox
tag: v0.3.0
runners:
- type: nextflow
engines:
- type: native

View File

@@ -0,0 +1,128 @@
workflow run_wf {
take:
input_ch
main:
// Count the number of input events per sample
// Results from events with the same sample ID need to be concatenated.
event_counts_ch = input_ch
| map {id, state ->
def new_state = state + ["event_id": id]
def new_event = [state.sample_id, new_state]
return new_event
}
| groupTuple(by: 0)
| flatMap { id, states ->
def orig_event_ids = states.collect{it.event_id}
def new_events = orig_event_ids.collect{ orig_event_id ->
[orig_event_id, ["n_events": states.size()]]
}
return new_events
}
// The number of events per sample needs is passed number to `groupTuple()`
// so that it can emit the sample as soon as it is ready. This makes sure
// that the samples are processed asynchronously.
output_ch = input_ch.join(event_counts_ch)
| flatMap {id, state_demultiplex, state_event_counts ->
assert state_demultiplex.input_r1.size() == state_demultiplex.input_r2.size(),
"Expected output from well demultiplexing to contain equal amount or forward and reverse FASTQ files."
def new_states = [state_demultiplex.input_r1, state_demultiplex.input_r2].transpose().collect{ fastq_files ->
def (r1_file, r2_file) = fastq_files
def regex = ~/^(\w+)_R[12]{1}_001\.fastq(\.gz)?$/
def parsed_file_name = r1_file.name =~ regex
def parsed_file_name_r2 = r2_file.name =~ regex
def well_id = parsed_file_name[0][1]
def well_id_r2 = parsed_file_name_r2[0][1]
assert (well_id.length() != 0) && (well_id == well_id_r2)
def new_state = state_demultiplex + [
"input_r1": r1_file,
"input_r2": r2_file,
"event_id": id,
]
def group_settings = groupKey("${state_demultiplex.sample_id}_${well_id}", state_event_counts.n_events)
return [group_settings, new_state]
}
return new_states
}
| groupTuple(by: 0, sort: "hash", remainder: true)
| map {group_settings, sample_states ->
def input_r1 = sample_states.collect{it.input_r1}.flatten()
def input_r2 = sample_states.collect{it.input_r2}.flatten()
def event_ids = sample_states.collect{it.event_id}
def sample_id_list = sample_states.collect{it.sample_id}.unique()
assert sample_id_list.size() == 1
def sample_id = sample_id_list[0]
assert input_r1.size() == input_r2.size()
def new_state = [
"input_r1": input_r1,
"input_r2": input_r2,
"event_id": event_ids,
"sample_id": sample_id,
]
return [group_settings.target, new_state]
}
| concat_text.run(
directives: [label: ["lowmem", "lowcpu"]],
key: "concat_samples_r1",
runIf: {id, state -> state.input_r1.size() > 1},
fromState: { id, state ->
def output_file_name = state.input_r1[0].name
[
input: state.input_r1,
gzip_output: false,
output: output_file_name
]
},
toState: { id, result, state ->
def newState = state + [ input_r1: [ result.output ] ]
return newState
}
)
| concat_text.run(
directives: [label: ["lowmem", "lowcpu"]],
key: "concat_samples_r2",
runIf: {id, state -> state.input_r2.size() > 1},
fromState: { id, state ->
def output_file_name = state.input_r2[0].name
[
input: state.input_r2,
gzip_output: false,
output: output_file_name
]
},
toState: { id, result, state ->
def newState = state + [ input_r2: [ result.output ] ]
return newState
}
)
| map {id, state ->
def new_state = [state.sample_id, state]
return new_state
}
| groupTuple(by: 0, sort: 'hash')
| map {id, states ->
def new_state = [
"input_r1": states.collect{it.input_r1}.flatten(),
"input_r2": states.collect{it.input_r2}.flatten(),
"_meta": ["join_id": states[0].event_id[0]]
]
return [id, new_state]
}
| setState(
[
"output_r1": "input_r1",
"output_r2": "input_r2",
"_meta": "_meta"
]
)
emit:
output_ch
}

View File

@@ -0,0 +1,45 @@
name: listInputDir
namespace: utils
description: List the contents of a directory and parse contained fastq files
arguments:
- name: "--input"
alternatives: [-i]
type: file
description: Path to the directory containing fastq files
required: true
example: fastq_dir
- name: --pools
description: "Pool names to include. By default all pools are selected for analysis."
type: string
multiple: true
- name: "--r1_output"
type: file
description: Path to read 1 fastq/fasta file
direction: output
- name: "--r2_output"
type: file
description: Path to read 2 fastq/fasta file
direction: output
- name: "--lane"
type: string
description: Lane nr
direction: output
- name: "--sample"
type: string
description: Sample nr
direction: output
- name: "--sample_id"
type: string
description: Sample name
direction: output
resources:
- type: nextflow_script
path: main.nf
entrypoint: run_wf
runners:
- type: nextflow
engines:
- type: native

View File

@@ -0,0 +1,72 @@
workflow run_wf {
take: in_
main:
out_ = in_
| flatMap{ id, state ->
println "Looking for fastq files in ${state.input}"
def allFastqs = state.input
.listFiles()
.findAll{
it.isFile() &&
it.name ==~ /^.+\.fastq.gz$|^.+\.fastq$|^.+\.fasta$/
}
println "Found ${allFastqs.size()} fastq/fasta files in ${state.input}"
assert allFastqs.size() > 0: "No fastq/fasta files found"
println("Extracting information from fastq/fasta filenames")
def processed_fastqs = allFastqs.collect { f ->
def regex = ~/^(\S+)_S(\d+)_(L(\d+)_)?R(\d)_(\d+)\.fast[qa](\.gz)?$/
def validFastq = f.name ==~ regex
assert validFastq: "${f} does not match the regex ${regex}"
def parsedFastq = f.name =~ regex
def lane = parsedFastq[0][3]
// Remove the trailing '_'
def lane_remove_trailing = lane == null ? "" : lane.replaceAll('_$', "")
def sample_id = parsedFastq[0][1]
if (sample_id in ["Undetermined"] || (state.pools && !state.pools.isEmpty() && !state.pools.contains(sample_id))) {
return null
}
return [
"fastq": f,
"sample_id": sample_id,
"sample": parsedFastq[0][2],
"lane": lane_remove_trailing,
"read": parsedFastq[0][5],
]
}
println("Group paired fastq/fasta files")
def grouped = processed_fastqs
.findAll{it != null}
.groupBy({it.sample_id}, {it.lane})
.collectMany{ sample_id, states_per_lane ->
def result = states_per_lane.collect{lane, lane_states ->
assert lane_states.size() == 2, "Expected to find two fastq files per lane! " +
"Found ${lane_states.size()}. State: ${states_per_lane}"
def r1_state = lane_states.find({it.read == "1"})
def r2_state = lane_states.find({it.read == "2"})
def fastq_state = [
"r1_output": r1_state.fastq,
"r2_output": r2_state.fastq
]
def new_state = fastq_state +
r1_state.findAll{it.key in ["sample_id", "sample", "lane"]} +
["_meta": ["join_id": id]]
def new_id = lane?.trim() ? "${sample_id}_${lane}".toString() : sample_id
return [new_id, new_state]
}
return result
}
return grouped
}
emit: out_
}

View File

@@ -0,0 +1,47 @@
name: save_params
namespace: utils
description: |
Save parameters to a YAML file
argument_groups:
- name: Inputs
arguments:
- name: "--id"
description: |
The id of the job
type: string
required: true
- name: "--params_yaml"
description: |
base64 encoded yaml containing the state
type: string
required: true
- name: Outputs
arguments:
- name: "--output"
description: |
The output YAML file
type: file
direction: output
required: true
example: "output.yaml"
resources:
- type: python_script
path: script.py
engines:
- type: docker
image: python:3.12-slim
setup:
- type: apt
packages:
- procps
- type: python
packages:
- pyyaml
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,28 @@
import re
import yaml
import base64
## VIASH START
par = {
"id": "sample_one",
"params_yaml": "cGFyYW1zX3lhbWw6IHt9Cg==",
"output": "output.yaml"
}
## VIASH END
class Dumper(yaml.Dumper):
def increase_indent(self, flow=False, indentless=False):
return super(Dumper, self).increase_indent(flow, False)
def decode_params_yaml(encoded_yaml):
yaml_bytes = base64.b64decode(encoded_yaml)
yaml_string = yaml_bytes.decode('utf-8')
yaml_data = yaml.safe_load(yaml_string)
return yaml_data
params = decode_params_yaml(par['params_yaml'])
with open(par["output"], 'w') as f:
yaml.dump(params, f, default_flow_style=False, Dumper=Dumper)

View File

View File

@@ -0,0 +1,144 @@
name: htrnaseq
namespace: workflows
authors:
- __merge__: /src/base/authors/dries_schaumont.yaml
roles: [ maintainer ]
argument_groups:
- name: Input arguments
arguments:
- name: --input_r1
description: |
Forward reads in FASTQ format. Multiple files corresponding to different lanes can be provided which will
be demultiplexed separately before joining the results for each individual well.
type: file
required: true
multiple: true
- name: --input_r2
description: |
Reverse reads in FASTQ format. Multiple files corresponding to different lanes can be provided which will
be demultiplexed separately before joining the results for each individual well.
type: file
required: true
multiple: true
- name: --barcodesFasta
type: file
required: true
- name: "--umi_length"
description: |
Length of the UMI sequences
type: integer
min: 1
default: 10
- name: --genomeDir
type: file
required: true
- name: --annotation
type: file
required: true
- name: --sample_id
type: string
required: false
description: |
Sample ID for the provided input files. If not provided, the value of --id
will be used. Input files will allways be demultiplexed separately,
but the FASTQs for wells with matching sample IDs will be concatenated before mapping.
- name: Output arguments
arguments:
- name: "--fastq_output"
description: "Directory containing output fastq files"
type: file
multiple: true
required: true
default: "fastq/*"
direction: output
- name: --star_output
description: Output from mapping with STAR
type: file
direction: output
multiple: true
required: true
default: star.$id/*
- name: "--nrReadsNrGenesPerChrom"
type: file
direction: output
required: true
default: "nrReadsNrGenesPerChrom.$id.txt"
- name: "--star_qc_metrics"
type: file
direction: output
required: true
default: "starLogs.$id.txt"
- name: "--eset"
type: file
direction: output
required: true
default: eset.$id.rds
- name: "--f_data"
type: file
direction: output
required: true
default: fData.$id.tsv
- name: "--p_data"
type: file
direction: output
required: true
default: pData.$id.tsv
- name: "--html_report"
type: file
direction: output
required: true
default: report.$id.html
- name: "--run_params"
type: file
direction: output
required: false
default: params.$id.yaml
resources:
- type: nextflow_script
path: main.nf
entrypoint: run_wf
test_resources:
- type: nextflow_script
path: test.nf
entrypoint: test_wf
- type: nextflow_script
path: test.nf
entrypoint: test_wf2
dependencies:
- name: stats/combine_star_logs
repository: local
- name: stats/generate_pool_statistics
repository: local
- name: stats/generate_well_statistics
repository: local
- name: workflows/well_demultiplex
repository: local
- name: workflows/well_metadata
repository: local
- name: parallel_map
repository: local
- name: eset/create_eset
repository: local
- name: eset/create_fdata
repository: local
- name: eset/create_pdata
repository: local
- name: report/create_report
repository: local
- name: utils/concatRuns
repository: local
- name: utils/save_params
repository: local
repositories:
- name: local
type: local
- name: bb
type: vsh
repo: biobox
tag: v0.3.1
runners:
- type: nextflow
engines:
- type: native

View File

@@ -0,0 +1,33 @@
#!/bin/bash
# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)
# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"
# Make sure the workflow is built
viash ns build --setup cb --parallel
export NXF_VER=24.04.4
set -eo pipefail
nextflow \
run . \
-main-script src/workflows/htrnaseq/test.nf \
-config ./src/config/labels.config \
-entry test_wf \
-resume \
-profile docker,local \
--publish_dir output
nextflow \
run . \
-main-script src/workflows/htrnaseq/test.nf \
-config ./src/config/labels.config \
-entry test_wf2 \
-resume \
-profile docker,local \
--publish_dir output2

View File

@@ -0,0 +1,359 @@
workflow run_wf {
take:
raw_ch
main:
input_ch = raw_ch
// Use the event ID as the default for the sample ID
| map {id, state ->
def sample_id = state.sample_id ?: id
def newState = state + ["sample_id": sample_id, "run_id": id]
return [id, newState]
}
| save_params.run(
runIf: { id, state ->
state.run_params != null
},
fromState: {id, state ->
// Define the function before using it
def convertPaths
convertPaths = { value ->
if (value instanceof java.nio.file.Path)
return value.toUriString()
else if (value instanceof List)
return value.collect { convertPaths(it) }
else if (value instanceof Collection)
throw new UnsupportedOperationException("Collections other than Lists are not supported")
else
return value
}
// Apply conversion to all state values
def convertedState = state.collectEntries { k, v -> [(k): convertPaths(v)] }
def yaml = new org.yaml.snakeyaml.Yaml()
def yamlString = yaml.dump(convertedState)
def encodedYaml = yamlString.bytes.encodeBase64().toString()
return [
"id": id,
"params_yaml": encodedYaml,
"output": state.run_params
]
},
toState: ["run_params": "output"]
)
// The featureData only has one requirement: the genome annotation.
// It can be generated straight away. Most of the time, there is one shared
// annotation for all of the inputs and the fData should only be calculated once.
// The state is manpulated in such a way that there is one event created per unique
// input annotation file. In turn, the featureData file can joined into the original input
// channel which allows it to be shared across events if required.
f_data_ch = input_ch
| toSortedList()
| flatMap {ids_and_states ->
def annotation_files = ids_and_states.inject([:]){ old_state, id_and_state ->
def (id, state) = id_and_state
def annotation_file = state.annotation
def new_state = old_state + [(annotation_file): (old_state.getOrDefault(annotation_file, []) + [id])]
return new_state
}
def file_names = annotation_files.keySet().collect{it.name}
assert (file_names.toSet().size() == file_names.size()),
"Please make sure that the annotation files have unique file names."
def new_states = annotation_files.collect{annotation_file, value ->
def new_state = [annotation_file.name , ["annotation": annotation_file, "event_ids": value]]
return new_state
}
return new_states
}
| create_fdata.run(
directives: [label: ["lowmem", "lowcpu"]],
fromState: [
"gtf": "annotation",
"output": "f_data"
],
toState: ["f_data": "output"]
)
| flatMap {_, state ->
def new_states = state.event_ids.collect{event_id ->
[event_id, ["f_data": state.f_data]]
}
return new_states
}
// Perform mapping of each well.
demultiplex_ch = input_ch
| well_demultiplex.run(
fromState: [
"input_r1": "input_r1",
"input_r2": "input_r2",
"barcodesFasta": "barcodesFasta",
],
toState: {id, result, state ->
def all_fastq = result.output_r1 + result.output_r2
def output_dir = all_fastq.collect{it.parent}.unique()
assert output_dir.size() == 1, "Expected output from well demultiplexing (id $id) to reside into one directory. Found: $output_dir"
def new_state = state + [
"input_r1": result.output_r1,
"input_r2": result.output_r2,
"fastq_output_directory": output_dir[0],
]
return new_state
}
)
fastq_output_directory_ch = demultiplex_ch
| map {id, state ->
def new_event = [state.sample_id, state]
return new_event
}
| groupTuple(by: 0, sort: "hash")
| map {id, states ->
def fastq_output_dirs = states.collect{it.fastq_output_directory}
def new_state = ["fastq_output_directory": fastq_output_dirs]
def new_event = [id, new_state]
return [id, new_state]
}
concat_samples_ch = demultiplex_ch.join(f_data_ch)
| map {id, demultiplex_state, f_data_state ->
def newState = demultiplex_state + ["f_data": f_data_state["f_data"]]
[id, newState]
}
| concatRuns.run(
fromState: [
"input_r1": "input_r1",
"input_r2": "input_r2",
"sample_id": "sample_id",
],
toState: {id, result, state ->
def state_overwite = [
"input_r1": result.output_r1,
"input_r2": result.output_r2,
"_meta": ["join_id": state.run_id]
]
return state + state_overwite
}
)
pool_ch = concat_samples_ch.join(fastq_output_directory_ch)
| map {id, concat_state, fastq_output_directory_state ->
def new_state = concat_state + fastq_output_directory_state
return [id, new_state]
}
| parallel_map.run(
directives: ["label": ["highmem", "lowcpu"]],
fromState: {id, state ->
[
"input_r1": state.input_r1,
"input_r2": state.input_r2,
"barcodesFasta": state.barcodesFasta,
"umiLength": state.umi_length,
"output": state.star_output[0],
"genomeDir": state.genomeDir,
]
},
toState: [
"star_output": "output",
]
)
// Split the events from 1 event per pool into events per well
// and add extra metadata about the wells to the state.
| well_metadata.run(
fromState: [
"barcodesFasta": "barcodesFasta",
"input_r1": "input_r1",
"input_r2": "input_r2",
"star_mapping": "star_output"
],
toState: [
"input_r1": "output_r1",
"input_r2": "output_r2",
"pool": "pool",
"well_id": "well_id",
"barcode": "barcode",
"lane": "lane",
"n_wells": "n_wells",
"star_mapping": "well_star_mapping",
]
)
// Use the bam file to generate statistics
| generate_well_statistics.run(
directives: [label: ["verylowmem", "verylowcpu"]],
fromState: { id, state ->
[
"input": state.star_mapping.resolve('Aligned.sortedByCoord.out.bam'),
"barcode": state.barcode,
"well_id": state.well_id,
]
},
toState: [
"nrReadsNrGenesPerChromWell": "nrReadsNrGenesPerChrom",
]
)
// Join the events back to pool-level
| map {id, state ->
// Create a special groupKey, such that groupTuple
// knows when all the barcodes have been grouped into 1 event.
// This way the processing is as distributed as possible.
def key = groupKey(state.pool, state.n_wells)
def newEvent = [key, state]
return newEvent
}
// Use a custom sorting function because sort: 'hash'
// requires a hash to be calculated on every entry of the state
// This is inefficient when the number of events is large
// (i.e large number or barcodes).
// Sorting on lexographical order of the barcode is sufficient here.
| groupTuple(sort: {a, b -> a.barcode <=> b.barcode})
| map {id, states ->
// Gather the keys from all states. for some state items,
// we need gather all the different items from across the states
def barcodes = states.collect{it.barcode}
assert barcodes.clone().unique().size() == barcodes.size(), \
"Error when gathering information for pool ${id}, barcodes are not unique!"
def well_ids = states.collect{it.well_id}
assert well_ids.clone().unique().size() == well_ids.size(), \
"Error when gathering information for pool ${id}, well IDs are not unique!"
def custom_state = [
"input_r1": states.collect{it.input_r1},
"input_r2": states.collect{it.input_r2},
"barcode": barcodes,
"well_id": well_ids,
"star_mapping": states.collect{it.star_mapping},
// Well and pool stats should be carefully kept separate.
// The workflow argument points to the name for the pool statistics:
"nrReadsNrGenesPerChromWell": states.collect{it.nrReadsNrGenesPerChromWell},
"nrReadsNrGenesPerChromPool": states[0].nrReadsNrGenesPerChrom
]
//For many state items, the value is the same across states.
def other_state_keys = states.inject([].toSet()){ current_keys, state ->
def new_keys = current_keys + state.keySet()
return new_keys
}.minus(custom_state.keySet())
// All other state should have a unique value
def old_state_items = other_state_keys.inject([:]){ old_state, argument_name ->
argument_values = states.collect{it.get(argument_name)}.unique()
assert argument_values.size() == 1, "Arguments should be the same across modalities. Please report this \
as a bug. Argument name: $argument_name, \
argument value: $argument_values"
def argument_value
argument_values.each { argument_value = it }
def current_state = old_state + [(argument_name): argument_value]
return current_state
}
def new_state = custom_state + old_state_items
[id.getGroupTarget(), new_state]
}
pool_statistics_ch = pool_ch
| generate_pool_statistics.run(
directives: ["label": ["lowmem", "verylowcpu"]],
fromState: [
"nrReadsNrGenesPerChrom": "nrReadsNrGenesPerChromWell",
"nrReadsNrGenesPerChromPool": "nrReadsNrGenesPerChromPool"
],
toState: [
"nrReadsNrGenesPerChromPool": "nrReadsNrGenesPerChromPool"
]
)
// The statistics from the STAR logs of different wells are joined
// on pool level
star_logs_ch = pool_ch
| combine_star_logs.run(
directives: ["label": ["lowmem", "verylowcpu"]],
fromState: {id, state -> [
"star_logs": state.star_output.collect{it.resolve("Log.final.out")},
"gene_summary_logs": state.star_output.collect{it.resolve("Solo.out/Gene/Summary.csv")},
"reads_per_gene_logs": state.star_output.collect{it.resolve("ReadsPerGene.out.tab")},
"barcodes": state.barcode,
"output": state.star_qc_metrics
]
},
toState: [
"star_qc_metrics": "output",
]
)
eset_ch = star_logs_ch.join(pool_statistics_ch, remainder: true)
| map {id, star_logs_state, pool_statistics_state ->
def newState = star_logs_state + ["nrReadsNrGenesPerChromPool": pool_statistics_state.nrReadsNrGenesPerChromPool]
return [id, newState]
}
| create_pdata.run(
directives: [label: ["lowmem", "lowcpu"]],
fromState: [
"star_stats_file": "star_qc_metrics",
"nrReadsNrGenesPerChromPool": "nrReadsNrGenesPerChromPool",
"output": "p_data"
],
toState: ["p_data": "output"],
)
| create_eset.run(
directives: [label: ["lowmem", "lowcpu"]],
fromState: [
"pDataFile": "p_data",
"fDataFile": "f_data",
"mappingDir": "star_output",
"output": "eset",
"barcodes": "barcode",
"poolName": "pool",
],
toState: [
"eset": "output",
]
)
report_channel = eset_ch
| toSortedList()
| map {ids_and_states ->
def states = ids_and_states.collect{it[1]}
def html_report = states[0].html_report
def ids = ids_and_states.collect{it[0]}
def esets = states.collect{it.eset}
["report", ["esets": esets, "html_report": html_report, "original_ids": ids]]
}
| create_report.run(
fromState: [
"eset": "esets",
"output_report": "html_report",
],
toState: [
"html_report": "output_report"
]
)
| flatMap {id, state ->
state.original_ids.collect{original_id ->
[original_id, ["html_report": state.html_report]]
}
}
output_ch = eset_ch.join(report_channel)
| map {id, state_eset, state_report ->
def new_state = state_eset + [
"html_report": state_report.html_report,
]
[id, new_state]
}
| setState([
"star_output": "star_output",
"fastq_output": "fastq_output_directory",
"nrReadsNrGenesPerChrom": "nrReadsNrGenesPerChromPool",
"star_qc_metrics": "star_qc_metrics",
"eset": "eset",
"f_data": "f_data",
"p_data": "p_data",
"html_report": "html_report",
"run_params": "run_params",
"_meta": "_meta",
])
emit:
output_ch
}

View File

@@ -0,0 +1,8 @@
params {
rootDir = java.nio.file.Paths.get("$projectDir/../../../").toAbsolutePath().normalize().toString()
}
// include common settings
includeConfig("${params.rootDir}/src/config/labels.config")

View File

@@ -0,0 +1,70 @@
nextflow.enable.dsl=2
targetDir = params.rootDir + "/target/nextflow"
include { htrnaseq } from targetDir + "/workflows/htrnaseq/main.nf"
include { check_eset } from targetDir + "/integration_test_components/htrnaseq/check_eset/main.nf"
params.resources_test = "gs://viash-hub-test-data/htrnaseq/v1/"
workflow test_wf {
resources_test_file = file(params.resources_test)
input_ch = Channel.fromList([
[
id: "sample_one",
input_r1: resources_test_file.resolve("100k/SRR14730301/VH02001612_S9_R1_001.fastq"),
input_r2: resources_test_file.resolve("100k/SRR14730301/VH02001612_S9_R2_001.fastq"),
genomeDir: resources_test_file.resolve("genomeDir/gencode.v41.star.sparse"),
barcodesFasta: resources_test_file.resolve("360-wells-with-ids.fasta"),
annotation: resources_test_file.resolve("genomeDir/gencode.v41.annotation.gtf.gz")
],
[
id: "sample_two",
input_r1: resources_test_file.resolve("100k/SRR14730302/VH02001614_S8_R1_001.fastq"),
input_r2: resources_test_file.resolve("100k/SRR14730302/VH02001614_S8_R2_001.fastq"),
genomeDir: resources_test_file.resolve("genomeDir/gencode.v41.star.sparse"),
barcodesFasta: resources_test_file.resolve("360-wells-with-ids.fasta"),
annotation: resources_test_file.resolve("genomeDir/gencode.v41.annotation.gtf.gz")
]
])
| map{ state -> [state.id, state] }
| view { "Input: $it" }
| htrnaseq.run(
toState: [
"eset": "eset",
"star_output": "star_output",
]
)
| check_eset.run(
runIf: {id, state -> id == "sample_one"},
toState: [
"eset": "eset",
"star_output": "star_output"
]
)
}
workflow test_wf2 {
// Test the edge case where one of the barcodes has no reads
resources_test_file = file(params.resources_test)
input_ch = Channel.fromList([
[
id: "sample_one",
input_r1: resources_test_file.resolve("100k/SRR14730301/VH02001612_S9_R1_001.fastq"),
input_r2: resources_test_file.resolve("100k/SRR14730301/VH02001612_S9_R2_001.fastq"),
genomeDir: resources_test_file.resolve("genomeDir/gencode.v41.star.sparse"),
barcodesFasta: resources_test_file.resolve("2-wells-1-no-reads.fasta"),
annotation: resources_test_file.resolve("genomeDir/gencode.v41.annotation.gtf.gz")
],
])
| map{ state -> [state.id, state] }
| view { "Input: $it" }
| htrnaseq.run(
toState: [
"eset": "eset",
"star_output": "star_output",
]
)
}

View File

@@ -0,0 +1,126 @@
name: runner
namespace: workflows
description: Runner for HT RNA-seq pipeline
argument_groups:
- name: Input arguments
arguments:
- name: --input
description: |
Base directory of the form `s3:/<bucket>/Sequencing/<Sequencer>/<RunID>/<demultiplex_dir>`.
Must contains FASTQ files in the format `PoolName_S*_L*_R1_001.fastq.gz` where
* PoolName is a unique ID for the microwell plates or combination thereof.
* S followed by a running number: the sample number based on the order
that samples are listed in the sample sheet (that was used to demultiplex the pools)
starting with 1 (e.g. S1)
* (Optional) the lane number (e.g. L001)
* _001 fixed suffix.
type: file
required: true
- name: --barcodesFasta
type: file
required: true
- name: --genomeDir
type: file
required: true
- name: --annotation
type: file
required: true
- name: --pools
description: |
Filter the FASTQ files in the input directory to only include pools from the provided list.
Pool names are inferred from the FASTQ file names (see input argument for more information).
By default all pools are included.
type: string
multiple: true
- name: "--umi_length"
description: |
Length of the UMI sequences
type: integer
min: 1
default: 10
- name: Metadata arguments
arguments:
- name: --id
description: Unique identifier for the run
type: string
- name: --project_id
description: Project ID
type: string
required: true
- name: --experiment_id
description: Experiment ID
type: string
required: true
- name: Publish arguments
arguments:
- name: --fastq_publish_dir
type: string
required: true
- name: --results_publish_dir
type: string
required: true
- name: Output arguments
description: |
Parameters that determine the structure of the output. These parameters are provided for internal use only
and their defaults should not be overwritten.
arguments:
- name: "--run_params"
type: file
direction: output
default: params.yaml
- name: "--star_output_dir"
type: file
direction: output
default: "star_output"
- name: "--nrReadsNrGenesPerChrom_dir"
type: file
direction: output
default: "nrReadsNrGenesPerChrom"
- name: "--star_qc_metrics_dir"
type: file
direction: output
default: "starLogs"
- name: "--eset_dir"
type: file
direction: output
default: "esets"
- name: "--f_data_dir"
type: file
direction: output
default: "fData"
- name: "--p_data_dir"
type: file
direction: output
default: "pData"
resources:
- type: nextflow_script
path: main.nf
entrypoint: run_wf
- path: disable_publishfiles_process.config
test_resources:
- type: nextflow_script
path: test.nf
entrypoint: test_wf
dependencies:
- name: utils/listInputDir
repository: local
- name: workflows/htrnaseq
repository: local
- name: io/publish_fastqs
repository: local
- name: io/publish_results
repository: local
- name: utils/save_params
repository: local
runners:
- type: nextflow
config:
script:
- includeConfig("disable_publishfiles_process.config")
engines:
- type: native

Some files were not shown because too many files have changed in this diff Show More