Build branch openpipeline_composed/set-public-scope with version set-public-scope to openpipeline_composed on branch set-public-scope (5068271)

Build pipeline: openpipelines-bio.openpipeline-composed.main-vnhhz

Source commit: 50682718d6

Source message: Merge branch 'set-public-scope' of github.com:openpipelines-bio/openpipeline_composed into set-public-scope
This commit is contained in:
CI
2026-04-08 12:30:26 +00:00
commit 432d646588
460 changed files with 298208 additions and 0 deletions

26
.gitignore vendored Normal file
View File

@@ -0,0 +1,26 @@
# IDEs and editors
/.idea
.project
.classpath
*.launch
.settings/
.vscode
# Temp
gitignore
test_results
# System Files
.DS_Store
Thumbs.db
# Nextflow
work
.nextflow*
trace-*.txt
# viash
/resources_test/
# pycache
*__pycache__*

19
CHANGELOG.md Normal file
View File

@@ -0,0 +1,19 @@
# openpipeline_composed x.x.x
## MINOR CHANGES
* `workflows/single_cell/process_integrate_annotate`: Set scope to `private` (PR #6).
* Bump `openpipeline` dependency version to `v4.0.4` (PR #9).
* Bump `viash` version to `0.9.7` (PR #10).
# openpipeline_composed 0.1.1
## MINOR CHANGES
* Add a README (PR #4).
# openpipeline_composed 0.1.0
Initial release containing a single-cell meta-workflow to process single cell omics samples, perform batch integration and/or label projection.

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2025 openpipelines-bio
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

42
README.md Normal file
View File

@@ -0,0 +1,42 @@
# OpenPipeline Composed
OpenPipeline Composed provides a comprehensive meta-workflow that combines multiple stand-alone workflows from the [OpenPipeline](https://github.com/openpipelines-bio/openpipeline/) package. The meta-workflow combines sample processing, batch integration, and cell type annotation into a unified pipeline for single-cell multi-omics data analysis.
[![ViashHub](https://img.shields.io/badge/ViashHub-openpipeline_composed-7a4baa.svg)](https://www.viash-hub.com/packages/openpipeline_composed)
[![GitHub](https://img.shields.io/badge/GitHub-openpipelines--bio%2Fopenpipeline_composed-blue.svg)](https://github.com/openpipelines-bio/openpipeline_composed)
[![GitHub License](https://img.shields.io/github/license/openpipelines-bio/openpipeline_composed.svg)](https://github.com/openpipelines-bio/openpipeline_composed/blob/main/LICENSE)
[![GitHub Issues](https://img.shields.io/github/issues/openpipelines-bio/openpipeline_composed.svg)](https://github.com/openpipelines-bio/openpipeline_composed/issues)
[![Viash version](https://img.shields.io/badge/Viash-v0.9.4-blue.svg)](https://viash.io)
## Overview
The sole purpose of this package is to provide a meta-workflow that orchestrates and combines various stand-alone workflows from the [OpenPipeline](https://github.com/openpipelines-bio/openpipeline/) package. By integrating multiple processing steps into a single workflow, it enables seamless processing from raw data to fully annotated, integrated datasets suitable for downstream analysis and atlas generation.
## Functionality
The meta-workflow combines three core OpenPipeline workflows:
- [**Sample Processing**](https://www.viash-hub.com/packages/openpipeline/latest/components/workflows/multiomics/process_samples): Initial quality control, filtering, and preprocessing
- [**Batch Integration**](https://www.viash-hub.com/packages/openpipeline/latest/components?search=workflows%2Fintegration): Integration using **Harmony** or **scVI** methods
- [**Cell Type Annotation**](https://www.viash-hub.com/packages/openpipeline/latest/components?search=workflows%2Fannotation): Annotation using **scANVI** or **CellTypist** methods
## Key Features
- 🔄 **End-to-End Processing**: Complete pipeline from raw data to annotated results
- 📊 **Atlas Generation**: Create comprehensive atlases from multiple datasets and sources
- 🔬 **Multi-Modal Support**: Process RNA-seq, ATAC-seq, protein, and spatial data
- 🎯 **Method Flexibility**: Choose from multiple integration and annotation approaches
- 🧬 **Reference Integration**: Leverage existing reference datasets for annotation
## Execution via CLI or Seqera Cloud
The openpipeline_composed package is available via [Viash Hub](https://www.viash-hub.com/packages/openpipeline_composed/latest/), where you can receive instructions on how to run the end-to-end workflow as well as individual subworkflows or components.
It's possible to run the workflow directly from Seqera Cloud. The necessary Nextflow schema files have been built and provided with the workflows in order to use the form-based input. However, Seqera Cloud can not deal with multiple-value parameters for batch processing of multiple samples. Therefore, it's better to use Viash Hub also here for launching the workflow on Seqera Cloud.
* Navigate to the [Viash Hub package page](https://www.viash-hub.com/packages/openpipeline_composed/latest/), select the workflow you want to launch and click the `launch` button.
* Select the execution environment of choice (e.g. `Seqera Cloud`, `Nextflow` or `Executable`)
* Fill in the form with the required parameters and launch the workflow.
## Support
For issues specific to the composed meta-workflow, please use the [GitHub issues tracker](https://github.com/openpipelines-bio/openpipeline_composed/issues). For general OpenPipeline questions, refer to the main [OpenPipeline documentation](https://openpipelines.bio/).

32
_viash.yaml Normal file
View File

@@ -0,0 +1,32 @@
viash_version: 0.9.7
source: src
target: target
name: openpipeline_composed
organization: vsh
links:
repository: https://github.com/openpipelines-bio/openpipeline_composed
docker_registry: ghcr.io
repositories:
- name: openpipeline
repo: openpipeline
type: vsh
tag: v4.0.4
- name: openpipeline_qc
repo: openpipeline_qc
type: vsh
tag: v0.2.2
- name: biobox
repo: biobox
type: vsh
tag: v0.4.2
info:
test_resources:
- type: s3
path: s3://openpipelines-bio/openpipeline_incubator/resources_test
dest: resources_test
config_mods: |
.requirements.commands := ['ps']
.runners[.type == 'nextflow'].directives.tag := '$id'
.resources += {path: '/src/configs/labels.config', dest: 'nextflow_labels.config'}
.runners[.type == 'nextflow'].config.script := 'includeConfig("nextflow_labels.config")'
version: set-public-scope

0
main.nf Normal file
View File

0
nextflow.config Normal file
View File

View File

@@ -0,0 +1,170 @@
#!/bin/bash
set -eo pipefail
# ensure that the command below is run from the root of the repository
REPO_ROOT=$(git rev-parse --show-toplevel)
cd "$REPO_ROOT"
# settings
ID=10x_5k_anticmv
OUT=resources_test/$ID
# create raw directory
raw_dir="$OUT/raw"
mkdir -p "$raw_dir"
# Check whether seqkit is available
if ! command -v seqkit &> /dev/null; then
echo "This script requires seqkit. Please make sure the binary is added to your PATH."
exit 1
fi
# dataset page:
# https://www.10xgenomics.com/resources/datasets/integrated-gex-totalseqc-and-tcr-analysis-of-connect-generated-library-from-5k-cmv-t-cells-2-standard
# check whether reference is available
reference_dir="resources_test/reference_gencodev41_chr1/"
genome_tar="$reference_dir/reference_cellranger.tar.gz"
if [[ ! -f "$genome_tar" ]]; then
echo "$genome_tar does not exist. Please create the reference genome first"
exit 1
fi
# download and untar source fastq files
tar_dir="$HOME/.cache/openpipeline/5k_human_antiCMV_T_TBNK_connect_Multiplex"
if [[ ! -d "$tar_dir" ]]; then
mkdir -p "$tar_dir"
# download fastqs and untar
wget "https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-vdj/6.1.2/5k_human_antiCMV_T_TBNK_connect_Multiplex/5k_human_antiCMV_T_TBNK_connect_Multiplex_fastqs.tar" -O "$tar_dir.tar"
tar -xvf "$tar_dir.tar" -C "$tar_dir" --strip-components=1
rm "$tar_dir.tar"
fi
function seqkit_head {
input="$1"
output="$2"
if [[ ! -f "$output" ]]; then
echo "> Processing `basename $input`"
seqkit head -n 200000 "$input" | gzip > "$output"
fi
}
orig_sample_id="5k_human_antiCMV_T_TBNK_connect"
seqkit_head "$tar_dir/gex_1/${orig_sample_id}_GEX_1_S1_L001_R1_001.fastq.gz" "$raw_dir/${orig_sample_id}_GEX_1_subset_S1_L001_R1_001.fastq.gz"
seqkit_head "$tar_dir/gex_1/${orig_sample_id}_GEX_1_S1_L001_R2_001.fastq.gz" "$raw_dir/${orig_sample_id}_GEX_1_subset_S1_L001_R2_001.fastq.gz"
seqkit_head "$tar_dir/ab/${orig_sample_id}_AB_S2_L004_R1_001.fastq.gz" "$raw_dir/${orig_sample_id}_AB_subset_S2_L004_R1_001.fastq.gz"
seqkit_head "$tar_dir/ab/${orig_sample_id}_AB_S2_L004_R2_001.fastq.gz" "$raw_dir/${orig_sample_id}_AB_subset_S2_L004_R2_001.fastq.gz"
seqkit_head "$tar_dir/vdj/${orig_sample_id}_VDJ_S1_L001_R1_001.fastq.gz" "$raw_dir/${orig_sample_id}_VDJ_subset_S1_L001_R1_001.fastq.gz"
seqkit_head "$tar_dir/vdj/${orig_sample_id}_VDJ_S1_L001_R2_001.fastq.gz" "$raw_dir/${orig_sample_id}_VDJ_subset_S1_L001_R2_001.fastq.gz"
# download immune panel fasta if needed
feature_reference="$raw_dir/feature_reference.csv"
if [[ ! -f "$feature_reference" ]]; then
wget "https://cf.10xgenomics.com/samples/cell-vdj/6.1.2/5k_human_antiCMV_T_TBNK_connect_Multiplex/5k_human_antiCMV_T_TBNK_connect_Multiplex_count_feature_reference.csv" -O "$feature_reference"
fi
# download vdj reference if needed
vdj_ref="$raw_dir/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0.tar.gz"
if [[ ! -f "$vdj_ref" ]]; then
wget "https://cf.10xgenomics.com/supp/cell-vdj/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0.tar.gz" -O "$vdj_ref"
fi
# Run mapping pipeline
cat > /tmp/params.yaml << HERE
param_list:
- id: "$ID"
input: "$raw_dir"
library_id:
- "${orig_sample_id}_GEX_1_subset"
- "${orig_sample_id}_AB_subset"
- "${orig_sample_id}_VDJ_subset"
library_type:
- "Gene Expression"
- "Antibody Capture"
- "VDJ"
gex_reference: "$genome_tar"
vdj_reference: "$vdj_ref"
feature_reference: "$feature_reference"
HERE
nextflow \
run https://packages.viash-hub.com/vsh/openpipeline \
-r v4.0.4 \
-main-script target/nextflow/mapping/cellranger_multi/main.nf \
-resume \
--publish_dir "${OUT}_v10/processed" \
-profile docker,mount_temp \
-params-file /tmp/params.yaml \
-c ./src/configs/labels_ci.config
# Convert to h5mu
cat > /tmp/params.yaml << HERE
id: "$orig_sample_id"
input: "$OUT/processed/10x_5k_anticmv.cellranger_multi.output"
publish_dir: "$OUT/"
output: "*.h5mu"
HERE
nextflow \
run https://packages.viash-hub.com/vsh/openpipeline \
-r v4.0.4 \
-main-script target/nextflow/convert/from_cellranger_multi_to_h5mu/main.nf \
-resume \
-profile docker,mount_temp \
-params-file /tmp/params.yaml \
-c ./src/configs/labels_ci.config
mv "$OUT/0.h5mu" "$OUT/${orig_sample_id}.h5mu"
# run qc workflow
cat > /tmp/params.yaml << HERE
id: "$ID"
input: "$OUT/$orig_sample_id.h5mu"
var_name_mitochondrial_genes: mitochondrial
var_name_ribosomal_genes: ribosomal
publish_dir: "$OUT/"
output: "${orig_sample_id}_qc.h5mu"
HERE
nextflow \
run https://packages.viash-hub.com/vsh/openpipeline \
-r v4.0.4 \
-main-script target/nextflow/workflows/qc/qc/main.nf \
-resume \
-profile docker,mount_temp \
-params-file /tmp/params.yaml \
-c ./src/configs/labels_ci.config
# Run full pipeline
cat > /tmp/params.yaml << HERE
id: "$ID"
input: "$OUT/${orig_sample_id}_qc.h5mu"
publish_dir: "$OUT/"
output: "${orig_sample_id}_mms.h5mu"
HERE
nextflow \
run https://packages.viash-hub.com/vsh/openpipeline \
-r v4.0.4 \
-main-script target/nextflow/workflows/multiomics/process_samples/main.nf \
-resume \
-profile docker,mount_temp \
-params-file /tmp/params.yaml \
-c ./src/configs/labels_ci.config
aws s3 sync \
"$OUT" \
s3://openpipelines-bio/openpipeline_incubator/resources_test/"$ID" \
--exclude "*.yaml" \
--delete \
--dryrun

View File

@@ -0,0 +1,166 @@
#!/bin/bash
set -eo pipefail
# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)
# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"
ID=annotation_test_data
OUT=resources_test/$ID/
# ideally, this would be a versioned pipeline run
[ -d "$OUT" ] || mkdir -p "$OUT"
# Download Tabula Sapiens Blood reference h5ad from https://doi.org/10.5281/zenodo.7587774
wget "https://zenodo.org/record/7587774/files/TS_Blood_filtered.h5ad?download=1" -O "${OUT}/tmp_TS_Blood_filtered.h5ad"
# Download Tabula Sapiens Blood pretrained model from https://doi.org/10.5281/zenodo.7580707
wget "https://zenodo.org/record/7580707/files/pretrained_models_Blood_ts.tar.gz?download=1" -O "${OUT}/tmp_pretrained_models_Blood_ts.tar.gz"
# Download PopV specific CL ontology files - needed for OnClass
# OUT_ONTOLOGY="${OUT}/ontology"
# [ -d "$OUT_ONTOLOGY" ] || mkdir -p "$OUT_ONTOLOGY"
# wget https://raw.githubusercontent.com/czbiohub/PopV/main/ontology/cl.obo \
# -O "${OUT_ONTOLOGY}/cl.obo"
# wget https://raw.githubusercontent.com/czbiohub/PopV/main/ontology/cl.ontology \
# -O "${OUT_ONTOLOGY}/cl.ontology"
# wget https://raw.githubusercontent.com/czbiohub/PopV/main/ontology/cl.ontology.nlp.emb \
# -O "${OUT_ONTOLOGY}/cl.ontology.nlp.emb"
# Process Tabula Sapiens Blood reference h5ad
# (Select one individual and 100 cells per cell type)
# normalize and log1p transform data
# Add treatment and disease columns
python <<HEREDOC
import anndata as ad
import scanpy as sc
import numpy as np
# Read in data
ref_adata = ad.read_h5ad("${OUT}/tmp_TS_Blood_filtered.h5ad")
sub_ref_adata = ref_adata[ref_adata.obs["donor_assay"] == "TSP14_10x 3' v3"]
n=100
s=sub_ref_adata.obs.groupby('cell_ontology_class').cell_ontology_class.transform('count')
sub_ref_adata_final = sub_ref_adata[sub_ref_adata.obs[s>=n].groupby('cell_ontology_class').head(n).index]
# Normalize and log1p transform data
data_for_scanpy = ad.AnnData(X=sub_ref_adata_final.X)
sc.pp.normalize_total(data_for_scanpy, target_sum=10000)
sc.pp.log1p(
data_for_scanpy,
base=None,
layer=None,
copy=False,
)
sub_ref_adata_final.layers["log_normalized"] = data_for_scanpy.X
# Add treatment and disease columns
n_cells = sub_ref_adata_final.n_obs
treatment = np.random.choice(["ctrl", "stim"], size=n_cells, p=[0.5, 0.5])
disease = np.random.choice(["healthy", "diseased"], size=n_cells, p=[0.5, 0.5])
sub_ref_adata_final.obs["treatment"] = treatment
sub_ref_adata_final.obs["disease"] = disease
# Write out data
sub_ref_adata_final.write("${OUT}/TS_Blood_filtered.h5ad", compression='gzip')
HEREDOC
echo "> Converting to h5mu"
viash run src/convert/from_h5ad_to_h5mu/config.vsh.yaml --engine docker -- \
--input "${OUT}/TS_Blood_filtered.h5ad" \
--output "${OUT}/TS_Blood_filtered.h5mu" \
--modality "rna"
rm "${OUT}/tmp_TS_Blood_filtered.h5ad"
echo "> Downloading pretrained CellTypist model and sample test data"
wget https://celltypist.cog.sanger.ac.uk/models/Pan_Immune_CellTypist/v2/Immune_All_Low.pkl \
-O "${OUT}/celltypist_model_Immune_All_Low.pkl"
wget https://celltypist.cog.sanger.ac.uk/Notebook_demo_data/demo_2000_cells.h5ad \
-O "${OUT}/demo_2000_cells.h5ad"
viash run src/convert/from_h5ad_to_h5mu/config.vsh.yaml --engine docker -- \
--input "${OUT}/demo_2000_cells.h5ad" \
--output "${OUT}/demo_2000_cells.h5mu" \
--modality "rna"
echo "> Fetching OnClass data and models"
OUT_ONTOLOGY="${OUT}/ontology"
[ -d "$OUT_ONTOLOGY" ] || mkdir -p "$OUT_ONTOLOGY"
wget https://figshare.com/ndownloader/files/28394466 -O "${OUT_ONTOLOGY}/OnClass_data_public_minimal.tar.gz"
tar -xzvf "${OUT_ONTOLOGY}/OnClass_data_public_minimal.tar.gz" -C "${OUT_ONTOLOGY}" --strip-components=2
rm "${OUT_ONTOLOGY}/allen.ontology"
rm "${OUT_ONTOLOGY}/OnClass_data_public_minimal.tar.gz"
wget https://figshare.com/ndownloader/files/28394541 -O "${OUT}/OnClass_models.tar.gz"
tar -xzvf "${OUT}/OnClass_models.tar.gz" -C "${OUT}" --strip-components=1
rm "${OUT}/OnClass_models.tar.gz"
rm "${OUT}/tmp_pretrained_models_Blood_ts.tar.gz"
find "${OUT}/Pretrained_model" ! -name "example_file_model*" -type f -exec rm -f {} +
mv "${OUT}/Pretrained_model" "${OUT}/onclass_model"
echo "> Creating simple SCVI model"
viash run src/integrate/scvi/config.vsh.yaml --engine docker -- \
--input "${OUT}/TS_Blood_filtered.h5mu" \
--obs_batch "donor_id" \
--var_gene_names "ensemblid" \
--output "${OUT}/scvi_output.h5mu" \
--output_model "${OUT}/scvi_model" \
--max_epochs 5 \
--n_obs_min_count 10 \
--n_var_min_count 10
echo "> Creating SCVI model with covariates"
viash run src/integrate/scvi/config.vsh.yaml --engine docker -- \
--input "${OUT}/TS_Blood_filtered.h5mu" \
--obs_batch "donor_id" \
--var_gene_names "ensemblid" \
--obs_categorical_covariate "assay" \
--obs_categorical_covariate "donor_assay" \
--output "${OUT}/scvi_covariate_output.h5mu" \
--output_model "${OUT}/scvi_covariate_model" \
--max_epochs 5 \
--n_obs_min_count 10 \
--n_var_min_count 10
echo "> Creating simple SCANVI model"
viash run src/annotate/scanvi/config.vsh.yaml --engine docker -- \
--input "${OUT}/TS_Blood_filtered.h5mu" \
--var_gene_names "ensemblid" \
--obs_labels "cell_ontology_class" \
--scvi_model "${OUT}/scvi_model" \
--output "${OUT}/scanvi_output.h5mu" \
--output_model "${OUT}/scanvi_model" \
--max_epochs 5
echo "> Creating SCANVI model with covariates"
viash run src/annotate/scanvi/config.vsh.yaml --engine docker -- \
--input "${OUT}/TS_Blood_filtered.h5mu" \
--var_gene_names "ensemblid" \
--obs_labels "cell_ontology_class" \
--scvi_model "${OUT}/scvi_covariate_model" \
--output "${OUT}/scanvi_covariate_output.h5mu" \
--output_model "${OUT}/scanvi_covariate_model" \
--max_epochs 5
rm "${OUT}/scanvi_output.h5mu"
rm "${OUT}/scanvi_covariate_output.h5mu"
rm "${OUT}/scvi_output.h5mu"
rm "${OUT}/scvi_covariate_output.h5mu"
rm -r "${OUT}/Pretrained_model/"
echo "> Creating Pseudobulk Data for DGEA"
viash run src/differential_expression/create_pseudobulk/config.vsh.yaml --engine docker -- \
--input "${OUT}/TS_Blood_filtered.h5mu" \
--obs_grouping "cell_type" \
--obs_sample_conditions "donor_id" \
--obs_sample_conditions "treatment" \
--obs_sample_conditions "disease" \
--min_num_cells_per_sample 5 \
--output "${OUT}/TS_Blood_filtered_pseudobulk.h5mu"

View File

@@ -0,0 +1,151 @@
#!/bin/bash
set -eo pipefail
# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)
# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"
ID=pbmc_1k_protein_v3
OUT=resources_test/$ID/$ID
DIR=$(dirname "$OUT")
# ideally, this would be a versioned pipeline run
[ -d "$DIR" ] || mkdir -p "$DIR"
# dataset page:
# https://www.10xgenomics.com/resources/datasets/1-k-pbm-cs-from-a-healthy-donor-gene-expression-and-cell-surface-protein-3-standard-3-0-0
# download metrics summary
wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_metrics_summary.csv \
-O "${OUT}_metrics_summary.csv"
# download counts h5 file
wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5 \
-O "${OUT}_filtered_feature_bc_matrix.h5"
wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_raw_feature_bc_matrix.h5 \
-O "${OUT}_raw_feature_bc_matrix.h5"
# download counts matrix tar gz file
wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_filtered_feature_bc_matrix.tar.gz \
-O "${OUT}_filtered_feature_bc_matrix.tar.gz"
# extract matrix tar gz
mkdir -p "${OUT}_filtered_feature_bc_matrix"
tar -xvf "${OUT}_filtered_feature_bc_matrix.tar.gz" \
-C "${OUT}_filtered_feature_bc_matrix" \
--strip-components 1
rm "${OUT}_filtered_feature_bc_matrix.tar.gz"
cat > /tmp/params.yaml << HERE
--input "${OUT}_filtered_feature_bc_matrix.h5" \
--input_metrics_summary "${OUT}_metrics_summary.csv" \
--output "${OUT}_filtered_feature_bc_matrix.h5mu"
param_list:
- id: "$ID"
genome_fasta: "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/GRCh38.primary_assembly.genome.fa.gz"
transcriptome_gtf: "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.annotation.gtf.gz"
target: ["bd_rhapsody", "cellranger_arc"]
output_fasta: "reference.fa.gz"
output_gtf: "reference.gtf.gz"
non_nuclear_contigs: null
output_cellranger_arc: "reference_cellranger.tar.gz"
output_bd_rhapsody: "reference_bd_rhapsody.tar.gz"
bdrhap_extra_star_params: "--genomeSAindexNbases 12 --genomeSAsparseD 2"
motifs_file: "$motifs_modified"
subset_regex: "chr1"
HERE
# convert 10x h5 to h5mu
nextflow run https://packages.viash-hub.com/vsh/openpipeline \
-latest \
-r v4.0.4 \
-main-script target/docker/convert/from_10xh5_to_h5mu/from_10xh5_to_h5mu \
-profile docker \
-c ./src/configs/labels_ci.config \
-params-file /tmp/params.yaml \
--publish_dir $OUT \
-resume
# run single sample
nextflow \
run . \
-main-script target/nextflow/workflows/rna/rna_singlesample/main.nf \
-c src/workflows/utils/labels_ci.config \
-profile docker \
--id pbmc_1k_protein_v3_uss \
--input "${OUT}_filtered_feature_bc_matrix.h5mu" \
--output "`basename $OUT`_uss.h5mu" \
--publishDir `dirname $OUT` \
-resume
# add the sample ID to the mudata object
nextflow \
run . \
-main-script target/nextflow/metadata/add_id/main.nf \
-c src/workflows/utils/labels_ci.config \
-profile docker \
--id pbmc_1k_protein_v3_uss \
--input "${OUT}_uss.h5mu" \
--input_id "pbmc_1k_protein_v3_uss" \
--output "`basename $OUT`_uss_with_id.h5mu" \
--output_compression "gzip" \
--publishDir `dirname $OUT` \
-resume
# run multisample
nextflow \
run . \
-main-script target/nextflow/workflows/rna/rna_multisample/main.nf \
-c src/workflows/utils/labels_ci.config \
-profile docker \
--id pbmc_1k_protein_v3_ums \
--input "${OUT}_uss_with_id.h5mu" \
--output "`basename $OUT`_ums.h5mu" \
--publishDir `dirname $OUT` \
-resume
rm "${OUT}_uss_with_id.h5mu"
# run dimred
nextflow \
run . \
-main-script target/nextflow/workflows/multiomics/dimensionality_reduction/main.nf \
-c src/workflows/utils/labels_ci.config \
-profile docker \
--id pbmc_1k_protein_v3_mms \
--input "${OUT}_ums.h5mu" \
--output "`basename $OUT`_mms.h5mu" \
--publishDir `dirname $OUT` \
--obs_covariates sample_id \
-resume
# run integration
nextflow \
run . \
-main-script target/nextflow/workflows/integration/harmony_leiden/main.nf \
-c src/workflows/utils/labels_ci.config \
-profile docker \
--id pbmc_1k_protein_v3_mms_integration \
--input "${OUT}_mms.h5mu" \
--output "`basename $OUT`_mms.h5mu" \
--publishDir `dirname $OUT` \
--obs_covariates sample_id \
-resume
python <<HEREDOC
import mudata as mu
mudata = mu.read_h5mu("${DIR}/pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5mu")
mudata.mod["rna"].write_h5ad("${DIR}/pbmc_1k_protein_v3_filtered_feature_bc_matrix_rna.h5ad")
HEREDOC
aws s3 sync \
"$OUT" \
s3://openpipelines-bio/openpipeline_incubator/resources_test/"$ID" \
--exclude "*.yaml" \
--delete \
--dryrun

View File

@@ -0,0 +1,166 @@
#/bin/bash
OUT_DIR=resources_test/qc_sample_data
OUT_DIR_SPATIAL=resources_test/spatial_qc_sample_data
[ ! -d "$OUT_DIR" ] && mkdir -p "$OUT_DIR"
[ ! -d "$OUT_DIR_SPATIAL" ] && mkdir -p "$OUT_DIR_SPATIAL"
# fetch/create h5mu from somewhere
cat > /tmp/params_create_h5mu.yaml <<EOF
param_list:
- id: sample_one
input_id: sample_one
input: s3://openpipelines-data/10x_5k_anticmv/5k_human_antiCMV_T_TBNK_connect_qc.h5mu
- id: sample_two
input_id: sample_two
input: s3://openpipelines-data/10x_5k_anticmv/5k_human_antiCMV_T_TBNK_connect_qc.h5mu
output: '\$id.qc.h5mu'
output_compression: gzip
publish_dir: "$OUT_DIR"
EOF
# add the sample ID to the mudata object
nextflow run openpipelines-bio/openpipeline \
-latest \
-r 2.1.2 \
-main-script target/nextflow/metadata/add_id/main.nf \
-c src/configs/labels_ci.config \
-profile docker \
-params-file /tmp/params_create_h5mu.yaml \
-resume
cat > /tmp/params_subset.yaml <<EOF
param_list:
- id: sample_one
input: resources_test/qc_sample_data/sample_one.qc.h5mu
- id: sample_two
input: resources_test/qc_sample_data/sample_two.qc.h5mu
output: '\$id.qc.h5mu'
number_of_observations: 10000
output_compression: gzip
publish_dir: "$OUT_DIR"
EOF
# subset h5mus
nextflow run openpipelines-bio/openpipeline \
-latest \
-r 2.1.2 \
-main-script target/nextflow/filter/subset_h5mu/main.nf \
-c src/configs/labels_ci.config \
-profile docker \
-params-file /tmp/params_subset.yaml \
-resume
cat > /tmp/add_metadata_obs.py <<EOF
import mudata as mu
import glob
import numpy as np
import pandas as pd
import os
# Directory containing the h5mu files
out_dir = "$(pwd)/resources_test/qc_sample_data"
# List of h5mu files
h5mu_files = glob.glob(os.path.join(out_dir, "*.h5mu"))
print(f"Found {len(h5mu_files)} h5mu files: {h5mu_files}")
# Metadata values to randomly assign
donor_ids = ["donor_1", "donor_2", "donor_3"]
cell_types = ["CD4+ T cell", "CD8+ T cell", "B cell", "NK cell", "Monocyte"]
batches = ["batch_A", "batch_B"]
conditions = ["treated", "control"]
for h5mu_file in h5mu_files:
print(f"Processing {h5mu_file}...")
# Load MuData object
mdata = mu.read_h5mu(h5mu_file)
rna = mdata.mod["rna"]
n_obs = rna.n_obs
# Generate random metadata
np.random.seed(42 + hash(h5mu_file) % 100) # Different seed for each file but reproducible
# Create metadata
rna.obs["donor_id"] = np.random.choice(donor_ids, size=n_obs)
rna.obs["cell_type"] = np.random.choice(cell_types, size=n_obs)
rna.obs["batch"] = np.random.choice(batches, size=n_obs)
rna.obs["condition"] = np.random.choice(conditions, size=n_obs)
# Add a continuous variable too
rna.obs["quality_score"] = np.random.uniform(0, 1, size=n_obs)
# Save the modified MuData object
mu.write_h5mu(h5mu_file, mdata)
print(f"Added metadata to {h5mu_file}")
print("All files processed successfully!")
EOF
# Execute the Python script
python /tmp/add_metadata_obs.py
# generate cellbender out for testing
cat > /tmp/params_cellbender.yaml <<EOF
param_list:
- id: sample_one
input: resources_test/qc_sample_data/sample_one.qc.h5mu
- id: sample_two
input: resources_test/qc_sample_data/sample_two.qc.h5mu
output: '\$id.qc.cellbender.h5mu'
epochs: 5
output_compression: gzip
publish_dir: "$OUT_DIR"
EOF
nextflow run openpipelines-bio/openpipeline \
-latest \
-r 2.1.2 \
-main-script target/nextflow/correction/cellbender_remove_background/main.nf \
-c src/configs/labels_ci.config \
-profile docker \
-params-file /tmp/params_cellbender.yaml \
-resume
# fetch spatial sample data from s3
aws s3 sync \
--profile di \
s3://openpipelines-bio/openpipeline_incubator/resources_test/spatial_qc_sample_data \
"$OUT_DIR_SPATIAL"
# generate json for testing
viash run src/ingestion_qc/h5mu_to_qc_json/config.vsh.yaml --engine docker -- \
--input "$OUT_DIR"/sample_one.qc.cellbender.h5mu \
--input "$OUT_DIR"/sample_two.qc.cellbender.h5mu \
--ingestion_method cellranger_multi \
--obs_metadata "donor_id;cell_type;batch;condition" \
--output "$OUT_DIR"/sc_dataset.json \
--output_reporting_json "$OUT_DIR"/sc_report_structure.json
viash run src/ingestion_qc/h5mu_to_qc_json/config.vsh.yaml --engine docker -- \
--input "$OUT_DIR_SPATIAL"/xenium_tiny.qc.h5mu \
--input "$OUT_DIR_SPATIAL"/xenium_tiny.qc.h5mu \
--ingestion_method xenium \
--min_num_nonzero_vars 1 \
--output "$OUT_DIR_SPATIAL"/xenium_dataset.json \
--output_reporting_json "$OUT_DIR_SPATIAL"/xenium_report_structure.json
# remove all state yaml files
rm "$OUT_DIR"/*.yaml
rm "$OUT_DIR_SPATIAL"/*.yaml
# copy to s3
aws s3 sync \
"$OUT_DIR" \
s3://openpipelines-bio/openpipeline_incubator/"$OUT_DIR" \
--delete \
--dryrun
aws s3 sync \
"$OUT_DIR_SPATIAL" \
s3://openpipelines-bio/openpipeline_incubator/"$OUT_DIR_SPATIAL" \
--delete \
--dryrun

View File

@@ -0,0 +1,74 @@
#!/bin/bash
set -eo pipefail
# ensure that the command below is run from the root of the repository
REPO_ROOT=$(git rev-parse --show-toplevel)
cd "$REPO_ROOT"
# settings
ID=reference_gencodev41_chr1
OUT=resources_test/$ID
mkdir -p "$OUT"
wget "https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip" -O "$OUT/ERCC92.zip"
# Download JASPAR files for reference building
# Source of the code below: https://support.10xgenomics.com/single-cell-atac/software/release-notes/references#GRCh38-2020-A-2.0.0
motifs_url="https://jaspar.elixir.no/download/data/2024/CORE/JASPAR2024_CORE_non-redundant_pfms_jaspar.txt"
motifs_in="${OUT}/JASPAR2024_CORE_non-redundant_pfms_jaspar.txt"
if [ ! -f "$motifs_in" ]; then
curl -sS "$motifs_url" > "$motifs_in"
fi
# Change motif headers so the human-readable motif name precedes the motif
# identifier. So ">MA0004.1 Arnt" -> ">Arnt_MA0004.1".
motifs_modified="${OUT}/$(basename "$motifs_in").modified"
awk '{
if ( substr($1, 1, 1) == ">" ) {
print ">" $2 "_" substr($1,2)
} else {
print
}
}' "$motifs_in" > "$motifs_modified"
cat > /tmp/params.yaml << HERE
param_list:
- id: "$ID"
genome_fasta: "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/GRCh38.primary_assembly.genome.fa.gz"
transcriptome_gtf: "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.annotation.gtf.gz"
target: ["bd_rhapsody", "cellranger_arc"]
output_fasta: "reference.fa.gz"
output_gtf: "reference.gtf.gz"
non_nuclear_contigs: null
output_cellranger_arc: "reference_cellranger.tar.gz"
output_bd_rhapsody: "reference_bd_rhapsody.tar.gz"
bdrhap_extra_star_params: "--genomeSAindexNbases 12 --genomeSAsparseD 2"
motifs_file: "$motifs_modified"
subset_regex: "chr1"
HERE
nextflow run https://packages.viash-hub.com/vsh/openpipeline \
-latest \
-r v4.0.4 \
-main-script target/nextflow/workflows/ingestion/make_reference/main.nf \
-profile docker \
-c ./src/configs/labels_ci.config \
-params-file /tmp/params.yaml \
--publish_dir $OUT \
-resume
rm "$motifs_modified"
rm "$motifs_in"
rm "$OUT/ERCC92.zip"
aws s3 sync \
"$OUT" \
s3://openpipelines-bio/openpipeline_incubator/resources_test/"$ID" \
--exclude "*.yaml" \
--delete \
--dryrun

View File

@@ -0,0 +1,37 @@
#/bin/bash
OUT_DIR=resources_test/spatial_qc_sample_data
[ ! -d "$OUT_DIR" ] && mkdir -p "$OUT_DIR"
# fetch/create h5mu from somewhere
cat > /tmp/qc.yaml <<EOF
param_list:
- id: xenium_tiny
input: s3://openpipelines-bio/openpipeline_spatial/resources_test/xenium/xenium_tiny.h5mu
- id: Lung5_Rep2_tiny
input: s3://openpipelines-bio/openpipeline_spatial/resources_test/cosmx/Lung5_Rep2_tiny.h5mu
var_name_mitochondrial_genes: mitochondrial
var_name_ribosomal_genes: ribosomal
output: '\$id.qc.h5mu'
output_compression: gzip
publish_dir: "$OUT_DIR"
EOF
nextflow run openpipelines-bio/openpipeline \
-latest \
-r 2.1.0 \
-main-script target/nextflow/workflows/qc/qc/main.nf \
-profile docker \
-params-file /tmp/qc.yaml \
-resume \
-config src/configs/labels_ci.config
# copy to s3
aws s3 sync \
--profile di \
resources_test/spatial_qc_sample_data \
s3://openpipelines-bio/openpipeline_incubator/resources_test/spatial_qc_sample_data \
--delete --dryrun \
--exclude "*" --include "*.h5mu" \

View File

@@ -0,0 +1,11 @@
name: Dorien Roosen
info:
role: Core Team Member
links:
email: dorien@data-intuitive.com
github: dorien-er
linkedin: dorien-roosen
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Scientist

View File

@@ -0,0 +1,11 @@
name: Jakub Majercik
info:
role: Contributor
links:
email: jakub@data-intuitive.com
github: jakubmajercik
linkedin: jakubmajercik
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatics Engineer

View File

@@ -0,0 +1,15 @@
name: Robrecht Cannoodt
info:
role: Core Team Member
links:
email: robrecht@data-intuitive.com
github: rcannood
orcid: "0000-0003-3641-729X"
linkedin: robrechtcannoodt
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Science Engineer
- name: Open Problems
href: https://openproblems.bio
role: Core Member

View File

@@ -0,0 +1,6 @@
name: Weiwei Schultz
info:
role: Contributor
organizations:
- name: Janssen R&D US
role: Associate Director Data Sciences

View File

@@ -0,0 +1,36 @@
profiles {
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
}

66
src/configs/labels.config Normal file
View File

@@ -0,0 +1,66 @@
process {
// Default resources for components that hardly do any processing
memory = { 2.GB * task.attempt }
cpus = 1
// Retry for exit codes that have something to do with memory issues
errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 3
maxMemory = null
// CPU resources
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 10 }
withLabel: highcpu { cpus = 20 }
// Memory resources
withLabel: verylowmem { memory = { get_memory( 4.GB * task.attempt ) } }
withLabel: lowmem { memory = { get_memory( 8.GB * task.attempt ) } }
withLabel: midmem { memory = { get_memory( 16.GB * task.attempt ) } }
withLabel: highmem { memory = { get_memory( 64.GB * task.attempt ) } }
withLabel: veryhighmem { memory = { get_memory( 75.GB * task.attempt ) } }
// Disk space
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
// NOTE: The above labels intentionally do not have an effect by default.
// The user should set the disk space requirements by adding the following
// to the compute environment:
//
// withLabel: lowdisk { disk = { 20.GB * task.attempt } }
// withLabel: middisk { disk = { 100.GB * task.attempt } }
// withLabel: highdisk { disk = { 200.GB * task.attempt } }
// withLabel: veryhighdisk { disk = { 500.GB * task.attempt } }
}
def get_memory(to_compare) {
if (!process.containsKey("maxMemory") || !process.maxMemory) {
return to_compare
}
try {
if (process.containsKey("maxRetries") && process.maxRetries && task.attempt == (process.maxRetries as int)) {
return process.maxMemory
}
else if (to_compare.compareTo(process.maxMemory as nextflow.util.MemoryUnit) == 1) {
return max_memory as nextflow.util.MemoryUnit
}
else {
return to_compare
}
} catch (all) {
println "Error processing memory resources. Please check that process.maxMemory '${process.maxMemory}' and process.maxRetries '${process.maxRetries}' are valid!"
System.exit(1)
}
}

View File

@@ -0,0 +1,105 @@
process {
withLabel: lowmem { memory = 13.Gb }
withLabel: lowcpu { cpus = 4 }
withLabel: midmem { memory = 13.Gb }
withLabel: midcpu { cpus = 4 }
withLabel: highmem { memory = 13.Gb }
withLabel: highcpu { cpus = 4 }
withLabel: veryhighmem { memory = 13.Gb }
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
}
env.NUMBA_CACHE_DIR = '/tmp'
trace {
enabled = true
overwrite = true
}
dag {
overwrite = true
}
process.maxForks = 1
profiles {
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
docker {
docker.fixOwnership = true
docker.enabled = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
local {
// This config is for local processing.
process {
maxMemory = 25.GB
withLabel: verylowcpu { cpus = 2 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 6 }
withLabel: highcpu { cpus = 12 }
withLabel: lowmem { memory = { get_memory( 8.GB * task.attempt ) } }
withLabel: midmem { memory = { get_memory( 12.GB * task.attempt ) } }
withLabel: highmem { memory = { get_memory( 20.GB * task.attempt ) } }
}
}
}
def get_memory(to_compare) {
if (!process.containsKey("maxMemory") || !process.maxMemory) {
return to_compare
}
try {
if (process.containsKey("maxRetries") && process.maxRetries && task.attempt == (process.maxRetries as int)) {
return process.maxMemory
}
else if (to_compare.compareTo(process.maxMemory as nextflow.util.MemoryUnit) == 1) {
return max_memory as nextflow.util.MemoryUnit
}
else {
return to_compare
}
} catch (all) {
println "Error processing memory resources. Please check that process.maxMemory '${process.maxMemory}' and process.maxRetries '${process.maxRetries}' are valid!"
System.exit(1)
}
}

View File

@@ -0,0 +1,372 @@
argument_groups:
- name: Input files
arguments:
- type: file
name: --input
required: false
description: |
The FASTQ files to be analyzed. FASTQ files should conform to the naming conventions of bcl2fastq and mkfastq:
`[Sample Name]_S[Sample Index]_L00[Lane Number]_[Read Type]_001.fastq.gz`
example: [ mysample_S1_L001_R1_001.fastq.gz, mysample_S1_L001_R2_001.fastq.gz ]
multiple: true
- name: Library arguments
arguments:
- type: string
name: --library_id
required: false
description: |
The Illumina sample name to analyze. This must exactly match the 'Sample Name'part
of the FASTQ files specified in the `--input` argument.
example: ["mysample1"]
multiple: true
- type: string
name: --library_type
required: false
description: |
The underlying feature type of the library.
choices: ["Gene Expression", "VDJ", "VDJ-T", "VDJ-B", "VDJ-T-GD", "Antibody Capture",
"CRISPR Guide Capture", "Multiplexing Capture", "Antigen Capture", "Custom"]
example: "Gene Expression"
multiple: true
- type: string
name: --library_subsample
required: false
description: |
The rate at which reads from the provided FASTQ files are sampled.
Must be strictly greater than 0 and less than or equal to 1.
example: "0.5"
multiple: true
- type: string
name: --library_lanes
required: false
description: Lanes associated with this sample. Defaults to using all lanes.
example: "1-4"
multiple: true
- type: string
name: "--library_chemistry"
description: |
Only applicable to FRP. Library-specific assay configuration. By default,
the assay configuration is detected automatically. Typically, users will
not need to specify a chemistry.
- name: Sample parameters
# Corresponds to the [samples] section
arguments:
- type: string
name: --sample_ids
alternatives: "--cell_multiplex_sample_id"
multiple: true
description: |
A name to identify a multiplexed sample. Must be alphanumeric with hyphens and/or underscores,
and less than 64 characters. Required for Cell Multiplexing libraries.
- type: string
multiple: true
name: --sample_description
alternatives: [--cell_multiplex_description]
description: A description for the sample.
- type: integer
multiple: true
name: --sample_expect_cells
example: 3000
description: |
Expected number of recovered cells, used as input to cell calling algorithm.
- type: integer
name: "--sample_force_cells"
example: 3000
multiple: true
required: false
description: |
Force pipeline to use this number of cells, bypassing cell detection.
- name: "Feature Barcode library specific arguments"
# Corresponds to the [feature] section
arguments:
- name: "--feature_reference"
type: file
description: |
Path to the Feature reference CSV file, declaring Feature Barcode constructs and associated barcodes.
Required only for Antibody Capture or CRISPR Guide Capture libraries.
See https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/feature-bc-analysis#feature-ref for more information."
example: "feature_reference.csv"
required: false
- name: "--feature_r1_length"
type: integer
required: false
description: |
Limit the length of the input Read 1 sequence of V(D)J libraries to the first N bases,
where N is the user-supplied value. Note that the length includes the Barcode and UMI
sequences so do not set this below 26.
- name: "--feature_r2_length"
type: integer
required: false
description: |
Limit the length of the input Read 2 sequence of V(D)J libraries to the first N bases,
where N is a user-supplied value. Trimming occurs before sequencing metrics are computed
and therefore, limiting the length of Read 2 may affect Q30 scores.
- name: "--min_crispr_umi"
type: integer
min: 1
required: false
description: |
Set the minimum number of CRISPR guide RNA UMIs required for protospacer detection.
If a lower or higher sensitivity is desired for detection, this value can be customized
according to specific experimental needs. Applicable only to datasets that include a
CRISPR Guide Capture library.
- name: Gene expression arguments
# Corresponds to the [gene-expression] section
description: Arguments relevant to the analysis of gene expression data.
arguments:
- name: "--gex_reference"
type: file
description: "Genome refence index built by Cell Ranger mkref."
example: "reference_genome.tar.gz"
required: true
- type: boolean
name: "--gex_secondary_analysis"
default: false
description: Whether or not to run the secondary analysis e.g. clustering.
- type: boolean
name: "--gex_generate_bam"
default: false
description: Whether to generate a BAM file.
- type: file
name: "--tenx_cloud_token_path"
description: The 10x Cloud Analysis user token used to enable cell annotation.
- type: string
name: "--cell_annotation_model"
description: |
"Cell annotation model to use. If auto, uses the default model for the species.
If not given, does not run cell annotation."
choices: ["auto", "human_pca_v1_beta", "mouse_pca_v1_beta"]
- type: integer
name: --gex_expect_cells
example: 3000
description: |
Expected number of recovered cells, used as input to cell calling algorithm.
- type: integer
name: "--gex_force_cells"
example: 3000
description: |
Force pipeline to use this number of cells, bypassing cell detection.
- type: boolean
name: "--gex_include_introns"
default: true
description: |
Whether or not to include intronic reads in counts.
This option does not apply to Fixed RNA Profiling analysis.
- name: "--gex_r1_length"
type: integer
required: false
description: |
Limit the length of the input Read 1 sequence of V(D)J libraries to the first N bases,
where N is the user-supplied value. Note that the length includes the Barcode and UMI
sequences so do not set this below 26.
- name: "--gex_r2_length"
type: integer
required: false
description: |
Limit the length of the input Read 2 sequence of V(D)J libraries to the first N bases,
where N is a user-supplied value. Trimming occurs before sequencing metrics are computed
and therefore, limiting the length of Read 2 may affect Q30 scores.
- type: string
name: --gex_chemistry
default: auto
description: |
Assay configuration. Either specify a single value which will be applied to all libraries,
or a number of values that is equal to the number of libararies. The latter is only applicable
to only applicable to Fixed RNA Profiling.
- auto: Chemistry autodetection (default)
- threeprime: Single Cell 3'
- SC3Pv1, SC3Pv2, SC3Pv3(-polyA), SC3Pv4(-polyA): Single Cell 3' v1, v2, v3, or v4
- SC3Pv3HT(-polyA): Single Cell 3' v3.1 HT
- SC-FB: Single Cell Antibody-only 3' v2 or 5'
- fiveprime: Single Cell 5'
- SC5P-PE: Paired-end Single Cell 5'
- SC5P-PE-v3: Paired-end Single Cell 5' v3
- SC5P-R2: R2-only Single Cell 5'
- SC5P-R2-v3: R2-only Single Cell 5' v3
- SCP5-PE-v3: Single Cell 5' paired-end v3 (GEM-X)
- SC5PHT : Single Cell 5' v2 HT
- SFRP: Fixed RNA Profiling (Singleplex)
- MFRP: Fixed RNA Profiling (Multiplex, Probe Barcode on R2)
- MFRP-R1: Fixed RNA Profiling (Multiplex, Probe Barcode on R1)
- MFRP-RNA: Fixed RNA Profiling (Multiplex, RNA, Probe Barcode on R2)
- MFRP-Ab: Fixed RNA Profiling (Multiplex, Antibody, Probe Barcode at R2:69)
- MFRP-Ab-R2pos50: Fixed RNA Profiling (Multiplex, Antibody, Probe Barcode at R2:50)
- MFRP-RNA-R1: Fixed RNA Profiling (Multiplex, RNA, Probe Barcode on R1)
- MFRP-Ab-R1: Fixed RNA Profiling (Multiplex, Antibody, Probe Barcode on R1)
- ARC-v1 for analyzing the Gene Expression portion of Multiome data. If Cell Ranger auto-detects ARC-v1 chemistry, an error is triggered.
See https://kb.10xgenomics.com/hc/en-us/articles/115003764132-How-does-Cell-Ranger-auto-detect-chemistry- for more information.
choices: [ auto, threeprime, fiveprime, SC3Pv1, SC3Pv2, SC3Pv3, SC3Pv3-polyA, SC3Pv4, SC3Pv4-polyA, SC3Pv3LT, SC3Pv3HT, SC3Pv3HT-polyA,
SC5P-PE, SC5P-PE-v3, SC5P-R2, SC-FB, SC5P-R2-v3, SCP5-PE-v3, SC5PHT, MFRP, MFRP-R1, MFRP-RNA, MFRP-Ab,
SFRP, MFRP-Ab-R2pos50, MFRP-RNA-R1, MFRP-Ab-R1, ARC-v1]
- name: "VDJ related parameters"
# The [vdj] section
arguments:
- name: "--vdj_reference"
type: file
description: "VDJ refence index built by Cell Ranger mkref."
example: "reference_vdj.tar.gz"
required: false
- name: "--vdj_inner_enrichment_primers"
type: file
description: |
V(D)J Immune Profiling libraries: if inner enrichment primers other than those provided
in the 10x Genomics kits are used, they need to be specified here as a
text file with one primer per line.
example: "enrichment_primers.txt"
required: false
- name: "--vdj_r1_length"
type: integer
required: false
description: |
Limit the length of the input Read 1 sequence of V(D)J libraries to the first N bases, where N is the user-supplied value.
Note that the length includes the Barcode and UMI sequences so do not set this below 26.
- name: "--vdj_r2_length"
type: integer
required: false
description: |
Limit the length of the input Read 2 sequence of V(D)J libraries to the first N bases, where N is a user-supplied value.
Trimming occurs before sequencing metrics are computed and therefore, limiting the length of Read 2 may affect Q30 scores
- name: "--vdj_denovo"
type: boolean
required: false
description: |
Run in reference-free mode (i.e., do not use annotations). This option is not supported for multiplexed experiments.
- name: 3' Cell multiplexing parameters (CellPlex Multiplexing)
# cell_multiplex_oligo_ids adds to [samples] section
# min_assignment_confidence, cmo_set barcode_sample_assignment are added to [gene-expression]
arguments:
- type: string
name: --cell_multiplex_oligo_ids
alternatives: [--cmo_ids]
multiple: true
description: |
The Cell Multiplexing oligo IDs used to multiplex this sample. If multiple CMOs were used for a sample,
separate IDs with a pipe (e.g., CMO301|CMO302). Required for Cell Multiplexing libraries.
- type: double
name: --min_assignment_confidence
description: |
The minimum estimated likelihood to call a sample as tagged with a Cell Multiplexing Oligo (CMO) instead of "Unassigned".
Users may wish to tolerate a higher rate of mis-assignment in order to obtain more singlets to include in their analysis,
or a lower rate of mis-assignment at the cost of obtaining fewer singlets.
- type: file
direction: input
required: false
name: "--cmo_set"
description: |
Path to a custom CMO set CSV file, declaring CMO constructs and associated barcodes. If the default CMO reference IDs that are built into
the Cell Ranger software are required, this option does not need to be used.
- type: file
direction: input
required: false
name: "--barcode_sample_assignment"
description: |
Path to a barcode-sample assignment CSV file that specifies the barcodes that belong to each sample.
- name: Hashtag multiplexing parameters
# Is added to [samples]
arguments:
- name: --hashtag_ids
type: string
multiple: true
description: |
The hashtag IDs used to multiplex this sample. If multiple antibody hashtags were used for the same sample,
you can separate IDs with a pipe.
- name: On-chip multiplexing parameters
# Is added to [samples]
arguments:
- name: --ocm_barcode_ids
type: string
multiple: true
# Note: choices is not an option here because multiple values can be added using pipe
description: |
The OCM barcode IDs used to multiplex this sample. Must be one of OB1, OB2, OB3, OB4.
If multiple OCM Barcodes were used for the same sample, you can separate IDs
with a pipe (e.g., OB1|OB2).
- name: Flex multiplexing paramaters
# probe_set, filter_probes and emptydrops_minimum_umis end up in [gene-expression]
# probe_barcode_ids ends up in [samples]
arguments:
- type: file
name: "--probe_set"
description: |
A probe set reference CSV file. It specifies the sequences used as a reference for probe alignment and the gene ID associated with each probe.
It must include 4 columns (probe file format 1.0.0): gene_id,probe_seq,probe_id,included,region and an optional 5th column (probe file format 1.0.1).
- gene_id: The Ensembl gene identifier targeted by the probe.
- probe_seq: The nucleotide sequence of the probe, which is complementary to the transcript sequence.
- probe_id: The probe identifier, whose format is described in Probe identifiers.
- included: A TRUE or FALSE flag specifying whether the probe is included in the filtered counts matrix output or excluded by the probe filter.
See filter-probes option of cellranger multi. All probes of a gene must be marked TRUE in the included column for that gene to be included.
- region: Present only in v1.0.1 probe set reference CSV. The gene boundary targeted by the probe. Accepted values are spliced or unspliced.
The file also contains a number of required metadata fields in the header in the format #key=value:
- panel_name: The name of the probe set.
- panel_type: Always predesigned for predesigned probe sets.
- reference_genome: The reference genome build used for probe design.
- reference_version: The version of the Cell Ranger reference transcriptome used for probe design.
- probe_set_file_format: The version of the probe set file format specification that this file conforms to.
- type: boolean # Null is also a valid option because passing this argument to cellranger (true or false) requires --probe_set
name: "--filter_probes"
description: |
If 'false', include all non-deprecated probes listed in the probe set reference CSV file.
If 'true' or not set, probes that are predicted to have off-target activity to homologous genes are excluded from analysis.
Not filtering will result in UMI counts from all non-deprecated probes,
including those with predicted off-target activity, to be used in the analysis.
Probes whose ID is prefixed with DEPRECATED are always excluded from the analysis.
- type: string
name: "--probe_barcode_ids"
multiple: true
description: |
The Fixed RNA Probe Barcode ID used for this sample, and for multiplex GEX + Antibody Capture libraries,
the corresponding Antibody Multiplexing Barcode IDs. 10x recommends specifying both barcodes (e.g., BC001+AB001)
when an Antibody Capture library is present. The barcode pair order is BC+AB and they
are separated with a "+" (no spaces). Alternatively, you can specify the Probe Barcode ID alone and
Cell Ranger's barcode pairing auto-detection algorithm will automatically match to the corresponding Antibody
Multiplexing Barcode.
- type: integer
name: --emptydrops_minimum_umis
min: 1
description: |
For singleplex Flex experiments, use this option to adjust the UMI cutoff during the second step of cell calling.
Cell Ranger will still perform the full cell calling process but will only evaluate barcodes with UMIs above
the threshold you specify.
- name: Antigen Capture (BEAM) libary arguments
# These end up in the [antigen-specificity] section
description: |
These arguments are recommended if an Antigen Capture (BEAM) library is present.
It is needed to calculate the antigen specificity score.
arguments:
- type: string
name: --control_id
multiple: true
description: |
A user-defined ID for any negative controls used in the T/BCR Antigen Capture assay. Must match id specified in the feature reference CSV.
May only include ASCII characters and must not use whitespace, slash, quote, or comma characters.
Each ID must be unique and must not collide with a gene identifier from the transcriptome.
- type: string
multiple: true
name: --mhc_allele
description: |
The MHC allele for TCR Antigen Capture libraries. Must match mhc_allele name specified in the Feature Reference CSV.
- name: "General arguments"
description: |
These arguments are applicable to all library types.
arguments:
- name: "--check_library_compatibility"
type: boolean
default: true
description: |
Optional. This option allows users to disable the check that evaluates 10x Barcode overlap between
ibraries when multiple libraries are specified (e.g., Gene Expression + Antibody Capture). Setting
this option to false will disable the check across all library combinations. We recommend running
this check (default), however if the pipeline errors out, users can bypass the check to generate
outputs for troubleshooting.

View File

@@ -0,0 +1,101 @@
name: "cellranger_multi_qc"
namespace: "single_cell"
# scope: "private"
description: "A pipeline for running Cell Ranger multi followed by QC."
authors:
- __merge__: /src/authors/jakub_majercik.yaml
roles: [ author, maintainer ]
- __merge__: /src/authors/weiwei_schultz.yaml
roles: [ contributor ]
__merge__: /src/single_cell/cellranger_multi_qc/cellranger_multi.yaml
argument_groups:
- name: Outputs
arguments:
- name: "--output_raw"
type: file
direction: output
description: "The raw output folder."
required: true
example: output_dir/
- name: "--output_h5mu"
type: file
direction: output
description: |
Locations for the output files. Must contain a wildcard (*) character,
which will be replaced with the sample name.
example: "*.h5mu"
required: true
- name: "--uns_metrics"
type: string
description: Name of the .uns slot under which to QC metrics (if any).
default: "metrics_cellranger"
- name: "--output_ingestion_qc_report"
type: file
direction: output
required: false
multiple: true
description: |
Ingestion QC report in HTML format. Generated when --create_sample_qc_report is true.
example: "*.sample_qc_report.html"
- name: "--output_processed_h5mu"
type: file
direction: output
required: false
description: |
Folder containing the QC-processed h5mu files. Generated when
--create_sample_qc_report is true.
example: "processed_h5mu/"
- name: "--output_multiqc_report"
type: file
direction: output
required: false
description: |
MultiQC report in HTML format. Generated when
--create_multiqc_report is true.
example: "multiqc_report.html"
- name: "QC reports"
description: "Options for generating optional QC reports."
arguments:
- name: "--create_sample_qc_report"
type: boolean
default: true
description: |
Whether to generate an ingestion QC report.
- name: "--create_multiqc_report"
type: boolean
default: true
description: |
Whether to run FastQC on the input FASTQ files and aggregate results
into a MultiQC report.
- name: "--run_cellbender"
type: boolean
default: false
description: |
Whether to run CellBender for ambient RNA removal as part of the
sample QC report generation. Only used when --create_sample_qc_report is true.
dependencies:
- name: workflows/ingestion/cellranger_multi
repository: openpipeline
- name: workflows/generate_qc_report
repository: openpipeline_qc
- name: fastqc
repository: biobox
- name: multiqc
repository: biobox
resources:
- type: nextflow_script
path: main.nf
entrypoint: run_wf
test_resources:
- type: nextflow_script
path: test.nf
entrypoint: test_wf
- path: /resources_test/10x_5k_anticmv/raw/
- path: /resources_test/10x_5k_fixed/raw/
- path: /resources_test/reference_gencodev41_chr1
runners:
- type: nextflow

View File

@@ -0,0 +1,38 @@
#!/bin/bash
set -eo pipefail
# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)
# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"
COMMON_ARGS=(
run .
-main-script src/single_cell/cellranger_multi_qc/test.nf
-resume
-profile docker
-c src/configs/labels_ci.config
-c src/configs/integration_tests.config
)
# test_wf: GEX + AB, sample QC report only
nextflow "${COMMON_ARGS[@]}" \
-entry test_wf \
--publish_dir test_output/cellranger_multi_qc/test_wf
# test_wf_ab_only: AB-only input, all report steps skipped
nextflow "${COMMON_ARGS[@]}" \
-entry test_wf_ab_only \
--publish_dir test_output/cellranger_multi_qc/test_wf_ab_only
# test_wf_both_reports: GEX + AB, both MultiQC and sample QC reports
nextflow "${COMMON_ARGS[@]}" \
-entry test_wf_both_reports \
--publish_dir test_output/cellranger_multi_qc/test_wf_both_reports
# test_wf_multiqc_only: GEX + AB, MultiQC report only
nextflow "${COMMON_ARGS[@]}" \
-entry test_wf_multiqc_only \
--publish_dir test_output/cellranger_multi_qc/test_wf_multiqc_only

View File

@@ -0,0 +1,81 @@
workflow run_wf {
take:
input_ch
main:
output_ch = input_ch
| map { id, state ->
[id, state + [_meta: [join_id: id]]]
}
| fastqc.run(
runIf: { id, state -> state.create_multiqc_report && state.library_type?.contains("Gene Expression") },
fromState: { id, state ->
[
input: state.input,
outdir: "${id}_fastqc"
]
},
toState: { id, output, state ->
state + [output_fastqc: output.outdir]
}
)
| cellranger_multi.run(
fromState: { id, state -> state },
toState: { id, output, state ->
state + [
output_raw: output.output_raw,
output_h5mu: output.output_h5mu
]
}
)
| multiqc.run(
runIf: { id, state -> state.create_multiqc_report && state.library_type?.contains("Gene Expression") },
fromState: { id, state ->
[
input: [state.output_fastqc, state.output_raw],
output_report: state.output_multiqc_report
]
},
toState: { id, output, state ->
state + [_multiqc_produced: true, output_multiqc_report: output.output_report]
}
)
| generate_qc_report.run(
runIf: { id, state -> state.create_sample_qc_report && state.library_type?.contains("Gene Expression") },
fromState: { id, state ->
[
id: id,
input: state.output_h5mu,
ingestion_method: "cellranger_multi",
run_cellbender: state.run_cellbender,
output_qc_report: state.output_ingestion_qc_report,
output_processed_h5mu: state.output_processed_h5mu
]
},
toState: { id, output, state ->
state + [
_qc_report_produced: true,
output_ingestion_qc_report: output.output_qc_report,
output_processed_h5mu: output.output_processed_h5mu
]
}
)
| map { id, state ->
def out = [output_raw: state.output_raw, output_h5mu: state.output_h5mu]
if (state._meta) out._meta = state._meta
if (state._multiqc_produced) out.output_multiqc_report = state.output_multiqc_report
if (state._qc_report_produced) {
out.output_ingestion_qc_report = state.output_ingestion_qc_report
out.output_processed_h5mu = state.output_processed_h5mu
}
[id, out]
}
emit:
output_ch
}

View File

@@ -0,0 +1,10 @@
manifest {
nextflowVersion = '!>=20.12.1-edge'
}
params {
rootDir = java.nio.file.Paths.get("$projectDir/../../../").toAbsolutePath().normalize().toString()
}
// include common settings
includeConfig("${params.rootDir}/src/configs/labels.config")

View File

@@ -0,0 +1,210 @@
nextflow.enable.dsl=2
include { cellranger_multi_qc } from params.rootDir + "/target/nextflow/single_cell/cellranger_multi_qc/main.nf"
params.resources_test = "s3://openpipelines-bio/openpipeline_incubator/resources_test/"
workflow test_wf {
resources_test = file(params.resources_test)
output_ch = Channel.fromList([
[
id: "sample_anticmv",
input: [
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_1_subset_S1_L001_R1_001.fastq.gz"),
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_1_subset_S1_L001_R2_001.fastq.gz"),
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_subset_S2_L004_R1_001.fastq.gz"),
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_subset_S2_L004_R2_001.fastq.gz")
],
gex_reference: resources_test.resolve("reference_gencodev41_chr1/reference_cellranger.tar.gz"),
feature_reference: resources_test.resolve("10x_5k_anticmv/raw/feature_reference.csv"),
library_id: ["5k_human_antiCMV_T_TBNK_connect_GEX_1_subset", "5k_human_antiCMV_T_TBNK_connect_AB_subset"],
library_type: ["Gene Expression", "Antibody Capture"],
output_raw: "sample_anticmv_raw/",
output_h5mu: "sample_anticmv.h5mu",
create_sample_qc_report: true,
output_ingestion_qc_report: "sample_anticmv_qc_report_*.html",
output_processed_h5mu: "sample_anticmv_processed"
]
])
| map { state -> [state.id, state] }
| cellranger_multi_qc
| view { output ->
assert output.size() == 2 : "Outputs should contain two elements; [id, state]"
def id = output[0]
assert id == "combined" : "Output ID should be 'combined'. Found: ${id}"
def state = output[1]
assert state instanceof Map : "State should be a map. Found: ${state}"
assert state.containsKey("output_raw") : "State should contain key 'output_raw'."
assert state.output_raw.isDirectory() : "'output_raw' should be a directory."
assert state.containsKey("output_h5mu") : "State should contain key 'output_h5mu'."
assert state.output_h5mu.isFile() : "'output_h5mu' should be a file."
assert state.output_h5mu.toString().endsWith(".h5mu") : "output_h5mu should end with '.h5mu'. Found: ${state.output_h5mu}"
assert state.containsKey("output_ingestion_qc_report") : "State should contain key 'output_ingestion_qc_report'."
assert state.output_ingestion_qc_report instanceof List : "'output_ingestion_qc_report' should be a list."
assert state.output_ingestion_qc_report.every { it.isFile() } : "All QC report files should exist."
assert state.containsKey("output_processed_h5mu") : "State should contain key 'output_processed_h5mu'."
assert state.output_processed_h5mu.isDirectory() : "'output_processed_h5mu' should be a directory."
"Output: $output"
}
}
workflow test_wf_ab_only {
resources_test = file(params.resources_test)
output_ch = Channel.fromList([
[
id: "sample_ab_only",
input: [
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_subset_S2_L004_R1_001.fastq.gz"),
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_subset_S2_L004_R2_001.fastq.gz")
],
gex_reference: resources_test.resolve("reference_gencodev41_chr1/reference_cellranger.tar.gz"),
feature_reference: resources_test.resolve("10x_5k_anticmv/raw/feature_reference.csv"),
library_id: ["5k_human_antiCMV_T_TBNK_connect_AB_subset"],
library_type: ["Antibody Capture"],
output_raw: "sample_ab_only_raw/",
output_h5mu: "sample_ab_only.h5mu",
create_sample_qc_report: true,
create_multiqc_report: true,
output_ingestion_qc_report: "sample_ab_only_qc_report_*.html",
output_processed_h5mu: "sample_ab_only_processed"
]
])
| map { state -> [state.id, state] }
| cellranger_multi_qc
| view { output ->
assert output.size() == 2 : "Outputs should contain two elements; [id, state]"
def id = output[0]
assert id == "sample_ab_only" : "Output ID should be 'sample_ab_only'. Found: ${id}"
def state = output[1]
assert state instanceof Map : "State should be a map. Found: ${state}"
assert state.containsKey("output_raw") : "State should contain key 'output_raw'."
assert state.output_raw.isDirectory() : "'output_raw' should be a directory."
assert state.containsKey("output_h5mu") : "State should contain key 'output_h5mu'."
assert state.output_h5mu.isFile() : "'output_h5mu' should be a file."
assert !state.containsKey("output_ingestion_qc_report") : "State should NOT contain 'output_ingestion_qc_report' for AB-only input."
assert !state.containsKey("output_multiqc_report") : "State should NOT contain 'output_multiqc_report' for AB-only input."
"Output: $output"
}
}
workflow test_wf_both_reports {
resources_test = file(params.resources_test)
output_ch = Channel.fromList([
[
id: "sample_both_reports",
input: [
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_1_subset_S1_L001_R1_001.fastq.gz"),
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_1_subset_S1_L001_R2_001.fastq.gz"),
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_subset_S2_L004_R1_001.fastq.gz"),
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_subset_S2_L004_R2_001.fastq.gz")
],
gex_reference: resources_test.resolve("reference_gencodev41_chr1/reference_cellranger.tar.gz"),
feature_reference: resources_test.resolve("10x_5k_anticmv/raw/feature_reference.csv"),
library_id: ["5k_human_antiCMV_T_TBNK_connect_GEX_1_subset", "5k_human_antiCMV_T_TBNK_connect_AB_subset"],
library_type: ["Gene Expression", "Antibody Capture"],
output_raw: "sample_both_reports_raw/",
output_h5mu: "sample_both_reports.h5mu",
create_sample_qc_report: true,
create_multiqc_report: true,
output_ingestion_qc_report: "sample_both_reports_qc_report_*.html",
output_processed_h5mu: "sample_both_reports_processed"
]
])
| map { state -> [state.id, state] }
| cellranger_multi_qc
| view { output ->
assert output.size() == 2 : "Outputs should contain two elements; [id, state]"
def id = output[0]
assert id == "combined" : "Output ID should be 'combined'. Found: ${id}"
def state = output[1]
assert state instanceof Map : "State should be a map. Found: ${state}"
assert state.containsKey("output_raw") : "State should contain key 'output_raw'."
assert state.output_raw.isDirectory() : "'output_raw' should be a directory."
assert state.containsKey("output_h5mu") : "State should contain key 'output_h5mu'."
assert state.output_h5mu.isFile() : "'output_h5mu' should be a file."
assert state.output_h5mu.toString().endsWith(".h5mu") : "output_h5mu should end with '.h5mu'."
assert state.containsKey("output_multiqc_report") : "State should contain key 'output_multiqc_report'."
assert state.output_multiqc_report.isFile() : "'output_multiqc_report' should be a file."
assert state.containsKey("output_ingestion_qc_report") : "State should contain key 'output_ingestion_qc_report'."
assert state.output_ingestion_qc_report instanceof List : "'output_ingestion_qc_report' should be a list."
assert state.output_ingestion_qc_report.every { it.isFile() } : "All QC report files should exist."
assert state.containsKey("output_processed_h5mu") : "State should contain key 'output_processed_h5mu'."
assert state.output_processed_h5mu.isDirectory() : "'output_processed_h5mu' should be a directory."
"Output: $output"
}
}
workflow test_wf_multiqc_only {
resources_test = file(params.resources_test)
output_ch = Channel.fromList([
[
id: "sample_multiqc_only",
input: [
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_1_subset_S1_L001_R1_001.fastq.gz"),
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_1_subset_S1_L001_R2_001.fastq.gz"),
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_subset_S2_L004_R1_001.fastq.gz"),
resources_test.resolve("10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_subset_S2_L004_R2_001.fastq.gz")
],
gex_reference: resources_test.resolve("reference_gencodev41_chr1/reference_cellranger.tar.gz"),
feature_reference: resources_test.resolve("10x_5k_anticmv/raw/feature_reference.csv"),
library_id: ["5k_human_antiCMV_T_TBNK_connect_GEX_1_subset", "5k_human_antiCMV_T_TBNK_connect_AB_subset"],
library_type: ["Gene Expression", "Antibody Capture"],
output_raw: "sample_multiqc_only_raw/",
output_h5mu: "sample_multiqc_only.h5mu",
create_sample_qc_report: false,
create_multiqc_report: true
]
])
| map { state -> [state.id, state] }
| cellranger_multi_qc
| view { output ->
assert output.size() == 2 : "Outputs should contain two elements; [id, state]"
def id = output[0]
assert id == "sample_multiqc_only" : "Output ID should be 'sample_multiqc_only'. Found: ${id}"
def state = output[1]
assert state instanceof Map : "State should be a map. Found: ${state}"
assert state.containsKey("output_raw") : "State should contain key 'output_raw'."
assert state.output_raw.isDirectory() : "'output_raw' should be a directory."
assert state.containsKey("output_h5mu") : "State should contain key 'output_h5mu'."
assert state.output_h5mu.isFile() : "'output_h5mu' should be a file."
assert state.output_h5mu.toString().endsWith(".h5mu") : "output_h5mu should end with '.h5mu'."
assert state.containsKey("output_multiqc_report") : "State should contain key 'output_multiqc_report'."
assert state.output_multiqc_report.isFile() : "'output_multiqc_report' should be a file."
assert !state.containsKey("output_ingestion_qc_report") : "State should NOT contain 'output_ingestion_qc_report' when only MultiQC is enabled."
assert !state.containsKey("output_processed_h5mu") : "State should NOT contain 'output_processed_h5mu' when only MultiQC is enabled."
"Output: $output"
}
}

View File

@@ -0,0 +1,386 @@
name: "process_integrate_annotate"
# scope: private
namespace: "single_cell"
description: |
A pipeline to process, integrate and annotate single cell (multi-)omics data.
Available integration methods:
- Harmony
- scVI
Available annotation methods:
- CellTypist
- scANVI (with scArches)
authors:
- __merge__: /src/authors/dorien_roosen.yaml
roles: [ author, maintainer ]
- __merge__: /src/authors/weiwei_schultz.yaml
roles: [ contributor ]
argument_groups:
- name: Input (query) data arguments
description: The input query dataset(s) to be annotated
arguments:
- name: "--id"
required: true
type: string
description: ID of the sample.
example: foo
- name: "--input"
required: true
type: file
description: Input query dataset(s) to be annotated
example: input.h5mu
- name: "--modality"
default: "rna"
type: string
description: Modality to be processed. Should match the modality in the --reference dataset, if provided.
- name: "--input_layer"
type: string
description: "The layer in the input data containing the raw counts, if .X is not to be used."
required: false
- name: "--input_var_gene_names"
type: string
required: false
description: |
The name of the adata var column containing gene names; when no gene_name_layer is provided, the var index will be used.
- name: "--input_reference_gene_overlap"
type: integer
default: 100
min: 1
description: |
The minimum number of genes present in both the reference and query datasets.
- name: Reference data arguments
description: Dataset to be used as a reference for label transfer and to train annotation algorithms on.
arguments:
- name: "--reference"
type: file
required: false
example: reference.h5mu
description: |
The reference dataset in .h5mu format to be used as a reference mapper and to train annotation algorithms on.
- name: "--reference_layer_raw_counts"
type: string
description: "The layer in the reference dataset containing the raw counts, if .X is not to be used."
required: false
- name: "--reference_layer_lognormalized_counts"
type: string
default: log_normalized
description: "The layer in the reference dataset containing the log-normalized counts, if .X is not to be used."
- name: "--reference_var_gene_names"
type: string
required: false
description: |
The name of the adata .var column containing gene names if the .var index is not to be used.
- name: "--reference_obs_batch"
type: string
required: false
description: |
The .obs column of the reference dataset containing the batch information.
- name: "--reference_obs_label"
type: string
example: cell_type
required: false
description: The `.obs` key of the target labels to tranfer.
- name: "--reference_obs_label_unlabeled_category"
type: string
default: "Unkown"
description: "Value in the --reference_obs_label field that indicates unlabeled observations"
- name: "--reference_var_input"
type: string
required: false
description: |
.var column containing highly variable genes. By default, do not subset genes.
- name: Methods
description: The available annotation and integration methods to integrate and/or annotate the query dataset(s) with.
arguments:
- name: "--integration_methods"
type: string
multiple: true
required: false
choices: [harmony, scvi]
example: harmony;scvi
description: Integration methods to be executed.
- name: "--annotation_methods"
type: string
multiple: true
required: false
choices: [celltypist, scanvi_scarches]
example: celltypist;scanvi_scarches
description: Annotation methods to be executed.
- name: "Pre-processing options: RNA filtering"
description: Pre-processing options for filtering RNA data
arguments:
- name: "--rna_min_counts"
example: 200
type: integer
description: Minimum number of counts captured per cell.
- name: "--rna_max_counts"
example: 5000000
type: integer
description: Maximum number of counts captured per cell.
- name: "--rna_min_genes_per_cell"
type: integer
example: 200
description: Minimum of non-zero values per cell.
- name: "--rna_max_genes_per_cell"
example: 1500000
type: integer
description: Maximum of non-zero values per cell.
- name: "--rna_min_cells_per_gene"
example: 3
type: integer
description: Minimum of non-zero values per gene.
- name: "--rna_min_fraction_mito"
example: 0
type: double
description: Minimum fraction of UMIs that are mitochondrial.
- name: "--rna_max_fraction_mito"
type: double
example: 0.2
description: Maximum fraction of UMIs that are mitochondrial.
- name: "Pre-processing options: Highly variable features detection"
description: Pre-processing options for detecting highly variable features
arguments:
- name: "--n_hvg"
type: integer
description: |
Number of highly-variable features to keep.
Only relevant if HVG need to be calculated across query and reference datasets (e.g. for --annotation_methods scvi_knn and harmony_knn).
For reference mapping-based methods, the HVG's specified in --reference_var_input will be used.
default: 2000
- name: "Pre-processing options: Mitochondrial & Ribosomal Gene Detection"
description: Pre-processing options for detecting mitochondrial genes
arguments:
- name: "--var_name_mitochondrial_genes"
type: string
required: false
description: |
In which .var slot to store a boolean array corresponding the mitochondrial genes.
- name: "--var_name_ribosomal_genes"
type: string
required: false
description: |
In which .var slot to store a boolean array corresponding the ribosomal genes.
- name: "--obs_name_mitochondrial_fraction"
type: string
required: false
description: |
When specified, write the fraction of counts originating from mitochondrial genes
(based on --mitochondrial_gene_regex) to an .obs column with the specified name.
Requires --var_name_mitochondrial_genes.
- name: "--obs_name_ribosomal_fraction"
type: string
required: false
description: |
When specified, write the fraction of counts originating from ribosomal genes
(based on --ribosomal_gene_regex) to an .obs column with the specified name.
Requires --var_name_ribosomal_genes.
- name: --mitochondrial_gene_regex
type: string
description: |
Regex string that identifies mitochondrial genes from --var_gene_names.
By default will detect human and mouse mitochondrial genes from a gene symbol.
required: false
default: "^[mM][tT]-"
- name: --ribosomal_gene_regex
type: string
description: |
Regex string that identifies ribosomal genes from --var_gene_names.
By default will detect human and mouse ribosomal genes from a gene symbol.
required: false
default: "^[Mm]?[Rr][Pp][LlSs]"
- name: "Pre-processing options: QC metrics calculation options"
description: Pre-processing options for calculating QC metrics
arguments:
- name: "--var_qc_metrics"
description: |
Keys to select a boolean (containing only True or False) column from .var.
For each cell, calculate the proportion of total values for genes which are labeled 'True',
compared to the total sum of the values for all genes. Defaults to the combined values specified for
--var_name_mitochondrial_genes and --highly_variable_features_var_output.
type: string
multiple: True
multiple_sep: ','
required: false
example: "ercc,highly_variable"
- name: Harmony integration options
description: Specifications for harmony integration.
arguments:
- name: "--harmony_theta"
type: double
description: |
Diversity clustering penalty parameter. Specify for each variable in group.by.vars.
theta=0 does not encourage any diversity. Larger values of theta
result in more diverse clusters."
default: 2
example: [0, 1, 2]
multiple: true
- name: "--harmony_obs_covariates"
type: string
description: "The .obs field(s) that define the covariate(s) to regress out."
example: ["batch", "sample"]
required: true
multiple: true
default: "sample_id"
- name: scVI, scANVI and scArches training options
# TODO - possibly provide separate training options for scVI, scANVI and scArches
description: Training arguments for scVI, scANVI and scArches. Relevant for --annotation_methods 'scvi_knn' and 'scanvi_scarches'.
arguments:
- name: "--early_stopping"
required: false
type: boolean
description: "Whether to perform early stopping with respect to the validation set."
- name: "--early_stopping_monitor"
choices: ["elbo_validation", "reconstruction_loss_validation", "kl_local_validation"]
default: "elbo_validation"
type: string
description: "Metric logged during validation set epoch."
- name: "--early_stopping_patience"
type: integer
min: 1
default: 45
description: "Number of validation epochs with no improvement after which training will be stopped."
- name: "--early_stopping_min_delta"
min: 0
type: double
default: 0.0
description: "Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement."
- name: "--max_epochs"
type: integer
description: "Number of passes through the dataset, defaults to (20000 / number of cells) * 400 or 400; whichever is smallest."
required: false
- name: "--reduce_lr_on_plateau"
description: "Whether to monitor validation loss and reduce learning rate when validation set `lr_scheduler_metric` plateaus."
type: boolean
default: True
- name: "--lr_factor"
description: "Factor to reduce learning rate."
type: double
default: 0.6
min: 0
- name: "--lr_patience"
description: "Number of epochs with no improvement after which learning rate will be reduced."
type: double
default: 30
min: 0
- name: CellTypist reference model
description: The CellTypist reference model to use for annotation. If not provided, the reference dataset will be used for model training.
arguments:
- name: "--celltypist_model"
type: file
description: "Pretrained model in pkl format. If not provided, the model will be trained on the reference data and --reference should be provided."
required: false
example: pretrained_model.pkl
- name: CellTypist annotation options
description: Specifications for CellTypist annotation.
arguments:
- name: "--celltypist_feature_selection"
type: boolean
description: "Whether to perform feature selection."
default: false
- name: "--celltypist_majority_voting"
type: boolean
description: "Whether to refine the predicted labels by running the majority voting classifier after over-clustering."
default: false
- name: "--celltypist_C"
type: double
description: "Inverse of regularization strength in logistic regression."
default: 1.0
- name: "--celltypist_max_iter"
type: integer
description: "Maximum number of iterations before reaching the minimum of the cost function."
default: 1000
- name: "--celltypist_use_SGD"
type: boolean_true
description: "Whether to use the stochastic gradient descent algorithm."
- name: "--celltypist_min_prop"
type: double
description: |
"For the dominant cell type within a subcluster, the minimum proportion of cells required to
support naming of the subcluster by this cell type. Ignored if majority_voting is set to False.
Subcluster that fails to pass this proportion threshold will be assigned 'Heterogeneous'."
default: 0
- name: Clustering options
description: Arguments for Leiden clustering. Only relevant for --annotation_methods `scvi_knn`, `scanvi_scarches` and `harmony_knn`.
arguments:
- name: "--leiden_resolution"
type: double
description: Control the coarseness of the clustering. Higher values lead to more clusters.
default: [1]
multiple: true
- name: Neighbor classifier arguments
description: Arguments related to calculating the n nearest neighbors. Only relevant for --annotation_methods `scvi_knn`, `scanvi_scarches` and `harmony_knn`.
arguments:
- name: "--knn_weights"
type: string
default: "uniform"
choices: ["uniform", "distance"]
description: |
Weight function used in prediction. Possible values are:
`uniform` (all points in each neighborhood are weighted equally) or
`distance` (weight points by the inverse of their distance)
- name: "--knn_n_neighbors"
type: integer
default: 15
min: 5
required: false
description: |
The number of neighbors to use in k-neighbor graph structure used for fast approximate nearest neighbor search with PyNNDescent.
Larger values will result in more accurate search results at the cost of computation time.
- name: Outputs
description: The output file to write the annotated dataset to.
arguments:
- name: "--output"
type: file
direction: output
required: true
description: |
The output file.
example: output.h5mu
dependencies:
- name: workflows/multiomics/process_samples
alias: process_samples_workflow
repository: openpipeline
- name: annotate/celltypist
repository: openpipeline
alias: celltypist_annotation
- name: workflows/annotation/scanvi_scarches
repository: openpipeline
alias: scanvi_scarches_annotation
- name: workflows/integration/harmony_leiden
repository: openpipeline
alias: harmony_integration
- name: workflows/integration/scvi_leiden
repository: openpipeline
alias: scvi_integration
resources:
- type: nextflow_script
path: main.nf
entrypoint: run_wf
test_resources:
- type: nextflow_script
path: test.nf
entrypoint: test_wf
- path: /resources_test/pbmc_1k_protein_v3/pbmc_1k_protein_v3_mms.h5mu
- path: /resources_test/annotation_test_data/TS_Blood_filtered.h5mu
- path: /resources_test/annotation_test_data/celltypist_model_Immune_All_Low.pkl
runners:
- type: nextflow

View File

@@ -0,0 +1,37 @@
#!/bin/bash
set -eo pipefail
# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)
# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"
nextflow \
run . \
-main-script src/single_cell/process_integrate_annotate/test.nf \
-entry test_wf \
-resume \
-profile docker \
-c src/configs/labels_ci.config \
-c src/configs/integration_tests.config \
--publish_dir test
nextflow \
run . \
-main-script src/single_cell/process_integrate_annotate/test.nf \
-profile docker,no_publish \
-resume \
-entry test_wf_2 \
-c src/configs/labels_ci.config \
-c src/configs/integration_tests.config
nextflow \
run . \
-main-script src/single_cell/process_integrate_annotate/test.nf \
-profile docker,no_publish \
-resume \
-entry test_wf_3 \
-c src/configs/labels_ci.config \
-c src/configs/integration_tests.config

View File

@@ -0,0 +1,210 @@
workflow run_wf {
take:
input_ch
main:
output_ch = input_ch
| map { id, state ->
def new_state = state + [ "query_processed": state.output, "_meta": ["join_id": id] ]
[id, new_state]
}
// Make sure parameters are filled out correctly
| map { id, state ->
def new_state = [:]
// Check that at least one of annotation_methods or integration_methods is not empty
if (!state.annotation_methods && !state.integration_methods) {
throw new RuntimeException("At least one of --annotation_methods or --integration_methods must be provided")
}
// Check CellTypist arguments
if (state.annotation_methods && state.annotation_methods.contains("celltypist") &&
(!state.celltypist_model && !state.reference)) {
throw new RuntimeException("Celltypist was selected as an annotation method. Either --celltypist_model or --reference must be provided.")
}
if (state.annotation_methods && state.annotation_methods.contains("celltypist") && state.celltypist_model && state.reference ) {
System.err.println(
"Warning: --celltypist_model is set and a --reference was provided. \
The pre-trained Celltypist model will be used for annotation, the reference will be ignored."
)
}
[id, state + new_state]
}
| process_samples_workflow.run(
fromState: [
"input": "input",
"id": "id",
"rna_layer": "input_layer",
"rna_min_counts": "rna_min_counts",
"rna_max_counts": "rna_max_counts",
"rna_min_genes_per_cell": "rna_min_genes_per_cell",
"rna_max_genes_per_cell": "rna_max_genes_per_cell",
"rna_min_cells_per_gene": "rna_min_cells_per_gene",
"rna_min_fraction_mito": "rna_min_fraction_mito",
"rna_max_fraction_mito": "rna_max_fraction_mito",
"rna_min_fraction_ribo": "rna_min_fraction_ribo",
"rna_max_fraction_ribo": "rna_max_fraction_ribo",
"var_name_mitochondrial_genes": "var_name_mitochondrial_genes",
"var_name_ribosomal_genes": "var_name_ribosomal_genes",
"var_gene_names": "input_var_gene_names",
"mitochondrial_gene_regex": "mitochondrial_gene_regex",
"ribosomal_gene_regex": "ribosomal_gene_regex",
"var_qc_metrics": "var_qc_metrics"
],
args: [
"pca_overwrite": "true",
"add_id_obs_output": "sample_id",
"highly_variable_features_var_output": "filter_with_hvg_query"
],
toState: ["query_processed": "output"],
)
// Integration methods
| harmony_integration.run(
runIf: { id, state ->
state.integration_methods && state.integration_methods.contains("harmony")
},
fromState: [
"id": "id",
"input": "query_processed",
"modality": "modality",
"theta": "harmony_theta",
"leiden_resolution": "leiden_resolution",
"obs_covariates": "harmony_obs_covariates"
],
args: [
"layer": "log_normalized",
"embedding": "X_pca",
"obsm_integrated": "X_harmony_integrated",
"uns_neighbors": "harmony_integration_neighbors",
"obsp_neighbor_distances": "harmony_integration_neighbor_distances",
"obsp_neighbor_connectivities": "harmony_integration_neighbor_connectivities",
"obs_cluster": "harmony_integration_leiden",
"obsm_umap": "X_harmony_umap"
],
toState: [ "query_processed": "output" ]
)
| scvi_integration.run(
runIf: { id, state ->
state.integration_methods && state.integration_methods.contains("scvi")
},
fromState: [
"id": "id",
"input": "query_processed",
"layer": "input_layer",
"modality": "modality",
"leiden_resolution": "leiden_resolution",
"early_stopping": "early_stopping",
"early_stopping_monitor": "early_stopping_monitor",
"early_stopping_patience": "early_stopping_patience",
"early_stopping_min_delta": "early_stopping_min_delta",
"max_epochs": "max_epochs",
"reduce_lr_on_plateau": "reduce_lr_on_plateau",
"lr_factor": "lr_factor",
"lr_patience": "lr_patience"
],
args: [
"obsm_output": "X_scvi_integrated",
"obs_batch": "sample_id",
"var_input": "filter_with_hvg_query",
"uns_neighbors": "scvi_integration_neighbors",
"obsp_neighbor_distances": "scvi_integration_neighbor_distances",
"obsp_neighbor_connectivities": "scvi_integration_neighbor_connectivities",
"obs_cluster": "scvi_integration_leiden",
"obsm_umap": "X_scvi_umap"
],
toState: [ "query_processed": "output", "scvi_model": "output_model" ]
)
// Annotation methods
| celltypist_annotation.run(
runIf: { id, state -> state.annotation_methods && state.annotation_methods.contains("celltypist") && state.celltypist_model },
fromState: [
"input": "query_processed",
"modality": "modality",
"input_var_gene_names": "input_var_gene_names",
"input_reference_gene_overlap": "input_reference_gene_overlap",
"model": "celltypist_model",
"majority_voting": "celltypist_majority_voting"
],
args: [
// log normalized counts are expected for celltypist
"input_layer": "log_normalized",
"output_obs_predictions": "celltypist_pred",
"output_obs_probability": "celltypist_proba"
],
toState: [ "query_processed": "output" ]
)
| celltypist_annotation.run(
runIf: { id, state -> state.annotation_methods && state.annotation_methods.contains("celltypist") && !state.celltypist_model },
fromState: [
"input": "query_processed",
"modality": "modality",
"input_var_gene_names": "input_var_gene_names",
"input_reference_gene_overlap": "input_reference_gene_overlap",
"reference": "reference",
"reference_layer": "reference_layer_lognormalized_counts",
"reference_obs_target": "reference_obs_label",
"reference_var_gene_names": "reference_var_gene_names",
"reference_obs_batch": "reference_obs_batch",
"reference_var_input": "reference_var_input",
"feature_selection": "celltypist_feature_selection",
"C": "celltypist_C",
"max_iter": "celltypist_max_iter",
"use_SGD": "celltypist_use_SGD",
"min_prop": "celltypist_min_prop",
"majority_voting": "celltypist_majority_voting"
],
args: [
// log normalized counts are expected for celltypist
"input_layer": "log_normalized",
"output_obs_predictions": "celltypist_pred",
"output_obs_probability": "celltypist_proba"
],
toState: [ "query_processed": "output" ]
)
| scanvi_scarches_annotation.run(
runIf: { id, state -> state.annotation_methods && state.annotation_methods.contains("scanvi_scarches")},
fromState: [
"id": "id",
"input": "query_processed",
"modality": "modality",
"layer": "input_layer",
"input_var_gene_names": "input_var_gene_names",
"reference": "reference",
"reference_obs_target": "reference_obs_label",
"reference_obs_batch_label": "reference_obs_batch",
"reference_var_hvg": "reference_var_input",
"reference_var_gene_names": "reference_var_gene_names",
"unlabeled_category": "reference_obs_label_unlabeled_category",
"early_stopping": "early_stopping",
"early_stopping_monitor": "early_stopping_monitor",
"early_stopping_patience": "early_stopping_patience",
"early_stopping_min_delta": "early_stopping_min_delta",
"max_epochs": "max_epochs",
"reduce_lr_on_plateau": "reduce_lr_on_plateau",
"lr_factor": "lr_factor",
"lr_patience": "lr_patience",
"leiden_resolution": "leiden_resolution",
"knn_weights": "knn_weights",
"knn_n_neighbors": "knn_n_neighbors"
],
args: [
"input_obs_batch_label": "sample_id",
"output_obs_predictions": "scanvi_knn_pred",
"output_obs_probability": "scanvi_knn_proba"
],
toState: [ "query_processed": "output" ]
)
| map {id, state ->
def new_state = state + ["output": state.query_processed]
[id, new_state]
}
| setState(["output", "_meta"])
emit:
output_ch
}

View File

@@ -0,0 +1,10 @@
manifest {
nextflowVersion = '!>=20.12.1-edge'
}
params {
rootDir = java.nio.file.Paths.get("$projectDir/../../../").toAbsolutePath().normalize().toString()
}
// include common settings
includeConfig("${params.rootDir}/src/configs/labels.config")

View File

@@ -0,0 +1,151 @@
nextflow.enable.dsl=2
include { process_integrate_annotate } from params.rootDir + "/target/nextflow/single_cell/process_integrate_annotate/main.nf"
params.resources_test = "s3://openpipelines-bio/openpipeline_incubator/resources_test/"
workflow test_wf {
resources_test = file(params.resources_test)
output_ch = Channel.fromList(
[
[
id: "simple_annotation_test",
input: resources_test.resolve("pbmc_1k_protein_v3/pbmc_1k_protein_v3_mms.h5mu"),
reference: resources_test.resolve("annotation_test_data/TS_Blood_filtered.h5mu"),
reference_var_gene_names: "ensemblid",
reference_layer_lognormalized_counts: "log_normalized",
reference_obs_batch: "donor_assay",
reference_obs_label: "cell_type",
max_epochs: "5",
annotation_methods: "celltypist;scanvi_scarches"
],
[
id: "simple_integration_test",
input: resources_test.resolve("pbmc_1k_protein_v3/pbmc_1k_protein_v3_mms.h5mu"),
integration_methods: "harmony;scvi"
],
[
id: "simple_execution_test",
input: resources_test.resolve("pbmc_1k_protein_v3/pbmc_1k_protein_v3_mms.h5mu"),
reference: resources_test.resolve("annotation_test_data/TS_Blood_filtered.h5mu"),
reference_var_gene_names: "ensemblid",
reference_layer_lognormalized_counts: "log_normalized",
reference_obs_batch: "donor_assay",
reference_obs_label: "cell_type",
max_epochs: "5",
annotation_methods: "scanvi_scarches",
integration_methods: "harmony"
]
])
| view {"State at start: $it"}
| map{ state -> [state.id, state] }
| process_integrate_annotate
| view {"After AaaS: $it"}
| view { output ->
assert output.size() == 2 : "Outputs should contain two elements; [id, state]"
// check id
def id = output[0]
assert id == "merged" : "Output ID should be `merged`"
// check output
def state = output[1]
assert state instanceof Map : "State should be a map. Found: ${state}"
assert state.containsKey("output") : "Output should contain key 'output'."
assert state.output.isFile() : "'output' should be a file."
assert state.output.toString().endsWith(".h5mu") : "Output file should end with '.h5mu'. Found: ${state.output}"
"Output: $output"
}
}
workflow test_wf_2 {
resources_test = file(params.resources_test)
output_ch = Channel.fromList(
[
[
id: "pbmc_with_more_params",
input: resources_test.resolve("pbmc_1k_protein_v3/pbmc_1k_protein_v3_mms.h5mu"),
rna_min_counts: 2,
rna_max_counts: 1000000,
rna_min_genes_per_cell: 1,
rna_max_genes_per_cell: 1000000,
rna_min_cells_per_gene: 1,
rna_min_fraction_mito: 0.0,
rna_max_fraction_mito: 1.0,
prot_min_counts: 3,
prot_max_counts: 1000000,
prot_min_proteins_per_cell: 1,
prot_max_proteins_per_cell: 1000000,
prot_min_cells_per_protein: 1,
var_name_mitochondrial_genes: 'mitochondrial',
obs_name_mitochondrial_fraction: 'fraction_mitochondrial',
add_id_to_obs: true,
add_id_make_observation_keys_unique: true,
add_id_obs_output: "sample_id",
reference: resources_test.resolve("annotation_test_data/TS_Blood_filtered.h5mu"),
reference_var_gene_names: "ensemblid",
reference_layer_lognormalized_counts: "log_normalized",
reference_obs_batch: "donor_assay",
reference_obs_label: "cell_type",
annotation_methods: "celltypist",
integration_methods: "scvi"
]
])
| view {"State at start: $it"}
| map { state -> [state.id, state] }
| process_integrate_annotate
| view {"After AaaS: $it"}
| view { output ->
assert output.size() == 2 : "Outputs should contain two elements; [id, state]"
// check id
def id = output[0]
assert id == "merged" : "Output ID should be `merged`"
// check output
def state = output[1]
assert state instanceof Map : "State should be a map. Found: ${state}"
assert state.containsKey("output") : "Output should contain key 'output'."
assert state.output.isFile() : "'output' should be a file."
assert state.output.toString().endsWith(".h5mu") : "Output file should end with '.h5mu'. Found: ${state.output}"
"Output: $output"
}
}
workflow test_wf_3 {
resources_test = file(params.resources_test)
output_ch = Channel.fromList(
[
[
id: "celltypist_model",
input: resources_test.resolve("pbmc_1k_protein_v3/pbmc_1k_protein_v3_mms.h5mu"),
celltypist_model: resources_test.resolve("annotation_test_data/celltypist_model_Immune_All_Low.pkl"),
annotation_methods: "celltypist",
input_var_gene_names: "gene_symbol"
]
])
| view {"State at start: $it"}
| map{ state -> [state.id, state] }
| process_integrate_annotate
| view {"After AaaS: $it"}
| view { output ->
assert output.size() == 2 : "Outputs should contain two elements; [id, state]"
// check id
def id = output[0]
assert id == "merged" : "Output ID should be `merged`"
// check output
def state = output[1]
assert state instanceof Map : "State should be a map. Found: ${state}"
assert state.containsKey("output") : "Output should contain key 'output'."
assert state.output.isFile() : "'output' should be a file."
assert state.output.toString().endsWith(".h5mu") : "Output file should end with '.h5mu'. Found: ${state.output}"
"Output: $output"
}
}

0
target/.build.yaml Normal file
View File

View File

@@ -0,0 +1,388 @@
name: "fastqc"
version: "v0.4.2"
authors:
- name: "Theodoro Gasperin Terra Camargo"
roles:
- "author"
- "maintainer"
info:
links:
email: "theodorogtc@gmail.com"
github: "tgaspe"
linkedin: "theodoro-gasperin-terra-camargo"
argument_groups:
- name: "Inputs"
arguments:
- type: "file"
name: "--input"
description: "FASTQ file(s) to be analyzed.\n"
info: null
example:
- "input.fq"
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: true
multiple_sep: ";"
- name: "Outputs"
description: "At least one of the output options (--html, --zip, --summary, --data)\
\ must be used.\n"
arguments:
- type: "file"
name: "--outdir"
description: "Output directory where the results will be saved.\n"
info: null
example:
- "results"
must_exist: true
create_parent: true
required: false
direction: "output"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--html"
description: "Create the HTML report of the results. \n'*' wild card must be provided\
\ in the output file name. \nWild card will be replaced by the input file basename.\n\
e.g. \n --input \"sample_1.fq\"\n --html \"*.html\"\n would create an output\
\ html file named sample_1.html\n"
info: null
example:
- "*.html"
must_exist: true
create_parent: true
required: false
direction: "output"
multiple: true
multiple_sep: ";"
- type: "file"
name: "--zip"
description: "Create the zip file(s) containing: html report, data, images, icons,\
\ summary, etc.\n'*' wild card must be provided in the output file name.\nWild\
\ card will be replaced by the input basename.\ne.g. \n --input \"sample_1.fq\"\
\n --html \"*.zip\"\n would create an output zip file named sample_1.zip\n"
info: null
example:
- "*.zip"
must_exist: true
create_parent: true
required: false
direction: "output"
multiple: true
multiple_sep: ";"
- type: "file"
name: "--summary"
description: "Create the summary file(s).\n'*' wild card must be provided in the\
\ output file name.\nWild card will be replaced by the input basename.\ne.g.\
\ \n --input \"sample_1.fq\"\n --summary \"*_summary.txt\"\n would create\
\ an output summary.txt file named sample_1_summary.txt\n"
info: null
example:
- "*_summary.txt"
must_exist: true
create_parent: true
required: false
direction: "output"
multiple: true
multiple_sep: ";"
- type: "file"
name: "--data"
description: "Create the data file(s).\n'*' wild card must be provided in the\
\ output file name.\nWild card will be replaced by the input basename.\ne.g.\
\ \n --input \"sample_1.fq\"\n --summary \"*_data.txt\"\n would create an\
\ output data.txt file named sample_1_data.txt\n"
info: null
example:
- "*_data.txt"
must_exist: true
create_parent: true
required: false
direction: "output"
multiple: true
multiple_sep: ";"
- name: "Options"
arguments:
- type: "boolean_true"
name: "--casava"
description: "Files come from raw casava output. Files in the same sample\ngroup\
\ (differing only by the group number) will be analysed\nas a set rather than\
\ individually. Sequences with the filter\nflag set in the header will be excluded\
\ from the analysis.\nFiles must have the same names given to them by casava\n\
(including being gzipped and ending with .gz) otherwise they\nwon't be grouped\
\ together correctly.\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--nano"
description: "Files come from nanopore sequences and are in fast5 format. In\n\
this mode you can pass in directories to process and the program\nwill take\
\ in all fast5 files within those directories and produce\na single output file\
\ from the sequences found in all files.\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--nofilter"
description: "If running with --casava then don't remove read flagged by\ncasava\
\ as poor quality when performing the QC analysis.\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--nogroup"
description: "Disable grouping of bases for reads >50bp. \nAll reports will show\
\ data for every base in the read. \nWARNING: Using this option will cause fastqc\
\ to crash \nand burn if you use it on really long reads, and your \nplots may\
\ end up a ridiculous size. You have been warned!\n"
info: null
direction: "input"
- type: "integer"
name: "--min_length"
description: "Sets an artificial lower limit on the length of the \nsequence to\
\ be shown in the report. As long as you \nset this to a value greater or equal\
\ to your longest \nread length then this will be the sequence length used \n\
to create your read groups. This can be useful for making\ndirectly comparable\
\ statistics from datasets with somewhat \nvariable read lengths.\n"
info: null
example:
- 0
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--format"
alternatives:
- "-f"
description: "Bypasses the normal sequence file format detection and \nforces\
\ the program to use the specified format. \nValid formats are bam, sam, bam_mapped,\
\ sam_mapped, and fastq.\n"
info: null
example:
- "bam"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--contaminants"
alternatives:
- "-c"
description: "Specifies a non-default file which contains the list \nof contaminants\
\ to screen overrepresented sequences against. \nThe file must contain sets\
\ of named contaminants in the form\nname[tab]sequence. Lines prefixed with\
\ a hash will be ignored.\n"
info: null
example:
- "contaminants.txt"
must_exist: true
create_parent: true
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--adapters"
alternatives:
- "-a"
description: "Specifies a non-default file which contains the list of \nadapter\
\ sequences which will be explicitly searched against \nthe library. The file\
\ must contain sets of named adapters \nin the form name[tab]sequence. Lines\
\ prefixed with a hash will be ignored.\n"
info: null
example:
- "adapters.txt"
must_exist: true
create_parent: true
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--limits"
alternatives:
- "-l"
description: "Specifies a non-default file which contains \na set of criteria\
\ which will be used to determine \nthe warn/error limits for the various modules.\
\ \nThis file can also be used to selectively remove \nsome modules from the\
\ output altogether. The format \nneeds to mirror the default limits.txt file\
\ found in \nthe Configuration folder.\n"
info: null
example:
- "limits.txt"
must_exist: true
create_parent: true
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--kmers"
alternatives:
- "-k"
description: "Specifies the length of Kmer to look for in the Kmer \ncontent module.\
\ Specified Kmer length must be between \n2 and 10. Default length is 7 if not\
\ specified.\n"
info: null
example:
- 7
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "boolean_true"
name: "--quiet"
alternatives:
- "-q"
description: "Suppress all progress messages on stdout and only report errors.\n"
info: null
direction: "input"
resources:
- type: "bash_script"
path: "script.sh"
is_executable: true
description: "FastQC - A high throughput sequence QC analysis tool."
test_resources:
- type: "bash_script"
path: "test.sh"
is_executable: true
info: null
status: "enabled"
scope:
image: "public"
target: "public"
requirements:
commands:
- "ps"
keywords:
- "Quality control"
- "BAM"
- "SAM"
- "FASTQ"
license: "GPL-3.0, Apache-2.0"
links:
repository: "https://github.com/s-andrews/FastQC"
homepage: "https://www.bioinformatics.babraham.ac.uk/projects/fastqc/"
documentation: "https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/"
issue_tracker: "https://github.com/s-andrews/FastQC/issues"
runners:
- type: "executable"
id: "executable"
docker_setup_strategy: "ifneedbepullelsecachedbuild"
- type: "nextflow"
id: "nextflow"
directives:
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
debug: false
container: "docker"
engines:
- type: "docker"
id: "docker"
image: "biocontainers/fastqc:v0.11.9_cv8"
target_registry: "images.viash-hub.com"
target_tag: "v0.4.2"
namespace_separator: "/"
setup:
- type: "docker"
run:
- "echo \"fastqc: $(fastqc --version | sed -n 's/^FastQC //p')\" > /var/software_versions.txt\n"
entrypoint: []
cmd: null
- type: "native"
id: "native"
build_info:
config: "src/fastqc/config.vsh.yaml"
runner: "nextflow"
engine: "docker|native"
output: "target/nextflow/fastqc"
executable: "target/nextflow/fastqc/main.nf"
viash_version: "0.9.4"
git_commit: "02b470d967226478af69c37eae2b1256be1b78fd"
git_remote: "https://github.com/viash-hub/biobox"
package_config:
name: "biobox"
version: "v0.4.2"
summary: "A curated collection of high-quality, standalone bioinformatics components\
\ built with [Viash](https://viash.io).\n"
description: "`biobox` offers a suite of reliable bioinformatics components, similar\
\ to [nf-core/modules](https://github.com/nf-core/modules) and [snakemake-wrappers/bio](https://github.com/snakemake/snakemake-wrappers/tree/master/bio),\
\ but built using the [Viash](https://viash.io) framework.\n\nThis approach emphasizes\
\ **reusability**, **reproducibility**, and adherence to **best practices**. Key\
\ features of `biobox` components include:\n\n* **Standalone & Nextflow Ready:**\
\ Run components directly via the command line or seamlessly integrate them into\
\ Nextflow workflows.\n* **High Quality Standards:**\n * Comprehensive documentation\
\ for components and parameters.\n * Full exposure of underlying tool arguments.\n\
\ * Containerized (Docker) for dependency management and reproducibility.\n\
\ * Unit tested for verified functionality.\n"
info: null
viash_version: "0.9.4"
source: "src"
target: "target"
config_mods:
- ".requirements.commands += ['ps']\n"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'v0.4.2'"
keywords:
- "bioinformatics"
- "modules"
- "sequencing"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/viash-hub/biobox"
issue_tracker: "https://github.com/viash-hub/biobox/issues"

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,126 @@
manifest {
name = 'fastqc'
mainScript = 'main.nf'
nextflowVersion = '!>=20.12.1-edge'
version = 'v0.4.2'
description = 'FastQC - A high throughput sequence QC analysis tool.'
author = 'Theodoro Gasperin Terra Camargo'
}
process.container = 'nextflow/bash:latest'
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
profiles {
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
podman {
podman.enabled = true
docker.enabled = false
singularity.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
shifter {
shifter.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
charliecloud.enabled = false
}
charliecloud {
charliecloud.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
}
}
process{
withLabel: mem1gb { memory = 1000000000.B }
withLabel: mem2gb { memory = 2000000000.B }
withLabel: mem5gb { memory = 5000000000.B }
withLabel: mem10gb { memory = 10000000000.B }
withLabel: mem20gb { memory = 20000000000.B }
withLabel: mem50gb { memory = 50000000000.B }
withLabel: mem100gb { memory = 100000000000.B }
withLabel: mem200gb { memory = 200000000000.B }
withLabel: mem500gb { memory = 500000000000.B }
withLabel: mem1tb { memory = 1000000000000.B }
withLabel: mem2tb { memory = 2000000000000.B }
withLabel: mem5tb { memory = 5000000000000.B }
withLabel: mem10tb { memory = 10000000000000.B }
withLabel: mem20tb { memory = 20000000000000.B }
withLabel: mem50tb { memory = 50000000000000.B }
withLabel: mem100tb { memory = 100000000000000.B }
withLabel: mem200tb { memory = 200000000000000.B }
withLabel: mem500tb { memory = 500000000000000.B }
withLabel: mem1gib { memory = 1073741824.B }
withLabel: mem2gib { memory = 2147483648.B }
withLabel: mem4gib { memory = 4294967296.B }
withLabel: mem8gib { memory = 8589934592.B }
withLabel: mem16gib { memory = 17179869184.B }
withLabel: mem32gib { memory = 34359738368.B }
withLabel: mem64gib { memory = 68719476736.B }
withLabel: mem128gib { memory = 137438953472.B }
withLabel: mem256gib { memory = 274877906944.B }
withLabel: mem512gib { memory = 549755813888.B }
withLabel: mem1tib { memory = 1099511627776.B }
withLabel: mem2tib { memory = 2199023255552.B }
withLabel: mem4tib { memory = 4398046511104.B }
withLabel: mem8tib { memory = 8796093022208.B }
withLabel: mem16tib { memory = 17592186044416.B }
withLabel: mem32tib { memory = 35184372088832.B }
withLabel: mem64tib { memory = 70368744177664.B }
withLabel: mem128tib { memory = 140737488355328.B }
withLabel: mem256tib { memory = 281474976710656.B }
withLabel: mem512tib { memory = 562949953421312.B }
withLabel: cpu1 { cpus = 1 }
withLabel: cpu2 { cpus = 2 }
withLabel: cpu5 { cpus = 5 }
withLabel: cpu10 { cpus = 10 }
withLabel: cpu20 { cpus = 20 }
withLabel: cpu50 { cpus = 50 }
withLabel: cpu100 { cpus = 100 }
withLabel: cpu200 { cpus = 200 }
withLabel: cpu500 { cpus = 500 }
withLabel: cpu1000 { cpus = 1000 }
}

View File

@@ -0,0 +1,175 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "fastqc",
"description": "FastQC - A high throughput sequence QC analysis tool.",
"type": "object",
"$defs": {
"inputs": {
"title": "Inputs",
"type": "object",
"description": "No description",
"properties": {
"input": {
"type": "array",
"items": {
"type": "string"
},
"format": "path",
"exists": true,
"description": "FASTQ file(s) to be analyzed.\n",
"help_text": "Type: `file`, multiple: `True`, required, direction: `input`, example: `[\"input.fq\"]`. "
}
}
},
"outputs": {
"title": "Outputs",
"type": "object",
"description": "At least one of the output options (--html, --zip, --summary, --data) must be used.\n",
"properties": {
"outdir": {
"type": "string",
"format": "path",
"description": "Output directory where the results will be saved.\n",
"help_text": "Type: `file`, multiple: `False`, default: `\"$id.$key.outdir\"`, direction: `output`, example: `\"results\"`. ",
"default": "$id.$key.outdir"
},
"html": {
"type": "array",
"items": {
"type": "string"
},
"format": "path",
"description": "Create the HTML report of the results",
"help_text": "Type: `file`, multiple: `True`, default: `\"$id.$key.html_*.html\"`, direction: `output`, example: `[\"*.html\"]`. ",
"default": "$id.$key.html_*.html"
},
"zip": {
"type": "array",
"items": {
"type": "string"
},
"format": "path",
"description": "Create the zip file(s) containing: html report, data, images, icons, summary, etc.\n'*' wild card must be provided in the output file name.\nWild card will be replaced by the input basename.\ne.g",
"help_text": "Type: `file`, multiple: `True`, default: `\"$id.$key.zip_*.zip\"`, direction: `output`, example: `[\"*.zip\"]`. ",
"default": "$id.$key.zip_*.zip"
},
"summary": {
"type": "array",
"items": {
"type": "string"
},
"format": "path",
"description": "Create the summary file(s).\n'*' wild card must be provided in the output file name.\nWild card will be replaced by the input basename.\ne.g",
"help_text": "Type: `file`, multiple: `True`, default: `\"$id.$key.summary_*.txt\"`, direction: `output`, example: `[\"*_summary.txt\"]`. ",
"default": "$id.$key.summary_*.txt"
},
"data": {
"type": "array",
"items": {
"type": "string"
},
"format": "path",
"description": "Create the data file(s).\n'*' wild card must be provided in the output file name.\nWild card will be replaced by the input basename.\ne.g",
"help_text": "Type: `file`, multiple: `True`, default: `\"$id.$key.data_*.txt\"`, direction: `output`, example: `[\"*_data.txt\"]`. ",
"default": "$id.$key.data_*.txt"
}
}
},
"options": {
"title": "Options",
"type": "object",
"description": "No description",
"properties": {
"casava": {
"type": "boolean",
"description": "Files come from raw casava output",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"nano": {
"type": "boolean",
"description": "Files come from nanopore sequences and are in fast5 format",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"nofilter": {
"type": "boolean",
"description": "If running with --casava then don't remove read flagged by\ncasava as poor quality when performing the QC analysis.\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"nogroup": {
"type": "boolean",
"description": "Disable grouping of bases for reads >50bp",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"min_length": {
"type": "integer",
"description": "Sets an artificial lower limit on the length of the \nsequence to be shown in the report",
"help_text": "Type: `integer`, multiple: `False`, example: `0`. "
},
"format": {
"type": "string",
"description": "Bypasses the normal sequence file format detection and \nforces the program to use the specified format",
"help_text": "Type: `string`, multiple: `False`, example: `\"bam\"`. "
},
"contaminants": {
"type": "string",
"format": "path",
"description": "Specifies a non-default file which contains the list \nof contaminants to screen overrepresented sequences against",
"help_text": "Type: `file`, multiple: `False`, direction: `input`, example: `\"contaminants.txt\"`. "
},
"adapters": {
"type": "string",
"format": "path",
"description": "Specifies a non-default file which contains the list of \nadapter sequences which will be explicitly searched against \nthe library",
"help_text": "Type: `file`, multiple: `False`, direction: `input`, example: `\"adapters.txt\"`. "
},
"limits": {
"type": "string",
"format": "path",
"description": "Specifies a non-default file which contains \na set of criteria which will be used to determine \nthe warn/error limits for the various modules",
"help_text": "Type: `file`, multiple: `False`, direction: `input`, example: `\"limits.txt\"`. "
},
"kmers": {
"type": "integer",
"description": "Specifies the length of Kmer to look for in the Kmer \ncontent module",
"help_text": "Type: `integer`, multiple: `False`, example: `7`. "
},
"quiet": {
"type": "boolean",
"description": "Suppress all progress messages on stdout and only report errors.\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
}
}
},
"nextflow input-output arguments": {
"title": "Nextflow input-output arguments",
"type": "object",
"description": "Input/output parameters for Nextflow itself. Please note that both publishDir and publish_dir are supported but at least one has to be configured.",
"properties": {
"publish_dir": {
"type": "string",
"description": "Path to an output directory.",
"help_text": "Type: `string`, multiple: `False`, required, example: `\"output/\"`. "
}
}
}
},
"allOf": [
{
"$ref": "#/$defs/inputs"
},
{
"$ref": "#/$defs/outputs"
},
{
"$ref": "#/$defs/options"
},
{
"$ref": "#/$defs/nextflow input-output arguments"
}
]
}

View File

@@ -0,0 +1,496 @@
name: "multiqc"
version: "v0.4.2"
authors:
- name: "Dorien Roosen"
roles:
- "author"
- "maintainer"
info:
links:
email: "dorien@data-intuitive.com"
github: "dorien-er"
linkedin: "dorien-roosen"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Scientist"
argument_groups:
- name: "Input"
arguments:
- type: "file"
name: "--input"
description: "File paths to be searched for analysis results to be included in\
\ the report.\n"
info: null
example:
- "data/results"
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: true
multiple_sep: ";"
- name: "Ouput"
arguments:
- type: "file"
name: "--output_report"
description: "Filepath of the generated report.\n"
info: null
example:
- "multiqc_report.html"
must_exist: false
create_parent: true
required: false
direction: "output"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--output_data"
description: "Output directory for parsed data files. If not provided, parsed\
\ data will not be published.\n"
info: null
example:
- "multiqc_data"
must_exist: false
create_parent: true
required: false
direction: "output"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--output_plots"
description: "Output directory for generated plots. If not provided, plots will\
\ not be published.\n"
info: null
example:
- "multiqc_plots"
must_exist: false
create_parent: true
required: false
direction: "output"
multiple: false
multiple_sep: ";"
- name: "Modules and analyses to run"
arguments:
- type: "string"
name: "--include_modules"
description: "Use only these module"
info: null
example:
- "fastqc"
- "cutadapt"
required: false
direction: "input"
multiple: true
multiple_sep: ";"
- type: "string"
name: "--exclude_modules"
description: "Do not use only these modules"
info: null
example:
- "fastqc"
- "cutadapt"
required: false
direction: "input"
multiple: true
multiple_sep: ";"
- type: "string"
name: "--ignore_analysis"
info: null
example:
- "run_one/*"
- "run_two/*"
required: false
direction: "input"
multiple: true
multiple_sep: ";"
- type: "string"
name: "--ignore_samples"
info: null
example:
- "sample_1*"
- "sample_3*"
required: false
direction: "input"
multiple: true
multiple_sep: ";"
- type: "boolean_true"
name: "--ignore_symlinks"
description: "Ignore symlinked directories and files"
info: null
direction: "input"
- name: "Sample name handling"
arguments:
- type: "boolean_true"
name: "--dirs"
description: "Prepend directory to sample names to avoid clashing filenames"
info: null
direction: "input"
- type: "integer"
name: "--dirs_depth"
description: "Prepend n directories to sample names. Negative number to take from\
\ start of path."
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "boolean_true"
name: "--full_names"
description: "Do not clean the sample names (leave as full file name)"
info: null
direction: "input"
- type: "boolean_true"
name: "--fn_as_s_name"
description: "Use the log filename as the sample name"
info: null
direction: "input"
- type: "file"
name: "--replace_names"
description: "TSV file to rename sample names during report generation"
info: null
example:
- "replace_names.tsv"
must_exist: true
create_parent: true
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Report Customisation"
arguments:
- type: "string"
name: "--title"
description: "Report title. Printed as page header, used for filename if not otherwise\
\ specified.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--comment"
description: "Custom comment, will be printed at the top of the report.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--template"
description: "Report template to use.\n"
info: null
required: false
choices:
- "default"
- "gathered"
- "geo"
- "highcharts"
- "sections"
- "simple"
direction: "input"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--sample_names"
description: "TSV file containing alternative sample names for renaming buttons\
\ in the report.\n"
info: null
example:
- "sample_names.tsv"
must_exist: true
create_parent: true
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--sample_filters"
description: "TSV file containing show/hide patterns for the report\n"
info: null
example:
- "sample_filters.tsv"
must_exist: true
create_parent: true
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--custom_css_file"
description: "Custom CSS file to add to the final report\n"
info: null
example:
- "custom_style_sheet.css"
must_exist: true
create_parent: true
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "boolean_true"
name: "--profile_runtime"
description: "Add analysis of how long MultiQC takes to run to the report\n"
info: null
direction: "input"
- name: "MultiQC behaviour"
arguments:
- type: "boolean_true"
name: "--verbose"
description: "Increase output verbosity.\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--quiet"
description: "Only show log warnings\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--strict"
description: "Don't catch exceptions, run additional code checks to help development.\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--development"
description: "Development mode. Do not compress and minimise JS, export uncompressed\
\ plot data.\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--require_logs"
description: "Require all explicitly requested modules to have log files. If not,\
\ MultiQC will exit with an error.\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--no_megaqc_upload"
description: "Don't upload generated report to MegaQC, even if MegaQC options\
\ are found.\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--no_ansi"
description: "Disable coloured log output.\n"
info: null
direction: "input"
- type: "string"
name: "--cl_config"
description: "YAML formatted string that allows to customize MultiQC behaviour\
\ like input file detection.\n"
info: null
example:
- "qualimap_config: { general_stats_coverage: [20,40,200] }"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Output format"
arguments:
- type: "boolean_true"
name: "--flat"
description: "Use only flat plots (static images).\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--interactive"
description: "Use only interactive plots (in-browser Javascript).\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--data_dir"
description: "Force the parsed data directory to be created.\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--no_data_dir"
description: "Prevent the parsed data directory from being created.\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--zip_data_dir"
description: "Compress the data directory.\n"
info: null
direction: "input"
- type: "string"
name: "--data_format"
description: "Output parsed data in a different format than the default 'txt'.\n"
info: null
required: false
choices:
- "tsv"
- "csv"
- "json"
- "yaml"
direction: "input"
multiple: false
multiple_sep: ";"
- type: "boolean_true"
name: "--pdf"
description: "Creates PDF report with the 'simple' template. Requires Pandoc to\
\ be installed.\n"
info: null
direction: "input"
resources:
- type: "bash_script"
path: "script.sh"
is_executable: true
description: "MultiQC aggregates results from bioinformatics analyses across many\
\ samples into a single report.\nIt searches a given directory for analysis logs\
\ and compiles a HTML report. It's a general use tool, perfect for summarising the\
\ output from numerous bioinformatics tools.\n"
test_resources:
- type: "bash_script"
path: "test.sh"
is_executable: true
- type: "file"
path: "test_data"
info:
keywords:
- "QC"
- "html report"
- "aggregate analysis"
links:
homepage: "https://multiqc.info/"
documentation: "https://multiqc.info/docs/"
repository: "https://github.com/MultiQC/MultiQC"
references:
doi: "10.1093/bioinformatics/btw354"
licence: "GPL v3 or later"
status: "enabled"
scope:
image: "public"
target: "public"
requirements:
commands:
- "ps"
license: "MIT"
links:
repository: "https://github.com/viash-hub/biobox"
runners:
- type: "executable"
id: "executable"
docker_setup_strategy: "ifneedbepullelsecachedbuild"
- type: "nextflow"
id: "nextflow"
directives:
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
debug: false
container: "docker"
engines:
- type: "docker"
id: "docker"
image: "quay.io/biocontainers/multiqc:1.21--pyhdfd78af_0"
target_registry: "images.viash-hub.com"
target_tag: "v0.4.2"
namespace_separator: "/"
setup:
- type: "docker"
run:
- "multiqc --version | sed 's/multiqc, version\\s\\(.*\\)/multiqc: \"\\1\"/' >\
\ /var/software_versions.txt\n"
test_setup:
- type: "apt"
packages:
- "jq"
interactive: false
entrypoint: []
cmd: null
- type: "native"
id: "native"
build_info:
config: "src/multiqc/config.vsh.yaml"
runner: "nextflow"
engine: "docker|native"
output: "target/nextflow/multiqc"
executable: "target/nextflow/multiqc/main.nf"
viash_version: "0.9.4"
git_commit: "02b470d967226478af69c37eae2b1256be1b78fd"
git_remote: "https://github.com/viash-hub/biobox"
package_config:
name: "biobox"
version: "v0.4.2"
summary: "A curated collection of high-quality, standalone bioinformatics components\
\ built with [Viash](https://viash.io).\n"
description: "`biobox` offers a suite of reliable bioinformatics components, similar\
\ to [nf-core/modules](https://github.com/nf-core/modules) and [snakemake-wrappers/bio](https://github.com/snakemake/snakemake-wrappers/tree/master/bio),\
\ but built using the [Viash](https://viash.io) framework.\n\nThis approach emphasizes\
\ **reusability**, **reproducibility**, and adherence to **best practices**. Key\
\ features of `biobox` components include:\n\n* **Standalone & Nextflow Ready:**\
\ Run components directly via the command line or seamlessly integrate them into\
\ Nextflow workflows.\n* **High Quality Standards:**\n * Comprehensive documentation\
\ for components and parameters.\n * Full exposure of underlying tool arguments.\n\
\ * Containerized (Docker) for dependency management and reproducibility.\n\
\ * Unit tested for verified functionality.\n"
info: null
viash_version: "0.9.4"
source: "src"
target: "target"
config_mods:
- ".requirements.commands += ['ps']\n"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'v0.4.2'"
keywords:
- "bioinformatics"
- "modules"
- "sequencing"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/viash-hub/biobox"
issue_tracker: "https://github.com/viash-hub/biobox/issues"

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,126 @@
manifest {
name = 'multiqc'
mainScript = 'main.nf'
nextflowVersion = '!>=20.12.1-edge'
version = 'v0.4.2'
description = 'MultiQC aggregates results from bioinformatics analyses across many samples into a single report.\nIt searches a given directory for analysis logs and compiles a HTML report. It\'s a general use tool, perfect for summarising the output from numerous bioinformatics tools.\n'
author = 'Dorien Roosen'
}
process.container = 'nextflow/bash:latest'
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
profiles {
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
podman {
podman.enabled = true
docker.enabled = false
singularity.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
shifter {
shifter.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
charliecloud.enabled = false
}
charliecloud {
charliecloud.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
}
}
process{
withLabel: mem1gb { memory = 1000000000.B }
withLabel: mem2gb { memory = 2000000000.B }
withLabel: mem5gb { memory = 5000000000.B }
withLabel: mem10gb { memory = 10000000000.B }
withLabel: mem20gb { memory = 20000000000.B }
withLabel: mem50gb { memory = 50000000000.B }
withLabel: mem100gb { memory = 100000000000.B }
withLabel: mem200gb { memory = 200000000000.B }
withLabel: mem500gb { memory = 500000000000.B }
withLabel: mem1tb { memory = 1000000000000.B }
withLabel: mem2tb { memory = 2000000000000.B }
withLabel: mem5tb { memory = 5000000000000.B }
withLabel: mem10tb { memory = 10000000000000.B }
withLabel: mem20tb { memory = 20000000000000.B }
withLabel: mem50tb { memory = 50000000000000.B }
withLabel: mem100tb { memory = 100000000000000.B }
withLabel: mem200tb { memory = 200000000000000.B }
withLabel: mem500tb { memory = 500000000000000.B }
withLabel: mem1gib { memory = 1073741824.B }
withLabel: mem2gib { memory = 2147483648.B }
withLabel: mem4gib { memory = 4294967296.B }
withLabel: mem8gib { memory = 8589934592.B }
withLabel: mem16gib { memory = 17179869184.B }
withLabel: mem32gib { memory = 34359738368.B }
withLabel: mem64gib { memory = 68719476736.B }
withLabel: mem128gib { memory = 137438953472.B }
withLabel: mem256gib { memory = 274877906944.B }
withLabel: mem512gib { memory = 549755813888.B }
withLabel: mem1tib { memory = 1099511627776.B }
withLabel: mem2tib { memory = 2199023255552.B }
withLabel: mem4tib { memory = 4398046511104.B }
withLabel: mem8tib { memory = 8796093022208.B }
withLabel: mem16tib { memory = 17592186044416.B }
withLabel: mem32tib { memory = 35184372088832.B }
withLabel: mem64tib { memory = 70368744177664.B }
withLabel: mem128tib { memory = 140737488355328.B }
withLabel: mem256tib { memory = 281474976710656.B }
withLabel: mem512tib { memory = 562949953421312.B }
withLabel: cpu1 { cpus = 1 }
withLabel: cpu2 { cpus = 2 }
withLabel: cpu5 { cpus = 5 }
withLabel: cpu10 { cpus = 10 }
withLabel: cpu20 { cpus = 20 }
withLabel: cpu50 { cpus = 50 }
withLabel: cpu100 { cpus = 100 }
withLabel: cpu200 { cpus = 200 }
withLabel: cpu500 { cpus = 500 }
withLabel: cpu1000 { cpus = 1000 }
}

View File

@@ -0,0 +1,334 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "multiqc",
"description": "MultiQC aggregates results from bioinformatics analyses across many samples into a single report.\nIt searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.\n",
"type": "object",
"$defs": {
"input": {
"title": "Input",
"type": "object",
"description": "No description",
"properties": {
"input": {
"type": "array",
"items": {
"type": "string"
},
"format": "path",
"exists": true,
"description": "File paths to be searched for analysis results to be included in the report.\n",
"help_text": "Type: `file`, multiple: `True`, required, direction: `input`, example: `[\"data/results\"]`. "
}
}
},
"ouput": {
"title": "Ouput",
"type": "object",
"description": "No description",
"properties": {
"output_report": {
"type": "string",
"format": "path",
"description": "Filepath of the generated report.\n",
"help_text": "Type: `file`, multiple: `False`, default: `\"$id.$key.output_report.html\"`, direction: `output`, example: `\"multiqc_report.html\"`. ",
"default": "$id.$key.output_report.html"
},
"output_data": {
"type": "string",
"format": "path",
"description": "Output directory for parsed data files",
"help_text": "Type: `file`, multiple: `False`, default: `\"$id.$key.output_data\"`, direction: `output`, example: `\"multiqc_data\"`. ",
"default": "$id.$key.output_data"
},
"output_plots": {
"type": "string",
"format": "path",
"description": "Output directory for generated plots",
"help_text": "Type: `file`, multiple: `False`, default: `\"$id.$key.output_plots\"`, direction: `output`, example: `\"multiqc_plots\"`. ",
"default": "$id.$key.output_plots"
}
}
},
"modules and analyses to run": {
"title": "Modules and analyses to run",
"type": "object",
"description": "No description",
"properties": {
"include_modules": {
"type": "array",
"items": {
"type": "string"
},
"description": "Use only these module",
"help_text": "Type: `string`, multiple: `True`, example: `[\"fastqc\";\"cutadapt\"]`. "
},
"exclude_modules": {
"type": "array",
"items": {
"type": "string"
},
"description": "Do not use only these modules",
"help_text": "Type: `string`, multiple: `True`, example: `[\"fastqc\";\"cutadapt\"]`. "
},
"ignore_analysis": {
"type": "array",
"items": {
"type": "string"
},
"description": "",
"help_text": "Type: `string`, multiple: `True`, example: `[\"run_one/*\";\"run_two/*\"]`. "
},
"ignore_samples": {
"type": "array",
"items": {
"type": "string"
},
"description": "",
"help_text": "Type: `string`, multiple: `True`, example: `[\"sample_1*\";\"sample_3*\"]`. "
},
"ignore_symlinks": {
"type": "boolean",
"description": "Ignore symlinked directories and files",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
}
}
},
"sample name handling": {
"title": "Sample name handling",
"type": "object",
"description": "No description",
"properties": {
"dirs": {
"type": "boolean",
"description": "Prepend directory to sample names to avoid clashing filenames",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"dirs_depth": {
"type": "integer",
"description": "Prepend n directories to sample names",
"help_text": "Type: `integer`, multiple: `False`. "
},
"full_names": {
"type": "boolean",
"description": "Do not clean the sample names (leave as full file name)",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"fn_as_s_name": {
"type": "boolean",
"description": "Use the log filename as the sample name",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"replace_names": {
"type": "string",
"format": "path",
"description": "TSV file to rename sample names during report generation",
"help_text": "Type: `file`, multiple: `False`, direction: `input`, example: `\"replace_names.tsv\"`. "
}
}
},
"report customisation": {
"title": "Report Customisation",
"type": "object",
"description": "No description",
"properties": {
"title": {
"type": "string",
"description": "Report title",
"help_text": "Type: `string`, multiple: `False`. "
},
"comment": {
"type": "string",
"description": "Custom comment, will be printed at the top of the report.\n",
"help_text": "Type: `string`, multiple: `False`. "
},
"template": {
"type": "string",
"description": "Report template to use.\n",
"help_text": "Type: `string`, multiple: `False`, choices: ``default`, `gathered`, `geo`, `highcharts`, `sections`, `simple``. ",
"enum": [
"default",
"gathered",
"geo",
"highcharts",
"sections",
"simple"
]
},
"sample_names": {
"type": "string",
"format": "path",
"description": "TSV file containing alternative sample names for renaming buttons in the report.\n",
"help_text": "Type: `file`, multiple: `False`, direction: `input`, example: `\"sample_names.tsv\"`. "
},
"sample_filters": {
"type": "string",
"format": "path",
"description": "TSV file containing show/hide patterns for the report\n",
"help_text": "Type: `file`, multiple: `False`, direction: `input`, example: `\"sample_filters.tsv\"`. "
},
"custom_css_file": {
"type": "string",
"format": "path",
"description": "Custom CSS file to add to the final report\n",
"help_text": "Type: `file`, multiple: `False`, direction: `input`, example: `\"custom_style_sheet.css\"`. "
},
"profile_runtime": {
"type": "boolean",
"description": "Add analysis of how long MultiQC takes to run to the report\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
}
}
},
"multiqc behaviour": {
"title": "MultiQC behaviour",
"type": "object",
"description": "No description",
"properties": {
"verbose": {
"type": "boolean",
"description": "Increase output verbosity.\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"quiet": {
"type": "boolean",
"description": "Only show log warnings\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"strict": {
"type": "boolean",
"description": "Don't catch exceptions, run additional code checks to help development.\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"development": {
"type": "boolean",
"description": "Development mode",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"require_logs": {
"type": "boolean",
"description": "Require all explicitly requested modules to have log files",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"no_megaqc_upload": {
"type": "boolean",
"description": "Don't upload generated report to MegaQC, even if MegaQC options are found.\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"no_ansi": {
"type": "boolean",
"description": "Disable coloured log output.\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"cl_config": {
"type": "string",
"description": "YAML formatted string that allows to customize MultiQC behaviour like input file detection.\n",
"help_text": "Type: `string`, multiple: `False`, example: `\"qualimap_config: { general_stats_coverage: [20,40,200] }\"`. "
}
}
},
"output format": {
"title": "Output format",
"type": "object",
"description": "No description",
"properties": {
"flat": {
"type": "boolean",
"description": "Use only flat plots (static images).\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"interactive": {
"type": "boolean",
"description": "Use only interactive plots (in-browser Javascript).\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"data_dir": {
"type": "boolean",
"description": "Force the parsed data directory to be created.\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"no_data_dir": {
"type": "boolean",
"description": "Prevent the parsed data directory from being created.\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"zip_data_dir": {
"type": "boolean",
"description": "Compress the data directory.\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"data_format": {
"type": "string",
"description": "Output parsed data in a different format than the default 'txt'.\n",
"help_text": "Type: `string`, multiple: `False`, choices: ``tsv`, `csv`, `json`, `yaml``. ",
"enum": [
"tsv",
"csv",
"json",
"yaml"
]
},
"pdf": {
"type": "boolean",
"description": "Creates PDF report with the 'simple' template",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
}
}
},
"nextflow input-output arguments": {
"title": "Nextflow input-output arguments",
"type": "object",
"description": "Input/output parameters for Nextflow itself. Please note that both publishDir and publish_dir are supported but at least one has to be configured.",
"properties": {
"publish_dir": {
"type": "string",
"description": "Path to an output directory.",
"help_text": "Type: `string`, multiple: `False`, required, example: `\"output/\"`. "
}
}
}
},
"allOf": [
{
"$ref": "#/$defs/input"
},
{
"$ref": "#/$defs/ouput"
},
{
"$ref": "#/$defs/modules and analyses to run"
},
{
"$ref": "#/$defs/sample name handling"
},
{
"$ref": "#/$defs/report customisation"
},
{
"$ref": "#/$defs/multiqc behaviour"
},
{
"$ref": "#/$defs/output format"
},
{
"$ref": "#/$defs/nextflow input-output arguments"
}
]
}

View File

@@ -0,0 +1,187 @@
name: "move_files_to_directory"
version: "v0.2.0"
authors:
- name: "Dorien Roosen"
roles:
- "maintainer"
info:
links:
email: "dorien@data-intuitive.com"
github: "dorien-er"
linkedin: "dorien-roosen"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Scientist"
argument_groups:
- name: "Arguments"
arguments:
- type: "file"
name: "--input"
description: "Paths of the files that will be copied into the output directory."
info: null
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: true
multiple_sep: ";"
- type: "file"
name: "--output"
description: "Path to output directory"
info: null
must_exist: true
create_parent: true
required: true
direction: "output"
multiple: false
multiple_sep: ";"
resources:
- type: "bash_script"
path: "script.sh"
is_executable: true
summary: "Publish one or multiple files to the same directory"
description: "This component copies one or multiple files to the same destination\
\ directory, creating the output directory if it doesn't exist."
test_resources:
- type: "bash_script"
path: "test.sh"
is_executable: true
info: null
status: "enabled"
scope:
image: "public"
target: "public"
requirements:
commands:
- "ps"
license: "MIT"
links:
repository: "https://github.com/viash-hub/craftbox"
runners:
- type: "executable"
id: "executable"
docker_setup_strategy: "ifneedbepullelsecachedbuild"
- type: "nextflow"
id: "nextflow"
directives:
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
debug: false
container: "docker"
engines:
- type: "docker"
id: "docker"
image: "debian:latest"
target_registry: "images.viash-hub.com"
target_tag: "v0.2.0"
namespace_separator: "/"
setup:
- type: "apt"
packages:
- "procps"
interactive: false
entrypoint: []
cmd: null
- type: "native"
id: "native"
build_info:
config: "src/move_files_to_directory/config.vsh.yaml"
runner: "nextflow"
engine: "docker|native"
output: "target/nextflow/move_files_to_directory"
executable: "target/nextflow/move_files_to_directory/main.nf"
viash_version: "0.9.4"
git_commit: "1c1b0a4a1aff891ab678072b0ba915ac3ac71610"
git_remote: "https://github.com/viash-hub/craftbox"
git_tag: "v0.1.0-8-g1c1b0a4"
package_config:
name: "craftbox"
version: "v0.2.0"
summary: "A collection of custom-tailored scripts and applied utilities built with\
\ Viash.\n"
description: "`craftbox` is a curated collection of custom scripts and utilities\
\ designed to tackle context-specific tasks.\n\nEmphasizing the Viash principles,\
\ `craftbox` components aim for **reusability**, **reproducibility**, and adherence\
\ to **best practices**. Key features generally include:\n\n* **Standalone & Nextflow\
\ Ready:** Components are built to run directly via the command line or be smoothly\
\ integrated into Nextflow workflows.\n* **Custom Implementations:** Contains\
\ scripts and tools developed for particular tasks that may not be found in broader\
\ collections.\n* **High Quality Standards (promoted by Viash):**\n * Clear\
\ documentation for components and their parameters.\n * Full exposure of underlying\
\ script/tool arguments for fine-grained control.\n * Containerized (Docker)\
\ to ensure dependency management and a consistent, reproducible runtime environment.\n\
\ * Unit tested where applicable to ensure components function as expected.\n"
info: null
viash_version: "0.9.4"
source: "src"
target: "target"
config_mods:
- ".requirements.commands := ['ps']\n"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'v0.2.0'"
keywords:
- "scripts"
- "custom"
- "implementations"
- "utilities"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/viash-hub/craftbox"
issue_tracker: "https://github.com/viash-hub/craftbox/issues"

View File

@@ -0,0 +1,126 @@
manifest {
name = 'move_files_to_directory'
mainScript = 'main.nf'
nextflowVersion = '!>=20.12.1-edge'
version = 'v0.2.0'
description = 'This component copies one or multiple files to the same destination directory, creating the output directory if it doesn\'t exist.'
author = 'Dorien Roosen'
}
process.container = 'nextflow/bash:latest'
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
profiles {
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
podman {
podman.enabled = true
docker.enabled = false
singularity.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
shifter {
shifter.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
charliecloud.enabled = false
}
charliecloud {
charliecloud.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
}
}
process{
withLabel: mem1gb { memory = 1000000000.B }
withLabel: mem2gb { memory = 2000000000.B }
withLabel: mem5gb { memory = 5000000000.B }
withLabel: mem10gb { memory = 10000000000.B }
withLabel: mem20gb { memory = 20000000000.B }
withLabel: mem50gb { memory = 50000000000.B }
withLabel: mem100gb { memory = 100000000000.B }
withLabel: mem200gb { memory = 200000000000.B }
withLabel: mem500gb { memory = 500000000000.B }
withLabel: mem1tb { memory = 1000000000000.B }
withLabel: mem2tb { memory = 2000000000000.B }
withLabel: mem5tb { memory = 5000000000000.B }
withLabel: mem10tb { memory = 10000000000000.B }
withLabel: mem20tb { memory = 20000000000000.B }
withLabel: mem50tb { memory = 50000000000000.B }
withLabel: mem100tb { memory = 100000000000000.B }
withLabel: mem200tb { memory = 200000000000000.B }
withLabel: mem500tb { memory = 500000000000000.B }
withLabel: mem1gib { memory = 1073741824.B }
withLabel: mem2gib { memory = 2147483648.B }
withLabel: mem4gib { memory = 4294967296.B }
withLabel: mem8gib { memory = 8589934592.B }
withLabel: mem16gib { memory = 17179869184.B }
withLabel: mem32gib { memory = 34359738368.B }
withLabel: mem64gib { memory = 68719476736.B }
withLabel: mem128gib { memory = 137438953472.B }
withLabel: mem256gib { memory = 274877906944.B }
withLabel: mem512gib { memory = 549755813888.B }
withLabel: mem1tib { memory = 1099511627776.B }
withLabel: mem2tib { memory = 2199023255552.B }
withLabel: mem4tib { memory = 4398046511104.B }
withLabel: mem8tib { memory = 8796093022208.B }
withLabel: mem16tib { memory = 17592186044416.B }
withLabel: mem32tib { memory = 35184372088832.B }
withLabel: mem64tib { memory = 70368744177664.B }
withLabel: mem128tib { memory = 140737488355328.B }
withLabel: mem256tib { memory = 281474976710656.B }
withLabel: mem512tib { memory = 562949953421312.B }
withLabel: cpu1 { cpus = 1 }
withLabel: cpu2 { cpus = 2 }
withLabel: cpu5 { cpus = 5 }
withLabel: cpu10 { cpus = 10 }
withLabel: cpu20 { cpus = 20 }
withLabel: cpu50 { cpus = 50 }
withLabel: cpu100 { cpus = 100 }
withLabel: cpu200 { cpus = 200 }
withLabel: cpu500 { cpus = 500 }
withLabel: cpu1000 { cpus = 1000 }
}

View File

@@ -0,0 +1,81 @@
{
"$schema": "http://json-schema.org/draft-07/schema",
"title": "move_files_to_directory",
"description": "This component copies one or multiple files to the same destination directory, creating the output directory if it doesn\u0027t exist.",
"type": "object",
"definitions": {
"arguments" : {
"title": "Arguments",
"type": "object",
"description": "No description",
"properties": {
"input": {
"type":
"string",
"description": "Type: List of `file`, required, multiple_sep: `\";\"`. Paths of the files that will be copied into the output directory",
"help_text": "Type: List of `file`, required, multiple_sep: `\";\"`. Paths of the files that will be copied into the output directory."
}
,
"output": {
"type":
"string",
"description": "Type: `file`, required, default: `$id.$key.output`. Path to output directory",
"help_text": "Type: `file`, required, default: `$id.$key.output`. Path to output directory"
,
"default":"$id.$key.output"
}
}
},
"nextflow input-output arguments" : {
"title": "Nextflow input-output arguments",
"type": "object",
"description": "Input/output parameters for Nextflow itself. Please note that both publishDir and publish_dir are supported but at least one has to be configured.",
"properties": {
"publish_dir": {
"type":
"string",
"description": "Type: `string`, required, example: `output/`. Path to an output directory",
"help_text": "Type: `string`, required, example: `output/`. Path to an output directory."
}
,
"param_list": {
"type":
"string",
"description": "Type: `string`, example: `my_params.yaml`. Allows inputting multiple parameter sets to initialise a Nextflow channel",
"help_text": "Type: `string`, example: `my_params.yaml`. Allows inputting multiple parameter sets to initialise a Nextflow channel. A `param_list` can either be a list of maps, a csv file, a json file, a yaml file, or simply a yaml blob.\n\n* A list of maps (as-is) where the keys of each map corresponds to the arguments of the pipeline. Example: in a `nextflow.config` file: `param_list: [ [\u0027id\u0027: \u0027foo\u0027, \u0027input\u0027: \u0027foo.txt\u0027], [\u0027id\u0027: \u0027bar\u0027, \u0027input\u0027: \u0027bar.txt\u0027] ]`.\n* A csv file should have column names which correspond to the different arguments of this pipeline. Example: `--param_list data.csv` with columns `id,input`.\n* A json or a yaml file should be a list of maps, each of which has keys corresponding to the arguments of the pipeline. Example: `--param_list data.json` with contents `[ {\u0027id\u0027: \u0027foo\u0027, \u0027input\u0027: \u0027foo.txt\u0027}, {\u0027id\u0027: \u0027bar\u0027, \u0027input\u0027: \u0027bar.txt\u0027} ]`.\n* A yaml blob can also be passed directly as a string. Example: `--param_list \"[ {\u0027id\u0027: \u0027foo\u0027, \u0027input\u0027: \u0027foo.txt\u0027}, {\u0027id\u0027: \u0027bar\u0027, \u0027input\u0027: \u0027bar.txt\u0027} ]\"`.\n\nWhen passing a csv, json or yaml file, relative path names are relativized to the location of the parameter file. No relativation is performed when `param_list` is a list of maps (as-is) or a yaml blob.",
"hidden": true
}
}
}
},
"allOf": [
{
"$ref": "#/definitions/arguments"
},
{
"$ref": "#/definitions/nextflow input-output arguments"
}
]
}

View File

@@ -0,0 +1,659 @@
name: "cellbender_remove_background"
namespace: "correction"
version: "v4.0.0"
argument_groups:
- name: "Inputs"
arguments:
- type: "file"
name: "--input"
alternatives:
- "-i"
description: "Input h5mu file. Data file on which to run tool. Data must be un-filtered:\
\ it should include empty droplets."
info: null
example:
- "input.h5mu"
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--modality"
description: "List of modalities to process."
info: null
default:
- "rna"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Outputs"
arguments:
- type: "file"
name: "--output"
alternatives:
- "-o"
description: "Full count matrix as an h5mu file, with background RNA removed.\
\ This file contains all the original droplet barcodes."
info: null
example:
- "output.h5mu"
must_exist: true
create_parent: true
required: true
direction: "output"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--layer_output"
description: "Output layer"
info: null
default:
- "cellbender_corrected"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--obs_background_fraction"
info: null
default:
- "cellbender_background_fraction"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--obs_cell_probability"
info: null
default:
- "cellbender_cell_probability"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--obs_cell_size"
info: null
default:
- "cellbender_cell_size"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--obs_droplet_efficiency"
description: "Name of the column in the .obs dataframe to store the droplet efficiencies\
\ in.\n"
info: null
default:
- "cellbender_droplet_efficiency"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--obs_latent_scale"
info: null
default:
- "cellbender_latent_scale"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--var_ambient_expression"
info: null
default:
- "cellbender_ambient_expression"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--obsm_gene_expression_encoding"
info: null
default:
- "cellbender_gene_expression_encoding"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_compression"
description: "Compression format to use for the output AnnData and/or Mudata objects.\n\
By default no compression is applied.\n"
info: null
example:
- "gzip"
required: false
choices:
- "gzip"
- "lzf"
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Arguments"
arguments:
- type: "boolean"
name: "--expected_cells_from_qc"
description: "Will use the Cell Ranger QC to determine the estimated number of\
\ cells"
info: null
default:
- false
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--expected_cells"
description: "Number of cells expected in the dataset (a rough estimate within\
\ a factor of 2 is sufficient)."
info: null
example:
- 1000
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--total_droplets_included"
description: "The number of droplets from the rank-ordered UMI plot\nthat will\
\ have their cell probabilities inferred as an\noutput. Include the droplets\
\ which might contain cells.\nDroplets beyond TOTAL_DROPLETS_INCLUDED should\
\ be\n'surely empty' droplets.\n"
info: null
example:
- 25000
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--force_cell_umi_prior"
description: "Ignore CellBender's heuristic prior estimation, and use this prior\
\ for UMI counts in cells."
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--force_empty_umi_prior"
description: "Ignore CellBender's heuristic prior estimation, and use this prior\
\ for UMI counts in empty droplets."
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--model"
description: "Which model is being used for count data.\n\n* 'naive' subtracts\
\ the estimated ambient profile.\n* 'simple' does not model either ambient RNA\
\ or random barcode swapping (for debugging purposes -- not recommended).\n\
* 'ambient' assumes background RNA is incorporated into droplets.\n* 'swapping'\
\ assumes background RNA comes from random barcode swapping (via PCR chimeras).\n\
* 'full' uses a combined ambient and swapping model.\n"
info: null
default:
- "full"
required: false
choices:
- "naive"
- "simple"
- "ambient"
- "swapping"
- "full"
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--epochs"
description: "Number of epochs to train."
info: null
default:
- 150
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--low_count_threshold"
description: "Droplets with UMI counts below this number are completely \nexcluded\
\ from the analysis. This can help identify the correct \nprior for empty droplet\
\ counts in the rare case where empty \ncounts are extremely high (over 200).\n"
info: null
default:
- 5
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--z_dim"
description: "Dimension of latent variable z.\n"
info: null
default:
- 64
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--z_layers"
description: "Dimension of hidden layers in the encoder for z.\n"
info: null
default:
- 512
required: false
direction: "input"
multiple: true
multiple_sep: ";"
- type: "double"
name: "--training_fraction"
description: "Training detail: the fraction of the data used for training.\nThe\
\ rest is never seen by the inference algorithm. Speeds up learning.\n"
info: null
default:
- 0.9
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "double"
name: "--empty_drop_training_fraction"
description: "Training detail: the fraction of the training data each epoch that\
\ \nis drawn (randomly sampled) from surely empty droplets.\n"
info: null
default:
- 0.2
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--ignore_features"
description: "Integer indices of features to ignore entirely. In the output\n\
count matrix, the counts for these features will be unchanged.\n"
info: null
required: false
direction: "input"
multiple: true
multiple_sep: ";"
- type: "double"
name: "--fpr"
description: "Target 'delta' false positive rate in [0, 1). Use 0 for a cohort\n\
of samples which will be jointly analyzed for differential expression.\nA false\
\ positive is a true signal count that is erroneously removed.\nMore background\
\ removal is accompanied by more signal removal at\nhigh values of FPR. You\
\ can specify multiple values, which will\ncreate multiple output files.\n"
info: null
default:
- 0.01
required: false
direction: "input"
multiple: true
multiple_sep: ";"
- type: "string"
name: "--exclude_feature_types"
description: "Feature types to ignore during the analysis. These features will\n\
be left unchanged in the output file.\n"
info: null
required: false
direction: "input"
multiple: true
multiple_sep: ";"
- type: "double"
name: "--projected_ambient_count_threshold"
description: "Controls how many features are included in the analysis, which\n\
can lead to a large speedup. If a feature is expected to have less\nthan PROJECTED_AMBIENT_COUNT_THRESHOLD\
\ counts total in all cells\n(summed), then that gene is excluded, and it will\
\ be unchanged\nin the output count matrix. For example, \nPROJECTED_AMBIENT_COUNT_THRESHOLD\
\ = 0 will include all features\nwhich have even a single count in any empty\
\ droplet.\n"
info: null
default:
- 0.1
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "double"
name: "--learning_rate"
description: "Training detail: lower learning rate for inference.\nA OneCycle\
\ learning rate schedule is used, where the\nupper learning rate is ten times\
\ this value. (For this\nvalue, probably do not exceed 1e-3).\n"
info: null
default:
- 1.0E-4
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "double"
name: "--final_elbo_fail_fraction"
description: "Training is considered to have failed if \n(best_test_ELBO - final_test_ELBO)/(best_test_ELBO\
\ - initial_test_ELBO) > FINAL_ELBO_FAIL_FRACTION.\nTraining will automatically\
\ re-run if --num-training-tries > 1.\nBy default, will not fail training based\
\ on final_training_ELBO.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "double"
name: "--epoch_elbo_fail_fraction"
description: "Training is considered to have failed if \n(previous_epoch_test_ELBO\
\ - current_epoch_test_ELBO)/(previous_epoch_test_ELBO - initial_train_ELBO)\
\ > EPOCH_ELBO_FAIL_FRACTION.\nTraining will automatically re-run if --num-training-tries\
\ > 1.\nBy default, will not fail training based on epoch_training_ELBO.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--num_training_tries"
description: "Number of times to attempt to train the model. At each subsequent\
\ attempt,\nthe learning rate is multiplied by LEARNING_RATE_RETRY_MULT.\n"
info: null
default:
- 1
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "double"
name: "--learning_rate_retry_mult"
description: "Learning rate is multiplied by this amount each time a new training\n\
attempt is made. (This parameter is only used if training fails based\non EPOCH_ELBO_FAIL_FRACTION\
\ or FINAL_ELBO_FAIL_FRACTION and\nNUM_TRAINING_TRIES is > 1.) \n"
info: null
default:
- 0.2
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--posterior_batch_size"
description: "Training detail: size of batches when creating the posterior.\n\
Reduce this to avoid running out of GPU memory creating the posterior\n(will\
\ be slower).\n"
info: null
default:
- 128
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--posterior_regulation"
description: "Posterior regularization method. (For experts: not required for\
\ normal usage,\nsee documentation). \n\n* PRq is approximate quantile-targeting.\n\
* PRmu is approximate mean-targeting aggregated over genes (behavior of v0.2.0).\n\
* PRmu_gene is approximate mean-targeting per gene.\n"
info: null
required: false
choices:
- "PRq"
- "PRmu"
- "PRmu_gene"
direction: "input"
multiple: false
multiple_sep: ";"
- type: "double"
name: "--alpha"
description: "Tunable parameter alpha for the PRq posterior regularization method\n\
(not normally used: see documentation).\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "double"
name: "--q"
description: "Tunable parameter q for the CDF threshold estimation method (not\n\
normally used: see documentation).\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--estimator"
description: "Output denoised count estimation method. (For experts: not required\n\
for normal usage, see documentation).\n"
info: null
default:
- "mckp"
required: false
choices:
- "map"
- "mean"
- "cdf"
- "sample"
- "mckp"
direction: "input"
multiple: false
multiple_sep: ";"
- type: "boolean_true"
name: "--estimator_multiple_cpu"
description: "Including the flag --estimator-multiple-cpu will use more than one\n\
CPU to compute the MCKP output count estimator in parallel (does nothing\nfor\
\ other estimators).\n"
info: null
direction: "input"
- type: "boolean"
name: "--constant_learning_rate"
description: "Including the flag --constant-learning-rate will use the ClippedAdam\n\
optimizer instead of the OneCycleLR learning rate schedule, which is\nthe default.\
\ Learning is faster with the OneCycleLR schedule.\nHowever, training can easily\
\ be continued from a checkpoint for more\nepochs than the initial command specified\
\ when using ClippedAdam. On\nthe other hand, if using the OneCycleLR schedule\
\ with 150 epochs\nspecified, it is not possible to pick up from that final\
\ checkpoint\nand continue training until 250 epochs.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "boolean_true"
name: "--debug"
description: "Including the flag --debug will log extra messages useful for debugging.\n"
info: null
direction: "input"
- type: "boolean_true"
name: "--cuda"
description: "Including the flag --cuda will run the inference on a\nGPU.\n"
info: null
direction: "input"
resources:
- type: "python_script"
path: "script.py"
is_executable: true
- type: "file"
path: "setup_logger.py"
- type: "file"
path: "nextflow_labels.config"
dest: "nextflow_labels.config"
description: "Eliminating technical artifacts from high-throughput single-cell RNA\
\ sequencing data.\n\nThis module removes counts due to ambient RNA molecules and\
\ random barcode swapping from (raw) UMI-based scRNA-seq count matrices. \nAt the\
\ moment, only the count matrices produced by the CellRanger count pipeline is supported.\
\ Support for additional tools and protocols \nwill be added in the future. A quick\
\ start tutorial can be found here.\n\nFleming et al. 2022, bioRxiv.\n"
test_resources:
- type: "python_script"
path: "test.py"
is_executable: true
- type: "file"
path: "pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5mu"
info: null
status: "enabled"
scope:
image: "public"
target: "public"
license: "MIT"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
runners:
- type: "executable"
id: "executable"
docker_setup_strategy: "ifneedbepullelsecachedbuild"
- type: "nextflow"
id: "nextflow"
directives:
label:
- "midcpu"
- "midmem"
- "gpu"
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
script:
- "includeConfig(\"nextflow_labels.config\")"
debug: false
container: "docker"
engines:
- type: "docker"
id: "docker"
image: "nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu22.04"
target_registry: "images.viash-hub.com"
target_tag: "v4.0.0"
namespace_separator: "/"
setup:
- type: "docker"
run:
- "apt update && DEBIAN_FRONTEND=noninteractive apt install -y make build-essential\
\ libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget ca-certificates\
\ curl llvm libncurses5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev\
\ liblzma-dev mecab-ipadic-utf8 git \\\n&& curl https://pyenv.run | bash \\\n\
&& pyenv update \\\n&& pyenv install $PYTHON_VERSION \\\n&& pyenv global $PYTHON_VERSION\
\ \\\n&& apt-get clean\n"
env:
- "PYENV_ROOT=\"/root/.pyenv\""
- "PATH=\"$PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH\""
- "PYTHON_VERSION=3.7.16"
- type: "python"
user: false
packages:
- "lxml~=4.8.0"
- "mudata~=0.2.1"
- "cellbender~=0.3.0"
upgrade: true
entrypoint: []
cmd: null
- type: "native"
id: "native"
build_info:
config: "src/correction/cellbender_remove_background/config.vsh.yaml"
runner: "nextflow"
engine: "docker|native"
output: "target/nextflow/correction/cellbender_remove_background"
executable: "target/nextflow/correction/cellbender_remove_background/main.nf"
viash_version: "0.9.4"
git_commit: "de02293c9e13198622b988dac952b2c8c70a1e35"
git_remote: "https://github.com/openpipelines-bio/openpipeline"
package_config:
name: "openpipeline"
version: "v4.0.0"
summary: "Best-practice workflows for single-cell multi-omics analyses.\n"
description: "OpenPipelines are extensible single cell analysis pipelines for reproducible\
\ and large-scale single cell processing using [Viash](https://viash.io) and [Nextflow](https://www.nextflow.io/).\n\
\nIn terms of workflows, the following has been made available, but keep in mind\
\ that\nindividual tools and functionality can be executed as standalone components\
\ as well.\n\n * Demultiplexing: conversion of raw sequencing data to FASTQ objects.\n\
\ * Ingestion: Read mapping and generating a count matrix.\n * Single sample\
\ processing: cell filtering and doublet detection.\n * Multisample processing:\
\ Count transformation, normalization, QC metric calulations.\n * Integration:\
\ Clustering, integration and batch correction using single and multimodal methods.\n\
\ * Downstream analysis workflows\n"
info:
test_resources:
- type: "s3"
path: "s3://openpipelines-data"
dest: "resources_test"
nextflow_labels_ci:
- path: "src/workflows/utils/labels_ci.config"
description: "Adds the correct memory and CPU labels when running on the Viash\
\ Hub CI."
viash_version: "0.9.4"
source: "src"
target: "target"
config_mods:
- ".resources += {path: '/src/workflows/utils/labels.config', dest: 'nextflow_labels.config'}\n\
.runners[.type == 'nextflow'].config.script := 'includeConfig(\"nextflow_labels.config\"\
)'"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'v4.0.0'"
keywords:
- "single-cell"
- "multimodal"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
homepage: "https://openpipelines.bio"
documentation: "https://openpipelines.bio/fundamentals"
issue_tracker: "https://github.com/openpipelines-bio/openpipeline/issues"

View File

@@ -0,0 +1,125 @@
manifest {
name = 'correction/cellbender_remove_background'
mainScript = 'main.nf'
nextflowVersion = '!>=20.12.1-edge'
version = 'v4.0.0'
description = 'Eliminating technical artifacts from high-throughput single-cell RNA sequencing data.\n\nThis module removes counts due to ambient RNA molecules and random barcode swapping from (raw) UMI-based scRNA-seq count matrices. \nAt the moment, only the count matrices produced by the CellRanger count pipeline is supported. Support for additional tools and protocols \nwill be added in the future. A quick start tutorial can be found here.\n\nFleming et al. 2022, bioRxiv.\n'
}
process.container = 'nextflow/bash:latest'
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
profiles {
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
podman {
podman.enabled = true
docker.enabled = false
singularity.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
shifter {
shifter.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
charliecloud.enabled = false
}
charliecloud {
charliecloud.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
}
}
process{
withLabel: mem1gb { memory = 1000000000.B }
withLabel: mem2gb { memory = 2000000000.B }
withLabel: mem5gb { memory = 5000000000.B }
withLabel: mem10gb { memory = 10000000000.B }
withLabel: mem20gb { memory = 20000000000.B }
withLabel: mem50gb { memory = 50000000000.B }
withLabel: mem100gb { memory = 100000000000.B }
withLabel: mem200gb { memory = 200000000000.B }
withLabel: mem500gb { memory = 500000000000.B }
withLabel: mem1tb { memory = 1000000000000.B }
withLabel: mem2tb { memory = 2000000000000.B }
withLabel: mem5tb { memory = 5000000000000.B }
withLabel: mem10tb { memory = 10000000000000.B }
withLabel: mem20tb { memory = 20000000000000.B }
withLabel: mem50tb { memory = 50000000000000.B }
withLabel: mem100tb { memory = 100000000000000.B }
withLabel: mem200tb { memory = 200000000000000.B }
withLabel: mem500tb { memory = 500000000000000.B }
withLabel: mem1gib { memory = 1073741824.B }
withLabel: mem2gib { memory = 2147483648.B }
withLabel: mem4gib { memory = 4294967296.B }
withLabel: mem8gib { memory = 8589934592.B }
withLabel: mem16gib { memory = 17179869184.B }
withLabel: mem32gib { memory = 34359738368.B }
withLabel: mem64gib { memory = 68719476736.B }
withLabel: mem128gib { memory = 137438953472.B }
withLabel: mem256gib { memory = 274877906944.B }
withLabel: mem512gib { memory = 549755813888.B }
withLabel: mem1tib { memory = 1099511627776.B }
withLabel: mem2tib { memory = 2199023255552.B }
withLabel: mem4tib { memory = 4398046511104.B }
withLabel: mem8tib { memory = 8796093022208.B }
withLabel: mem16tib { memory = 17592186044416.B }
withLabel: mem32tib { memory = 35184372088832.B }
withLabel: mem64tib { memory = 70368744177664.B }
withLabel: mem128tib { memory = 140737488355328.B }
withLabel: mem256tib { memory = 281474976710656.B }
withLabel: mem512tib { memory = 562949953421312.B }
withLabel: cpu1 { cpus = 1 }
withLabel: cpu2 { cpus = 2 }
withLabel: cpu5 { cpus = 5 }
withLabel: cpu10 { cpus = 10 }
withLabel: cpu20 { cpus = 20 }
withLabel: cpu50 { cpus = 50 }
withLabel: cpu100 { cpus = 100 }
withLabel: cpu200 { cpus = 200 }
withLabel: cpu500 { cpus = 500 }
withLabel: cpu1000 { cpus = 1000 }
}
includeConfig("nextflow_labels.config")

View File

@@ -0,0 +1,48 @@
process {
// Default resources for components that hardly do any processing
memory = { 2.GB * task.attempt }
cpus = 1
// Retry for exit codes that have something to do with memory issues
errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 3
// The memory a task is assinged increases with each attempt
// uncomment the line below and adjust the value to set a global upper limit on the memory.
// resourceLimits = [ memory: 240.Gb ]
// CPU resources
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 10 }
withLabel: highcpu { cpus = 20 }
// Memory resources
withLabel: lowmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 4.GB * task.attempt } }
withLabel: midmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 25.GB * task.attempt } }
withLabel: highmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 50.GB * task.attempt } }
withLabel: veryhighmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 75.GB * task.attempt } }
// Disk space
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
// NOTE: The above labels intentionally do not have an effect by default.
// The user should set the disk space requirements by adding the following
// to the compute environment:
//
// withLabel: lowdisk { disk = { 20.GB * task.attempt } }
// withLabel: middisk { disk = { 100.GB * task.attempt } }
// withLabel: highdisk { disk = { 200.GB * task.attempt } }
// withLabel: veryhighdisk { disk = { 500.GB * task.attempt } }
}

View File

@@ -0,0 +1,335 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "cellbender_remove_background",
"description": "Eliminating technical artifacts from high-throughput single-cell RNA sequencing data.\n\nThis module removes counts due to ambient RNA molecules and random barcode swapping from (raw) UMI-based scRNA-seq count matrices. \nAt the moment, only the count matrices produced by the CellRanger count pipeline is supported. Support for additional tools and protocols \nwill be added in the future. A quick start tutorial can be found here.\n\nFleming et al. 2022, bioRxiv.\n",
"type": "object",
"$defs": {
"inputs": {
"title": "Inputs",
"type": "object",
"description": "No description",
"properties": {
"input": {
"type": "string",
"format": "path",
"exists": true,
"description": "Input h5mu file",
"help_text": "Type: `file`, multiple: `False`, required, direction: `input`, example: `\"input.h5mu\"`. "
},
"modality": {
"type": "string",
"description": "List of modalities to process.",
"help_text": "Type: `string`, multiple: `False`, default: `\"rna\"`. ",
"default": "rna"
}
}
},
"outputs": {
"title": "Outputs",
"type": "object",
"description": "No description",
"properties": {
"output": {
"type": "string",
"format": "path",
"description": "Full count matrix as an h5mu file, with background RNA removed",
"help_text": "Type: `file`, multiple: `False`, required, default: `\"$id.$key.output.h5mu\"`, direction: `output`, example: `\"output.h5mu\"`. ",
"default": "$id.$key.output.h5mu"
},
"layer_output": {
"type": "string",
"description": "Output layer",
"help_text": "Type: `string`, multiple: `False`, default: `\"cellbender_corrected\"`. ",
"default": "cellbender_corrected"
},
"obs_background_fraction": {
"type": "string",
"description": "",
"help_text": "Type: `string`, multiple: `False`, default: `\"cellbender_background_fraction\"`. ",
"default": "cellbender_background_fraction"
},
"obs_cell_probability": {
"type": "string",
"description": "",
"help_text": "Type: `string`, multiple: `False`, default: `\"cellbender_cell_probability\"`. ",
"default": "cellbender_cell_probability"
},
"obs_cell_size": {
"type": "string",
"description": "",
"help_text": "Type: `string`, multiple: `False`, default: `\"cellbender_cell_size\"`. ",
"default": "cellbender_cell_size"
},
"obs_droplet_efficiency": {
"type": "string",
"description": "Name of the column in the .obs dataframe to store the droplet efficiencies in.\n",
"help_text": "Type: `string`, multiple: `False`, default: `\"cellbender_droplet_efficiency\"`. ",
"default": "cellbender_droplet_efficiency"
},
"obs_latent_scale": {
"type": "string",
"description": "",
"help_text": "Type: `string`, multiple: `False`, default: `\"cellbender_latent_scale\"`. ",
"default": "cellbender_latent_scale"
},
"var_ambient_expression": {
"type": "string",
"description": "",
"help_text": "Type: `string`, multiple: `False`, default: `\"cellbender_ambient_expression\"`. ",
"default": "cellbender_ambient_expression"
},
"obsm_gene_expression_encoding": {
"type": "string",
"description": "",
"help_text": "Type: `string`, multiple: `False`, default: `\"cellbender_gene_expression_encoding\"`. ",
"default": "cellbender_gene_expression_encoding"
},
"output_compression": {
"type": "string",
"description": "Compression format to use for the output AnnData and/or Mudata objects.\nBy default no compression is applied.\n",
"help_text": "Type: `string`, multiple: `False`, example: `\"gzip\"`, choices: ``gzip`, `lzf``. ",
"enum": [
"gzip",
"lzf"
]
}
}
},
"arguments": {
"title": "Arguments",
"type": "object",
"description": "No description",
"properties": {
"expected_cells_from_qc": {
"type": "boolean",
"description": "Will use the Cell Ranger QC to determine the estimated number of cells",
"help_text": "Type: `boolean`, multiple: `False`, default: `false`. ",
"default": false
},
"expected_cells": {
"type": "integer",
"description": "Number of cells expected in the dataset (a rough estimate within a factor of 2 is sufficient).",
"help_text": "Type: `integer`, multiple: `False`, example: `1000`. "
},
"total_droplets_included": {
"type": "integer",
"description": "The number of droplets from the rank-ordered UMI plot\nthat will have their cell probabilities inferred as an\noutput",
"help_text": "Type: `integer`, multiple: `False`, example: `25000`. "
},
"force_cell_umi_prior": {
"type": "integer",
"description": "Ignore CellBender's heuristic prior estimation, and use this prior for UMI counts in cells.",
"help_text": "Type: `integer`, multiple: `False`. "
},
"force_empty_umi_prior": {
"type": "integer",
"description": "Ignore CellBender's heuristic prior estimation, and use this prior for UMI counts in empty droplets.",
"help_text": "Type: `integer`, multiple: `False`. "
},
"model": {
"type": "string",
"description": "Which model is being used for count data.\n\n* 'naive' subtracts the estimated ambient profile.\n* 'simple' does not model either ambient RNA or random barcode swapping (for debugging purposes -- not recommended).\n* 'ambient' assumes background RNA is incorporated into droplets.\n* 'swapping' assumes background RNA comes from random barcode swapping (via PCR chimeras).\n* 'full' uses a combined ambient and swapping model.\n",
"help_text": "Type: `string`, multiple: `False`, default: `\"full\"`, choices: ``naive`, `simple`, `ambient`, `swapping`, `full``. ",
"enum": [
"naive",
"simple",
"ambient",
"swapping",
"full"
],
"default": "full"
},
"epochs": {
"type": "integer",
"description": "Number of epochs to train.",
"help_text": "Type: `integer`, multiple: `False`, default: `150`. ",
"default": 150
},
"low_count_threshold": {
"type": "integer",
"description": "Droplets with UMI counts below this number are completely \nexcluded from the analysis",
"help_text": "Type: `integer`, multiple: `False`, default: `5`. ",
"default": 5
},
"z_dim": {
"type": "integer",
"description": "Dimension of latent variable z.\n",
"help_text": "Type: `integer`, multiple: `False`, default: `64`. ",
"default": 64
},
"z_layers": {
"type": "array",
"items": {
"type": "integer"
},
"description": "Dimension of hidden layers in the encoder for z.\n",
"help_text": "Type: `integer`, multiple: `True`, default: `[512]`. ",
"default": [
512
]
},
"training_fraction": {
"type": "number",
"description": "Training detail: the fraction of the data used for training.\nThe rest is never seen by the inference algorithm",
"help_text": "Type: `double`, multiple: `False`, default: `0.9`. ",
"default": 0.9
},
"empty_drop_training_fraction": {
"type": "number",
"description": "Training detail: the fraction of the training data each epoch that \nis drawn (randomly sampled) from surely empty droplets.\n",
"help_text": "Type: `double`, multiple: `False`, default: `0.2`. ",
"default": 0.2
},
"ignore_features": {
"type": "array",
"items": {
"type": "integer"
},
"description": "Integer indices of features to ignore entirely",
"help_text": "Type: `integer`, multiple: `True`. "
},
"fpr": {
"type": "array",
"items": {
"type": "number"
},
"description": "Target 'delta' false positive rate in [0, 1)",
"help_text": "Type: `double`, multiple: `True`, default: `[0.01]`. ",
"default": [
0.01
]
},
"exclude_feature_types": {
"type": "array",
"items": {
"type": "string"
},
"description": "Feature types to ignore during the analysis",
"help_text": "Type: `string`, multiple: `True`. "
},
"projected_ambient_count_threshold": {
"type": "number",
"description": "Controls how many features are included in the analysis, which\ncan lead to a large speedup",
"help_text": "Type: `double`, multiple: `False`, default: `0.1`. ",
"default": 0.1
},
"learning_rate": {
"type": "number",
"description": "Training detail: lower learning rate for inference.\nA OneCycle learning rate schedule is used, where the\nupper learning rate is ten times this value",
"help_text": "Type: `double`, multiple: `False`, default: `1.0E-4`. ",
"default": 0.00010
},
"final_elbo_fail_fraction": {
"type": "number",
"description": "Training is considered to have failed if \n(best_test_ELBO - final_test_ELBO)/(best_test_ELBO - initial_test_ELBO) > FINAL_ELBO_FAIL_FRACTION.\nTraining will automatically re-run if --num-training-tries > 1.\nBy default, will not fail training based on final_training_ELBO.\n",
"help_text": "Type: `double`, multiple: `False`. "
},
"epoch_elbo_fail_fraction": {
"type": "number",
"description": "Training is considered to have failed if \n(previous_epoch_test_ELBO - current_epoch_test_ELBO)/(previous_epoch_test_ELBO - initial_train_ELBO) > EPOCH_ELBO_FAIL_FRACTION.\nTraining will automatically re-run if --num-training-tries > 1.\nBy default, will not fail training based on epoch_training_ELBO.\n",
"help_text": "Type: `double`, multiple: `False`. "
},
"num_training_tries": {
"type": "integer",
"description": "Number of times to attempt to train the model",
"help_text": "Type: `integer`, multiple: `False`, default: `1`. ",
"default": 1
},
"learning_rate_retry_mult": {
"type": "number",
"description": "Learning rate is multiplied by this amount each time a new training\nattempt is made",
"help_text": "Type: `double`, multiple: `False`, default: `0.2`. ",
"default": 0.2
},
"posterior_batch_size": {
"type": "integer",
"description": "Training detail: size of batches when creating the posterior.\nReduce this to avoid running out of GPU memory creating the posterior\n(will be slower).\n",
"help_text": "Type: `integer`, multiple: `False`, default: `128`. ",
"default": 128
},
"posterior_regulation": {
"type": "string",
"description": "Posterior regularization method",
"help_text": "Type: `string`, multiple: `False`, choices: ``PRq`, `PRmu`, `PRmu_gene``. ",
"enum": [
"PRq",
"PRmu",
"PRmu_gene"
]
},
"alpha": {
"type": "number",
"description": "Tunable parameter alpha for the PRq posterior regularization method\n(not normally used: see documentation).\n",
"help_text": "Type: `double`, multiple: `False`. "
},
"q": {
"type": "number",
"description": "Tunable parameter q for the CDF threshold estimation method (not\nnormally used: see documentation).\n",
"help_text": "Type: `double`, multiple: `False`. "
},
"estimator": {
"type": "string",
"description": "Output denoised count estimation method",
"help_text": "Type: `string`, multiple: `False`, default: `\"mckp\"`, choices: ``map`, `mean`, `cdf`, `sample`, `mckp``. ",
"enum": [
"map",
"mean",
"cdf",
"sample",
"mckp"
],
"default": "mckp"
},
"estimator_multiple_cpu": {
"type": "boolean",
"description": "Including the flag --estimator-multiple-cpu will use more than one\nCPU to compute the MCKP output count estimator in parallel (does nothing\nfor other estimators).\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"constant_learning_rate": {
"type": "boolean",
"description": "Including the flag --constant-learning-rate will use the ClippedAdam\noptimizer instead of the OneCycleLR learning rate schedule, which is\nthe default",
"help_text": "Type: `boolean`, multiple: `False`. "
},
"debug": {
"type": "boolean",
"description": "Including the flag --debug will log extra messages useful for debugging.\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"cuda": {
"type": "boolean",
"description": "Including the flag --cuda will run the inference on a\nGPU.\n",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
}
}
},
"nextflow input-output arguments": {
"title": "Nextflow input-output arguments",
"type": "object",
"description": "Input/output parameters for Nextflow itself. Please note that both publishDir and publish_dir are supported but at least one has to be configured.",
"properties": {
"publish_dir": {
"type": "string",
"description": "Path to an output directory.",
"help_text": "Type: `string`, multiple: `False`, required, example: `\"output/\"`. "
}
}
}
},
"allOf": [
{
"$ref": "#/$defs/inputs"
},
{
"$ref": "#/$defs/outputs"
},
{
"$ref": "#/$defs/arguments"
},
{
"$ref": "#/$defs/nextflow input-output arguments"
}
]
}

View File

@@ -0,0 +1,12 @@
def setup_logger():
import logging
from sys import stdout
logger = logging.getLogger()
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler(stdout)
logFormatter = logging.Formatter("%(asctime)s %(levelname)-8s %(message)s")
console_handler.setFormatter(logFormatter)
logger.addHandler(console_handler)
return logger

View File

@@ -0,0 +1,274 @@
name: "add_id"
namespace: "metadata"
version: "v4.0.0"
authors:
- name: "Dries Schaumont"
roles:
- "maintainer"
info:
role: "Core Team Member"
links:
email: "dries@data-intuitive.com"
github: "DriesSchaumont"
orcid: "0000-0002-4389-0440"
linkedin: "dries-schaumont"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Scientist"
argument_groups:
- name: "Arguments"
arguments:
- type: "file"
name: "--input"
alternatives:
- "-i"
description: "Path to the input .h5mu."
info: null
example:
- "sample_path"
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--input_id"
description: "The input id."
info: null
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--obs_output"
description: "Name of the .obs column where to store the id."
info: null
default:
- "sample_id"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--output"
alternatives:
- "-o"
description: "Name of output MuData file.\n"
info: null
example:
- "output.h5mu"
must_exist: true
create_parent: true
required: false
direction: "output"
multiple: false
multiple_sep: ";"
- type: "boolean_true"
name: "--make_observation_keys_unique"
description: "Join the id to the .obs index (.obs_names)."
info: null
direction: "input"
- type: "string"
name: "--output_compression"
description: "Compression format to use for the output AnnData and/or Mudata objects.\n\
By default no compression is applied.\n"
info: null
example:
- "gzip"
required: false
choices:
- "gzip"
- "lzf"
direction: "input"
multiple: false
multiple_sep: ";"
resources:
- type: "python_script"
path: "script.py"
is_executable: true
- type: "file"
path: "setup_logger.py"
- type: "file"
path: "nextflow_labels.config"
dest: "nextflow_labels.config"
description: "Add id of .obs. Also allows to make .obs_names (the .obs index) unique\
\ \nby prefixing the values with an unique id per .h5mu file.\n"
test_resources:
- type: "python_script"
path: "test.py"
is_executable: true
- type: "file"
path: "e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
info: null
status: "enabled"
scope:
image: "public"
target: "public"
license: "MIT"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
runners:
- type: "executable"
id: "executable"
docker_setup_strategy: "ifneedbepullelsecachedbuild"
- type: "nextflow"
id: "nextflow"
directives:
label:
- "singlecpu"
- "lowmem"
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
script:
- "includeConfig(\"nextflow_labels.config\")"
debug: false
container: "docker"
engines:
- type: "docker"
id: "docker"
image: "python:3.11-slim"
target_registry: "images.viash-hub.com"
target_tag: "v4.0.0"
namespace_separator: "/"
setup:
- type: "apt"
packages:
- "procps"
interactive: false
- type: "python"
user: false
packages:
- "anndata~=0.12.7"
- "awkward"
- "mudata~=0.3.2"
script:
- "exec(\"try:\\n import zarr; from importlib.metadata import version\\nexcept\
\ ModuleNotFoundError:\\n exit(0)\\nelse: assert int(version(\\\"zarr\\\"\
).partition(\\\".\\\")[0]) > 2\")"
upgrade: true
test_setup:
- type: "apt"
packages:
- "git"
interactive: false
- type: "python"
user: false
packages:
- "viashpy==0.8.0"
github:
- "openpipelines-bio/core#subdirectory=packages/python/openpipeline_testutils"
upgrade: true
entrypoint: []
cmd: null
- type: "native"
id: "native"
build_info:
config: "src/metadata/add_id/config.vsh.yaml"
runner: "nextflow"
engine: "docker|native"
output: "target/nextflow/metadata/add_id"
executable: "target/nextflow/metadata/add_id/main.nf"
viash_version: "0.9.4"
git_commit: "de02293c9e13198622b988dac952b2c8c70a1e35"
git_remote: "https://github.com/openpipelines-bio/openpipeline"
package_config:
name: "openpipeline"
version: "v4.0.0"
summary: "Best-practice workflows for single-cell multi-omics analyses.\n"
description: "OpenPipelines are extensible single cell analysis pipelines for reproducible\
\ and large-scale single cell processing using [Viash](https://viash.io) and [Nextflow](https://www.nextflow.io/).\n\
\nIn terms of workflows, the following has been made available, but keep in mind\
\ that\nindividual tools and functionality can be executed as standalone components\
\ as well.\n\n * Demultiplexing: conversion of raw sequencing data to FASTQ objects.\n\
\ * Ingestion: Read mapping and generating a count matrix.\n * Single sample\
\ processing: cell filtering and doublet detection.\n * Multisample processing:\
\ Count transformation, normalization, QC metric calulations.\n * Integration:\
\ Clustering, integration and batch correction using single and multimodal methods.\n\
\ * Downstream analysis workflows\n"
info:
test_resources:
- type: "s3"
path: "s3://openpipelines-data"
dest: "resources_test"
nextflow_labels_ci:
- path: "src/workflows/utils/labels_ci.config"
description: "Adds the correct memory and CPU labels when running on the Viash\
\ Hub CI."
viash_version: "0.9.4"
source: "src"
target: "target"
config_mods:
- ".resources += {path: '/src/workflows/utils/labels.config', dest: 'nextflow_labels.config'}\n\
.runners[.type == 'nextflow'].config.script := 'includeConfig(\"nextflow_labels.config\"\
)'"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'v4.0.0'"
keywords:
- "single-cell"
- "multimodal"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
homepage: "https://openpipelines.bio"
documentation: "https://openpipelines.bio/fundamentals"
issue_tracker: "https://github.com/openpipelines-bio/openpipeline/issues"

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,126 @@
manifest {
name = 'metadata/add_id'
mainScript = 'main.nf'
nextflowVersion = '!>=20.12.1-edge'
version = 'v4.0.0'
description = 'Add id of .obs. Also allows to make .obs_names (the .obs index) unique \nby prefixing the values with an unique id per .h5mu file.\n'
author = 'Dries Schaumont'
}
process.container = 'nextflow/bash:latest'
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
profiles {
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
podman {
podman.enabled = true
docker.enabled = false
singularity.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
shifter {
shifter.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
charliecloud.enabled = false
}
charliecloud {
charliecloud.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
}
}
process{
withLabel: mem1gb { memory = 1000000000.B }
withLabel: mem2gb { memory = 2000000000.B }
withLabel: mem5gb { memory = 5000000000.B }
withLabel: mem10gb { memory = 10000000000.B }
withLabel: mem20gb { memory = 20000000000.B }
withLabel: mem50gb { memory = 50000000000.B }
withLabel: mem100gb { memory = 100000000000.B }
withLabel: mem200gb { memory = 200000000000.B }
withLabel: mem500gb { memory = 500000000000.B }
withLabel: mem1tb { memory = 1000000000000.B }
withLabel: mem2tb { memory = 2000000000000.B }
withLabel: mem5tb { memory = 5000000000000.B }
withLabel: mem10tb { memory = 10000000000000.B }
withLabel: mem20tb { memory = 20000000000000.B }
withLabel: mem50tb { memory = 50000000000000.B }
withLabel: mem100tb { memory = 100000000000000.B }
withLabel: mem200tb { memory = 200000000000000.B }
withLabel: mem500tb { memory = 500000000000000.B }
withLabel: mem1gib { memory = 1073741824.B }
withLabel: mem2gib { memory = 2147483648.B }
withLabel: mem4gib { memory = 4294967296.B }
withLabel: mem8gib { memory = 8589934592.B }
withLabel: mem16gib { memory = 17179869184.B }
withLabel: mem32gib { memory = 34359738368.B }
withLabel: mem64gib { memory = 68719476736.B }
withLabel: mem128gib { memory = 137438953472.B }
withLabel: mem256gib { memory = 274877906944.B }
withLabel: mem512gib { memory = 549755813888.B }
withLabel: mem1tib { memory = 1099511627776.B }
withLabel: mem2tib { memory = 2199023255552.B }
withLabel: mem4tib { memory = 4398046511104.B }
withLabel: mem8tib { memory = 8796093022208.B }
withLabel: mem16tib { memory = 17592186044416.B }
withLabel: mem32tib { memory = 35184372088832.B }
withLabel: mem64tib { memory = 70368744177664.B }
withLabel: mem128tib { memory = 140737488355328.B }
withLabel: mem256tib { memory = 281474976710656.B }
withLabel: mem512tib { memory = 562949953421312.B }
withLabel: cpu1 { cpus = 1 }
withLabel: cpu2 { cpus = 2 }
withLabel: cpu5 { cpus = 5 }
withLabel: cpu10 { cpus = 10 }
withLabel: cpu20 { cpus = 20 }
withLabel: cpu50 { cpus = 50 }
withLabel: cpu100 { cpus = 100 }
withLabel: cpu200 { cpus = 200 }
withLabel: cpu500 { cpus = 500 }
withLabel: cpu1000 { cpus = 1000 }
}
includeConfig("nextflow_labels.config")

View File

@@ -0,0 +1,48 @@
process {
// Default resources for components that hardly do any processing
memory = { 2.GB * task.attempt }
cpus = 1
// Retry for exit codes that have something to do with memory issues
errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 3
// The memory a task is assinged increases with each attempt
// uncomment the line below and adjust the value to set a global upper limit on the memory.
// resourceLimits = [ memory: 240.Gb ]
// CPU resources
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 10 }
withLabel: highcpu { cpus = 20 }
// Memory resources
withLabel: lowmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 4.GB * task.attempt } }
withLabel: midmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 25.GB * task.attempt } }
withLabel: highmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 50.GB * task.attempt } }
withLabel: veryhighmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 75.GB * task.attempt } }
// Disk space
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
// NOTE: The above labels intentionally do not have an effect by default.
// The user should set the disk space requirements by adding the following
// to the compute environment:
//
// withLabel: lowdisk { disk = { 20.GB * task.attempt } }
// withLabel: middisk { disk = { 100.GB * task.attempt } }
// withLabel: highdisk { disk = { 200.GB * task.attempt } }
// withLabel: veryhighdisk { disk = { 500.GB * task.attempt } }
}

View File

@@ -0,0 +1,75 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "add_id",
"description": "Add id of .obs. Also allows to make .obs_names (the .obs index) unique \nby prefixing the values with an unique id per .h5mu file.\n",
"type": "object",
"$defs": {
"arguments": {
"title": "Arguments",
"type": "object",
"description": "No description",
"properties": {
"input": {
"type": "string",
"format": "path",
"exists": true,
"description": "Path to the input .h5mu.",
"help_text": "Type: `file`, multiple: `False`, required, direction: `input`, example: `\"sample_path\"`. "
},
"input_id": {
"type": "string",
"description": "The input id.",
"help_text": "Type: `string`, multiple: `False`, required. "
},
"obs_output": {
"type": "string",
"description": "Name of the .obs column where to store the id.",
"help_text": "Type: `string`, multiple: `False`, default: `\"sample_id\"`. ",
"default": "sample_id"
},
"output": {
"type": "string",
"format": "path",
"description": "Name of output MuData file.\n",
"help_text": "Type: `file`, multiple: `False`, default: `\"$id.$key.output.h5mu\"`, direction: `output`, example: `\"output.h5mu\"`. ",
"default": "$id.$key.output.h5mu"
},
"make_observation_keys_unique": {
"type": "boolean",
"description": "Join the id to the .obs index (.obs_names).",
"help_text": "Type: `boolean_true`, multiple: `False`, default: `false`. ",
"default": false
},
"output_compression": {
"type": "string",
"description": "Compression format to use for the output AnnData and/or Mudata objects.\nBy default no compression is applied.\n",
"help_text": "Type: `string`, multiple: `False`, example: `\"gzip\"`, choices: ``gzip`, `lzf``. ",
"enum": [
"gzip",
"lzf"
]
}
}
},
"nextflow input-output arguments": {
"title": "Nextflow input-output arguments",
"type": "object",
"description": "Input/output parameters for Nextflow itself. Please note that both publishDir and publish_dir are supported but at least one has to be configured.",
"properties": {
"publish_dir": {
"type": "string",
"description": "Path to an output directory.",
"help_text": "Type: `string`, multiple: `False`, required, example: `\"output/\"`. "
}
}
}
},
"allOf": [
{
"$ref": "#/$defs/arguments"
},
{
"$ref": "#/$defs/nextflow input-output arguments"
}
]
}

View File

@@ -0,0 +1,12 @@
def setup_logger():
import logging
from sys import stdout
logger = logging.getLogger()
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler(stdout)
logFormatter = logging.Formatter("%(asctime)s %(levelname)-8s %(message)s")
console_handler.setFormatter(logFormatter)
logger.addHandler(console_handler)
return logger

View File

@@ -0,0 +1,330 @@
name: "grep_annotation_column"
namespace: "metadata"
version: "v4.0.0"
authors:
- name: "Dries Schaumont"
roles:
- "maintainer"
info:
role: "Core Team Member"
links:
email: "dries@data-intuitive.com"
github: "DriesSchaumont"
orcid: "0000-0002-4389-0440"
linkedin: "dries-schaumont"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Scientist"
argument_groups:
- name: "Inputs"
description: "Arguments related to the input dataset."
arguments:
- type: "file"
name: "--input"
alternatives:
- "-i"
description: "Path to the input .h5mu."
info: null
example:
- "sample_path"
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--input_column"
description: "Column to query. If not specified, use .var_names or .obs_names,\
\ depending on the value of --matrix"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--input_layer"
description: "Input data to use when calculating fraction of observations that\
\ match with the query. \nOnly used when --output_fraction_column is provided.\
\ If not specified, .X is used.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--modality"
description: "Which modality to get the annotation matrix from.\n"
info: null
example:
- "rna"
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--matrix"
description: "Matrix to fetch the column from that will be searched."
info: null
example:
- "var"
required: false
choices:
- "var"
- "obs"
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Outputs"
description: "Arguments related to how the output will be written."
arguments:
- type: "file"
name: "--output"
alternatives:
- "-o"
description: "Location of the output MuData file.\n"
info: null
example:
- "output.h5mu"
must_exist: true
create_parent: true
required: false
direction: "output"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_match_column"
description: "Name of the column to write the result to."
info: null
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_fraction_column"
description: "For the opposite axis, name of the column to write the fraction\
\ of \nobservations that matches to the pattern.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_compression"
description: "Compression format to use for the output AnnData and/or Mudata objects.\n\
By default no compression is applied.\n"
info: null
example:
- "gzip"
required: false
choices:
- "gzip"
- "lzf"
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Query options"
description: "Options related to the query"
arguments:
- type: "string"
name: "--regex_pattern"
description: "Regex to use to match with the input column."
info: null
example:
- "^[mM][tT]-"
required: true
direction: "input"
multiple: false
multiple_sep: ";"
resources:
- type: "python_script"
path: "script.py"
is_executable: true
- type: "file"
path: "setup_logger.py"
- type: "file"
path: "compress_h5mu.py"
- type: "file"
path: "nextflow_labels.config"
dest: "nextflow_labels.config"
description: "Perform a regex lookup on a column from the annotation matrices .obs\
\ or .var.\nThe annotation matrix can originate from either a modality, or all modalities\
\ (global .var or .obs).\n"
test_resources:
- type: "python_script"
path: "test.py"
is_executable: true
- type: "file"
path: "e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
info: null
status: "enabled"
scope:
image: "public"
target: "public"
license: "MIT"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
runners:
- type: "executable"
id: "executable"
docker_setup_strategy: "ifneedbepullelsecachedbuild"
- type: "nextflow"
id: "nextflow"
directives:
label:
- "singlecpu"
- "lowmem"
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
script:
- "includeConfig(\"nextflow_labels.config\")"
debug: false
container: "docker"
engines:
- type: "docker"
id: "docker"
image: "python:3.11-slim"
target_registry: "images.viash-hub.com"
target_tag: "v4.0.0"
namespace_separator: "/"
setup:
- type: "apt"
packages:
- "procps"
interactive: false
- type: "python"
user: false
packages:
- "anndata~=0.12.7"
- "awkward"
- "mudata~=0.3.2"
script:
- "exec(\"try:\\n import zarr; from importlib.metadata import version\\nexcept\
\ ModuleNotFoundError:\\n exit(0)\\nelse: assert int(version(\\\"zarr\\\"\
).partition(\\\".\\\")[0]) > 2\")"
upgrade: true
test_setup:
- type: "apt"
packages:
- "git"
interactive: false
- type: "python"
user: false
packages:
- "viashpy==0.8.0"
github:
- "openpipelines-bio/core#subdirectory=packages/python/openpipeline_testutils"
upgrade: true
entrypoint: []
cmd: null
- type: "native"
id: "native"
build_info:
config: "src/metadata/grep_annotation_column/config.vsh.yaml"
runner: "nextflow"
engine: "docker|native"
output: "target/nextflow/metadata/grep_annotation_column"
executable: "target/nextflow/metadata/grep_annotation_column/main.nf"
viash_version: "0.9.4"
git_commit: "de02293c9e13198622b988dac952b2c8c70a1e35"
git_remote: "https://github.com/openpipelines-bio/openpipeline"
package_config:
name: "openpipeline"
version: "v4.0.0"
summary: "Best-practice workflows for single-cell multi-omics analyses.\n"
description: "OpenPipelines are extensible single cell analysis pipelines for reproducible\
\ and large-scale single cell processing using [Viash](https://viash.io) and [Nextflow](https://www.nextflow.io/).\n\
\nIn terms of workflows, the following has been made available, but keep in mind\
\ that\nindividual tools and functionality can be executed as standalone components\
\ as well.\n\n * Demultiplexing: conversion of raw sequencing data to FASTQ objects.\n\
\ * Ingestion: Read mapping and generating a count matrix.\n * Single sample\
\ processing: cell filtering and doublet detection.\n * Multisample processing:\
\ Count transformation, normalization, QC metric calulations.\n * Integration:\
\ Clustering, integration and batch correction using single and multimodal methods.\n\
\ * Downstream analysis workflows\n"
info:
test_resources:
- type: "s3"
path: "s3://openpipelines-data"
dest: "resources_test"
nextflow_labels_ci:
- path: "src/workflows/utils/labels_ci.config"
description: "Adds the correct memory and CPU labels when running on the Viash\
\ Hub CI."
viash_version: "0.9.4"
source: "src"
target: "target"
config_mods:
- ".resources += {path: '/src/workflows/utils/labels.config', dest: 'nextflow_labels.config'}\n\
.runners[.type == 'nextflow'].config.script := 'includeConfig(\"nextflow_labels.config\"\
)'"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'v4.0.0'"
keywords:
- "single-cell"
- "multimodal"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
homepage: "https://openpipelines.bio"
documentation: "https://openpipelines.bio/fundamentals"
issue_tracker: "https://github.com/openpipelines-bio/openpipeline/issues"

View File

@@ -0,0 +1,87 @@
import shutil
from anndata import AnnData
from mudata import write_h5ad
from h5py import File as H5File
from h5py import Group, Dataset
from pathlib import Path
from typing import Union, Literal
from functools import partial
def compress_h5mu(
input_path: Union[str, Path],
output_path: Union[str, Path],
compression: Union[Literal["gzip"], Literal["lzf"]],
):
input_path, output_path = str(input_path), str(output_path)
def copy_attributes(in_object, out_object):
for key, value in in_object.attrs.items():
out_object.attrs[key] = value
def visit_path(
output_h5: H5File,
compression: Union[Literal["gzip"], Literal["lzf"]],
name: str,
object: Union[Group, Dataset],
):
if isinstance(object, Group):
new_group = output_h5.create_group(name)
copy_attributes(object, new_group)
elif isinstance(object, Dataset):
# Compression only works for non-scalar Dataset objects
# Scalar objects dont have a shape defined
if not object.compression and object.shape not in [None, ()]:
new_dataset = output_h5.create_dataset(
name, data=object, compression=compression
)
copy_attributes(object, new_dataset)
else:
output_h5.copy(object, name)
else:
raise NotImplementedError(
f"Could not copy element {name}, "
f"type has not been implemented yet: {type(object)}"
)
with (
H5File(input_path, "r") as input_h5,
H5File(output_path, "w", userblock_size=512) as output_h5,
):
copy_attributes(input_h5, output_h5)
input_h5.visititems(partial(visit_path, output_h5, compression))
with open(input_path, "rb") as input_bytes:
# Mudata puts metadata like this in the first 512 bytes:
# MuData (format-version=0.1.0;creator=muon;creator-version=0.2.0)
# See mudata/_core/io.py, read_h5mu() function
starting_metadata = input_bytes.read(100)
# The metadata is padded with extra null bytes up until 512 bytes
truncate_location = starting_metadata.find(b"\x00")
starting_metadata = starting_metadata[:truncate_location]
with open(output_path, "br+") as f:
nbytes = f.write(starting_metadata)
f.write(b"\0" * (512 - nbytes))
def write_h5ad_to_h5mu_with_compression(
output_file: Union[str, Path],
h5mu: Union[str, Path],
modality_name: str,
modality_data: AnnData,
output_compression=None,
):
output_file = Path(output_file)
h5mu = Path(h5mu)
output_file_uncompressed = (
output_file.with_name(output_file.stem + "_uncompressed.h5mu")
if output_compression
else output_file
)
shutil.copyfile(h5mu, output_file_uncompressed)
write_h5ad(filename=output_file_uncompressed, mod=modality_name, data=modality_data)
if output_compression:
compress_h5mu(
output_file_uncompressed, output_file, compression=output_compression
)
output_file_uncompressed.unlink()

View File

@@ -0,0 +1,126 @@
manifest {
name = 'metadata/grep_annotation_column'
mainScript = 'main.nf'
nextflowVersion = '!>=20.12.1-edge'
version = 'v4.0.0'
description = 'Perform a regex lookup on a column from the annotation matrices .obs or .var.\nThe annotation matrix can originate from either a modality, or all modalities (global .var or .obs).\n'
author = 'Dries Schaumont'
}
process.container = 'nextflow/bash:latest'
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
profiles {
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
podman {
podman.enabled = true
docker.enabled = false
singularity.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
shifter {
shifter.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
charliecloud.enabled = false
}
charliecloud {
charliecloud.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
}
}
process{
withLabel: mem1gb { memory = 1000000000.B }
withLabel: mem2gb { memory = 2000000000.B }
withLabel: mem5gb { memory = 5000000000.B }
withLabel: mem10gb { memory = 10000000000.B }
withLabel: mem20gb { memory = 20000000000.B }
withLabel: mem50gb { memory = 50000000000.B }
withLabel: mem100gb { memory = 100000000000.B }
withLabel: mem200gb { memory = 200000000000.B }
withLabel: mem500gb { memory = 500000000000.B }
withLabel: mem1tb { memory = 1000000000000.B }
withLabel: mem2tb { memory = 2000000000000.B }
withLabel: mem5tb { memory = 5000000000000.B }
withLabel: mem10tb { memory = 10000000000000.B }
withLabel: mem20tb { memory = 20000000000000.B }
withLabel: mem50tb { memory = 50000000000000.B }
withLabel: mem100tb { memory = 100000000000000.B }
withLabel: mem200tb { memory = 200000000000000.B }
withLabel: mem500tb { memory = 500000000000000.B }
withLabel: mem1gib { memory = 1073741824.B }
withLabel: mem2gib { memory = 2147483648.B }
withLabel: mem4gib { memory = 4294967296.B }
withLabel: mem8gib { memory = 8589934592.B }
withLabel: mem16gib { memory = 17179869184.B }
withLabel: mem32gib { memory = 34359738368.B }
withLabel: mem64gib { memory = 68719476736.B }
withLabel: mem128gib { memory = 137438953472.B }
withLabel: mem256gib { memory = 274877906944.B }
withLabel: mem512gib { memory = 549755813888.B }
withLabel: mem1tib { memory = 1099511627776.B }
withLabel: mem2tib { memory = 2199023255552.B }
withLabel: mem4tib { memory = 4398046511104.B }
withLabel: mem8tib { memory = 8796093022208.B }
withLabel: mem16tib { memory = 17592186044416.B }
withLabel: mem32tib { memory = 35184372088832.B }
withLabel: mem64tib { memory = 70368744177664.B }
withLabel: mem128tib { memory = 140737488355328.B }
withLabel: mem256tib { memory = 281474976710656.B }
withLabel: mem512tib { memory = 562949953421312.B }
withLabel: cpu1 { cpus = 1 }
withLabel: cpu2 { cpus = 2 }
withLabel: cpu5 { cpus = 5 }
withLabel: cpu10 { cpus = 10 }
withLabel: cpu20 { cpus = 20 }
withLabel: cpu50 { cpus = 50 }
withLabel: cpu100 { cpus = 100 }
withLabel: cpu200 { cpus = 200 }
withLabel: cpu500 { cpus = 500 }
withLabel: cpu1000 { cpus = 1000 }
}
includeConfig("nextflow_labels.config")

View File

@@ -0,0 +1,48 @@
process {
// Default resources for components that hardly do any processing
memory = { 2.GB * task.attempt }
cpus = 1
// Retry for exit codes that have something to do with memory issues
errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 3
// The memory a task is assinged increases with each attempt
// uncomment the line below and adjust the value to set a global upper limit on the memory.
// resourceLimits = [ memory: 240.Gb ]
// CPU resources
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 10 }
withLabel: highcpu { cpus = 20 }
// Memory resources
withLabel: lowmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 4.GB * task.attempt } }
withLabel: midmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 25.GB * task.attempt } }
withLabel: highmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 50.GB * task.attempt } }
withLabel: veryhighmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 75.GB * task.attempt } }
// Disk space
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
// NOTE: The above labels intentionally do not have an effect by default.
// The user should set the disk space requirements by adding the following
// to the compute environment:
//
// withLabel: lowdisk { disk = { 20.GB * task.attempt } }
// withLabel: middisk { disk = { 100.GB * task.attempt } }
// withLabel: highdisk { disk = { 200.GB * task.attempt } }
// withLabel: veryhighdisk { disk = { 500.GB * task.attempt } }
}

View File

@@ -0,0 +1,117 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "grep_annotation_column",
"description": "Perform a regex lookup on a column from the annotation matrices .obs or .var.\nThe annotation matrix can originate from either a modality, or all modalities (global .var or .obs).\n",
"type": "object",
"$defs": {
"inputs": {
"title": "Inputs",
"type": "object",
"description": "Arguments related to the input dataset.",
"properties": {
"input": {
"type": "string",
"format": "path",
"exists": true,
"description": "Path to the input .h5mu.",
"help_text": "Type: `file`, multiple: `False`, required, direction: `input`, example: `\"sample_path\"`. "
},
"input_column": {
"type": "string",
"description": "Column to query",
"help_text": "Type: `string`, multiple: `False`. "
},
"input_layer": {
"type": "string",
"description": "Input data to use when calculating fraction of observations that match with the query",
"help_text": "Type: `string`, multiple: `False`. "
},
"modality": {
"type": "string",
"description": "Which modality to get the annotation matrix from.\n",
"help_text": "Type: `string`, multiple: `False`, required, example: `\"rna\"`. "
},
"matrix": {
"type": "string",
"description": "Matrix to fetch the column from that will be searched.",
"help_text": "Type: `string`, multiple: `False`, example: `\"var\"`, choices: ``var`, `obs``. ",
"enum": [
"var",
"obs"
]
}
}
},
"outputs": {
"title": "Outputs",
"type": "object",
"description": "Arguments related to how the output will be written.",
"properties": {
"output": {
"type": "string",
"format": "path",
"description": "Location of the output MuData file.\n",
"help_text": "Type: `file`, multiple: `False`, default: `\"$id.$key.output.h5mu\"`, direction: `output`, example: `\"output.h5mu\"`. ",
"default": "$id.$key.output.h5mu"
},
"output_match_column": {
"type": "string",
"description": "Name of the column to write the result to.",
"help_text": "Type: `string`, multiple: `False`, required. "
},
"output_fraction_column": {
"type": "string",
"description": "For the opposite axis, name of the column to write the fraction of \nobservations that matches to the pattern.\n",
"help_text": "Type: `string`, multiple: `False`. "
},
"output_compression": {
"type": "string",
"description": "Compression format to use for the output AnnData and/or Mudata objects.\nBy default no compression is applied.\n",
"help_text": "Type: `string`, multiple: `False`, example: `\"gzip\"`, choices: ``gzip`, `lzf``. ",
"enum": [
"gzip",
"lzf"
]
}
}
},
"query options": {
"title": "Query options",
"type": "object",
"description": "Options related to the query",
"properties": {
"regex_pattern": {
"type": "string",
"description": "Regex to use to match with the input column.",
"help_text": "Type: `string`, multiple: `False`, required, example: `\"^[mM][tT]-\"`. "
}
}
},
"nextflow input-output arguments": {
"title": "Nextflow input-output arguments",
"type": "object",
"description": "Input/output parameters for Nextflow itself. Please note that both publishDir and publish_dir are supported but at least one has to be configured.",
"properties": {
"publish_dir": {
"type": "string",
"description": "Path to an output directory.",
"help_text": "Type: `string`, multiple: `False`, required, example: `\"output/\"`. "
}
}
}
},
"allOf": [
{
"$ref": "#/$defs/inputs"
},
{
"$ref": "#/$defs/outputs"
},
{
"$ref": "#/$defs/query options"
},
{
"$ref": "#/$defs/nextflow input-output arguments"
}
]
}

View File

@@ -0,0 +1,12 @@
def setup_logger():
import logging
from sys import stdout
logger = logging.getLogger()
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler(stdout)
logFormatter = logging.Formatter("%(asctime)s %(levelname)-8s %(message)s")
console_handler.setFormatter(logFormatter)
logger.addHandler(console_handler)
return logger

View File

@@ -0,0 +1,390 @@
name: "calculate_qc_metrics"
namespace: "qc"
version: "v4.0.0"
authors:
- name: "Dries Schaumont"
roles:
- "author"
info:
role: "Core Team Member"
links:
email: "dries@data-intuitive.com"
github: "DriesSchaumont"
orcid: "0000-0002-4389-0440"
linkedin: "dries-schaumont"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Scientist"
argument_groups:
- name: "Inputs"
arguments:
- type: "file"
name: "--input"
description: "Input h5mu file"
info: null
example:
- "input.h5mu"
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--modality"
description: "Which modality from the input MuData file to process. \n"
info: null
default:
- "rna"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--layer"
description: "Layer from modality to use as input data. If not provided the .X\
\ attribute is used.\n"
info: null
example:
- "raw_counts"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Metrics added to .obs"
arguments:
- type: "string"
name: "--var_qc_metrics"
description: "Keys to select a boolean (containing only True or False) column\
\ from .var.\nFor each cell, calculate the proportion of total values for genes\
\ which are labeled 'True', \ncompared to the total sum of the values for all\
\ genes.\n"
info: null
example:
- "ercc,highly_variable,mitochondrial"
required: false
direction: "input"
multiple: true
multiple_sep: ";"
- type: "boolean"
name: "--var_qc_metrics_fill_na_value"
description: "Fill any 'NA' values found in the columns specified with --var_qc_metrics\
\ to 'True' or 'False'.\nas False.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "integer"
name: "--top_n_vars"
description: "Number of top vars to be used to calculate cumulative proportions.\n\
If not specified, proportions are not calculated. `--top_n_vars 20;50` finds\n\
cumulative proportion to the 20th and 50th most expressed vars.\n"
info: null
required: false
direction: "input"
multiple: true
multiple_sep: ";"
- type: "string"
name: "--output_obs_num_nonzero_vars"
description: "Name of column in .obs describing, for each observation, the number\
\ of stored values\n(including explicit zeroes). In other words, the name of\
\ the column that counts\nfor each row the number of columns that contain data.\n"
info: null
default:
- "num_nonzero_vars"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_obs_total_counts_vars"
description: "Name of the column for .obs describing, for each observation (row),\n\
the sum of the stored values in the columns.\n"
info: null
default:
- "total_counts"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Metrics added to .var"
arguments:
- type: "string"
name: "--output_var_num_nonzero_obs"
description: "Name of column describing, for each feature, the number of stored\
\ values\n(including explicit zeroes). In other words, the name of the column\
\ that counts\nfor each column the number of rows that contain data.\n"
info: null
default:
- "num_nonzero_obs"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_var_total_counts_obs"
description: "Name of the column in .var describing, for each feature (column),\n\
the sum of the stored values in the rows.\n"
info: null
default:
- "total_counts"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_var_obs_mean"
description: "Name of the column in .obs providing the mean of the values in each\
\ row.\n"
info: null
default:
- "obs_mean"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_var_pct_dropout"
description: "Name of the column in .obs providing for each feature the percentage\
\ of\nobservations the feature does not appear on (i.e. is missing). Same as\
\ `--num_nonzero_obs`\nbut percentage based.\n"
info: null
default:
- "pct_dropout"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Outputs"
arguments:
- type: "file"
name: "--output"
description: "Output h5mu file."
info: null
example:
- "output.h5mu"
must_exist: true
create_parent: true
required: false
direction: "output"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_compression"
description: "Compression format to use for the output AnnData and/or Mudata objects.\n\
By default no compression is applied.\n"
info: null
example:
- "gzip"
required: false
choices:
- "gzip"
- "lzf"
direction: "input"
multiple: false
multiple_sep: ";"
resources:
- type: "python_script"
path: "script.py"
is_executable: true
- type: "file"
path: "setup_logger.py"
- type: "file"
path: "compress_h5mu.py"
- type: "file"
path: "nextflow_labels.config"
dest: "nextflow_labels.config"
description: "Add basic quality control metrics to an .h5mu file.\n\nThe metrics are\
\ comparable to what scanpy.pp.calculate_qc_metrics output,\nalthough they have\
\ slightly different names:\n\nVar metrics (name in this component -> name in scanpy):\n\
\ - pct_dropout -> pct_dropout_by_{expr_type}\n - num_nonzero_obs -> n_cells_by_{expr_type}\n\
\ - obs_mean -> mean_{expr_type}\n - total_counts -> total_{expr_type}\n\n Obs\
\ metrics:\n - num_nonzero_vars -> n_genes_by_{expr_type}\n - pct_{var_qc_metrics}\
\ -> pct_{expr_type}_{qc_var}\n - total_counts_{var_qc_metrics} -> total_{expr_type}_{qc_var}\n\
\ - pct_of_counts_in_top_{top_n_vars}_vars -> pct_{expr_type}_in_top_{n}_{var_type}\n\
\ - total_counts -> total_{expr_type}\n \n"
test_resources:
- type: "python_script"
path: "test.py"
is_executable: true
- type: "file"
path: "pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5mu"
info: null
status: "enabled"
scope:
image: "public"
target: "public"
license: "MIT"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
runners:
- type: "executable"
id: "executable"
docker_setup_strategy: "ifneedbepullelsecachedbuild"
- type: "nextflow"
id: "nextflow"
directives:
label:
- "singlecpu"
- "midmem"
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
script:
- "includeConfig(\"nextflow_labels.config\")"
debug: false
container: "docker"
engines:
- type: "docker"
id: "docker"
image: "python:3.11-slim"
target_registry: "images.viash-hub.com"
target_tag: "v4.0.0"
namespace_separator: "/"
setup:
- type: "apt"
packages:
- "procps"
interactive: false
- type: "python"
user: false
packages:
- "anndata~=0.12.7"
- "awkward"
- "mudata~=0.3.2"
- "scipy"
script:
- "exec(\"try:\\n import zarr; from importlib.metadata import version\\nexcept\
\ ModuleNotFoundError:\\n exit(0)\\nelse: assert int(version(\\\"zarr\\\"\
).partition(\\\".\\\")[0]) > 2\")"
upgrade: true
test_setup:
- type: "apt"
packages:
- "git"
interactive: false
- type: "python"
user: false
packages:
- "viashpy==0.8.0"
github:
- "openpipelines-bio/core#subdirectory=packages/python/openpipeline_testutils"
upgrade: true
- type: "python"
user: false
packages:
- "scanpy"
upgrade: true
entrypoint: []
cmd: null
- type: "native"
id: "native"
build_info:
config: "src/qc/calculate_qc_metrics/config.vsh.yaml"
runner: "nextflow"
engine: "docker|native"
output: "target/nextflow/qc/calculate_qc_metrics"
executable: "target/nextflow/qc/calculate_qc_metrics/main.nf"
viash_version: "0.9.4"
git_commit: "de02293c9e13198622b988dac952b2c8c70a1e35"
git_remote: "https://github.com/openpipelines-bio/openpipeline"
package_config:
name: "openpipeline"
version: "v4.0.0"
summary: "Best-practice workflows for single-cell multi-omics analyses.\n"
description: "OpenPipelines are extensible single cell analysis pipelines for reproducible\
\ and large-scale single cell processing using [Viash](https://viash.io) and [Nextflow](https://www.nextflow.io/).\n\
\nIn terms of workflows, the following has been made available, but keep in mind\
\ that\nindividual tools and functionality can be executed as standalone components\
\ as well.\n\n * Demultiplexing: conversion of raw sequencing data to FASTQ objects.\n\
\ * Ingestion: Read mapping and generating a count matrix.\n * Single sample\
\ processing: cell filtering and doublet detection.\n * Multisample processing:\
\ Count transformation, normalization, QC metric calulations.\n * Integration:\
\ Clustering, integration and batch correction using single and multimodal methods.\n\
\ * Downstream analysis workflows\n"
info:
test_resources:
- type: "s3"
path: "s3://openpipelines-data"
dest: "resources_test"
nextflow_labels_ci:
- path: "src/workflows/utils/labels_ci.config"
description: "Adds the correct memory and CPU labels when running on the Viash\
\ Hub CI."
viash_version: "0.9.4"
source: "src"
target: "target"
config_mods:
- ".resources += {path: '/src/workflows/utils/labels.config', dest: 'nextflow_labels.config'}\n\
.runners[.type == 'nextflow'].config.script := 'includeConfig(\"nextflow_labels.config\"\
)'"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'v4.0.0'"
keywords:
- "single-cell"
- "multimodal"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
homepage: "https://openpipelines.bio"
documentation: "https://openpipelines.bio/fundamentals"
issue_tracker: "https://github.com/openpipelines-bio/openpipeline/issues"

View File

@@ -0,0 +1,87 @@
import shutil
from anndata import AnnData
from mudata import write_h5ad
from h5py import File as H5File
from h5py import Group, Dataset
from pathlib import Path
from typing import Union, Literal
from functools import partial
def compress_h5mu(
input_path: Union[str, Path],
output_path: Union[str, Path],
compression: Union[Literal["gzip"], Literal["lzf"]],
):
input_path, output_path = str(input_path), str(output_path)
def copy_attributes(in_object, out_object):
for key, value in in_object.attrs.items():
out_object.attrs[key] = value
def visit_path(
output_h5: H5File,
compression: Union[Literal["gzip"], Literal["lzf"]],
name: str,
object: Union[Group, Dataset],
):
if isinstance(object, Group):
new_group = output_h5.create_group(name)
copy_attributes(object, new_group)
elif isinstance(object, Dataset):
# Compression only works for non-scalar Dataset objects
# Scalar objects dont have a shape defined
if not object.compression and object.shape not in [None, ()]:
new_dataset = output_h5.create_dataset(
name, data=object, compression=compression
)
copy_attributes(object, new_dataset)
else:
output_h5.copy(object, name)
else:
raise NotImplementedError(
f"Could not copy element {name}, "
f"type has not been implemented yet: {type(object)}"
)
with (
H5File(input_path, "r") as input_h5,
H5File(output_path, "w", userblock_size=512) as output_h5,
):
copy_attributes(input_h5, output_h5)
input_h5.visititems(partial(visit_path, output_h5, compression))
with open(input_path, "rb") as input_bytes:
# Mudata puts metadata like this in the first 512 bytes:
# MuData (format-version=0.1.0;creator=muon;creator-version=0.2.0)
# See mudata/_core/io.py, read_h5mu() function
starting_metadata = input_bytes.read(100)
# The metadata is padded with extra null bytes up until 512 bytes
truncate_location = starting_metadata.find(b"\x00")
starting_metadata = starting_metadata[:truncate_location]
with open(output_path, "br+") as f:
nbytes = f.write(starting_metadata)
f.write(b"\0" * (512 - nbytes))
def write_h5ad_to_h5mu_with_compression(
output_file: Union[str, Path],
h5mu: Union[str, Path],
modality_name: str,
modality_data: AnnData,
output_compression=None,
):
output_file = Path(output_file)
h5mu = Path(h5mu)
output_file_uncompressed = (
output_file.with_name(output_file.stem + "_uncompressed.h5mu")
if output_compression
else output_file
)
shutil.copyfile(h5mu, output_file_uncompressed)
write_h5ad(filename=output_file_uncompressed, mod=modality_name, data=modality_data)
if output_compression:
compress_h5mu(
output_file_uncompressed, output_file, compression=output_compression
)
output_file_uncompressed.unlink()

View File

@@ -0,0 +1,126 @@
manifest {
name = 'qc/calculate_qc_metrics'
mainScript = 'main.nf'
nextflowVersion = '!>=20.12.1-edge'
version = 'v4.0.0'
description = 'Add basic quality control metrics to an .h5mu file.\n\nThe metrics are comparable to what scanpy.pp.calculate_qc_metrics output,\nalthough they have slightly different names:\n\nVar metrics (name in this component -> name in scanpy):\n - pct_dropout -> pct_dropout_by_{expr_type}\n - num_nonzero_obs -> n_cells_by_{expr_type}\n - obs_mean -> mean_{expr_type}\n - total_counts -> total_{expr_type}\n\n Obs metrics:\n - num_nonzero_vars -> n_genes_by_{expr_type}\n - pct_{var_qc_metrics} -> pct_{expr_type}_{qc_var}\n - total_counts_{var_qc_metrics} -> total_{expr_type}_{qc_var}\n - pct_of_counts_in_top_{top_n_vars}_vars -> pct_{expr_type}_in_top_{n}_{var_type}\n - total_counts -> total_{expr_type}\n \n'
author = 'Dries Schaumont'
}
process.container = 'nextflow/bash:latest'
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
profiles {
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
podman {
podman.enabled = true
docker.enabled = false
singularity.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
shifter {
shifter.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
charliecloud.enabled = false
}
charliecloud {
charliecloud.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
}
}
process{
withLabel: mem1gb { memory = 1000000000.B }
withLabel: mem2gb { memory = 2000000000.B }
withLabel: mem5gb { memory = 5000000000.B }
withLabel: mem10gb { memory = 10000000000.B }
withLabel: mem20gb { memory = 20000000000.B }
withLabel: mem50gb { memory = 50000000000.B }
withLabel: mem100gb { memory = 100000000000.B }
withLabel: mem200gb { memory = 200000000000.B }
withLabel: mem500gb { memory = 500000000000.B }
withLabel: mem1tb { memory = 1000000000000.B }
withLabel: mem2tb { memory = 2000000000000.B }
withLabel: mem5tb { memory = 5000000000000.B }
withLabel: mem10tb { memory = 10000000000000.B }
withLabel: mem20tb { memory = 20000000000000.B }
withLabel: mem50tb { memory = 50000000000000.B }
withLabel: mem100tb { memory = 100000000000000.B }
withLabel: mem200tb { memory = 200000000000000.B }
withLabel: mem500tb { memory = 500000000000000.B }
withLabel: mem1gib { memory = 1073741824.B }
withLabel: mem2gib { memory = 2147483648.B }
withLabel: mem4gib { memory = 4294967296.B }
withLabel: mem8gib { memory = 8589934592.B }
withLabel: mem16gib { memory = 17179869184.B }
withLabel: mem32gib { memory = 34359738368.B }
withLabel: mem64gib { memory = 68719476736.B }
withLabel: mem128gib { memory = 137438953472.B }
withLabel: mem256gib { memory = 274877906944.B }
withLabel: mem512gib { memory = 549755813888.B }
withLabel: mem1tib { memory = 1099511627776.B }
withLabel: mem2tib { memory = 2199023255552.B }
withLabel: mem4tib { memory = 4398046511104.B }
withLabel: mem8tib { memory = 8796093022208.B }
withLabel: mem16tib { memory = 17592186044416.B }
withLabel: mem32tib { memory = 35184372088832.B }
withLabel: mem64tib { memory = 70368744177664.B }
withLabel: mem128tib { memory = 140737488355328.B }
withLabel: mem256tib { memory = 281474976710656.B }
withLabel: mem512tib { memory = 562949953421312.B }
withLabel: cpu1 { cpus = 1 }
withLabel: cpu2 { cpus = 2 }
withLabel: cpu5 { cpus = 5 }
withLabel: cpu10 { cpus = 10 }
withLabel: cpu20 { cpus = 20 }
withLabel: cpu50 { cpus = 50 }
withLabel: cpu100 { cpus = 100 }
withLabel: cpu200 { cpus = 200 }
withLabel: cpu500 { cpus = 500 }
withLabel: cpu1000 { cpus = 1000 }
}
includeConfig("nextflow_labels.config")

View File

@@ -0,0 +1,48 @@
process {
// Default resources for components that hardly do any processing
memory = { 2.GB * task.attempt }
cpus = 1
// Retry for exit codes that have something to do with memory issues
errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 3
// The memory a task is assinged increases with each attempt
// uncomment the line below and adjust the value to set a global upper limit on the memory.
// resourceLimits = [ memory: 240.Gb ]
// CPU resources
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 10 }
withLabel: highcpu { cpus = 20 }
// Memory resources
withLabel: lowmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 4.GB * task.attempt } }
withLabel: midmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 25.GB * task.attempt } }
withLabel: highmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 50.GB * task.attempt } }
withLabel: veryhighmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 75.GB * task.attempt } }
// Disk space
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
// NOTE: The above labels intentionally do not have an effect by default.
// The user should set the disk space requirements by adding the following
// to the compute environment:
//
// withLabel: lowdisk { disk = { 20.GB * task.attempt } }
// withLabel: middisk { disk = { 100.GB * task.attempt } }
// withLabel: highdisk { disk = { 200.GB * task.attempt } }
// withLabel: veryhighdisk { disk = { 500.GB * task.attempt } }
}

View File

@@ -0,0 +1,156 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "calculate_qc_metrics",
"description": "Add basic quality control metrics to an .h5mu file.\n\nThe metrics are comparable to what scanpy.pp.calculate_qc_metrics output,\nalthough they have slightly different names:\n\nVar metrics (name in this component -> name in scanpy):\n - pct_dropout -> pct_dropout_by_{expr_type}\n - num_nonzero_obs -> n_cells_by_{expr_type}\n - obs_mean -> mean_{expr_type}\n - total_counts -> total_{expr_type}\n\n Obs metrics:\n - num_nonzero_vars -> n_genes_by_{expr_type}\n - pct_{var_qc_metrics} -> pct_{expr_type}_{qc_var}\n - total_counts_{var_qc_metrics} -> total_{expr_type}_{qc_var}\n - pct_of_counts_in_top_{top_n_vars}_vars -> pct_{expr_type}_in_top_{n}_{var_type}\n - total_counts -> total_{expr_type}\n \n",
"type": "object",
"$defs": {
"inputs": {
"title": "Inputs",
"type": "object",
"description": "No description",
"properties": {
"input": {
"type": "string",
"format": "path",
"exists": true,
"description": "Input h5mu file",
"help_text": "Type: `file`, multiple: `False`, required, direction: `input`, example: `\"input.h5mu\"`. "
},
"modality": {
"type": "string",
"description": "Which modality from the input MuData file to process",
"help_text": "Type: `string`, multiple: `False`, default: `\"rna\"`. ",
"default": "rna"
},
"layer": {
"type": "string",
"description": "Layer from modality to use as input data",
"help_text": "Type: `string`, multiple: `False`, example: `\"raw_counts\"`. "
}
}
},
"outputs": {
"title": "Outputs",
"type": "object",
"description": "No description",
"properties": {
"output": {
"type": "string",
"format": "path",
"description": "Output h5mu file.",
"help_text": "Type: `file`, multiple: `False`, default: `\"$id.$key.output.h5mu\"`, direction: `output`, example: `\"output.h5mu\"`. ",
"default": "$id.$key.output.h5mu"
},
"output_compression": {
"type": "string",
"description": "Compression format to use for the output AnnData and/or Mudata objects.\nBy default no compression is applied.\n",
"help_text": "Type: `string`, multiple: `False`, example: `\"gzip\"`, choices: ``gzip`, `lzf``. ",
"enum": [
"gzip",
"lzf"
]
}
}
},
"metrics added to .obs": {
"title": "Metrics added to .obs",
"type": "object",
"description": "No description",
"properties": {
"var_qc_metrics": {
"type": "array",
"items": {
"type": "string"
},
"description": "Keys to select a boolean (containing only True or False) column from .var.\nFor each cell, calculate the proportion of total values for genes which are labeled 'True', \ncompared to the total sum of the values for all genes.\n",
"help_text": "Type: `string`, multiple: `True`, example: `[\"ercc,highly_variable,mitochondrial\"]`. "
},
"var_qc_metrics_fill_na_value": {
"type": "boolean",
"description": "Fill any 'NA' values found in the columns specified with --var_qc_metrics to 'True' or 'False'.\nas False.\n",
"help_text": "Type: `boolean`, multiple: `False`. "
},
"top_n_vars": {
"type": "array",
"items": {
"type": "integer"
},
"description": "Number of top vars to be used to calculate cumulative proportions.\nIf not specified, proportions are not calculated",
"help_text": "Type: `integer`, multiple: `True`. "
},
"output_obs_num_nonzero_vars": {
"type": "string",
"description": "Name of column in .obs describing, for each observation, the number of stored values\n(including explicit zeroes)",
"help_text": "Type: `string`, multiple: `False`, default: `\"num_nonzero_vars\"`. ",
"default": "num_nonzero_vars"
},
"output_obs_total_counts_vars": {
"type": "string",
"description": "Name of the column for .obs describing, for each observation (row),\nthe sum of the stored values in the columns.\n",
"help_text": "Type: `string`, multiple: `False`, default: `\"total_counts\"`. ",
"default": "total_counts"
}
}
},
"metrics added to .var": {
"title": "Metrics added to .var",
"type": "object",
"description": "No description",
"properties": {
"output_var_num_nonzero_obs": {
"type": "string",
"description": "Name of column describing, for each feature, the number of stored values\n(including explicit zeroes)",
"help_text": "Type: `string`, multiple: `False`, default: `\"num_nonzero_obs\"`. ",
"default": "num_nonzero_obs"
},
"output_var_total_counts_obs": {
"type": "string",
"description": "Name of the column in .var describing, for each feature (column),\nthe sum of the stored values in the rows.\n",
"help_text": "Type: `string`, multiple: `False`, default: `\"total_counts\"`. ",
"default": "total_counts"
},
"output_var_obs_mean": {
"type": "string",
"description": "Name of the column in .obs providing the mean of the values in each row.\n",
"help_text": "Type: `string`, multiple: `False`, default: `\"obs_mean\"`. ",
"default": "obs_mean"
},
"output_var_pct_dropout": {
"type": "string",
"description": "Name of the column in .obs providing for each feature the percentage of\nobservations the feature does not appear on (i.e",
"help_text": "Type: `string`, multiple: `False`, default: `\"pct_dropout\"`. ",
"default": "pct_dropout"
}
}
},
"nextflow input-output arguments": {
"title": "Nextflow input-output arguments",
"type": "object",
"description": "Input/output parameters for Nextflow itself. Please note that both publishDir and publish_dir are supported but at least one has to be configured.",
"properties": {
"publish_dir": {
"type": "string",
"description": "Path to an output directory.",
"help_text": "Type: `string`, multiple: `False`, required, example: `\"output/\"`. "
}
}
}
},
"allOf": [
{
"$ref": "#/$defs/inputs"
},
{
"$ref": "#/$defs/outputs"
},
{
"$ref": "#/$defs/metrics added to .obs"
},
{
"$ref": "#/$defs/metrics added to .var"
},
{
"$ref": "#/$defs/nextflow input-output arguments"
}
]
}

View File

@@ -0,0 +1,12 @@
def setup_logger():
import logging
from sys import stdout
logger = logging.getLogger()
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler(stdout)
logFormatter = logging.Formatter("%(asctime)s %(levelname)-8s %(message)s")
console_handler.setFormatter(logFormatter)
logger.addHandler(console_handler)
return logger

View File

@@ -0,0 +1,415 @@
name: "qc"
namespace: "workflows/qc"
version: "v4.0.0"
authors:
- name: "Dries Schaumont"
roles:
- "author"
- "maintainer"
info:
role: "Core Team Member"
links:
email: "dries@data-intuitive.com"
github: "DriesSchaumont"
orcid: "0000-0002-4389-0440"
linkedin: "dries-schaumont"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Scientist"
argument_groups:
- name: "Inputs"
arguments:
- type: "string"
name: "--id"
description: "ID of the sample."
info: null
example:
- "foo"
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--input"
alternatives:
- "-i"
description: "Path to the sample."
info: null
example:
- "input.h5mu"
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--modality"
description: "Which modality to process."
info: null
default:
- "rna"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--layer"
description: "Layer to calculate qc metrics for."
info: null
example:
- "raw_counts"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Mitochondrial & Ribosomal Gene Detection"
arguments:
- type: "string"
name: "--var_gene_names"
description: ".var column name to be used to detect mitochondrial/ribosomal genes\
\ instead of .var_names (default if not set).\nGene names matching with the\
\ regex value from --mitochondrial_gene_regex or --ribosomal_gene_regex will\
\ be \nidentified as mitochondrial or ribosomal genes, respectively.\n"
info: null
example:
- "gene_symbol"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--var_name_mitochondrial_genes"
description: "In which .var slot to store a boolean array corresponding the mitochondrial\
\ genes.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--obs_name_mitochondrial_fraction"
description: ".Obs slot to store the fraction of reads found to be mitochondrial.\
\ Defaults to 'fraction_' suffixed by the value of --var_name_mitochondrial_genes\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--mitochondrial_gene_regex"
description: "Regex string that identifies mitochondrial genes from --var_gene_names.\n\
By default will detect human and mouse mitochondrial genes from a gene symbol.\n"
info: null
default:
- "^[mM][tT]-"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--var_name_ribosomal_genes"
description: "In which .var slot to store a boolean array corresponding the ribosomal\
\ genes.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--obs_name_ribosomal_fraction"
description: "When specified, write the fraction of counts originating from ribosomal\
\ genes \n(based on --ribosomal_gene_regex) to an .obs column with the specified\
\ name.\nRequires --var_name_ribosomal_genes.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--ribosomal_gene_regex"
description: "Regex string that identifies ribosomal genes from --var_gene_names.\n\
By default will detect human and mouse ribosomal genes from a gene symbol.\n"
info: null
default:
- "^[Mm]?[Rr][Pp][LlSs]"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- name: "QC metrics calculation options"
arguments:
- type: "string"
name: "--var_qc_metrics"
description: "Keys to select a boolean (containing only True or False) column\
\ from .var.\nFor each cell, calculate the proportion of total values for genes\
\ which are labeled 'True', \ncompared to the total sum of the values for all\
\ genes. Defaults to the value from\n--var_name_mitochondrial_genes.\n"
info: null
example:
- "ercc,highly_variable"
required: false
direction: "input"
multiple: true
multiple_sep: ","
- type: "integer"
name: "--top_n_vars"
description: "Number of top vars to be used to calculate cumulative proportions.\n\
If not specified, proportions are not calculated. `--top_n_vars 20,50` finds\n\
cumulative proportion to the 20th and 50th most expressed vars.\n"
info: null
default:
- 50
- 100
- 200
- 500
required: false
direction: "input"
multiple: true
multiple_sep: ","
- type: "string"
name: "--output_obs_num_nonzero_vars"
description: "Name of column in .obs describing, for each observation, the number\
\ of stored values\n(including explicit zeroes). In other words, the name of\
\ the column that counts\nfor each row the number of columns that contain data.\n"
info: null
default:
- "num_nonzero_vars"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_obs_total_counts_vars"
description: "Name of the column for .obs describing, for each observation (row),\n\
the sum of the stored values in the columns.\n"
info: null
default:
- "total_counts"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_var_num_nonzero_obs"
description: "Name of column describing, for each feature, the number of stored\
\ values\n(including explicit zeroes). In other words, the name of the column\
\ that counts\nfor each column the number of rows that contain data.\n"
info: null
default:
- "num_nonzero_obs"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_var_total_counts_obs"
description: "Name of the column in .var describing, for each feature (column),\n\
the sum of the stored values in the rows.\n"
info: null
default:
- "total_counts"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_var_obs_mean"
description: "Name of the column in .obs providing the mean of the values in each\
\ row.\n"
info: null
default:
- "obs_mean"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_var_pct_dropout"
description: "Name of the column in .obs providing for each feature the percentage\
\ of\nobservations the feature does not appear on (i.e. is missing). Same as\
\ `--output_var_num_nonzero_obs`\nbut percentage based.\n"
info: null
default:
- "pct_dropout"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Outputs"
arguments:
- type: "file"
name: "--output"
description: "Destination path to the output."
info: null
example:
- "output.h5mu"
must_exist: true
create_parent: true
required: true
direction: "output"
multiple: false
multiple_sep: ";"
resources:
- type: "nextflow_script"
path: "main.nf"
is_executable: true
entrypoint: "run_wf"
- type: "file"
path: "utils"
- type: "file"
path: "nextflow_labels.config"
dest: "nextflow_labels.config"
description: "A pipeline to add basic qc statistics to a MuData "
test_resources:
- type: "nextflow_script"
path: "test.nf"
is_executable: true
entrypoint: "test_wf"
- type: "file"
path: "concat_test_data"
- type: "file"
path: "pbmc_1k_protein_v3"
info:
test_dependencies:
- name: "qc_test"
namespace: "test_workflows/qc"
status: "enabled"
scope:
image: "public"
target: "public"
dependencies:
- name: "metadata/grep_annotation_column"
repository:
type: "local"
- name: "qc/calculate_qc_metrics"
repository:
type: "local"
license: "MIT"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
runners:
- type: "nextflow"
id: "nextflow"
directives:
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
script:
- "includeConfig(\"nextflow_labels.config\")"
debug: false
container: "docker"
engines:
- type: "native"
id: "native"
build_info:
config: "src/workflows/qc/qc/config.vsh.yaml"
runner: "nextflow"
engine: "native"
output: "target/nextflow/workflows/qc/qc"
executable: "target/nextflow/workflows/qc/qc/main.nf"
viash_version: "0.9.4"
git_commit: "de02293c9e13198622b988dac952b2c8c70a1e35"
git_remote: "https://github.com/openpipelines-bio/openpipeline"
dependencies:
- "target/nextflow/metadata/grep_annotation_column"
- "target/nextflow/qc/calculate_qc_metrics"
package_config:
name: "openpipeline"
version: "v4.0.0"
summary: "Best-practice workflows for single-cell multi-omics analyses.\n"
description: "OpenPipelines are extensible single cell analysis pipelines for reproducible\
\ and large-scale single cell processing using [Viash](https://viash.io) and [Nextflow](https://www.nextflow.io/).\n\
\nIn terms of workflows, the following has been made available, but keep in mind\
\ that\nindividual tools and functionality can be executed as standalone components\
\ as well.\n\n * Demultiplexing: conversion of raw sequencing data to FASTQ objects.\n\
\ * Ingestion: Read mapping and generating a count matrix.\n * Single sample\
\ processing: cell filtering and doublet detection.\n * Multisample processing:\
\ Count transformation, normalization, QC metric calulations.\n * Integration:\
\ Clustering, integration and batch correction using single and multimodal methods.\n\
\ * Downstream analysis workflows\n"
info:
test_resources:
- type: "s3"
path: "s3://openpipelines-data"
dest: "resources_test"
nextflow_labels_ci:
- path: "src/workflows/utils/labels_ci.config"
description: "Adds the correct memory and CPU labels when running on the Viash\
\ Hub CI."
viash_version: "0.9.4"
source: "src"
target: "target"
config_mods:
- ".resources += {path: '/src/workflows/utils/labels.config', dest: 'nextflow_labels.config'}\n\
.runners[.type == 'nextflow'].config.script := 'includeConfig(\"nextflow_labels.config\"\
)'"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'v4.0.0'"
keywords:
- "single-cell"
- "multimodal"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
homepage: "https://openpipelines.bio"
documentation: "https://openpipelines.bio/fundamentals"
issue_tracker: "https://github.com/openpipelines-bio/openpipeline/issues"

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,126 @@
manifest {
name = 'workflows/qc/qc'
mainScript = 'main.nf'
nextflowVersion = '!>=20.12.1-edge'
version = 'v4.0.0'
description = 'A pipeline to add basic qc statistics to a MuData '
author = 'Dries Schaumont'
}
process.container = 'nextflow/bash:latest'
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
profiles {
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
podman {
podman.enabled = true
docker.enabled = false
singularity.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
shifter {
shifter.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
charliecloud.enabled = false
}
charliecloud {
charliecloud.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
}
}
process{
withLabel: mem1gb { memory = 1000000000.B }
withLabel: mem2gb { memory = 2000000000.B }
withLabel: mem5gb { memory = 5000000000.B }
withLabel: mem10gb { memory = 10000000000.B }
withLabel: mem20gb { memory = 20000000000.B }
withLabel: mem50gb { memory = 50000000000.B }
withLabel: mem100gb { memory = 100000000000.B }
withLabel: mem200gb { memory = 200000000000.B }
withLabel: mem500gb { memory = 500000000000.B }
withLabel: mem1tb { memory = 1000000000000.B }
withLabel: mem2tb { memory = 2000000000000.B }
withLabel: mem5tb { memory = 5000000000000.B }
withLabel: mem10tb { memory = 10000000000000.B }
withLabel: mem20tb { memory = 20000000000000.B }
withLabel: mem50tb { memory = 50000000000000.B }
withLabel: mem100tb { memory = 100000000000000.B }
withLabel: mem200tb { memory = 200000000000000.B }
withLabel: mem500tb { memory = 500000000000000.B }
withLabel: mem1gib { memory = 1073741824.B }
withLabel: mem2gib { memory = 2147483648.B }
withLabel: mem4gib { memory = 4294967296.B }
withLabel: mem8gib { memory = 8589934592.B }
withLabel: mem16gib { memory = 17179869184.B }
withLabel: mem32gib { memory = 34359738368.B }
withLabel: mem64gib { memory = 68719476736.B }
withLabel: mem128gib { memory = 137438953472.B }
withLabel: mem256gib { memory = 274877906944.B }
withLabel: mem512gib { memory = 549755813888.B }
withLabel: mem1tib { memory = 1099511627776.B }
withLabel: mem2tib { memory = 2199023255552.B }
withLabel: mem4tib { memory = 4398046511104.B }
withLabel: mem8tib { memory = 8796093022208.B }
withLabel: mem16tib { memory = 17592186044416.B }
withLabel: mem32tib { memory = 35184372088832.B }
withLabel: mem64tib { memory = 70368744177664.B }
withLabel: mem128tib { memory = 140737488355328.B }
withLabel: mem256tib { memory = 281474976710656.B }
withLabel: mem512tib { memory = 562949953421312.B }
withLabel: cpu1 { cpus = 1 }
withLabel: cpu2 { cpus = 2 }
withLabel: cpu5 { cpus = 5 }
withLabel: cpu10 { cpus = 10 }
withLabel: cpu20 { cpus = 20 }
withLabel: cpu50 { cpus = 50 }
withLabel: cpu100 { cpus = 100 }
withLabel: cpu200 { cpus = 200 }
withLabel: cpu500 { cpus = 500 }
withLabel: cpu1000 { cpus = 1000 }
}
includeConfig("nextflow_labels.config")

View File

@@ -0,0 +1,48 @@
process {
// Default resources for components that hardly do any processing
memory = { 2.GB * task.attempt }
cpus = 1
// Retry for exit codes that have something to do with memory issues
errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 3
// The memory a task is assinged increases with each attempt
// uncomment the line below and adjust the value to set a global upper limit on the memory.
// resourceLimits = [ memory: 240.Gb ]
// CPU resources
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 10 }
withLabel: highcpu { cpus = 20 }
// Memory resources
withLabel: lowmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 4.GB * task.attempt } }
withLabel: midmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 25.GB * task.attempt } }
withLabel: highmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 50.GB * task.attempt } }
withLabel: veryhighmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 75.GB * task.attempt } }
// Disk space
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
// NOTE: The above labels intentionally do not have an effect by default.
// The user should set the disk space requirements by adding the following
// to the compute environment:
//
// withLabel: lowdisk { disk = { 20.GB * task.attempt } }
// withLabel: middisk { disk = { 100.GB * task.attempt } }
// withLabel: highdisk { disk = { 200.GB * task.attempt } }
// withLabel: veryhighdisk { disk = { 500.GB * task.attempt } }
}

View File

@@ -0,0 +1,190 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "qc",
"description": "A pipeline to add basic qc statistics to a MuData ",
"type": "object",
"$defs": {
"inputs": {
"title": "Inputs",
"type": "object",
"description": "No description",
"properties": {
"id": {
"type": "string",
"description": "ID of the sample.",
"help_text": "Type: `string`, multiple: `False`, required, example: `\"foo\"`. "
},
"input": {
"type": "string",
"format": "path",
"exists": true,
"description": "Path to the sample.",
"help_text": "Type: `file`, multiple: `False`, required, direction: `input`, example: `\"input.h5mu\"`. "
},
"modality": {
"type": "string",
"description": "Which modality to process.",
"help_text": "Type: `string`, multiple: `False`, default: `\"rna\"`. ",
"default": "rna"
},
"layer": {
"type": "string",
"description": "Layer to calculate qc metrics for.",
"help_text": "Type: `string`, multiple: `False`, example: `\"raw_counts\"`. "
}
}
},
"outputs": {
"title": "Outputs",
"type": "object",
"description": "No description",
"properties": {
"output": {
"type": "string",
"format": "path",
"description": "Destination path to the output.",
"help_text": "Type: `file`, multiple: `False`, required, default: `\"$id.$key.output.h5mu\"`, direction: `output`, example: `\"output.h5mu\"`. ",
"default": "$id.$key.output.h5mu"
}
}
},
"mitochondrial & ribosomal gene detection": {
"title": "Mitochondrial & Ribosomal Gene Detection",
"type": "object",
"description": "No description",
"properties": {
"var_gene_names": {
"type": "string",
"description": ".var column name to be used to detect mitochondrial/ribosomal genes instead of .var_names (default if not set).\nGene names matching with the regex value from --mitochondrial_gene_regex or --ribosomal_gene_regex will be \nidentified as mitochondrial or ribosomal genes, respectively.\n",
"help_text": "Type: `string`, multiple: `False`, example: `\"gene_symbol\"`. "
},
"var_name_mitochondrial_genes": {
"type": "string",
"description": "In which .var slot to store a boolean array corresponding the mitochondrial genes.\n",
"help_text": "Type: `string`, multiple: `False`. "
},
"obs_name_mitochondrial_fraction": {
"type": "string",
"description": ".Obs slot to store the fraction of reads found to be mitochondrial",
"help_text": "Type: `string`, multiple: `False`. "
},
"mitochondrial_gene_regex": {
"type": "string",
"description": "Regex string that identifies mitochondrial genes from --var_gene_names.\nBy default will detect human and mouse mitochondrial genes from a gene symbol.\n",
"help_text": "Type: `string`, multiple: `False`, default: `\"^[mM][tT]-\"`. ",
"default": "^[mM][tT]-"
},
"var_name_ribosomal_genes": {
"type": "string",
"description": "In which .var slot to store a boolean array corresponding the ribosomal genes.\n",
"help_text": "Type: `string`, multiple: `False`. "
},
"obs_name_ribosomal_fraction": {
"type": "string",
"description": "When specified, write the fraction of counts originating from ribosomal genes \n(based on --ribosomal_gene_regex) to an .obs column with the specified name.\nRequires --var_name_ribosomal_genes.\n",
"help_text": "Type: `string`, multiple: `False`. "
},
"ribosomal_gene_regex": {
"type": "string",
"description": "Regex string that identifies ribosomal genes from --var_gene_names.\nBy default will detect human and mouse ribosomal genes from a gene symbol.\n",
"help_text": "Type: `string`, multiple: `False`, default: `\"^[Mm]?[Rr][Pp][LlSs]\"`. ",
"default": "^[Mm]?[Rr][Pp][LlSs]"
}
}
},
"qc metrics calculation options": {
"title": "QC metrics calculation options",
"type": "object",
"description": "No description",
"properties": {
"var_qc_metrics": {
"type": "array",
"items": {
"type": "string"
},
"description": "Keys to select a boolean (containing only True or False) column from .var.\nFor each cell, calculate the proportion of total values for genes which are labeled 'True', \ncompared to the total sum of the values for all genes",
"help_text": "Type: `string`, multiple: `True`, example: `[\"ercc,highly_variable\"]`. "
},
"top_n_vars": {
"type": "array",
"items": {
"type": "integer"
},
"description": "Number of top vars to be used to calculate cumulative proportions.\nIf not specified, proportions are not calculated",
"help_text": "Type: `integer`, multiple: `True`, default: `[50,100,200,500]`. ",
"default": [
50,
100,
200,
500
]
},
"output_obs_num_nonzero_vars": {
"type": "string",
"description": "Name of column in .obs describing, for each observation, the number of stored values\n(including explicit zeroes)",
"help_text": "Type: `string`, multiple: `False`, default: `\"num_nonzero_vars\"`. ",
"default": "num_nonzero_vars"
},
"output_obs_total_counts_vars": {
"type": "string",
"description": "Name of the column for .obs describing, for each observation (row),\nthe sum of the stored values in the columns.\n",
"help_text": "Type: `string`, multiple: `False`, default: `\"total_counts\"`. ",
"default": "total_counts"
},
"output_var_num_nonzero_obs": {
"type": "string",
"description": "Name of column describing, for each feature, the number of stored values\n(including explicit zeroes)",
"help_text": "Type: `string`, multiple: `False`, default: `\"num_nonzero_obs\"`. ",
"default": "num_nonzero_obs"
},
"output_var_total_counts_obs": {
"type": "string",
"description": "Name of the column in .var describing, for each feature (column),\nthe sum of the stored values in the rows.\n",
"help_text": "Type: `string`, multiple: `False`, default: `\"total_counts\"`. ",
"default": "total_counts"
},
"output_var_obs_mean": {
"type": "string",
"description": "Name of the column in .obs providing the mean of the values in each row.\n",
"help_text": "Type: `string`, multiple: `False`, default: `\"obs_mean\"`. ",
"default": "obs_mean"
},
"output_var_pct_dropout": {
"type": "string",
"description": "Name of the column in .obs providing for each feature the percentage of\nobservations the feature does not appear on (i.e",
"help_text": "Type: `string`, multiple: `False`, default: `\"pct_dropout\"`. ",
"default": "pct_dropout"
}
}
},
"nextflow input-output arguments": {
"title": "Nextflow input-output arguments",
"type": "object",
"description": "Input/output parameters for Nextflow itself. Please note that both publishDir and publish_dir are supported but at least one has to be configured.",
"properties": {
"publish_dir": {
"type": "string",
"description": "Path to an output directory.",
"help_text": "Type: `string`, multiple: `False`, required, example: `\"output/\"`. "
}
}
}
},
"allOf": [
{
"$ref": "#/$defs/inputs"
},
{
"$ref": "#/$defs/outputs"
},
{
"$ref": "#/$defs/mitochondrial & ribosomal gene detection"
},
{
"$ref": "#/$defs/qc metrics calculation options"
},
{
"$ref": "#/$defs/nextflow input-output arguments"
}
]
}

View File

@@ -0,0 +1 @@
process.errorStrategy = 'ignore'

View File

@@ -0,0 +1,36 @@
profiles {
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
}

View File

@@ -0,0 +1,48 @@
process {
// Default resources for components that hardly do any processing
memory = { 2.GB * task.attempt }
cpus = 1
// Retry for exit codes that have something to do with memory issues
errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 3
// The memory a task is assinged increases with each attempt
// uncomment the line below and adjust the value to set a global upper limit on the memory.
// resourceLimits = [ memory: 240.Gb ]
// CPU resources
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 10 }
withLabel: highcpu { cpus = 20 }
// Memory resources
withLabel: lowmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 4.GB * task.attempt } }
withLabel: midmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 25.GB * task.attempt } }
withLabel: highmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 50.GB * task.attempt } }
withLabel: veryhighmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 75.GB * task.attempt } }
// Disk space
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
// NOTE: The above labels intentionally do not have an effect by default.
// The user should set the disk space requirements by adding the following
// to the compute environment:
//
// withLabel: lowdisk { disk = { 20.GB * task.attempt } }
// withLabel: middisk { disk = { 100.GB * task.attempt } }
// withLabel: highdisk { disk = { 200.GB * task.attempt } }
// withLabel: veryhighdisk { disk = { 500.GB * task.attempt } }
}

View File

@@ -0,0 +1,33 @@
process {
withLabel: lowmem { memory = 13.Gb }
withLabel: lowcpu { cpus = 4 }
withLabel: midmem { memory = 13.Gb }
withLabel: midcpu { cpus = 4 }
withLabel: highmem { memory = 13.Gb }
withLabel: highcpu { cpus = 4 }
withLabel: veryhighmem { memory = 13.Gb }
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
}
env.NUMBA_CACHE_DIR = '/tmp'
trace {
enabled = true
overwrite = true
}
dag {
overwrite = true
}
process.maxForks = 1

View File

@@ -0,0 +1,336 @@
name: "move_mudata_obs_to_tiledb"
namespace: "tiledb"
version: "v4.0.4"
authors:
- name: "Dries Schaumont"
roles:
- "author"
- "maintainer"
info:
role: "Core Team Member"
links:
email: "dries@data-intuitive.com"
github: "DriesSchaumont"
orcid: "0000-0002-4389-0440"
linkedin: "dries-schaumont"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Scientist"
argument_groups:
- name: "Input database"
description: "Open a tileDB-SOMA database by URI or as a local directory."
arguments:
- type: "string"
name: "--input_uri"
description: "A URI pointing to a TileDB-SOMA database. Mutually exclusive with\
\ 'input_dir'"
info: null
example:
- "s3://bucket/path"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--input_dir"
description: "Path to a TileDB-SOMA database as a local directory"
info: null
example:
- "./tiledb_database"
must_exist: true
create_parent: true
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--s3_region"
description: "Region where the TileDB-SOMA database is hosted.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--endpoint"
description: "Custom endpoint to use to connect to S3\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "boolean"
name: "--s3_no_sign_request"
description: "Do not sign S3 requests. Credentials will not be loaded if this\
\ argument is provided.\n"
info: null
default:
- false
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_modality"
description: "TileDB-SOMA measurement to add the output to.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--obs_index_name_input"
description: "Name of the index that is used to describe the cells (observations).\n"
info: null
default:
- "cell_id"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- name: "MuData input"
arguments:
- type: "file"
name: "--input_mudata"
description: "MuData object to take the columns from. The observations and their\
\ order should\nmatch between the database and the input modality.\n"
info: null
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--modality"
description: "Modality where to take the .obs from.\n"
info: null
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--obs_input"
description: "Columns from .obs to copy. The keys should not be present yet in\
\ the database.\n"
info: null
required: false
direction: "input"
multiple: true
multiple_sep: ";"
- name: "TileDB-SOMA output"
arguments:
- type: "file"
name: "--output_tiledb"
description: "Output to a directory instead of adding to the existing database.\n"
info: null
must_exist: true
create_parent: true
required: false
direction: "output"
multiple: false
multiple_sep: ";"
resources:
- type: "python_script"
path: "script.py"
is_executable: true
- type: "file"
path: "setup_logger.py"
- type: "file"
path: "nextflow_labels.config"
dest: "nextflow_labels.config"
description: "Move .obs columns from a MuData modality to an existing tileDB database.\n\
The .obs keys should not exist in the database yet; and the observations from the\
\ modality and \ntheir order should match with what is already present the tiledb\
\ database.\n"
test_resources:
- type: "python_script"
path: "test.py"
is_executable: true
- type: "file"
path: "tiledb"
- type: "file"
path: "pbmc_1k_protein_v3"
info: null
status: "enabled"
scope:
image: "private"
target: "private"
license: "MIT"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
runners:
- type: "executable"
id: "executable"
docker_setup_strategy: "ifneedbepullelsecachedbuild"
docker_run_args:
- "--env"
- "AWS_ACCESS_KEY_ID"
- "--env"
- "AWS_SECRET_ACCESS_KEY"
- "--env"
- "AWS_DEFAULT_REGION"
- type: "nextflow"
id: "nextflow"
directives:
label:
- "highmem"
- "midcpu"
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
script:
- "includeConfig(\"nextflow_labels.config\")"
debug: false
container: "docker"
engines:
- type: "docker"
id: "docker"
image: "python:3.12"
target_registry: "images.viash-hub.com"
target_tag: "v4.0.4"
namespace_separator: "/"
setup:
- type: "python"
user: false
packages:
- "anndata~=0.12.7"
- "awkward"
- "mudata~=0.3.2"
- "tiledbsoma"
- "boto3"
- "awscli"
script:
- "exec(\"try:\\n import zarr; from importlib.metadata import version\\nexcept\
\ ModuleNotFoundError:\\n exit(0)\\nelse: assert int(version(\\\"zarr\\\"\
).partition(\\\".\\\")[0]) > 2\")"
upgrade: true
test_setup:
- type: "apt"
packages:
- "git"
interactive: false
- type: "python"
user: false
packages:
- "viashpy==0.8.0"
github:
- "openpipelines-bio/core#subdirectory=packages/python/openpipeline_testutils"
upgrade: true
- type: "python"
user: false
packages:
- "moto[server]"
upgrade: true
entrypoint: []
cmd: null
- type: "native"
id: "native"
build_info:
config: "src/tiledb/move_mudata_obs_to_tiledb/config.vsh.yaml"
runner: "nextflow"
engine: "docker|native"
output: "target/_private/nextflow/tiledb/move_mudata_obs_to_tiledb"
executable: "target/_private/nextflow/tiledb/move_mudata_obs_to_tiledb/main.nf"
viash_version: "0.9.4"
git_commit: "fb7dc76676aa63d06ae1421bbdd6312ad4f67312"
git_remote: "https://github.com/openpipelines-bio/openpipeline"
package_config:
name: "openpipeline"
version: "v4.0.4"
summary: "Best-practice workflows for single-cell multi-omics analyses.\n"
description: "OpenPipelines are extensible single cell analysis pipelines for reproducible\
\ and large-scale single cell processing using [Viash](https://viash.io) and [Nextflow](https://www.nextflow.io/).\n\
\nIn terms of workflows, the following has been made available, but keep in mind\
\ that\nindividual tools and functionality can be executed as standalone components\
\ as well.\n\n * Demultiplexing: conversion of raw sequencing data to FASTQ objects.\n\
\ * Ingestion: Read mapping and generating a count matrix.\n * Single sample\
\ processing: cell filtering and doublet detection.\n * Multisample processing:\
\ Count transformation, normalization, QC metric calulations.\n * Integration:\
\ Clustering, integration and batch correction using single and multimodal methods.\n\
\ * Downstream analysis workflows\n"
info:
test_resources:
- type: "s3"
path: "s3://openpipelines-data"
dest: "resources_test"
nextflow_labels_ci:
- path: "src/workflows/utils/labels_ci.config"
description: "Adds the correct memory and CPU labels when running on the Viash\
\ Hub CI."
viash_version: "0.9.4"
source: "src"
target: "target"
config_mods:
- ".resources += {path: '/src/workflows/utils/labels.config', dest: 'nextflow_labels.config'}\n\
.runners[.type == 'nextflow'].config.script := 'includeConfig(\"nextflow_labels.config\"\
)'"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'v4.0.4'"
keywords:
- "single-cell"
- "multimodal"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
homepage: "https://openpipelines.bio"
documentation: "https://openpipelines.bio/fundamentals"
issue_tracker: "https://github.com/openpipelines-bio/openpipeline/issues"

View File

@@ -0,0 +1,126 @@
manifest {
name = 'tiledb/move_mudata_obs_to_tiledb'
mainScript = 'main.nf'
nextflowVersion = '!>=20.12.1-edge'
version = 'v4.0.4'
description = 'Move .obs columns from a MuData modality to an existing tileDB database.\nThe .obs keys should not exist in the database yet; and the observations from the modality and \ntheir order should match with what is already present the tiledb database.\n'
author = 'Dries Schaumont'
}
process.container = 'nextflow/bash:latest'
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
profiles {
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
podman {
podman.enabled = true
docker.enabled = false
singularity.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
shifter {
shifter.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
charliecloud.enabled = false
}
charliecloud {
charliecloud.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
}
}
process{
withLabel: mem1gb { memory = 1000000000.B }
withLabel: mem2gb { memory = 2000000000.B }
withLabel: mem5gb { memory = 5000000000.B }
withLabel: mem10gb { memory = 10000000000.B }
withLabel: mem20gb { memory = 20000000000.B }
withLabel: mem50gb { memory = 50000000000.B }
withLabel: mem100gb { memory = 100000000000.B }
withLabel: mem200gb { memory = 200000000000.B }
withLabel: mem500gb { memory = 500000000000.B }
withLabel: mem1tb { memory = 1000000000000.B }
withLabel: mem2tb { memory = 2000000000000.B }
withLabel: mem5tb { memory = 5000000000000.B }
withLabel: mem10tb { memory = 10000000000000.B }
withLabel: mem20tb { memory = 20000000000000.B }
withLabel: mem50tb { memory = 50000000000000.B }
withLabel: mem100tb { memory = 100000000000000.B }
withLabel: mem200tb { memory = 200000000000000.B }
withLabel: mem500tb { memory = 500000000000000.B }
withLabel: mem1gib { memory = 1073741824.B }
withLabel: mem2gib { memory = 2147483648.B }
withLabel: mem4gib { memory = 4294967296.B }
withLabel: mem8gib { memory = 8589934592.B }
withLabel: mem16gib { memory = 17179869184.B }
withLabel: mem32gib { memory = 34359738368.B }
withLabel: mem64gib { memory = 68719476736.B }
withLabel: mem128gib { memory = 137438953472.B }
withLabel: mem256gib { memory = 274877906944.B }
withLabel: mem512gib { memory = 549755813888.B }
withLabel: mem1tib { memory = 1099511627776.B }
withLabel: mem2tib { memory = 2199023255552.B }
withLabel: mem4tib { memory = 4398046511104.B }
withLabel: mem8tib { memory = 8796093022208.B }
withLabel: mem16tib { memory = 17592186044416.B }
withLabel: mem32tib { memory = 35184372088832.B }
withLabel: mem64tib { memory = 70368744177664.B }
withLabel: mem128tib { memory = 140737488355328.B }
withLabel: mem256tib { memory = 281474976710656.B }
withLabel: mem512tib { memory = 562949953421312.B }
withLabel: cpu1 { cpus = 1 }
withLabel: cpu2 { cpus = 2 }
withLabel: cpu5 { cpus = 5 }
withLabel: cpu10 { cpus = 10 }
withLabel: cpu20 { cpus = 20 }
withLabel: cpu50 { cpus = 50 }
withLabel: cpu100 { cpus = 100 }
withLabel: cpu200 { cpus = 200 }
withLabel: cpu500 { cpus = 500 }
withLabel: cpu1000 { cpus = 1000 }
}
includeConfig("nextflow_labels.config")

View File

@@ -0,0 +1,48 @@
process {
// Default resources for components that hardly do any processing
memory = { 2.GB * task.attempt }
cpus = 1
// Retry for exit codes that have something to do with memory issues
errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 3
// The memory a task is assinged increases with each attempt
// uncomment the line below and adjust the value to set a global upper limit on the memory.
// resourceLimits = [ memory: 240.Gb ]
// CPU resources
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 10 }
withLabel: highcpu { cpus = 20 }
// Memory resources
withLabel: lowmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 4.GB * task.attempt } }
withLabel: midmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 25.GB * task.attempt } }
withLabel: highmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 50.GB * task.attempt } }
withLabel: veryhighmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 75.GB * task.attempt } }
// Disk space
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
// NOTE: The above labels intentionally do not have an effect by default.
// The user should set the disk space requirements by adding the following
// to the compute environment:
//
// withLabel: lowdisk { disk = { 20.GB * task.attempt } }
// withLabel: middisk { disk = { 100.GB * task.attempt } }
// withLabel: highdisk { disk = { 200.GB * task.attempt } }
// withLabel: veryhighdisk { disk = { 500.GB * task.attempt } }
}

View File

@@ -0,0 +1,12 @@
def setup_logger():
import logging
from sys import stdout
logger = logging.getLogger()
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler(stdout)
logFormatter = logging.Formatter("%(asctime)s %(levelname)-8s %(message)s")
console_handler.setFormatter(logFormatter)
logger.addHandler(console_handler)
return logger

View File

@@ -0,0 +1,313 @@
name: "move_mudata_obsm_to_tiledb"
namespace: "tiledb"
version: "v4.0.4"
authors:
- name: "Dries Schaumont"
roles:
- "author"
- "maintainer"
info:
role: "Core Team Member"
links:
email: "dries@data-intuitive.com"
github: "DriesSchaumont"
orcid: "0000-0002-4389-0440"
linkedin: "dries-schaumont"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Scientist"
argument_groups:
- name: "Input database"
description: "Open a tileDB-SOMA database by URI."
arguments:
- type: "string"
name: "--input_uri"
description: "A URI pointing to a TileDB-SOMA database."
info: null
example:
- "s3://bucket/path"
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--s3_region"
description: "Region where the TileDB-SOMA database is hosted.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--endpoint"
description: "Custom endpoint to use to connect to S3\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "boolean"
name: "--s3_no_sign_request"
description: "Do not sign S3 requests. Credentials will not be loaded if this\
\ argument is provided.\n"
info: null
default:
- false
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--output_modality"
description: "TileDB-SOMA measurement to add the output to.\n"
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- name: "MuData input"
arguments:
- type: "file"
name: "--input_mudata"
description: "MuData object to take the columns from. The observations and their\
\ order should\nmatch between the database and the input modality.\n"
info: null
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--modality"
description: "Modality where to take the .obsm from.\n"
info: null
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--obsm_input"
description: "Keys from .obm to copy. The keys should not be present yet in the\
\ database.\n"
info: null
required: false
direction: "input"
multiple: true
multiple_sep: ";"
- name: "TileDB-SOMA output"
arguments:
- type: "file"
name: "--output_tiledb"
description: "Output to a directory instead of adding to the existing database.\n"
info: null
must_exist: true
create_parent: true
required: false
direction: "output"
multiple: false
multiple_sep: ";"
resources:
- type: "python_script"
path: "script.py"
is_executable: true
- type: "file"
path: "setup_logger.py"
- type: "file"
path: "nextflow_labels.config"
dest: "nextflow_labels.config"
description: "Move .obsm items from a MuData modality to an existing tileDB database.\n\
The .obsm keys should not exist in the database yet; and the observations from the\
\ modality and \ntheir order should match with what is already present the tiledb\
\ database.\n"
test_resources:
- type: "python_script"
path: "test.py"
is_executable: true
- type: "file"
path: "tiledb"
- type: "file"
path: "pbmc_1k_protein_v3"
info: null
status: "enabled"
scope:
image: "private"
target: "private"
license: "MIT"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
runners:
- type: "executable"
id: "executable"
docker_setup_strategy: "ifneedbepullelsecachedbuild"
docker_run_args:
- "--env"
- "AWS_ACCESS_KEY_ID"
- "--env"
- "AWS_SECRET_ACCESS_KEY"
- "--env"
- "AWS_DEFAULT_REGION"
- type: "nextflow"
id: "nextflow"
directives:
label:
- "highmem"
- "midcpu"
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
script:
- "includeConfig(\"nextflow_labels.config\")"
debug: false
container: "docker"
engines:
- type: "docker"
id: "docker"
image: "python:3.12"
target_registry: "images.viash-hub.com"
target_tag: "v4.0.4"
namespace_separator: "/"
setup:
- type: "python"
user: false
packages:
- "anndata~=0.12.7"
- "awkward"
- "mudata~=0.3.2"
- "tiledbsoma"
- "boto3"
- "awscli"
script:
- "exec(\"try:\\n import zarr; from importlib.metadata import version\\nexcept\
\ ModuleNotFoundError:\\n exit(0)\\nelse: assert int(version(\\\"zarr\\\"\
).partition(\\\".\\\")[0]) > 2\")"
upgrade: true
test_setup:
- type: "apt"
packages:
- "git"
interactive: false
- type: "python"
user: false
packages:
- "viashpy==0.8.0"
github:
- "openpipelines-bio/core#subdirectory=packages/python/openpipeline_testutils"
upgrade: true
- type: "python"
user: false
packages:
- "moto[server]"
upgrade: true
entrypoint: []
cmd: null
- type: "native"
id: "native"
build_info:
config: "src/tiledb/move_mudata_obsm_to_tiledb/config.vsh.yaml"
runner: "nextflow"
engine: "docker|native"
output: "target/_private/nextflow/tiledb/move_mudata_obsm_to_tiledb"
executable: "target/_private/nextflow/tiledb/move_mudata_obsm_to_tiledb/main.nf"
viash_version: "0.9.4"
git_commit: "fb7dc76676aa63d06ae1421bbdd6312ad4f67312"
git_remote: "https://github.com/openpipelines-bio/openpipeline"
package_config:
name: "openpipeline"
version: "v4.0.4"
summary: "Best-practice workflows for single-cell multi-omics analyses.\n"
description: "OpenPipelines are extensible single cell analysis pipelines for reproducible\
\ and large-scale single cell processing using [Viash](https://viash.io) and [Nextflow](https://www.nextflow.io/).\n\
\nIn terms of workflows, the following has been made available, but keep in mind\
\ that\nindividual tools and functionality can be executed as standalone components\
\ as well.\n\n * Demultiplexing: conversion of raw sequencing data to FASTQ objects.\n\
\ * Ingestion: Read mapping and generating a count matrix.\n * Single sample\
\ processing: cell filtering and doublet detection.\n * Multisample processing:\
\ Count transformation, normalization, QC metric calulations.\n * Integration:\
\ Clustering, integration and batch correction using single and multimodal methods.\n\
\ * Downstream analysis workflows\n"
info:
test_resources:
- type: "s3"
path: "s3://openpipelines-data"
dest: "resources_test"
nextflow_labels_ci:
- path: "src/workflows/utils/labels_ci.config"
description: "Adds the correct memory and CPU labels when running on the Viash\
\ Hub CI."
viash_version: "0.9.4"
source: "src"
target: "target"
config_mods:
- ".resources += {path: '/src/workflows/utils/labels.config', dest: 'nextflow_labels.config'}\n\
.runners[.type == 'nextflow'].config.script := 'includeConfig(\"nextflow_labels.config\"\
)'"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'v4.0.4'"
keywords:
- "single-cell"
- "multimodal"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
homepage: "https://openpipelines.bio"
documentation: "https://openpipelines.bio/fundamentals"
issue_tracker: "https://github.com/openpipelines-bio/openpipeline/issues"

View File

@@ -0,0 +1,126 @@
manifest {
name = 'tiledb/move_mudata_obsm_to_tiledb'
mainScript = 'main.nf'
nextflowVersion = '!>=20.12.1-edge'
version = 'v4.0.4'
description = 'Move .obsm items from a MuData modality to an existing tileDB database.\nThe .obsm keys should not exist in the database yet; and the observations from the modality and \ntheir order should match with what is already present the tiledb database.\n'
author = 'Dries Schaumont'
}
process.container = 'nextflow/bash:latest'
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
profiles {
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
podman {
podman.enabled = true
docker.enabled = false
singularity.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
shifter {
shifter.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
charliecloud.enabled = false
}
charliecloud {
charliecloud.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
}
}
process{
withLabel: mem1gb { memory = 1000000000.B }
withLabel: mem2gb { memory = 2000000000.B }
withLabel: mem5gb { memory = 5000000000.B }
withLabel: mem10gb { memory = 10000000000.B }
withLabel: mem20gb { memory = 20000000000.B }
withLabel: mem50gb { memory = 50000000000.B }
withLabel: mem100gb { memory = 100000000000.B }
withLabel: mem200gb { memory = 200000000000.B }
withLabel: mem500gb { memory = 500000000000.B }
withLabel: mem1tb { memory = 1000000000000.B }
withLabel: mem2tb { memory = 2000000000000.B }
withLabel: mem5tb { memory = 5000000000000.B }
withLabel: mem10tb { memory = 10000000000000.B }
withLabel: mem20tb { memory = 20000000000000.B }
withLabel: mem50tb { memory = 50000000000000.B }
withLabel: mem100tb { memory = 100000000000000.B }
withLabel: mem200tb { memory = 200000000000000.B }
withLabel: mem500tb { memory = 500000000000000.B }
withLabel: mem1gib { memory = 1073741824.B }
withLabel: mem2gib { memory = 2147483648.B }
withLabel: mem4gib { memory = 4294967296.B }
withLabel: mem8gib { memory = 8589934592.B }
withLabel: mem16gib { memory = 17179869184.B }
withLabel: mem32gib { memory = 34359738368.B }
withLabel: mem64gib { memory = 68719476736.B }
withLabel: mem128gib { memory = 137438953472.B }
withLabel: mem256gib { memory = 274877906944.B }
withLabel: mem512gib { memory = 549755813888.B }
withLabel: mem1tib { memory = 1099511627776.B }
withLabel: mem2tib { memory = 2199023255552.B }
withLabel: mem4tib { memory = 4398046511104.B }
withLabel: mem8tib { memory = 8796093022208.B }
withLabel: mem16tib { memory = 17592186044416.B }
withLabel: mem32tib { memory = 35184372088832.B }
withLabel: mem64tib { memory = 70368744177664.B }
withLabel: mem128tib { memory = 140737488355328.B }
withLabel: mem256tib { memory = 281474976710656.B }
withLabel: mem512tib { memory = 562949953421312.B }
withLabel: cpu1 { cpus = 1 }
withLabel: cpu2 { cpus = 2 }
withLabel: cpu5 { cpus = 5 }
withLabel: cpu10 { cpus = 10 }
withLabel: cpu20 { cpus = 20 }
withLabel: cpu50 { cpus = 50 }
withLabel: cpu100 { cpus = 100 }
withLabel: cpu200 { cpus = 200 }
withLabel: cpu500 { cpus = 500 }
withLabel: cpu1000 { cpus = 1000 }
}
includeConfig("nextflow_labels.config")

View File

@@ -0,0 +1,48 @@
process {
// Default resources for components that hardly do any processing
memory = { 2.GB * task.attempt }
cpus = 1
// Retry for exit codes that have something to do with memory issues
errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 3
// The memory a task is assinged increases with each attempt
// uncomment the line below and adjust the value to set a global upper limit on the memory.
// resourceLimits = [ memory: 240.Gb ]
// CPU resources
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 10 }
withLabel: highcpu { cpus = 20 }
// Memory resources
withLabel: lowmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 4.GB * task.attempt } }
withLabel: midmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 25.GB * task.attempt } }
withLabel: highmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 50.GB * task.attempt } }
withLabel: veryhighmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 75.GB * task.attempt } }
// Disk space
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
// NOTE: The above labels intentionally do not have an effect by default.
// The user should set the disk space requirements by adding the following
// to the compute environment:
//
// withLabel: lowdisk { disk = { 20.GB * task.attempt } }
// withLabel: middisk { disk = { 100.GB * task.attempt } }
// withLabel: highdisk { disk = { 200.GB * task.attempt } }
// withLabel: veryhighdisk { disk = { 500.GB * task.attempt } }
}

View File

@@ -0,0 +1,12 @@
def setup_logger():
import logging
from sys import stdout
logger = logging.getLogger()
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler(stdout)
logFormatter = logging.Formatter("%(asctime)s %(levelname)-8s %(message)s")
console_handler.setFormatter(logFormatter)
logger.addHandler(console_handler)
return logger

View File

@@ -0,0 +1,233 @@
name: "split_modalities"
namespace: "workflows/multiomics"
version: "v4.0.4"
authors:
- name: "Dries Schaumont"
roles:
- "author"
- "maintainer"
info:
role: "Core Team Member"
links:
email: "dries@data-intuitive.com"
github: "DriesSchaumont"
orcid: "0000-0002-4389-0440"
linkedin: "dries-schaumont"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Scientist"
argument_groups:
- name: "Inputs"
arguments:
- type: "string"
name: "--id"
description: "ID of the sample."
info: null
example:
- "foo"
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--input"
alternatives:
- "-i"
description: "Path to the sample."
info: null
example:
- "input.h5mu"
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Outputs"
arguments:
- type: "file"
name: "--output"
alternatives:
- "-o"
description: "Output directory containing multiple h5mu files."
info: null
example:
- "/path/to/output"
must_exist: true
create_parent: true
required: true
direction: "output"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--output_types"
description: "A csv containing the base filename and modality type per output\
\ file."
info: null
example:
- "types.csv"
must_exist: true
create_parent: true
required: true
direction: "output"
multiple: false
multiple_sep: ";"
resources:
- type: "nextflow_script"
path: "main.nf"
is_executable: true
entrypoint: "run_wf"
- type: "file"
path: "utils"
- type: "file"
path: "nextflow_labels.config"
dest: "nextflow_labels.config"
description: "A pipeline to split a multimodal mudata files into several unimodal\
\ mudata files."
test_resources:
- type: "nextflow_script"
path: "test.nf"
is_executable: true
entrypoint: "test_wf"
- type: "file"
path: "pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5mu"
info:
test_dependencies:
- name: "split_modalities_test"
namespace: "test_workflows/multiomics"
status: "enabled"
scope:
image: "private"
target: "private"
dependencies:
- name: "dataflow/split_modalities"
alias: "split_modalities_component"
repository:
type: "local"
license: "MIT"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
runners:
- type: "nextflow"
id: "nextflow"
directives:
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
script:
- "includeConfig(\"nextflow_labels.config\")"
debug: false
container: "docker"
engines:
- type: "native"
id: "native"
build_info:
config: "src/workflows/multiomics/split_modalities/config.vsh.yaml"
runner: "nextflow"
engine: "native"
output: "target/_private/nextflow/workflows/multiomics/split_modalities"
executable: "target/_private/nextflow/workflows/multiomics/split_modalities/main.nf"
viash_version: "0.9.4"
git_commit: "fb7dc76676aa63d06ae1421bbdd6312ad4f67312"
git_remote: "https://github.com/openpipelines-bio/openpipeline"
dependencies:
- "target/nextflow/dataflow/split_modalities"
package_config:
name: "openpipeline"
version: "v4.0.4"
summary: "Best-practice workflows for single-cell multi-omics analyses.\n"
description: "OpenPipelines are extensible single cell analysis pipelines for reproducible\
\ and large-scale single cell processing using [Viash](https://viash.io) and [Nextflow](https://www.nextflow.io/).\n\
\nIn terms of workflows, the following has been made available, but keep in mind\
\ that\nindividual tools and functionality can be executed as standalone components\
\ as well.\n\n * Demultiplexing: conversion of raw sequencing data to FASTQ objects.\n\
\ * Ingestion: Read mapping and generating a count matrix.\n * Single sample\
\ processing: cell filtering and doublet detection.\n * Multisample processing:\
\ Count transformation, normalization, QC metric calulations.\n * Integration:\
\ Clustering, integration and batch correction using single and multimodal methods.\n\
\ * Downstream analysis workflows\n"
info:
test_resources:
- type: "s3"
path: "s3://openpipelines-data"
dest: "resources_test"
nextflow_labels_ci:
- path: "src/workflows/utils/labels_ci.config"
description: "Adds the correct memory and CPU labels when running on the Viash\
\ Hub CI."
viash_version: "0.9.4"
source: "src"
target: "target"
config_mods:
- ".resources += {path: '/src/workflows/utils/labels.config', dest: 'nextflow_labels.config'}\n\
.runners[.type == 'nextflow'].config.script := 'includeConfig(\"nextflow_labels.config\"\
)'"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'v4.0.4'"
keywords:
- "single-cell"
- "multimodal"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
homepage: "https://openpipelines.bio"
documentation: "https://openpipelines.bio/fundamentals"
issue_tracker: "https://github.com/openpipelines-bio/openpipeline/issues"

View File

@@ -0,0 +1,126 @@
manifest {
name = 'workflows/multiomics/split_modalities'
mainScript = 'main.nf'
nextflowVersion = '!>=20.12.1-edge'
version = 'v4.0.4'
description = 'A pipeline to split a multimodal mudata files into several unimodal mudata files.'
author = 'Dries Schaumont'
}
process.container = 'nextflow/bash:latest'
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
profiles {
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
podman {
podman.enabled = true
docker.enabled = false
singularity.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
shifter {
shifter.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
charliecloud.enabled = false
}
charliecloud {
charliecloud.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
}
}
process{
withLabel: mem1gb { memory = 1000000000.B }
withLabel: mem2gb { memory = 2000000000.B }
withLabel: mem5gb { memory = 5000000000.B }
withLabel: mem10gb { memory = 10000000000.B }
withLabel: mem20gb { memory = 20000000000.B }
withLabel: mem50gb { memory = 50000000000.B }
withLabel: mem100gb { memory = 100000000000.B }
withLabel: mem200gb { memory = 200000000000.B }
withLabel: mem500gb { memory = 500000000000.B }
withLabel: mem1tb { memory = 1000000000000.B }
withLabel: mem2tb { memory = 2000000000000.B }
withLabel: mem5tb { memory = 5000000000000.B }
withLabel: mem10tb { memory = 10000000000000.B }
withLabel: mem20tb { memory = 20000000000000.B }
withLabel: mem50tb { memory = 50000000000000.B }
withLabel: mem100tb { memory = 100000000000000.B }
withLabel: mem200tb { memory = 200000000000000.B }
withLabel: mem500tb { memory = 500000000000000.B }
withLabel: mem1gib { memory = 1073741824.B }
withLabel: mem2gib { memory = 2147483648.B }
withLabel: mem4gib { memory = 4294967296.B }
withLabel: mem8gib { memory = 8589934592.B }
withLabel: mem16gib { memory = 17179869184.B }
withLabel: mem32gib { memory = 34359738368.B }
withLabel: mem64gib { memory = 68719476736.B }
withLabel: mem128gib { memory = 137438953472.B }
withLabel: mem256gib { memory = 274877906944.B }
withLabel: mem512gib { memory = 549755813888.B }
withLabel: mem1tib { memory = 1099511627776.B }
withLabel: mem2tib { memory = 2199023255552.B }
withLabel: mem4tib { memory = 4398046511104.B }
withLabel: mem8tib { memory = 8796093022208.B }
withLabel: mem16tib { memory = 17592186044416.B }
withLabel: mem32tib { memory = 35184372088832.B }
withLabel: mem64tib { memory = 70368744177664.B }
withLabel: mem128tib { memory = 140737488355328.B }
withLabel: mem256tib { memory = 281474976710656.B }
withLabel: mem512tib { memory = 562949953421312.B }
withLabel: cpu1 { cpus = 1 }
withLabel: cpu2 { cpus = 2 }
withLabel: cpu5 { cpus = 5 }
withLabel: cpu10 { cpus = 10 }
withLabel: cpu20 { cpus = 20 }
withLabel: cpu50 { cpus = 50 }
withLabel: cpu100 { cpus = 100 }
withLabel: cpu200 { cpus = 200 }
withLabel: cpu500 { cpus = 500 }
withLabel: cpu1000 { cpus = 1000 }
}
includeConfig("nextflow_labels.config")

View File

@@ -0,0 +1,48 @@
process {
// Default resources for components that hardly do any processing
memory = { 2.GB * task.attempt }
cpus = 1
// Retry for exit codes that have something to do with memory issues
errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 3
// The memory a task is assinged increases with each attempt
// uncomment the line below and adjust the value to set a global upper limit on the memory.
// resourceLimits = [ memory: 240.Gb ]
// CPU resources
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 10 }
withLabel: highcpu { cpus = 20 }
// Memory resources
withLabel: lowmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 4.GB * task.attempt } }
withLabel: midmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 25.GB * task.attempt } }
withLabel: highmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 50.GB * task.attempt } }
withLabel: veryhighmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 75.GB * task.attempt } }
// Disk space
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
// NOTE: The above labels intentionally do not have an effect by default.
// The user should set the disk space requirements by adding the following
// to the compute environment:
//
// withLabel: lowdisk { disk = { 20.GB * task.attempt } }
// withLabel: middisk { disk = { 100.GB * task.attempt } }
// withLabel: highdisk { disk = { 200.GB * task.attempt } }
// withLabel: veryhighdisk { disk = { 500.GB * task.attempt } }
}

View File

@@ -0,0 +1,36 @@
profiles {
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
}

View File

@@ -0,0 +1,48 @@
process {
// Default resources for components that hardly do any processing
memory = { 2.GB * task.attempt }
cpus = 1
// Retry for exit codes that have something to do with memory issues
errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 3
// The memory a task is assinged increases with each attempt
// uncomment the line below and adjust the value to set a global upper limit on the memory.
// resourceLimits = [ memory: 240.Gb ]
// CPU resources
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 4 }
withLabel: midcpu { cpus = 10 }
withLabel: highcpu { cpus = 20 }
// Memory resources
withLabel: lowmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 4.GB * task.attempt } }
withLabel: midmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 25.GB * task.attempt } }
withLabel: highmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 50.GB * task.attempt } }
withLabel: veryhighmem { memory = { task?.resourceLimits?.memory && task?.maxRetries && task.attempt >= task.maxRetries ? task.resourceLimits.memory : 75.GB * task.attempt } }
// Disk space
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
// NOTE: The above labels intentionally do not have an effect by default.
// The user should set the disk space requirements by adding the following
// to the compute environment:
//
// withLabel: lowdisk { disk = { 20.GB * task.attempt } }
// withLabel: middisk { disk = { 100.GB * task.attempt } }
// withLabel: highdisk { disk = { 200.GB * task.attempt } }
// withLabel: veryhighdisk { disk = { 500.GB * task.attempt } }
}

View File

@@ -0,0 +1,33 @@
process {
withLabel: lowmem { memory = 13.Gb }
withLabel: lowcpu { cpus = 4 }
withLabel: midmem { memory = 13.Gb }
withLabel: midcpu { cpus = 4 }
withLabel: highmem { memory = 13.Gb }
withLabel: highcpu { cpus = 4 }
withLabel: veryhighmem { memory = 13.Gb }
withLabel: lowdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: middisk {
disk = {process.disk ? process.disk : null}
}
withLabel: highdisk {
disk = {process.disk ? process.disk : null}
}
withLabel: veryhighdisk {
disk = {process.disk ? process.disk : null}
}
}
env.NUMBA_CACHE_DIR = '/tmp'
trace {
enabled = true
overwrite = true
}
dag {
overwrite = true
}
process.maxForks = 1

View File

@@ -0,0 +1,250 @@
name: "log_normalize"
namespace: "workflows/rna"
version: "v4.0.4"
authors:
- name: "Dries Schaumont"
roles:
- "author"
info:
role: "Core Team Member"
links:
email: "dries@data-intuitive.com"
github: "DriesSchaumont"
orcid: "0000-0002-4389-0440"
linkedin: "dries-schaumont"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Scientist"
argument_groups:
- name: "Inputs"
arguments:
- type: "file"
name: "--input"
description: "MuData file to transform."
info: null
example:
- "dataset.h5mu"
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--modality"
description: "Modality to process."
info: null
default:
- "rna"
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- type: "string"
name: "--layer"
description: "Input layer containing raw counts. If not specified, .X is used."
info: null
required: false
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Transormation options"
arguments:
- type: "integer"
name: "--target_sum"
description: "Normalize total counts to the specified amount. If not set, after\
\ normalization each observation (cell) \nwill have a total count equal to the\
\ median of total counts for observations (cells) before normalization.\n"
info: null
required: false
min: 1
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Output slots"
arguments:
- type: "string"
name: "--output_layer"
description: "Layer to write the log-transformed counts to.\n"
info: null
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Output"
arguments:
- type: "file"
name: "--output"
description: "Destination path to the output."
info: null
example:
- "output.h5mu"
must_exist: true
create_parent: true
required: true
direction: "output"
multiple: false
multiple_sep: ";"
resources:
- type: "nextflow_script"
path: "main.nf"
is_executable: true
entrypoint: "run_wf"
- type: "file"
path: "utils"
- type: "file"
path: "nextflow_labels.config"
dest: "nextflow_labels.config"
description: "Performs normalization and subsequent log-transformation of raw count\
\ data."
test_resources:
- type: "nextflow_script"
path: "test.nf"
is_executable: true
entrypoint: "test_wf"
- type: "file"
path: "pbmc_1k_protein_v3"
info: null
status: "enabled"
scope:
image: "private"
target: "private"
dependencies:
- name: "transform/normalize_total"
repository:
type: "local"
- name: "transform/log1p"
repository:
type: "local"
- name: "transform/delete_layer"
repository:
type: "local"
license: "MIT"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
runners:
- type: "nextflow"
id: "nextflow"
directives:
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
script:
- "includeConfig(\"nextflow_labels.config\")"
debug: false
container: "docker"
engines:
- type: "native"
id: "native"
build_info:
config: "src/workflows/rna/log_normalize/config.vsh.yaml"
runner: "nextflow"
engine: "native"
output: "target/_private/nextflow/workflows/rna/log_normalize"
executable: "target/_private/nextflow/workflows/rna/log_normalize/main.nf"
viash_version: "0.9.4"
git_commit: "fb7dc76676aa63d06ae1421bbdd6312ad4f67312"
git_remote: "https://github.com/openpipelines-bio/openpipeline"
dependencies:
- "target/nextflow/transform/normalize_total"
- "target/nextflow/transform/log1p"
- "target/nextflow/transform/delete_layer"
package_config:
name: "openpipeline"
version: "v4.0.4"
summary: "Best-practice workflows for single-cell multi-omics analyses.\n"
description: "OpenPipelines are extensible single cell analysis pipelines for reproducible\
\ and large-scale single cell processing using [Viash](https://viash.io) and [Nextflow](https://www.nextflow.io/).\n\
\nIn terms of workflows, the following has been made available, but keep in mind\
\ that\nindividual tools and functionality can be executed as standalone components\
\ as well.\n\n * Demultiplexing: conversion of raw sequencing data to FASTQ objects.\n\
\ * Ingestion: Read mapping and generating a count matrix.\n * Single sample\
\ processing: cell filtering and doublet detection.\n * Multisample processing:\
\ Count transformation, normalization, QC metric calulations.\n * Integration:\
\ Clustering, integration and batch correction using single and multimodal methods.\n\
\ * Downstream analysis workflows\n"
info:
test_resources:
- type: "s3"
path: "s3://openpipelines-data"
dest: "resources_test"
nextflow_labels_ci:
- path: "src/workflows/utils/labels_ci.config"
description: "Adds the correct memory and CPU labels when running on the Viash\
\ Hub CI."
viash_version: "0.9.4"
source: "src"
target: "target"
config_mods:
- ".resources += {path: '/src/workflows/utils/labels.config', dest: 'nextflow_labels.config'}\n\
.runners[.type == 'nextflow'].config.script := 'includeConfig(\"nextflow_labels.config\"\
)'"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'v4.0.4'"
keywords:
- "single-cell"
- "multimodal"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/openpipelines-bio/openpipeline"
docker_registry: "ghcr.io"
homepage: "https://openpipelines.bio"
documentation: "https://openpipelines.bio/fundamentals"
issue_tracker: "https://github.com/openpipelines-bio/openpipeline/issues"

View File

@@ -0,0 +1,126 @@
manifest {
name = 'workflows/rna/log_normalize'
mainScript = 'main.nf'
nextflowVersion = '!>=20.12.1-edge'
version = 'v4.0.4'
description = 'Performs normalization and subsequent log-transformation of raw count data.'
author = 'Dries Schaumont'
}
process.container = 'nextflow/bash:latest'
// detect tempdir
tempDir = java.nio.file.Paths.get(
System.getenv('NXF_TEMP') ?:
System.getenv('VIASH_TEMP') ?:
System.getenv('TEMPDIR') ?:
System.getenv('TMPDIR') ?:
'/tmp'
).toAbsolutePath()
profiles {
no_publish {
process {
withName: '.*' {
publishDir = [
enabled: false
]
}
}
}
mount_temp {
docker.temp = tempDir
podman.temp = tempDir
charliecloud.temp = tempDir
}
docker {
docker.enabled = true
// docker.userEmulation = true
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
podman {
podman.enabled = true
docker.enabled = false
singularity.enabled = false
shifter.enabled = false
charliecloud.enabled = false
}
shifter {
shifter.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
charliecloud.enabled = false
}
charliecloud {
charliecloud.enabled = true
docker.enabled = false
singularity.enabled = false
podman.enabled = false
shifter.enabled = false
}
}
process{
withLabel: mem1gb { memory = 1000000000.B }
withLabel: mem2gb { memory = 2000000000.B }
withLabel: mem5gb { memory = 5000000000.B }
withLabel: mem10gb { memory = 10000000000.B }
withLabel: mem20gb { memory = 20000000000.B }
withLabel: mem50gb { memory = 50000000000.B }
withLabel: mem100gb { memory = 100000000000.B }
withLabel: mem200gb { memory = 200000000000.B }
withLabel: mem500gb { memory = 500000000000.B }
withLabel: mem1tb { memory = 1000000000000.B }
withLabel: mem2tb { memory = 2000000000000.B }
withLabel: mem5tb { memory = 5000000000000.B }
withLabel: mem10tb { memory = 10000000000000.B }
withLabel: mem20tb { memory = 20000000000000.B }
withLabel: mem50tb { memory = 50000000000000.B }
withLabel: mem100tb { memory = 100000000000000.B }
withLabel: mem200tb { memory = 200000000000000.B }
withLabel: mem500tb { memory = 500000000000000.B }
withLabel: mem1gib { memory = 1073741824.B }
withLabel: mem2gib { memory = 2147483648.B }
withLabel: mem4gib { memory = 4294967296.B }
withLabel: mem8gib { memory = 8589934592.B }
withLabel: mem16gib { memory = 17179869184.B }
withLabel: mem32gib { memory = 34359738368.B }
withLabel: mem64gib { memory = 68719476736.B }
withLabel: mem128gib { memory = 137438953472.B }
withLabel: mem256gib { memory = 274877906944.B }
withLabel: mem512gib { memory = 549755813888.B }
withLabel: mem1tib { memory = 1099511627776.B }
withLabel: mem2tib { memory = 2199023255552.B }
withLabel: mem4tib { memory = 4398046511104.B }
withLabel: mem8tib { memory = 8796093022208.B }
withLabel: mem16tib { memory = 17592186044416.B }
withLabel: mem32tib { memory = 35184372088832.B }
withLabel: mem64tib { memory = 70368744177664.B }
withLabel: mem128tib { memory = 140737488355328.B }
withLabel: mem256tib { memory = 281474976710656.B }
withLabel: mem512tib { memory = 562949953421312.B }
withLabel: cpu1 { cpus = 1 }
withLabel: cpu2 { cpus = 2 }
withLabel: cpu5 { cpus = 5 }
withLabel: cpu10 { cpus = 10 }
withLabel: cpu20 { cpus = 20 }
withLabel: cpu50 { cpus = 50 }
withLabel: cpu100 { cpus = 100 }
withLabel: cpu200 { cpus = 200 }
withLabel: cpu500 { cpus = 500 }
withLabel: cpu1000 { cpus = 1000 }
}
includeConfig("nextflow_labels.config")

Some files were not shown because too many files have changed in this diff Show More