openpipeline/README.qmd

---
format: gfm
---
```{r setup, include=FALSE}
project <- yaml::read_yaml("_viash.yaml")
license <- paste0(project$links$repository, "/blob/main/LICENSE")
```
# OpenPipeline

Extensible single cell analysis pipelines for reproducible and large-scale single cell processing using Viash and Nextflow.

[![ViashHub](https://img.shields.io/badge/ViashHub-`r project$name`-7a4baa.svg)](https://www.viash-hub.com/packages/`r project$name`)
[![GitHub](https://img.shields.io/badge/GitHub-viash--hub%2F`r project$name`-blue.svg)](`r project$links$repository`)
[![GitHub License](https://img.shields.io/github/license/`r project$organization`/`r project$name`.svg)](`r license`)
[![GitHub Issues](https://img.shields.io/github/issues/`r project$organization`/`r project$name`.svg)](`r project$links$issue_tracker`)
[![Viash version](https://img.shields.io/badge/Viash-v`r gsub("-", "--", project$viash_version)`-blue.svg)](https://viash.io)

## Documentation

Please find more in-depth documentation on [the website](https://openpipelines.bio/).

## Functionality Overview

Openpipelines execute a list of predefined tasks. These descrete steps are also provided as standalone components that can be executed individually, with a standardized interface. This is especially useful when a particular step wraps a tool that you do not necessarily always need to execute in a workflow context.

In terms of workflows, the following functionality is provided:

* Demultiplexing: conversion of raw sequencing data to FASTQ objects.
* [Ingestion](https://openpipelines.bio/fundamentals/architecture.html#sec-ingestion): Read mapping and generating a count matrix.
* [Single sample processing](https://openpipelines.bio/fundamentals/architecture.html#sec-single-sample): cell filtering and doublet detection.
* [Multisample processing](https://openpipelines.bio/fundamentals/architecture.html#sec-multisample-processing): Count transformation, normalization, QC metric calulations.
* [Integration](https://openpipelines.bio/fundamentals/architecture.html#sec-intergration): Clustering, integration and batch correction using single and multimodal methods.
* Downstream analysis workflows


```{mermaid lang='mermaid'}
flowchart LR
  demultiplexing["Step 1: Demultiplexing"]
  ingestion["Step 2: Ingestion"]
  process_samples["Step 3: Process Samples"]
  integration["Step 4: Integration"]
  downstream["Step 5: Downstream"]
  demultiplexing-->ingestion-->process_samples-->integration-->downstream
```

## Guided execution using Viash Hub (CLI and Seqera cloud)

Openpipelines is now available on [Viash Hub](https://www.viash-hub.com/packages/openpipeline/latest). Viash Hub provides a list of components and workflows, together with a graphical interface that guides you through the steps of running a workflow or standalone component. Intstructions are provided for using a local viash or nextflow executable (requires using a linux based OS), but connecting to a Seqera cloud instance is also supported.

## Execution using the nextflow executable

Executing a workflow  is a bit more involved and requires familiarity with the command line interface (CLI).

### Setup

In order to use the workflows in this package on your local computer, you'll need to do the following:

* Install [nextflow](https://www.nextflow.io/docs/latest/install.html)
* Install a nextflow compatible executor. This workflow provides a profile for [docker](https://docs.docker.com/get-started/).

### Location of the workflow scripts

Nextflow workflow scripts, schema's and configuration files can be found in the `target/nextflow` folder.
On the `main` branch however, only the source code that needs to be build into the functionning workflows and components can be found.
Instead, please refer to the `main_build` branch or any of the tags to find the `target` folders.
Components and workflows are organized into namespaces, which can be nested. Workflows are located at `target/nextflow/workflows`,
while components that execute individual workflow steps are

A reference of workflows and modules is also provided in the [documentation](https://openpipelines.bio/components/).

### Retrieving a list of a workflow parameters

A list of workflows arguments can be consulted in multiple ways:

* On [Viash Hub](https://www.viash-hub.com/packages/openpipeline/latest)
* In the [reference documentation](https://openpipelines.bio/components/)
* The config YAML file lists the argument for each workflow and component
* In the `target/nextflow` folder, a nextflow schema JSON file (`nextflow_schema.json`) is provided next to each workflow `.nf` file.
* Using nextflow on the CLI:

```bash
nextflow run openpipelines-bio/openpipeline \
    -r 2.1.1 \
    -main-script target/nextflow/workflows/ingestion/demux/main.nf \
    --help
```

### Resource usage tuning

Nextflow's labels can be used to specify the amount of resources a process can use. This workflow uses the following labels for CPU, memory and disk:

* `lowmem`, `lowmem`, `midmem`, `highmem`, `veryhighmem`
* `lowcpu`, `lowcpu`, `midcpu`, `highcpu`, `veryhighcpu`
* `lowdisk`, `middisk`, `highdisk`, `veryhighdisk`

The defaults for these labels can be found at `src/workflows/utils/labels.config`. Nextflow checks that the specified resources for a process do not exceed what is available on the machine and will not start if it does. Create your own config file to tune the labels to your needs, for example:

```
// Resource labels
withLabel: verylowcpu { cpus = 2 }
withLabel: lowcpu { cpus = 8 }
withLabel: midcpu { cpus = 16 }
withLabel: highcpu { cpus = 16 }

withLabel: verylowmem { memory = 4.GB }
withLabel: lowmem { memory = 8.GB }
withLabel: midmem { memory = 16.GB }
withLabel: highmem { memory = 32.GB }
```

When starting nextflow using the CLI, you can use `-c` to provide the file to nextflow and overwrite the defaults.

### Demultiplexing example

Here, generating FASTQ files from raw sequencing data is demonstrated, based on data generated using 10X genomic's protocols. However, BD genomics data is also supported by Openpipeline. If you wish to try it out yourself, test data is available at `s3://openpipelines-data/cellranger_tiny_bcl/bcl`.

```bash
nextflow run openpipelines-bio/openpipeline \
    -r 2.1.1 \
    -main-script target/nextflow/workflows/ingestion/demux/main.nf \
    -c "<path to resource config file>" \
    -profile docker \
    --publish_dir "<path to output directory>" \
    --id "cellranger_tiny_bcl" \
    --input "s3://openpipelines-data/cellranger_tiny_bcl/bcl" \
    --sample_sheet "s3://openpipelines-data/cellranger_tiny_bcl/bcl/sample_sheet.csv" \
    --demultiplexer "mkfastq"
```

### Mapping and read counting

FASTQ files can be mapped to a reference genome and the resulting mapped reads can be counted in order to generate a count matrix. Both `BD Rhapsody` and `Cell Ranger` are supported. Here, we demonstrate using Cell Ranger multi on test data available at `s3://openpipelines-data/10x_5k_anticmv`.


In order to facilitate passing multiple argument values, the parameters can be specified using a YAML file.
```yaml
input:
    - "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_*.fastq.gz"
    - "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_*.fastq.gz"
    - "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_VDJ_*.fastq.gz"
gex_reference: "s3://openpipelines-data/reference_gencodev41_chr1/reference_cellranger.tar.gz"
vdj_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0.tar.gz"
feature_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/feature_reference.csv"
library_id:
    - "5k_human_antiCMV_T_TBNK_connect_GEX_1_subset"
    - "5k_human_antiCMV_T_TBNK_connect_AB_subset"
    - "5k_human_antiCMV_T_TBNK_connect_VDJ_subset"
library_type:
    - "Gene Expression"
    - "Antibody Capture"
    - "VDJ"
```
You can pass this file to nextflow using `-params-file`

```bash
nextflow run openpipelines-bio/openpipeline \
    -r 2.1.1 \
    -main-script target/nextflow/workflows/ingestion/cellranger_multi/main.nf \
    -c "<path to resource config file>" \
    -profile docker \
    -params-file "<path to your parameter YAML file>" \
    --publish_dir "<path to output directory>"
```

### Filtering, normalization, clustering, dimensionality reduction and QC calculations (w/o integration)

Once you have an MuData object for each of your samples, you can process it into a multisample file that is ready for integration and other downstream analyses. This can be done using the `process_samples` workflow. Here is an example, but please keep in mind that the exact parameters that need to be provided differ depending on you data. A lot of functionality for this pipeline can be customized, including the name of the output slots where data is being stored.

```yaml
param_list:
    - id: "sample_1"
      input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
      rna_min_counts: 2
    - id: "sample_2"
      input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
      rna_min_counts: 1
rna_max_counts: 1000000
rna_min_genes_per_cell: 1
rna_max_genes_per_cell: 1000000
rna_min_cells_per_gene: 1
rna_min_fraction_mito: 0.0
rna_max_fraction_mito: 1.0
```

In order to provide multiple samples to the pipeline, `param_list` is used. Using `param_list` it is possible to specify arguments per sample. However, it is still possible to define arguments for all samples together by listing those outside the `param_list` block.

```bash
nextflow run openpipelines-bio/openpipeline \
    -r 2.1.1 \
    -main-script target/nextflow/workflows/multiomics/process_samples/main.nf \
    -c "<path to resource config file>" \
    -profile docker \
    -params-file "<path to your parameter YAML file>"
    --publish_dir "<path to output directory>"
```


## Executing standalone components using the Viash executable

Another option to execute individual modules on the CLI is to use `viash run`. All you need to do is download viash, clone the Openpipeline repository and point viash to a config file. However, keep in mind that using `viash run` for workflows is currently not supported. Please see `viash run --help` for more information on how to use the command, but here is an example:


```bash
viash run --engine docker src/mapping/cellranger_multi/config.vsh.yaml --help
```