README.md



# OpenPipeline

Extensible single cell analysis pipelines for reproducible and
large-scale single cell processing using Viash and Nextflow.

[![ViashHub](https://img.shields.io/badge/ViashHub-openpipeline-7a4baa.svg)](https://www.viash-hub.com/packages/openpipeline)
[![GitHub](https://img.shields.io/badge/GitHub-viash--hub%2Fopenpipeline-blue.svg)](https://github.com/openpipelines-bio/openpipeline)
[![GitHub
License](https://img.shields.io/github/license/openpipelines-bio/openpipeline.svg)](https://github.com/openpipelines-bio/openpipeline/blob/main/LICENSE)
[![GitHub
Issues](https://img.shields.io/github/issues/openpipelines-bio/openpipeline.svg)](https://github.com/openpipelines-bio/openpipeline/issues)
[![Viash
version](https://img.shields.io/badge/Viash-v0.9.3-blue.svg)](https://viash.io)

## Documentation

Please find more in-depth documentation on [the
website](https://openpipelines.bio/).

## Functionality Overview

Openpipelines execute a list of predefined tasks. These descrete steps
are also provided as standalone components that can be executed
individually, with a standardized interface. This is especially useful
when a particular step wraps a tool that you do not necessarily always
need to execute in a workflow context.

In terms of workflows, the following functionality is provided:

- Demultiplexing: conversion of raw sequencing data to FASTQ objects.
- [Ingestion](https://openpipelines.bio/fundamentals/architecture.html#sec-ingestion):
  Read mapping and generating a count matrix.
- [Single sample
  processing](https://openpipelines.bio/fundamentals/architecture.html#sec-single-sample):
  cell filtering and doublet detection.
- [Multisample
  processing](https://openpipelines.bio/fundamentals/architecture.html#sec-multisample-processing):
  Count transformation, normalization, QC metric calulations.
- [Integration](https://openpipelines.bio/fundamentals/architecture.html#sec-intergration):
  Clustering, integration and batch correction using single and
  multimodal methods.
- Downstream analysis workflows

``` mermaid lang="mermaid"
flowchart LR
  demultiplexing["Step 1: Demultiplexing"]
  ingestion["Step 2: Ingestion"]
  process_samples["Step 3: Process Samples"]
  integration["Step 4: Integration"]
  downstream["Step 5: Downstream"]
  demultiplexing-->ingestion-->process_samples-->integration-->downstream
```

## Guided execution using Viash Hub (CLI and Seqera cloud)

Openpipelines is now available on [Viash
Hub](https://www.viash-hub.com/packages/openpipeline/latest). Viash Hub
provides a list of components and workflows, together with a graphical
interface that guides you through the steps of running a workflow or
standalone component. Intstructions are provided for using a local viash
or nextflow executable (requires using a linux based OS), but connecting
to a Seqera cloud instance is also supported.

## Execution using the nextflow executable

Executing a workflow is a bit more involved and requires familiarity
with the command line interface (CLI).

### Setup

In order to use the workflows in this package on your local computer,
you’ll need to do the following:

- Install [nextflow](https://www.nextflow.io/docs/latest/install.html)
- Install a nextflow compatible executor. This workflow provides a
  profile for [docker](https://docs.docker.com/get-started/).

### Location of the workflow scripts

Nextflow workflow scripts, schema’s and configuration files can be found
in the `target/nextflow` folder. On the `main` branch however, only the
source code that needs to be build into the functionning workflows and
components can be found. Instead, please refer to the `main_build`
branch or any of the tags to find the `target` folders. Components and
workflows are organized into namespaces, which can be nested. Workflows
are located at `target/nextflow/workflows`, while components that
execute individual workflow steps are

A reference of workflows and modules is also provided in the
[documentation](https://openpipelines.bio/components/).

### Retrieving a list of a workflow parameters

A list of workflows arguments can be consulted in multiple ways:

- On [Viash Hub](https://www.viash-hub.com/packages/openpipeline/latest)
- In the [reference
  documentation](https://openpipelines.bio/components/)
- The config YAML file lists the argument for each workflow and
  component
- In the `target/nextflow` folder, a nextflow schema JSON file
  (`nextflow_schema.json`) is provided next to each workflow `.nf` file.
- Using nextflow on the CLI:

``` bash
nextflow run openpipelines-bio/openpipeline \
    -r 2.1.1 \
    -main-script target/nextflow/workflows/ingestion/demux/main.nf \
    --help
```

### Resource usage tuning

Nextflow’s labels can be used to specify the amount of resources a
process can use. This workflow uses the following labels for CPU, memory
and disk:

- `lowmem`, `lowmem`, `midmem`, `highmem`, `veryhighmem`
- `lowcpu`, `lowcpu`, `midcpu`, `highcpu`, `veryhighcpu`
- `lowdisk`, `middisk`, `highdisk`, `veryhighdisk`

The defaults for these labels can be found at
`src/workflows/utils/labels.config`. Nextflow checks that the specified
resources for a process do not exceed what is available on the machine
and will not start if it does. Create your own config file to tune the
labels to your needs, for example:

    // Resource labels
    withLabel: verylowcpu { cpus = 2 }
    withLabel: lowcpu { cpus = 8 }
    withLabel: midcpu { cpus = 16 }
    withLabel: highcpu { cpus = 16 }

    withLabel: verylowmem { memory = 4.GB }
    withLabel: lowmem { memory = 8.GB }
    withLabel: midmem { memory = 16.GB }
    withLabel: highmem { memory = 32.GB }

When starting nextflow using the CLI, you can use `-c` to provide the
file to nextflow and overwrite the defaults.

### Demultiplexing example

Here, generating FASTQ files from raw sequencing data is demonstrated,
based on data generated using 10X genomic’s protocols. However, BD
genomics data is also supported by Openpipeline. If you wish to try it
out yourself, test data is available at
`s3://openpipelines-data/cellranger_tiny_bcl/bcl`.

``` bash
nextflow run openpipelines-bio/openpipeline \
    -r 2.1.1 \
    -main-script target/nextflow/workflows/ingestion/demux/main.nf \
    -c "<path to resource config file>" \
    -profile docker \
    --publish_dir "<path to output directory>" \
    --id "cellranger_tiny_bcl" \
    --input "s3://openpipelines-data/cellranger_tiny_bcl/bcl" \
    --sample_sheet "s3://openpipelines-data/cellranger_tiny_bcl/bcl/sample_sheet.csv" \
    --demultiplexer "mkfastq"
```

### Mapping and read counting

FASTQ files can be mapped to a reference genome and the resulting mapped
reads can be counted in order to generate a count matrix. Both
`BD Rhapsody` and `Cell Ranger` are supported. Here, we demonstrate
using Cell Ranger multi on test data available at
`s3://openpipelines-data/10x_5k_anticmv`.

In order to facilitate passing multiple argument values, the parameters
can be specified using a YAML file.

``` yaml
input:
    - "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_*.fastq.gz"
    - "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_*.fastq.gz"
    - "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_VDJ_*.fastq.gz"
gex_reference: "s3://openpipelines-data/reference_gencodev41_chr1/reference_cellranger.tar.gz"
vdj_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0.tar.gz"
feature_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/feature_reference.csv"
library_id:
    - "5k_human_antiCMV_T_TBNK_connect_GEX_1_subset"
    - "5k_human_antiCMV_T_TBNK_connect_AB_subset"
    - "5k_human_antiCMV_T_TBNK_connect_VDJ_subset"
library_type:
    - "Gene Expression"
    - "Antibody Capture"
    - "VDJ"
```

You can pass this file to nextflow using `-params-file`

``` bash
nextflow run openpipelines-bio/openpipeline \
    -r 2.1.1 \
    -main-script target/nextflow/workflows/ingestion/cellranger_multi/main.nf \
    -c "<path to resource config file>" \
    -profile docker \
    -params-file "<path to your parameter YAML file>" \
    --publish_dir "<path to output directory>"
```

### Filtering, normalization, clustering, dimensionality reduction and QC calculations (w/o integration)

Once you have an MuData object for each of your samples, you can process
it into a multisample file that is ready for integration and other
downstream analyses. This can be done using the `process_samples`
workflow. Here is an example, but please keep in mind that the exact
parameters that need to be provided differ depending on you data. A lot
of functionality for this pipeline can be customized, including the name
of the output slots where data is being stored.

``` yaml
param_list:
    - id: "sample_1"
      input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
      rna_min_counts: 2
    - id: "sample_2"
      input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
      rna_min_counts: 1
rna_max_counts: 1000000
rna_min_genes_per_cell: 1
rna_max_genes_per_cell: 1000000
rna_min_cells_per_gene: 1
rna_min_fraction_mito: 0.0
rna_max_fraction_mito: 1.0
```

In order to provide multiple samples to the pipeline, `param_list` is
used. Using `param_list` it is possible to specify arguments per sample.
However, it is still possible to define arguments for all samples
together by listing those outside the `param_list` block.

``` bash
nextflow run openpipelines-bio/openpipeline \
    -r 2.1.1 \
    -main-script target/nextflow/workflows/multiomics/process_samples/main.nf \
    -c "<path to resource config file>" \
    -profile docker \
    -params-file "<path to your parameter YAML file>"
    --publish_dir "<path to output directory>"
```

## Executing standalone components using the Viash executable

Another option to execute individual modules on the CLI is to use
`viash run`. All you need to do is download viash, clone the
Openpipeline repository and point viash to a config file. However, keep
in mind that using `viash run` for workflows is currently not supported.
Please see `viash run --help` for more information on how to use the
command, but here is an example:

``` bash
viash run --engine docker src/mapping/cellranger_multi/config.vsh.yaml --help
```