259 lines
10 KiB
Markdown
259 lines
10 KiB
Markdown
|
|
|
|||
|
|
|
|||
|
|
# OpenPipeline
|
|||
|
|
|
|||
|
|
Extensible single cell analysis pipelines for reproducible and
|
|||
|
|
large-scale single cell processing using Viash and Nextflow.
|
|||
|
|
|
|||
|
|
[](https://www.viash-hub.com/packages/openpipeline)
|
|||
|
|
[](https://github.com/openpipelines-bio/openpipeline)
|
|||
|
|
[](https://github.com/openpipelines-bio/openpipeline/blob/main/LICENSE)
|
|||
|
|
[](https://github.com/openpipelines-bio/openpipeline/issues)
|
|||
|
|
[](https://viash.io)
|
|||
|
|
|
|||
|
|
## Documentation
|
|||
|
|
|
|||
|
|
Please find more in-depth documentation on [the
|
|||
|
|
website](https://openpipelines.bio/).
|
|||
|
|
|
|||
|
|
## Functionality Overview
|
|||
|
|
|
|||
|
|
Openpipelines execute a list of predefined tasks. These descrete steps
|
|||
|
|
are also provided as standalone components that can be executed
|
|||
|
|
individually, with a standardized interface. This is especially useful
|
|||
|
|
when a particular step wraps a tool that you do not necessarily always
|
|||
|
|
need to execute in a workflow context.
|
|||
|
|
|
|||
|
|
In terms of workflows, the following functionality is provided:
|
|||
|
|
|
|||
|
|
- Demultiplexing: conversion of raw sequencing data to FASTQ objects.
|
|||
|
|
- [Ingestion](https://openpipelines.bio/fundamentals/architecture.html#sec-ingestion):
|
|||
|
|
Read mapping and generating a count matrix.
|
|||
|
|
- [Single sample
|
|||
|
|
processing](https://openpipelines.bio/fundamentals/architecture.html#sec-single-sample):
|
|||
|
|
cell filtering and doublet detection.
|
|||
|
|
- [Multisample
|
|||
|
|
processing](https://openpipelines.bio/fundamentals/architecture.html#sec-multisample-processing):
|
|||
|
|
Count transformation, normalization, QC metric calulations.
|
|||
|
|
- [Integration](https://openpipelines.bio/fundamentals/architecture.html#sec-intergration):
|
|||
|
|
Clustering, integration and batch correction using single and
|
|||
|
|
multimodal methods.
|
|||
|
|
- Downstream analysis workflows
|
|||
|
|
|
|||
|
|
``` mermaid lang="mermaid"
|
|||
|
|
flowchart LR
|
|||
|
|
demultiplexing["Step 1: Demultiplexing"]
|
|||
|
|
ingestion["Step 2: Ingestion"]
|
|||
|
|
process_samples["Step 3: Process Samples"]
|
|||
|
|
integration["Step 4: Integration"]
|
|||
|
|
downstream["Step 5: Downstream"]
|
|||
|
|
demultiplexing-->ingestion-->process_samples-->integration-->downstream
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Guided execution using Viash Hub (CLI and Seqera cloud)
|
|||
|
|
|
|||
|
|
Openpipelines is now available on [Viash
|
|||
|
|
Hub](https://www.viash-hub.com/packages/openpipeline/latest). Viash Hub
|
|||
|
|
provides a list of components and workflows, together with a graphical
|
|||
|
|
interface that guides you through the steps of running a workflow or
|
|||
|
|
standalone component. Intstructions are provided for using a local viash
|
|||
|
|
or nextflow executable (requires using a linux based OS), but connecting
|
|||
|
|
to a Seqera cloud instance is also supported.
|
|||
|
|
|
|||
|
|
## Execution using the nextflow executable
|
|||
|
|
|
|||
|
|
Executing a workflow is a bit more involved and requires familiarity
|
|||
|
|
with the command line interface (CLI).
|
|||
|
|
|
|||
|
|
### Setup
|
|||
|
|
|
|||
|
|
In order to use the workflows in this package on your local computer,
|
|||
|
|
you’ll need to do the following:
|
|||
|
|
|
|||
|
|
- Install [nextflow](https://www.nextflow.io/docs/latest/install.html)
|
|||
|
|
- Install a nextflow compatible executor. This workflow provides a
|
|||
|
|
profile for [docker](https://docs.docker.com/get-started/).
|
|||
|
|
|
|||
|
|
### Location of the workflow scripts
|
|||
|
|
|
|||
|
|
Nextflow workflow scripts, schema’s and configuration files can be found
|
|||
|
|
in the `target/nextflow` folder. On the `main` branch however, only the
|
|||
|
|
source code that needs to be build into the functionning workflows and
|
|||
|
|
components can be found. Instead, please refer to the `main_build`
|
|||
|
|
branch or any of the tags to find the `target` folders. Components and
|
|||
|
|
workflows are organized into namespaces, which can be nested. Workflows
|
|||
|
|
are located at `target/nextflow/workflows`, while components that
|
|||
|
|
execute individual workflow steps are
|
|||
|
|
|
|||
|
|
A reference of workflows and modules is also provided in the
|
|||
|
|
[documentation](https://openpipelines.bio/components/).
|
|||
|
|
|
|||
|
|
### Retrieving a list of a workflow parameters
|
|||
|
|
|
|||
|
|
A list of workflows arguments can be consulted in multiple ways:
|
|||
|
|
|
|||
|
|
- On [Viash Hub](https://www.viash-hub.com/packages/openpipeline/latest)
|
|||
|
|
- In the [reference
|
|||
|
|
documentation](https://openpipelines.bio/components/)
|
|||
|
|
- The config YAML file lists the argument for each workflow and
|
|||
|
|
component
|
|||
|
|
- In the `target/nextflow` folder, a nextflow schema JSON file
|
|||
|
|
(`nextflow_schema.json`) is provided next to each workflow `.nf` file.
|
|||
|
|
- Using nextflow on the CLI:
|
|||
|
|
|
|||
|
|
``` bash
|
|||
|
|
nextflow run openpipelines-bio/openpipeline \
|
|||
|
|
-r 2.1.1 \
|
|||
|
|
-main-script target/nextflow/workflows/ingestion/demux/main.nf \
|
|||
|
|
--help
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Resource usage tuning
|
|||
|
|
|
|||
|
|
Nextflow’s labels can be used to specify the amount of resources a
|
|||
|
|
process can use. This workflow uses the following labels for CPU, memory
|
|||
|
|
and disk:
|
|||
|
|
|
|||
|
|
- `lowmem`, `lowmem`, `midmem`, `highmem`, `veryhighmem`
|
|||
|
|
- `lowcpu`, `lowcpu`, `midcpu`, `highcpu`, `veryhighcpu`
|
|||
|
|
- `lowdisk`, `middisk`, `highdisk`, `veryhighdisk`
|
|||
|
|
|
|||
|
|
The defaults for these labels can be found at
|
|||
|
|
`src/workflows/utils/labels.config`. Nextflow checks that the specified
|
|||
|
|
resources for a process do not exceed what is available on the machine
|
|||
|
|
and will not start if it does. Create your own config file to tune the
|
|||
|
|
labels to your needs, for example:
|
|||
|
|
|
|||
|
|
// Resource labels
|
|||
|
|
withLabel: verylowcpu { cpus = 2 }
|
|||
|
|
withLabel: lowcpu { cpus = 8 }
|
|||
|
|
withLabel: midcpu { cpus = 16 }
|
|||
|
|
withLabel: highcpu { cpus = 16 }
|
|||
|
|
|
|||
|
|
withLabel: verylowmem { memory = 4.GB }
|
|||
|
|
withLabel: lowmem { memory = 8.GB }
|
|||
|
|
withLabel: midmem { memory = 16.GB }
|
|||
|
|
withLabel: highmem { memory = 32.GB }
|
|||
|
|
|
|||
|
|
When starting nextflow using the CLI, you can use `-c` to provide the
|
|||
|
|
file to nextflow and overwrite the defaults.
|
|||
|
|
|
|||
|
|
### Demultiplexing example
|
|||
|
|
|
|||
|
|
Here, generating FASTQ files from raw sequencing data is demonstrated,
|
|||
|
|
based on data generated using 10X genomic’s protocols. However, BD
|
|||
|
|
genomics data is also supported by Openpipeline. If you wish to try it
|
|||
|
|
out yourself, test data is available at
|
|||
|
|
`s3://openpipelines-data/cellranger_tiny_bcl/bcl`.
|
|||
|
|
|
|||
|
|
``` bash
|
|||
|
|
nextflow run openpipelines-bio/openpipeline \
|
|||
|
|
-r 2.1.1 \
|
|||
|
|
-main-script target/nextflow/workflows/ingestion/demux/main.nf \
|
|||
|
|
-c "<path to resource config file>" \
|
|||
|
|
-profile docker \
|
|||
|
|
--publish_dir "<path to output directory>" \
|
|||
|
|
--id "cellranger_tiny_bcl" \
|
|||
|
|
--input "s3://openpipelines-data/cellranger_tiny_bcl/bcl" \
|
|||
|
|
--sample_sheet "s3://openpipelines-data/cellranger_tiny_bcl/bcl/sample_sheet.csv" \
|
|||
|
|
--demultiplexer "mkfastq"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Mapping and read counting
|
|||
|
|
|
|||
|
|
FASTQ files can be mapped to a reference genome and the resulting mapped
|
|||
|
|
reads can be counted in order to generate a count matrix. Both
|
|||
|
|
`BD Rhapsody` and `Cell Ranger` are supported. Here, we demonstrate
|
|||
|
|
using Cell Ranger multi on test data available at
|
|||
|
|
`s3://openpipelines-data/10x_5k_anticmv`.
|
|||
|
|
|
|||
|
|
In order to facilitate passing multiple argument values, the parameters
|
|||
|
|
can be specified using a YAML file.
|
|||
|
|
|
|||
|
|
``` yaml
|
|||
|
|
input:
|
|||
|
|
- "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_*.fastq.gz"
|
|||
|
|
- "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_*.fastq.gz"
|
|||
|
|
- "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_VDJ_*.fastq.gz"
|
|||
|
|
gex_reference: "s3://openpipelines-data/reference_gencodev41_chr1/reference_cellranger.tar.gz"
|
|||
|
|
vdj_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0.tar.gz"
|
|||
|
|
feature_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/feature_reference.csv"
|
|||
|
|
library_id:
|
|||
|
|
- "5k_human_antiCMV_T_TBNK_connect_GEX_1_subset"
|
|||
|
|
- "5k_human_antiCMV_T_TBNK_connect_AB_subset"
|
|||
|
|
- "5k_human_antiCMV_T_TBNK_connect_VDJ_subset"
|
|||
|
|
library_type:
|
|||
|
|
- "Gene Expression"
|
|||
|
|
- "Antibody Capture"
|
|||
|
|
- "VDJ"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
You can pass this file to nextflow using `-params-file`
|
|||
|
|
|
|||
|
|
``` bash
|
|||
|
|
nextflow run openpipelines-bio/openpipeline \
|
|||
|
|
-r 2.1.1 \
|
|||
|
|
-main-script target/nextflow/workflows/ingestion/cellranger_multi/main.nf \
|
|||
|
|
-c "<path to resource config file>" \
|
|||
|
|
-profile docker \
|
|||
|
|
-params-file "<path to your parameter YAML file>" \
|
|||
|
|
--publish_dir "<path to output directory>"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Filtering, normalization, clustering, dimensionality reduction and QC calculations (w/o integration)
|
|||
|
|
|
|||
|
|
Once you have an MuData object for each of your samples, you can process
|
|||
|
|
it into a multisample file that is ready for integration and other
|
|||
|
|
downstream analyses. This can be done using the `process_samples`
|
|||
|
|
workflow. Here is an example, but please keep in mind that the exact
|
|||
|
|
parameters that need to be provided differ depending on you data. A lot
|
|||
|
|
of functionality for this pipeline can be customized, including the name
|
|||
|
|
of the output slots where data is being stored.
|
|||
|
|
|
|||
|
|
``` yaml
|
|||
|
|
param_list:
|
|||
|
|
- id: "sample_1"
|
|||
|
|
input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
|
|||
|
|
rna_min_counts: 2
|
|||
|
|
- id: "sample_2"
|
|||
|
|
input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
|
|||
|
|
rna_min_counts: 1
|
|||
|
|
rna_max_counts: 1000000
|
|||
|
|
rna_min_genes_per_cell: 1
|
|||
|
|
rna_max_genes_per_cell: 1000000
|
|||
|
|
rna_min_cells_per_gene: 1
|
|||
|
|
rna_min_fraction_mito: 0.0
|
|||
|
|
rna_max_fraction_mito: 1.0
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
In order to provide multiple samples to the pipeline, `param_list` is
|
|||
|
|
used. Using `param_list` it is possible to specify arguments per sample.
|
|||
|
|
However, it is still possible to define arguments for all samples
|
|||
|
|
together by listing those outside the `param_list` block.
|
|||
|
|
|
|||
|
|
``` bash
|
|||
|
|
nextflow run openpipelines-bio/openpipeline \
|
|||
|
|
-r 2.1.1 \
|
|||
|
|
-main-script target/nextflow/workflows/multiomics/process_samples/main.nf \
|
|||
|
|
-c "<path to resource config file>" \
|
|||
|
|
-profile docker \
|
|||
|
|
-params-file "<path to your parameter YAML file>"
|
|||
|
|
--publish_dir "<path to output directory>"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Executing standalone components using the Viash executable
|
|||
|
|
|
|||
|
|
Another option to execute individual modules on the CLI is to use
|
|||
|
|
`viash run`. All you need to do is download viash, clone the
|
|||
|
|
Openpipeline repository and point viash to a config file. However, keep
|
|||
|
|
in mind that using `viash run` for workflows is currently not supported.
|
|||
|
|
Please see `viash run --help` for more information on how to use the
|
|||
|
|
command, but here is an example:
|
|||
|
|
|
|||
|
|
``` bash
|
|||
|
|
viash run --engine docker src/mapping/cellranger_multi/config.vsh.yaml --help
|
|||
|
|
```
|