# openpipelines 2.x.x (Unreleased) ## BREAKING CHANGES * Added cell multiplexing support to the `from_cellranger_multi_to_h5mu` component and the `cellranger_multi` workflow. These components now output multiple .h5mu files. The `output` and `output_h5mu` arguments respectively now require a value containing a wildcard character `*`, which will be replaced by the sample ID to form the final output file names . Additionally, a `sample_csv` argument is added to the `from_cellragner_multi_to_h5mu` component which describes the sample name per output file (PR #803). * `demux/bcl_convert`: update BCL convert from 3.10 to 4.2 (PR #774). * `demux/cellranger_mkfastq`, `mapping/cellranger_count`, `mapping/cellranger_multi` and `reference/build_cellranger_reference`: update cellranger to `8.0.1` (PR #774 and PR #811). * Removed `--disable_library_compatibility_check` in favour of `--check_library_compatibility` to the `mapping/cellranger_multi` component and the `ingestion/cellranger_multi` workflow (PR #818). * `lianapy`: bumped version to `1.3.0` (PR #827 and PR #862). Additionally, `groupby` is now a required argument. * `concat`: this component was deprecated and has now been removed, use `concatenate_h5mu` instead (PR #796). * The `workflows` folder in the root of the project no longer contains symbolic links to the build workflows in `target`. Using any workflows that was previously linked in this directory will now result in an error which will indicate the location of the workflow to be used instead (PR #796). * `XGBoost`: bump version to `2.0.3` (PR #646). * Several components: update anndata to `0.10.8` and mudata to `0.2.3` (PR #645). * `filter/filter_with_hvg`: this component was deprecated and has now been removed. Use `feature_annotation/highly_variable_features_scanpy` instead (PR #843). * `dataflow/concat`: this component was deprecated and has now been removed. Use `dataflow/concatenate_h5mu` instead (PR #857). * `convert/from_h5mu_to_seurat`: bump seurat to latest version (PR #850). * `workflows/ingestion/bd_rhapsody`: Upgrade BD Rhapsody 1.x to 2.x, thereby changing the interface of the workflow (PR #846). * `mapping/bd_rhapsody`: Upgrade BD Rhapsody 1.x to 2.x, thereby changing the interface of the workflow (PR #846). * `reference/make_bdrhap_reference`: Upgrade BD Rhapsody 1.x to 2.x, thereby changing the interface of the workflow (PR #846). * `reference/build_star_reference`: Rename `mapping/star_build_reference` to `reference/build_star_reference` (PR #846). * `reference/cellranger_mkgtf`: Rename `reference/mkgtf` to `reference/cellranger_mkgtf` (PR #846). * `labels_transfer/xgboost`: Align interface with new annotation workflow - Store label probabilities instead of uncertainties - Take `.h5mu` format as an input instead of `.h5ad` * `reference/build_cellranger_arc_reference`: a default value of "output" is now specified for the argument `--genome`, inline with `reference/build_cellranger_reference` component. Additionally, providing a value for `--organism` is no longer required and its default value of `Homo Sapiens` has been removed (PR #864). ## NEW FUNCTIONALITY * Added `demux/cellranger_atac_mkfastq` component: demultiplex raw sequencing data for ATAC experiments (PR #726). * `process_samples`, `process_batches` and `rna_multisample` workflows: added functionality to scale the log-normalized gene expression data to unit variance and zero mean. The scaled data will be output to a different layer and the representation with reduced dimensions will be created and stored in addition to the non-scaled data (PR #733). * `transform/scaling`: add `--input_layer` and `--output_layer` arguments (PR #733). * CI: added checking of mudata contents for multiple workflows (PR #783). * Added multiple arguments to the `cellranger_multi` workflow in order to maintain feature parity with the `mapping/cellranger_multi` component (PR #803). * `convert/from_cellranger_to_h5mu`: add support for antigen analysis. * Added `demux/cellranger_atac_mkfastq` component: demultiplex raw sequencing data for ATAC experiments (PR #726). * Added `reference/build_cellranger_reference` component: build reference file compatible with ATAC and ATAC+GEX experiments (PR #726). * `demux/bcl_convert`: add support for no lane splitting (PR #804). * `reference/cellranger_mkgtf` component: Added cellranger mkgtf as a standalone component (PR #771). * `scgpt/cross_check_genes` component: Added a gene-model cross check component for scGPT (PR #758). * `scgpt/embedding`: component: Added scGPT embedding component (PR #761) * `scgpt/tokenize_pad`: component: Added scGPT padding and tokenization component (PR #754). * `scgpt/binning` component: Added a scGPT pre-processing binning component (PR #765). * `workflows/integration/scgpt_leiden` workflow with scGPT integration followed by Leiden clustering (PR #794). * `scgpt/cell_type_annotation` component: Added scGPT cell type annotation component (PR #798). * `resources_test_scripts/scGPT.sh`: Added script to include scGPT test resources (PR #800). * `transform/clr` component: Added the option to set the `axis` along which to apply CLR. Possible to override on workflow level as well (PR #767). * `annotate/celltypist` component: Added a CellTypist annotation component (PR #825). * `dataflow/split_h5mu` component: Added a component to split a single h5mu file into multiple h5mu files based on the values of an .obs column (PR #824). * `workflows/test_workflows/ingestion` components & `workflows/ingestion`: Added standalone components for integration testing of ingestion workflows (PR #801). * `workflows/ingestion/make_reference`: Add additional arguments passed through to the STAR and BD Rhapsody reference components (PR #846). * `annotate/random_forest_annotation` component: Added a random forest cell type annotation component (PR #848). * `dataflow/concatenate_h5mu`: data from `.uns`, both originating from the global and per-modality slots, is now retained in the final concatenated output object. Additionally, added the `uns_merge_mode` argument in order to tune the behavior when conflicting keys are detected across samples (PR #859). * `dimred/densmap` component: Added a densMAP dimensionality reduction component (PR #748). * `annotate/scanvi` component: Added a component to annotate cells using scANVI (PR #833). * `transform/bpcells_regress_out` component: Added a component to regress out effects of confounding variables in the count matrix using BPCells (PR #863). * `transform/regress_out`: Allow providing 'input' and 'output' layers for scanpy regress_out functionality (PR #863). * `workflows/ingestion/make_reference`: add possibility to build CellRanger ARC references. Added `--motifs_file`, `--non_nuclear_contigs` and `--output_cellranger_arc` arguments (PR #864). * Test resources (reference_gencodev41_chr1): switch reference genome for CellRanger to ARC variant (PR #864). * `transform/bpcells_regress_out` component: Added a component to regress out effects of confounding variables in the count matrix using BPCells (PR #863). * `transform/regress_out`: Allow providing 'input' and 'output' layers for scanpy regress_out functionality (PR #863). * Added `transform/tfidf` component: normalize ATAC data with TF-IDF (PR #870). * Added `dimred/lsi` component (PR #552). * `metadata/copy_obs` component: Added a component to copy an .obs column from a MuData object to another (PR #874). * `annotate/onclass`: component: Added a component to annotate cell types using OnClass (PR #844). * `annotate/svm` component: Added a component to annotate cell types using support vector machine (SVM) (PR #845). * `metadata/duplicate_var` component: Added a component to make a copy from one .var field or index to another .var field within the same MuData object (PR #877). * `filter/subset_obsp` component: Added a component to subset an .obsp matrix by column based on the value of an .obs field. The resulting subset is moved to an .obsm field (PR #888). * `labels_transfer/knn` component: Enable using additional distance functions for KNN classification (PR #830) and allow to perform KNN classification based on a pre-calculated neighborhood graph (PR #890). ## MINOR CHANGES * `resources_test_scripts/cellranger_atac_tiny_bcl.sh` script: generate counts from fastq files using CellRanger atac count (PR #726). * `neighbors/find_neighbors` component: Modified to include results of KNN in the output file (PR #748). 2 new optional arguments added to set .obsm slots to save KNN results into: - `obsm_knn_indices` - `obsm_knn_distances` * `cellbender_remove_background_v0_2`: update base image to `nvcr.io/nvidia/pytorch:23.12-py3` (PR #646). * Bump scvelo to `0.3.2` (PR #828). * Pin numpy<2 for several components (PR #815). * Added `resources_test_scripts/cellranger_atac_tiny_bcl.sh` script: download tiny bcl file with an ATAC experiment, download a motifs file, demultiplex bcl files to reads in fastq format (PR #726). * `mapping/cellranger_multi` component now outputs logs on failure of the `cellranger multi` process (PR #766). * Bump `viash-actions` to `v6` (PR #821). * `reference/make_reference`: Do not try to extract genome fasta and transcriptome gtf if they are not gzipped (PR #856). * Changes related to syncing the test resources (PR #867): - Add `.info.test_resources` to `_viash.yaml` to specify where test resources need to be synced from. - `download/sync_test_resources`: Use `.info.test_resources` in `_viash.yaml` to detect where test resources need to be synced from. - Update CI to use `project/sync-and-cache` instead of `project/sync-and-cache-s3`. ## BUG FIXES * Fix failing tests for `ingestion/cellranger_postprocessing`, `ingestion/conversion` and `multiomics/process_batches` (PR #869). * `convert/from_10xh5_to_h5mu`: add .uns slot to mdata root when metrics file is provided (PR #887). * Use `params.resources_test` in test workflows in order to point to an alternative location (e.g. a cache). * Fix ingestion components not working when optional arguments are unset (PR #894). ## DOCUMENTATION * Update authorship of components (PR #835). # openpipelines 1.0.3 ## BUG FIXES * `qc/calculate_qc_metrics`: increase total counts accuracy with low precision floating dtypes as input layer (PR # , backported from PR #852). # openpipelines 1.0.2 ## BUG FIXES * `dataflow/concatenate_h5mu`: fix writing out multidimensional annotation dataframes (e.g. `.varm`) that had their data dtype (dtype) changed as a result of adding more observations after concatenation, causing `TypeError`. One notable example of this happening is when one of the samples does not have a multimodal annotation dataframe which is present in another sample; causing the values being filled with `NA` (PR #842, backported from PR #837). # openpipelines 1.0.1 ## BUG FIXES * Bump viash to `0.8.6` (PR #816, backported from #815). This changes the at-runtime generated nextflow process from an in-memory to an on-disk temporary file, which should cause less issues with Nextflow Fusion. # openpipelines 1.0.0-rc6 ## BUG FIXES * `dataflow/concatenate_h5mu`: fix regression bug where observations are no longer linked to the correct metadata after concatenation (PR #807) * `transform/normalize_total` component: pass the `target_sum` argument to `sc.pp.normalize_total()` (PR #823). # openpipelines 1.0.0-rc5 ## BUG FIXES * `cluster/leiden`: prevent leiden component from hanging when a child process is killed (e.g. when there is not enough memory available) (PR #805). # openpipelines 1.0.0-rc4 ## BREAKING CHANGES * `query/cellxgene_census`: Refactored the interface, documentation and internal workings of this component (PR #621). - Renamed arguments to align with standard OpenPipelines notations and cellxgene census API: - `--input_database` became `--input_uri` - `--cellxgene_release` became `--census_version` - `--cell_query` became `--obs_value_filter` - `--cells_filter_columns` became `--cell_filter_grouping` - `--min_cells_filter_columns` became `--cell_filter_minimum_count` - `--modality` became `--output_modality` - Removed `--dataset_id` since it was no longer being used. - Added `--add_dataset_meta` to add metadata to the output MuData object. - Documentation of the component and its arguments was improved. ## BUG FIXES * `mapping/cellranger_multi`: Fix the regex for the fastq input files to allow dropping the lane from the input file names (e.g. `_L001`) (PR #778). * `workflows/rna/rna_singlesample`: Fix argument passing `top_n_vars` and `obs_name_mitochondrial_fraction` to the `qc` subworkflow (PR #779). # openpipelines 1.0.0-rc3 ## BREAKING CHANGES * Docker image names now use `/` instead of `_` between the name of the component and the namespace (PR #712). ## BUG FIXES * `rna_singlesample`: fixed a bug where selecting the column for the filtering with mitochondrial fractions using `obs_name_mitochondrial_fraction` was done with the wrong column name, causing `ValueError` (PR #743). * Fix publishing in `process_samples` and `process_batches` (PR #759). ## NEW FUNCTIONALITY * `dimred/tsne` component: Added a tSNE dimensionality reduction component (PR #742). # openpipelines 1.0.0-rc2 ## BUG FIXES * Cellranger multi: Fix using a relative input path for `--vdj_inner_enrichment_primers` (PR #717) * `dataflow/split_modalities`: remove unused `compression` argument. Use `output_compression` instead (PR #714). * `metadata/grep_annotation_column`: fix calculating fraction when an input observation has no counts, which caused the result to be out of bounds. * Fix `--output` argument not working for several workflows (PR #740). ## MINOR CHANGES * `metadata/grep_annotation_column`: Added more logging output (PR #697). * `metadata/add_id` and `metadata/grep_annotation_column`: Bump python to 3.11 (PR #697). * Bump viash to 0.8.5 (PR #697) * `dataflow/split_modalities`: add more logging output and bump python to 3.12 (PR #714). * `correction/cellbender`: Update nextflow resource labels from `singlecpu` and `lowmem` to `midcpu` and `midmem` (PR #736) # openpipelines 1.0.0rc1 ## BREAKING CHANGES * Change separator for arguments with multiple inputs from `:` to `;` (PR #700 and #707). Now, _all_ arguments with `multiple: true` will use `;` as the separator. This change was made to be able to deal with file paths that contain `:`, e.g. `s3://my-bucket/my:file.txt`. Furthermore, the `;` separator will become the default separator for all arguments with `multiple: true` in Viash >= 0.9.0. * This project now uses viash version 0.8.4 to build components and workflows. Changes related to this version update should be _mostly_ backwards compatible with respect to the results and execution of the pipelines. From a development perspective, drastic updates have been made to the developemt workflow. Development related changes: * Bump viash version to 0.8.4 (PR #598, PR#638 and #706) in the project configuration. * All pipelines no longer use the anonymous workflow. Instead, these workflows were given a name which was added to the viash config as the entrypoint to the pipeline (PR #598). * Removed the `workflows` folder and moved its contents to new locations: 1. The `resources_test_scripts` folder now resides in the root of the project (PR #605). 2. All workflows have been moved to the `src/workflows` folder (PR #605). This implies that workflows must now be build using `viash (ns) build`, just like with components. 3. Adjust GitHub Actions to account for new workflow paths (PR #605). 4. In order to be backwards compatible, the `workflows` folder now contains symbolic links to the build workflows in `target`. This is not a problem when using the repository for pipeline execution. However, if a developer wishes to contribute to the project, symlink support should be enabled in git using `git config core.symlinks=true`. Alternatively, use `git clone -c core.symlinks=true git@github.com:openpipelines-bio/openpipeline.git` when cloning the repository. This avoids the symlinks being resolved (PR #628). 4bis. With PR #668, the workflows have been renamed. This does not hamper the backwards compatibility of the symlinks that have been described in 4, because they still use the original location which includes the original name. * `multiomics/rna_singlesample` has been renamed to `rna/process_single_sample`, * `multiomics/rna_multisample` has been renamed to `rna/rna_multisample`, * `multiomics/prot_multisample` became `prot/prot_multisample`, * `multiomics/prot_singlesample` became `prot/prot_singlesample`, * `multiomics/full_pipeline` was moved to `multiomics/process_samples`, * `multiomics/multisample` has been renamed to `multiomics/process_batches`, * `multiomics/integration/initialize_integration` changed to `multiomics/dimensionality_reduction`, * finally, all workflows at `multiomics/integration/*` were moved to `integration/*` 5. Removed the `workflows/utils` folder. Functionality that was provided by the `DataflowHelper` and `WorkflowHelper` is now being provided by viash when the workflow is being build (PR #605). End-user facing changes: * The `concat` component had been deprecated and will be removed in a future release. It's functionality has been copied to the `concatenate_h5mu` component because the name is in conflict with the `concat` operator from nextflow (PR #598). * `prot_singlesample`, `rna_singlesample`, `prot_multisample` and `rna_multisample`: QC statistics are now only calculated once where needed. This means that the mitochondrial gene detection is performed in the `rna_singlesample` pipeline and the other count based statistics are calculated during the `prot_multisample` and `rna_multisample` pipelines. In both cases, the `qc` pipeline is being used, but only parts of that workflow are activated by parametrization. Previously the count based statistics were calculated in both the `singlesample` and `multisample` pipelines, with the results from the multisample pipelines overwriting the previous results. What is breaking here is that the qc statistics are not being added to the results of the singlesample worklows. This is _not_ an issue when using the `full_pipeline` because in this case the singlesample and multisample workflows are executed in-tandem. If you wish to execute the singlesample workflows in a seperate manner and still include count based statistics, please run the `qc` pipeline on the result of the singlesample workflow (PR #604). * `filter/filter_with_hvg` has been renamed to `feature_annotation/highly_variable_features_scanpy`, along with the following changes (PR #667). - `--do_filter` was removed - `--n_top_genes` has been renamed to `--n_top_features` * `full_pipeline`, `multisample` and `rna_multisample`: Renamed arguments (PR #667). - `--filter_with_hvg_var_output` became `--highly_variable_features_obs_batch_key` - `--filter_with_hvg_obs_batch_key` became `--highly_variable_features_var_output` * `rna_multisample`: Renamed arguments (PR #667). - `--filter_with_hvg_n_top_genes` became `--highly_variable_features_n_top_features` - `--filter_with_hvg_flavor` became `--highly_variable_features_flavor` * Renamed `obsm_metrics` to `uns_metrics` for the `cellranger_mapping` workflow because the cellranger metrics are stored in `.uns` and not `.obsm` (PR #610). ## MAJOR CHANGES * `mapping/cellranger_mkfastq`: update from cellranger `6.0.2` to `7.0.1` (PR #675) ## NEW FUNCTIONALITY * `multisample` pipeline: This workflow now works when provided multimple unimodal files or multiple multimodal files, in addition to the previously supported single multimodal file (PR #606). The modalities are processed independently from each other: - As before, a single multimodal file is split into several unimodal MuData objects, each modality being stored in a file. - (New) When multiple unimodal files are provided, they can be used used as is. - (New) Mosaic input (i.e. multiple uni- or multimodal files) are split into unimodal files. Providing the same modality twice is not supported however, meaning the modalities should be unique. For example, using `input: ["data1.h5mu", "data2.h5mu"]` with `data1.h5mu` providing data for `rna` and `atac` and `data2.h5mu` for `rna` and `prot` will not work, because the `rna` modality is present in both input files. * `multisample` workflow: throw an error when argument values for the merge component or the `initialize_integration` workflow differ between the inputs (PR #606). * Added a `split_modalities` workflow in order to split a multimodal mudata files into several unimodal mudata files. Its behavior is identical to the `split_modalities` component, but it also provides functionality to make sure everything works when nextflow's `-stub` option is enabled (PR #606). * All workflow now use `dependencies` to handle includes from other workflows (PR #606). * `qc/calculate_qc_metrics`: allow setting the output column names and disabling the calculation of several metrics (PR #644). * `rna_multisample`, `prot_multisample` and `qc` workflows: allow setting the output column names and disabling the calculation of several metrics (PR #606). * `cluster/leiden`: Allow calculating multiple resolutions in parallel (PR #645). * `qc/calculate_qc_metrics`: allow setting the output column names and disabling the calculation of several metrics (PR #644). * `rna_multisample` workflow: added `--modality` argument (PR #607). * `multisample` workflow: in addition to using multimodal files as input, this workflow now also accepts a list of files. The list of files must be the unimodal equivalents of a split multimodal file. The modalities in the list must be unique and after processing the modalities will be merged into multimodal files (PR #606). * Added `filter/intersect_obs` component which removes observations that are not shared between modalities (PR #589). * Re-enable `convert/from_h5mu_to_seurat` component (PR #616). * Added the `gdo_singlesample` pipeline with basic count filtering (PR #672). * `process_samples` pipeline: the `--rna_layer`, `--prot_layer` and `gdo_layer` argument can not be used to specify an alternative layer to .X where the raw data are stored. To enable this feature, the following changes were required: - Added `transform/move_layer` component. - `filter/filter_with_scrublet`: added `--layer` argument. - `transform/clr`: added `--input_layer` argument. - `metadata/grep_annotation_column`: added `--input_layer` argument. - `rna/rna_singlesample`, `rna/rna_multisample`, `prot/prot_singlesample` and `prot/prot_multisample`: add `--layer` argument. - `process_batches`: Added `rna_layer` and `prot_layer` arguments. * Enable dataset functionality for nf-tower (PR #701) * Added `annotate/score_genes` and `annotate/score_genes_cell_cycle` to calculate scanpy gene scores (PR #703). ## MINOR CHANGES * Refactored `rna_multisample` (PR #607), `cellranger_multi` (PR #609), `cellranger_mapping` (PR #610) and other (PR #606) pipelines to use `fromState` and `toState` functionality. * `metadata/add_id`: add more runtime logging (PR #663). * `cluster/leiden`: Bump python to 3.11 and leidenalg to 0.10.0 (PR #645). * `mapping/htseq_count_to_h5mu` and `multi_star`: update polars and gtfparse (PR #642). * Pin `from_h5mu_to_seurat` to use Seurat to version 4 (PR #630). * `velocity/scvelo`: bump scvelo to 0.3.1 and python to 3.10 (PR #640). * Updated the Viash YAML schemas to the latest version of Viash (PR #620). * `build_cellranger_reference` and `build_bdrhap_reference`: Bump go version to `1.21.4` when building seqkit for testing the component (PR #624 and PR #637). * `correction/cellbender_remove_background`: Remove `muon` as a test dependency (PR #636). * (Automatic testing) Update viashpy to 0.6.0 (PR #665). * `integrate/scarches`, `integrate/scvi`, `velocity/scvelo` and `integrate/totalvi`: pin jax, jaxlib to `<0.4.23` (PR #699). * `integrate/scvi`: Unpin `numba` and pin scvi-tools to `1.0.3` (PR #699). * `integrate/totalvi`: Enable GPU-accelerated computing, unpin `torchmetrics` and pin jax, jaxlib to `<0.4.23` (PR #699). ## BUG FIXES * `transform/log1p`: fix `--input_layer` argument not functioning (PR #678). * `dataflow/concat` and `dataflow/concatenate_h5mu`: Fix an issue where using `--mode move` on samples with non-overlapping features would cause `var_names` to become unaligned to the data (PR #653). * `filter/filter_with_scrublet`: (Testing) Fix duplicate test function names (PR #641). * `dataflow/concatenate_h5mu` and `dataflow/concat`: Fix `TypeError` when using mode 'move' and a column with conflicting metadata does not exist across all samples (PR #631). * `dataflow/concatenate_h5mu` and `dataflow/concat`: Fix an issue where joining columns with different datatypes caused `TypeError` (PR #619). * `qc/calculate_qc_metrics`: Resolved an issue where statistics based on the input columns selected with `--var_qc_metrics` were incorrect when these input columns were encoded in `pd.BooleanDtype()` (PR #685). * `move_obsm_to_obs`: fix setting output columns when they already exist (PR #690). * `workflows/dimensionality_reduction` workflow: nearest neighbour calculations no longer recalcalates the PCA when `obm_input` `--obsm_pca` is not set to `X_pca`. * `feature_annotation/highly_variable_scanpy`: fix .X being used to remove observations with 0 counts when `--layer` has been specified. * `filter/filter_with_counts`: fix `--layer` argument not being used. * `transform/normalize_total`: fix incorrect layer being written to the output when the input layer if not `.X`. * `src/workflows/qc`: fix input layer not being taken into account when calculating the fraction of mitochondrial genes (always used .X). * `convert/from_cellranger_multi_to_h5mu`: fix metric values not repesented as percentages being devided by 100. (#704). # openpipelines 0.12.1 ## BUG FIXES * `rna_singlesample`: Fix filtering parameters values `min_counts`, `max_counts`, `min_genes_per_cell`, `max_genes_per_cell` and `min_cells_per_gene` not being passed to the `filter_with_counts` component (PR #614). * `prot_singlesample`: Fix filtering parameters values `min_counts`, `max_counts`, `min_proteins_per_cell`, `max_proteins_per_cell` and `min_cells_per_protein` not being passed to the `filter_with_counts` component (PR #614). # openpipelines 0.12.0 ## BREAKING CHANGES The detection of mitochondrial genes has been revisited in order to remove the interdependency with the count filtering and the QC metric calculation. Implementing this changes involved breaking some existing functionality: * `filter/filter_with_counts`: removed `--var_gene_names`, `--mitochondrial_gene_regex`, `--var_name_mitochondrial_genes`, `--min_fraction_mito` and `--max_fraction_mito` (PR #585). * `workflows/prot_singlesample`: removed `--min_fraction_mito` and `--max_fraction_mito` because regex-based detection detection of mitochondrial genes is not possible (PR #585). * The fraction of counts that originated from mitochondrial genes used to be written to an .obs column with a name that was derived from `pct_` suffixed by the name of the mitochondrial gene column. The `--obs_name_mitochondrial_fraction` argument is introduced to change the destination column and the default prefix has changed from `pct_` to `fraction_` (PR #585). ## NEW FUNCTIONALITY * `workflows/qc`: A pipeline to add basic qc statistics to a MuData object (PR #585). * `workflows/rna_singlesample`: added `--obs_name_mitochondrial_fraction` and make sure that the values from `--max_fraction_mito` and `--min_fraction_mito` are bound between 0 and 1 (PR #585). * Added `filter/delimit_fraction`: Turns an annotation column containing values between 0 and 1 into a boolean column based on thresholds (PR #585). * Added `metadata/grep_annotation_column`: Perform a regex lookup on a column from the annotation matrices .obs or .var (PR #585). * `workflows/full_pipelines`: added `--obs_name_mitochondrial_fraction` argument (PR #585). * `workflows/prot_multisample`: added `--var_qc_metrics` and `--top_n_vars` arguments (PR #585). * Added genetic demultiplexing methods `cellsnp`, `demuxlet`, `freebayes`, `freemuxlet`, `scsplit`, `sourorcell` and `vireo` (PR #343). ## MINOR CHANGES * Several components: bump scanpy to 1.9.5 (PR #594). * Refactored `prot_multisample` and `prot_singlesample` pipelines to use `fromState` and `toState` functionality (PR #585). # openpipelines 0.11.0 ## BREAKING CHANGES * Nextflow VDSL3: set `simplifyOutput` to `False` by default. This implies that components and workflows will output a hashmap with a sole "output" entry when there is only one output (PR #563). * `integrate/scvi`: rename `model_output` argument to `output_model` in order to align with the `scvi_leiden` workflow. This also fixes a bug with the workflow where the argument did not function (PR #562). ## MINOR CHANGES * `dataflow/concat`: reduce memory consumption when using `--other_axis_mode move` by processing only one annotation matrix (`.var`, `.obs`) at a time (PR #569). * Update viashpy and pin it to `0.5.0` (PR #572 and PR #577). * `convert/from_h5ad_to_h5mu`, `convert/from_h5mu_to_h5ad`, `dimred/pca`, `dimred/umap/`, `filter/filter_with_counts`, `filter/filter_with_hvg`, `filter/remove_modality`, `filter/subset_h5mu`, `integrate/scanorama`, `transform/delete_layer` and `transform/log1p`: update python to `3.9` (PR #572). * `integrate/scarches`: update base image, `scvi-tools` and `pandas` to `nvcr.io/nvidia/pytorch:23.09-py3`, `~=1.0.3` and `~=2.1.0` respectively (PR #572). * `integrate/totalvi`: update python to 3.9 and scvi-tools to `~=1.0.3` (PR #572). * `correction/cellbender_remove_background`: change base image to `nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu22.04` and downwgrade MuData to 0.2.1 because it is the oldest version that uses python 3.7 (PR #575). * Several integration workflows: prevent leiden from being executed when no resolutions are provided (PR #583). * `dataflow/concat`: bump pandas to ~=2.1.1 and reduce memory consumption by only reading one modality into memory at a time (PR #568). * `annotate/popv`: bump `jax` and `jaxlib` to `0.4.10`, scanpy to `1.9.4`, scvi to `1.0.3` and pin `ml-dtypes` to < 0.3.0 (PR #565). * `velocity/scvelo`: pin matplotlib to < 3.8.0 (PR #566). * `mapping/multi_star`: pin multiqc to 1.15.0 (PR #566). * `mapping/bd_rhapsody`: pin pandas version to <2 (PR #563). * `query/cellxgene_census`: replaced label `singlecpu` with label `midcpu`. * `query/cellxgene_census`: avoid creating MuData object in memory by writing the modality directly to disk (PR #558). * `integrate/scvi`: use `midcpu` label instead of `singlecpu` (PR #561). ## BUG FIXES * `transform/clr`: raise an error when CLR fails to return the requested output (PR #579). * `correction/cellbender_remove_background`: fix missing helper functionality when using Fusion (PR #575). * `convert/from_bdrhap_to_h5mu`: Avoid `TypeError: Can't implicitly convert non-string objects to strings` by using categorical dtypes when a string column contains NA values (PR #563). * `qc/calculate_qc_metrics`: fix calculating mitochondrial gene related QC metrics when only or no mitochondrial genes were found (PR #564). # openpipelines 0.10.1 ## MINOR CHANGES * `integration/scvi_leiden`: Expose hvg selection argument `--var_input` (#543, PR #547). ## BUG FIXES * `integration/bbknn_leiden`: Set leiden clustering parameter to multiple (#542, PR #545). * `integration/scvi_leiden`: Fix component name in Viash config (PR #547). * `integration/harmony_leiden`: Pass `--uns_neighbors` argument `umap` (PR #548). * Add workaround for bug where resources aren't available when using Nextflow fusion by including `setup_logger`, `subset_vars` and `compress_h5mu` in the script itself (PR #549). # openpipelines 0.10.0 ## BREAKING CHANGES * `workflows/full_pipeline`: removed `--prot_min_fraction_mito` and `--prot_max_fraction_mito` (PR #451) * `workflows/rna_multisample` and `workflows/prot_multisample`: Removed concatenation from these pipelines. The input for these pipelines is now a single mudata file that contains data for multiple samples. If you wish to use this pipeline on multiple single-sample mudata files, you can use the `dataflow/concat` components on them first. This also implies that the ability to add ids to multiple single-sample mudata files prior to concatenation is no longer required, hence the removal of `--add_id_to_obs`, `--sample_id`, `--add_id_obs_output`, and `--add_id_make_observation_keys_unique` (PR #475). * The `scvi` pipeline was renamed to `scvi_leiden` because `leiden` clustering was added to the pipeline (PR #499). * Upgrade `correction/cellbender_remove_background` from CellBender v0.2 to CellBender v0.3.0 (PR #523). Between these versions, several arguments related to the slots of the output file have been changed. ## MAJOR CHANGES * Several components: update anndata to 0.9.3 and mudata to 0.2.3 (PR #423). * Base resources assigned for a process without any labels is now 1 CPU and 2GB (PR #518). * Updated to Viash 0.7.5 (PR #513). * Removed deprecated `variant: vdsl3` tags (PR #513). * Removed unused `version: dev` (PR #513). * `multiomics/integration/harmony_leiden`: Refactored data flow (PR #513). * `ingestion/bd_rhapsody`: Refactored data flow (PR #513). * `query/cellxgene_census`: increased returned metadata content, revised query option, added filtering strategy and refactored functionality (PR #520). * Refactor loggers using `setup_logger()` helper function (PR #534). * Refactor unittest tests to pytest tests (PR #534). ## MINOR CHANGES * Add resource labels to several components (PR #518). * `full_pipeline`: default value for `--var_qc_metrics` is now the combined values specified for `--mitochondrial_gene_regex` and `--filter_with_hvg_var_output`. * `dataflow/concat`: reduce memory consumption by only reading one modality at the same time (PR #474). * Components that use CellRanger, BCL Convert or bcl2fastq: updated from Ubuntu 20.04 to Ubuntu 22.04 (PR #494). * Components that use CellRanger: updated Picard to 2.27.5 (PR #494). * `interprete/liana`: Update lianapy to 0.1.9 (PR #497). * `qc/multiqc`: add unittests (PR #502). * `reference/build_cellranger_reference`: add unit tests (PR #506). * `reference/build_bd_rhapsody_reference`: add unittests (PR #504). ## NEW FUNCTIONALITY * Added `compression/compress_h5mu` component (PR #530). * Resource management: when a process exits with a status code between 137 and 140, retry the process with increased memory requirements. Memory scales by multiplying the base memory assigned to the process with the attempt number (PR #518 and PR #527). * `integrate/scvi`: Add `--n_hidden_nodes`, `--n_dimensions_latent_space`, `--n_hidden_layers`, `--dropout_rate`, `--dispersion`, `--gene_likelihood`, `--use_layer_normalization`, `--use_batch_normalization`, `--encode_covariates`, `--deeply_inject_covariates` and `--use_observed_lib_size` parameters. * `filter/filter_with_counts`: add `--var_name_mitochondrial_genes` argument to store a boolean array corresponding the detected mitochondrial genes. * `full_pipeline` and `rna_singlesample` pipelines: add `--var_name_mitochondrial_genes`, `--var_gene_names` and `--mitochondrial_gene_regex` arguments to specify mitochondrial gene detection behaviour. * `integrate/scvi`: Add `--obs_labels`, `--obs_size_factor`, `--obs_categorical_covariate` and `--obs_continuous_covariate` arguments (PR #496). * Added `var_qc_metrics_fill_na_value` argument to `calculate_qc_metrics` (PR #477). * Added `multiomics/multisample` pipeline to run multisample processing followed by the integration setup. It is considered an entrypoint into the full pipeline which skips the single-sample processing. The idea is to allow a a re-run of these steps after a sample has already been processed by the `full_pipeline`. Keep in mind that samples that are provided as input to this pipeline are processed separately and are not concatenated. Hence, the input should be a concatenated sample (PR #475). * Added `multiomics/integration/bbknn_leiden` workflow. (PR #456). * `workflows/prot_multisample` and `workflows/full_pipelines`: add basic QC statistics to prot modality (PR #485). * `mapping/cellranger_multi`: Add tests for the mapping of Crispr Guide Capture data (PR #494). * `convert/from_cellranger_multi_to_h5mu`: add `perturbation_efficiencies_by_feature` and `perturbation_efficiencies_by_feature` information to .uns slot of `gdo` modality (PR #494). * `convert/from_cellranger_multi_to_h5mu`: add `feature_reference` information to the MuData object. Information is split between the modalities. For example `CRISPR Guide Capture` information if added to the `.uns` slot of the `gdo` modality, while `Antibody Capture` information is added to the .uns slot of `prot` (PR #494). * Added `multiomics/integration/totalvi_leiden` pipeline (PR #500). * Added totalVI component (PR #386). * `workflows/full_pipeline`: Add `pca_overwrite` argument (PR #511). * Add `main_build_viash_hub` action to build, tag, and push components and docker images for viash-hub.com (PR #480). * `integration/bbknn_leiden`: Update state management to `fromState` / `toState` (PR #538). * `mapping/cellranger_multi`: Add optional helper input: allow for passing modality specific inputs, from which library type and library id are inferred (PR #693). ## DOCUMENTATION * `images`: Added images for various concepts, such as a sample, a cell, RNA, ADT, ATAC, VDJ (PR #515). * `multiomics/rna_singlesample`: Add image for workflow (PR #515). * `multiomics/rna_multisample`: Add image for workflow (PR #515). * `multiomics/prot_singlesample`: Add image for workflow (PR #515). * `multiomics/prot_multisample`: Add image for workflow (PR #515). ## BUG FIXES * Fix an issue with `workflows/multiomics/scanorama_leiden` where the `--output` argument doesn't work as expected (PR #509). * Fix an issue with `workflows/full_pipeline` not correctly caching previous runs (PR #460). * Fix incorrect namespaces of the integration pipelines (PR #464). * Fix an issue in several workflows where the `--output` argument would not work (PR #476). * `integration/harmony_leiden` and `integration/scanorama_leiden`: Fix an issue where the prefix of the columns that store the leiden clusters was hardcoded to `leiden`, instead of adapting to the value for `--obs_cluster` (PR #482). * `velocity/velocyto`: Resolve symbolic link before checking whether the transcriptome is a gzip (PR #484). * `workflows/integration/scanorama_leiden`: fix an issue where `--obsm_input`, --obs_batch`, `--batch_size`, `--sigma`, `--approx`, `--alpha` and `-knn` were not working beacuse they were not passed through to the scanorama component (PR #487). * `workflows/integration/scanorama_leiden`: fix leiden being calculated on the wrong embedding because the `--obsm_input` argument was not correctly set to the output embedding of scanorama (PR #487). * `mapping/cellranger_multi`: Fix and issue where modalities did not have the proper name (PR #494). * `metadata/add_uns_to_obs`: Fix `KeyError: 'ouput_compression'` error (PR #501). * `neighbors/bbknn`: Fix `--input` not being a required argument (PR #518). * Create `correction/cellbender_remove_background_v0.2` for legacy CellBender v0.2 format (PR #523). * `integrate/scvi`: Ensure output has the same dimensionality as the input (PR #524). * `mapping/bd_rhapsody`: Fix `--dryrun` argument not working (PR #534). * `qc/multiqc`: Fix component not working for multiple inputs (PR #537). Also converted Bash script to Python scripts. * `neighbors/bbknn`: Fix `--uns_output`, `--obsp_distances` and `--obsp_connectivities` not being processed correctly (PR #538). # openpipelines 0.9.0 ## BREAKING CHANGES Running the integration in the `full_pipeline` deemed to be impractical because a plethora of integration methods exist, which in turn interact with other functionality (like clustering). This generates a large number of possible usecases which one pipeline cannot cover in an easy manner. Instead, each integration methods will be split into its separate pipeline, and the `full_pipeline` will prepare for integration by performing steps that are required by many integration methods. Therefore, the following changes were performed: * `workflows/full_pipeline`: `harmony` integration and `leiden` clustering are removed from the pipeline. * Added `initialize_integration` to run calculations that output information commonly required by the integration methods. This pipeline runs PCA, nearest neighbours and UMAP. This pipeline is run as a subpipeline at the end of `full_pipeline`. * Added `leiden_harmony` integration pipeline: run harmony integration followed by neighbour calculations and leiden clustering. Also runs umap on the result. * Removed the `integration` pipeline. The old behavior of the `full_pipeline` can be obtained by running `full_pipeline` followed by the `leiden_harmony` pipeline. * The `crispr` and `hashing` modalities have been renamed to `gdo` and `hto` respectively (PR #392). * Updated Viash to 0.7.4 (PR #390). * `cluster/leiden`: Output is now stored into `.obsm` instead of `.obs` (PR #431). ## NEW FUNCTIONALITY * `cluster/leiden` and `integration/harmony_leiden`: allow running leiden multiple times with multiple resolutions (PR #431). * `workflows/full_pipeline`: PCA, nearest neighbours and UMAP are now calculated for the `prot` modality (PR #396). * `transform/clr`: added `output_layer` argument (PR #396). * `workflows/integration/scvi`: Run scvi integration followed by neighbour calculations and run umap on the result (PR #396). * `mapping/cellranger_multi` and `workflows/ingestion/cellranger_multi`: Added `--vdj_inner_enrichment_primers` argument (PR #417). * `metadata/move_obsm_to_obs`: Move a matrix from an `.obsm` slot into `.obs` (PR #431). * `integrate/scvi` validity checks for non-normalized input, obs and vars in order to proceed to training (PR #429). * `schemas`: Added schema files for authors (PR #436). * `schemas`: Added schema file for Viash configs (PR #436). * `schemas`: Refactor author import paths (PR #436). * `schemas`: Added schema file for file format specification files (PR #437). * `query/cellxgene_census`: Query Cellxgene census component and save the results to a MuData file. (PR #433). ## MAJOR CHANGES * `report/mermaid`: Now used `mermaid-cli` to generate images instead of creating a request to `mermaid.ink`. New `--output_format`, `--width`, `--height` and `--background_color` arguments were added (PR #419). * All components that used `python` as base container: use `slim` version to reduce container image size (PR #427). ## MINOR CHANGES * `integrate/scvi`: update scvi to 1.0.0 (PR #448) * `mapping/multi_star`: Added `--min_success_rate` which causes component to fail when the success rate of processed samples were successful (PR #408). * `correction/cellbender_remove_background` and `transform/clr`: update muon to 0.1.5 (PR #428) * `ingestion/cellranger_postprocessing`: split integration tests into several workflows (PR #425). * `schemas`: Add schema file for author yamls (PR #436). * `mapping/multi_star`, `mapping/star_build_reference` and `mapping/star_align`: update STAR from 2.7.10a to 2.7.10b (PR #441). ## BUG FIXES * `annotate/popv`: Fix concat issue when the input data has multiple layers (#395, PR #397). * `annotate/popv`: Fix indexing issue when MuData object contain non overlapping modalities (PR #405). * `mapping/multi_star`: Fix issue where temp dir could not be created when group_id contains slashes (PR #406). * `mapping/multi_star_to_h5mu`: Use glob to look for count files recursively (PR #408). * `annotate/popv`: Pin `PopV`, `jax` and `jaxlib` versions (PR #415). * `integrate/scvi`: the max_epochs is no longer required since it has a default value (PR #396). * `workflows/full_pipeline`: fix `make_observation_keys_unique` parameter not being correctly passed to the `add_id` component, causing `ValueError: Observations are not unique across samples` during execution of the `concat` component (PR #422). * `annotate/popv`: now sets `aprox` to `False` to avoid using `annoy` in scanorama because it fails on processors that are missing the AVX-512 instruction sets, causing `Illegal instruction (core dumped)`. * `workflows/full_pipeline`: Avoid adding sample names to observation ids twice (PR #457). # openpipelines 0.8.0 ## BREAKING CHANGES * `workflows/full_pipeline`: Renamed inconsistencies in argument naming (#372): - `rna_min_vars_per_cell` was renamed to `rna_min_genes_per_cell` - `rna_max_vars_per_cell` was renamed to `rna_max_genes_per_cell` - `prot_min_vars_per_cell` was renamed to `prot_min_proteins_per_cell` - `prot_max_vars_per_cell` was renamed to `prot_max_proteins_per_cell` * `velocity/scvelo`: bump anndata from <0.8 to 0.9. ## NEW FUNCTIONALITY * Added an extra label `veryhighmem` mostly for `cellranger_multi` with a large number of samples. * Added `multiomics/prot_multisample` pipeline. * Added `clr` functionality to `prot_multisample` pipeline. * Added `interpret/lianapy`: Enables the use of any combination of ligand-receptor methods and resources, and their consensus. * `filter/filter_with_scrublet`: Add `--allow_automatic_threshold_detection_fail`: when scrublet fails to detect doublets, the component will now put `NA` in the output columns. * `workflows/full_pipeline`: Allow not setting the sample ID to the .obs column of the MuData object. * `workflows/rna_multisample`: Add the ID of the sample to the .obs column of the MuData object. * `correction/cellbender_remove_background`: add `obsm_latent_gene_encoding` parameter to store the latent gene representation. ## BUG FIXES * `transform/clr`: fix anndata object instead of matrix being stored as a layer in output `MuData`, resulting in `NoneTypeError` object after reading the `.layers` back in. * `dataflow/concat` and `dataflow/merge`: fixed a bug where boolean values were cast to their string representation. * `workflows/full_pipeline`: fix running pipeline with `-stub`. * Fixed an issue where passing a remote file URI (for example `http://` or `s3://`) as `param_list` caused `No such file` errors. * `workflows/full_pipeline`: Fix incorrectly named filtering arguments (#372). * `integrate/scvi`: Fix bug when subsetting using the `var_input` argument (PR #385). * * `correction/cellbender_remove_background`: add `obsm_latent_gene_encoding` parameter to store the latent gene representation. ## MINOR CHANGES * `integrate/scarches`, `integrate/scvi` and `correction/cellbender_remove_background`: Update base container to `nvcr.io/nvidia/pytorch:22.12-py3` * `integrate/scvi`: add `gpu` label for nextflow platform. * `integrate/scvi`: use cuda enabled `jax` install. * `convert/from_cellranger_multi_to_h5mu`, `dataflow/concat` and `dataflow/merge`: update pandas to 2.0.0 * `dataflow/concat` and `dataflow/merge`: Boolean and integer columns are now represented by the `BooleanArray` and `IntegerArray` dtypes in order to allow storing `NA` values. * `interpret/lianapy`: use the latest development release (commit 11156ddd0139a49dfebdd08ac230f0ebf008b7f8) of lianapy in order to fix compatibility with numpy 1.24.x. * `filter/filter_with_hvg`: Add error when specified input layer cannot be found in input data. * `workflows/multiomics/full_pipeline`: publish the output from sample merging to allow running different integrations. * CI: Remove various unused software libraries from runner image in order to avoid `no space left on device` (PR #425, PR #447). # openpipelines 0.7.1 ## NEW FUNCTIONALITY * `integrate/scvi`: use `nvcr.io/nvidia/pytorch:22.09-py3` as base container to enable GPU acceleration. * `integrate/scvi`: add `--model_output` to save model. * `workflows/ingestion/cellranger_mapping`: Added `output_type` to output the filtered Cell Ranger data as h5mu, not the converted raw 10xh5 output. * Several components: added `--output_compression` component to set the compression of output .h5mu files. * `workflows/full_pipeline` and `workflows/integration`: Added `leiden_resolution` argument to control the coarseness of the clustering. * Added `--rna_theta` and `--rna_harmony_theta` to full and integration pipeline respectively in order to tune the diversity clustering penalty parameter for harmony integration. * `dimred/pca`: fix `variance` slot containing a second copy of the variance ratio matrix and not the variances. ## BUG FIXES * `mapping/cellranger_multi`: Fix an issue where using a directory as value for `--input` would cause `AttributeError`. * `workflows/integration`: `init_pos` is no longer set to the integration layer (e.g. `X_pca_integrated`). ## MINOR CHANGES * `integration` and `full` workflows: do not run harmony integration when `obs_covariates` is not provided. * Add `highmem` label to `dimred/pca` component. * Remove disabled `convert/from_csv_to_h5mu` component. * Update to Viash 0.7.1. * Several components: update to scanpy 1.9.2 * `process_10xh5/filter_10xh5`: speed up build by using `eddelbuettel/r2u:22.04` base container. ## MAJOR CHANGES * `dataflow/concat`: Renamed `--compression` to `--output_compression`. # openpipelines 0.7.0 ## MAJOR CHANGES * Removed `bin` folder. As of viash 0.6.4, a `_viash.yaml` file can be included in the root of a repository to set common viash options for the project. These options were previously covered in the `bin/init` script, but this new feature of viash makes its use unnecessary. The `viash` and `nextlow` should now be installed in a directory that is included in your `$PATH`. ## MINOR CHANGES * `filter/do_filter`: raise an error instead of printing a warning when providing a column for `var_filer` or `obs_filter` that doesn't exist. ## BUG FIXES * `workflows/full_pipeline`: Fix setting .var output column for filter_with_hvg. * Fix running `mapping/cellranger_multi` without passing all references. * `filter/filter_with_scrublet`: now sets `use_approx_neighbors` to `False` to avoid using `annoy` because it fails on processors that are missing the AVX-512 instruction sets. * `workflows`: Updated `WorkflowHelper` to newer version that allows applying defaults when calling a subworkflow from another workflow. * Several components: pin matplotlib to <3.7 to fix scanpy compatibility (see https://github.com/scverse/scanpy/issues/2411). * `workflows`: fix a bug when running a subworkflow from a workflow would cause the parent config to be read instead of the subworklow config. * `correction/cellbender_remove_background`: Fix description of input for cellbender_remove_background. * `filter/do_filter`: resolved an issue where the .obs column instead of the .var column was being logged when filtering using the .var column. * `workflows/rna_singlesample` and `workflows/prot_singlesample`: Correctly set var and obs columns while filtering with counts. * `filter/do_filter`: removed the default input value for `var_filter` argument. * `workflows/full_pipeline` and `workflows/integration`: fix PCA not using highly variable genes filter. # openpipelines 0.6.2 ## NEW FUNCTIONALITY * `workflows/full_pipeline`: added `filter_with_hvg_obs_batch_key` argument for batched detection of highly variable genes. * `workflows/rna_multisample`: added `filter_with_hvg_obs_batch_key`, `filter_with_hvg_flavor` and `filter_with_hvg_n_top_genes` arguments. * `qc/calculate_qc_metrics`: Add basic statistics: `pct_dropout`, `num_zero_obs`, `obs_mean` and `total_counts` are added to .var. `num_nonzero_vars`, `pct_{var_qc_metrics}`, `total_counts_{var_qc_metrics}`, `pct_of_counts_in_top_{top_n_vars}_vars` and `total_counts` are included in .obs * `workflows/multiomics/rna_multisample` and `workflows/multiomics/full_pipeline`: add `qc/calculate_qc_metrics` component to workflow. * `workflows/multiomics/prot_singlesample`: Processing unimodal single-sample CITE-seq data. * `workflows/multiomics/rna_singlesample` and `workflows/multiomics/full_pipeline`: Add filtering arguments to pipeline. ## MINOR CHANGES * `convert/from_bdrhap_to_h5mu`: bump R version to 4.2. * `process_10xh5/filter_10xh5`: bump R version to 4.2. * `dataflow/concat`: include path of file in error message when reading a mudata file fails. * `mapping/cellranger_multi`: write cellranger console output to a `cellranger_multi.log` file. ## BUG FIXES * `mapping/htseq_count_to_h5mu`: Fix a bug where reading in the gtf file caused `AttributeError`. * `dataflow/concat`: the `--input_id` is no longer required when `--mode` is not `move`. * `filter/filter_with_hvg`: does no longer try to use `--varm_name` to set non-existant metadata when running with `--flavor seurat_v3`, which was causing `KeyError`. * `filter/filter_with_hvg`: Enforce that `n_top_genes` is set when `flavor` is set to 'seurat_v3'. * `filter/filter_with_hvg`: Improve error message when trying to use 'cell_ranger' as `flavor` and passing unfiltered data. * `mapping/cellranger_multi` now applies `gex_chemistry`, `gex_secondary_analysis`, `gex_generate_bam`, `gex_include_introns` and `gex_expect_cells`. # openpipeline 0.6.1 ## NEW FUNCTIONALITY * `mapping/multi_star`: A parallellized version of running STAR (and HTSeq). * `mapping/multi_star_to_h5mu`: Convert the output of `multi_star` to a h5mu file. ## BUG FIXES * `filter/filter_with_counts`: Fix an issue where mitochrondrial genes were being detected in .var_names, which contain ENSAMBL IDs instead of gene symbols in the pipelines. Solution was to create a `--var_gene_names` argument which allows selecting a .var column to check using a regex (`--mitochondrial_gene_regex`). * `dataflow/concat`, `report/mermaid`, `transform/clr`: Don't forget to exit with code returned by pytest. # openpipeline 0.6.0 ## NEW FUNCTIONALITY * `workflows/full_pipeline`: add `filter_with_hvg_var_output` argument. * `dimred/pca`: Add `--overwrite` and `--var_input` arguments. * `tranform/clr`: Perform CLR normalization on CITE-seq data. * `workflows/ingestion/cellranger_multi`: Run Cell Ranger multi and convert the output to .h5mu. * `filter/remove_modality`: Remove a single modality from a MuData file. * `mapping/star_align`: Align `.fastq` files using STAR. * `mapping/star_align_v273a`: Align `.fastq` files using STAR v2.7.3a. * `mapping/star_build_reference`: Create a STAR reference index. * `mapping/cellranger_multi`: Align fastq files using Cell Ranger multi. * `mapping/samtools_sort`: Sort and (optionally) index alignments. * `mapping/htseq_count`: Quantify gene expression for subsequent testing for differential expression. * `mapping/htseq_count_to_h5mu`: Convert one or more HTSeq outputs to a MuData file. * Added from `convert/from_cellranger_multi_to_h5mu` component. ## MAJOR CHANGES * `convert/from_velocyto_to_h5mu`: Moved to `velocity/velocyto_to_h5mu`. It also now accepts an optional `--input_h5mu` argument, to allow directly reading the RNA velocity data into a `.h5mu` file containing the other modalities. * `resources_test/cellranger_tiny_fastq`: Include RNA velocity computations as part of the script. * `mapping/cellranger_mkfastq`: remove --memory and --cpu arguments as (resource management is automatically provided by viash). ## MINOR CHANGES * Several components: use `gzip` compression for writing .h5mu files. * Default value for `obs_covariates` argument of full pipeline is now `sample_id`. * Set the `tag` directive of all Nextflow components to '$id'. ## BUG FIXES * Keep data for modalities that are not specifically enabled when running full pipeline. * Fix many components thanks to Viash 0.6.4, which causes errors to be thrown when input and output files are defined but not found. # openpipeline 0.5.1 ## BREAKING CHANGES * `reference/make_reference`: Input files changed from `type: string` to `type: file` to allow Nextflow to cache the input files fetched from URL. * several components (except `from_h5ad_to_h5mu`): the `--modality` arguments no longer accept multiple values. * Remove outdated `resources_test_scripts`. * `convert/from_h5mu_to_seurat`: Disabled because MuDataSeurat is currently broken, see [https://github.com/PMBio/MuDataSeurat/issues/9](PMBio/MuDataSeurat#9). * `integrate/harmony`: Disabled because it is currently not functioning and the alternative, harmonypy, is used in the workflows. * `dataflow/concat`: Renamed --sample_names to --input_id and moved the ability to add sample id and to join the sample ids with the observation names to `metadata/add_id` * Moved `dataflow/concat`, `dataflow/merge` and `dataflow/split_modalities` to a new namespace: `dataflow`. * Moved `workflows/conversion/conversion` to `workflows/ingestion/conversion` ## NEW FUNCTIONALITY * `metadata/add_id`: Add an id to a column in .obs. Also allows joining the id to the .obs_names. * `workflows/ingestion/make_reference`: A generic component to build a transcriptomics reference into one of many formats. * `integrate/scvi`: Performs scvi integration. * `integrate/add_metadata`: Add a csv containing metadata to the .obs or .var field of a mudata file. * `DataflowHelper.nf`: Added `passthroughMap`. Usage: ```groovy include { passthroughMap as pmap } from "./DataflowHelper.nf" workflow { Channel.fromList([["id", [input: "foo"], "passthrough"]]) | pmap{ id, data -> [id, data + [arg: 10]] } } ``` Note that in the example above, using a regular `map` would result in an exception being thrown, that is, "Invalid method invocation `call` with arguments". A synonymous of doing this with a regular `map()` would be: ```groovy workflow { Channel.fromList([["id", [input: "foo"], "passthrough"]]) | map{ tup -> def (id, data) = tup [id, data + [arg: 10]] + tup.drop(2) } } ``` * `correction/cellbender_remove_background`: Eliminating technical artifacts from high-throughput single-cell RNA sequencing data. * `workflows/ingestion/cellranger_postprocessing`: Add post-processing of h5mu files created from Cell Ranger data. * `annotate/popv`: Performs popular major vote cell typing on single cell sequence data. ## MAJOR CHANGES * `workflows/utils/DataflowHelper.nf`: Added helper functions `setWorkflowArguments()` and `getWorkflowArguments()` to split the data field of a channel event into a hashmap. Example usage: ```groovy | setWorkflowArguments( pca: [ "input": "input", "obsm_output": "obsm_pca" ] integration: [ "obs_covariates": "obs_covariates", "obsm_input": "obsm_pca" ] ) | getWorkflowArguments("pca") | pca | getWorkflowArguments("integration") | integration ``` * `mapping/cellranger_count`: Allow passing both directories as well as individual fastq.gz files as inputs. * `convert/from_10xh5_to_h5mu`: Allow reading in QC metrics, use gene ids as `.obs_names` instead of gene symbols. * `workflows/conversion`: Update pipeline to use the latest practices and to get it to a working state. ## MINOR CHANGES * `dimred/umap`: Streamline UMAP parameters by adding `--obsm_output` parameter to allow choosing the output `.obsm` slot. * `workflows/multiomics/integration`: Added arguments for tuning the various output slots of the integration pipeline, namely `--obsm_pca`, `--obsm_integrated`, `--uns_neighbors`, `--obsp_neighbor_distances`, `--obsp_neighbor_connectivities`, `--obs_cluster`, `--obsm_umap`. * Switch to Viash 0.6.1. * `filter/subset_h5mu`: Add `--modality` argument, export to VDSL3, add unit test. * `dataflow/split_modalities`: Also output modality types in a separate csv. ## BUG FIXES * `convert/from_bd_to_10x_molecular_barcode_tags`: Replaced UTF8 characters with ASCII. OpenJDK 17 or lower might throw the following exception when trying to read a UTF8 file: `java.nio.charset.MalformedInputException: Input length = 1`. * `dataflow/concat`: Overriding sample name in .obs no longer raises `AttributeError`. * `dataflow/concat`: Fix false positives when checking for conflicts in .obs and .var when using `--mode move`. # openpipeline 0.5.0 Major redesign of the integration and multiomic workflows. Current list of workflows: * `ingestion/bd_rhapsody`: A generic pipeline for running BD Rhapsody WTA or Targeted mapping, with support for AbSeq, VDJ and/or SMK. * `ingestion/cellranger_mapping`: A pipeline for running Cell Ranger mapping. * `ingestion/demux`: A generic pipeline for running bcl2fastq, bcl-convert or Cell Ranger mkfastq. * `multiomics/rna_singlesample`: Processing unimodal single-sample RNA transcriptomics data. * `multiomics/rna_multisample`: Processing unimodal multi-sample RNA transcriptomics data. * `multiomics/integration`: A pipeline for demultiplexing multimodal multi-sample RNA transcriptomics data. * `multiomics/full_pipeline`: A pipeline to analyse multiple multiomics samples. ## BREAKING CHANGES * Many components: Renamed `.var["gene_ids"]` and `.var["feature_types"]` to `.var["gene_id"]` and `.var["feature_type"]`. ## DEPRECATED * `convert/from_10xh5_to_h5ad` and `convert/from_bdrhap_to_h5ad`: Removed h5ad based components. * `mapping/bd_rhapsody_wta` and `workflows/ingestion/bd_rhapsody_wta`: Deprecated in favour for more generic `mapping/bd_rhapsody` and `workflows/ingestion/bd_rhapsody` pipelines. * `convert/from_csv_to_h5mu`: Disable until it is needed again. * `dataflow/concat`: Deprecated `"concat"` option for `--other_axis_mode`. ## NEW COMPONENTS * `graph/bbknn`: Batch balanced KNN. * `transform/scaling`: Scale data to unit variance and zero mean. * `mapping/bd_rhapsody`: Added generic component for running the BD Rhapsody WTA or Targeted analysis, with support for AbSeq, VDJ and/or SMK. * `integrate/harmony` and `integrate/harmonypy`: Run a Harmony integration analysis (R-based and Python-based, respectively). * `integrate/scanorama`: Use Scanorama to integrate different experiments. * `reference/make_reference`: Download a transcriptomics reference and preprocess it (adding ERCC spikeins and filtering with a regex). * `reference/build_bdrhap_reference`: Compile a reference into a STAR index in the format expected by BD Rhapsody. ## NEW WORKFLOWS * `workflows/ingestion/bd_rhapsody`: Added generic workflow for running the BD Rhapsody WTA or Targeted analysis, with support for AbSeq, VDJ and/or SMK. * `workflows/multiomics/full_pipeline`: Implement pipeline for processing multiple multiomics samples. ## NEW FUNCTIONALITY * `convert/from_bdrhap_to_h5mu`: Added support for being able to deal with WTA, Targeted, SMK, AbSeq and VDJ data. * `dataflow/concat`: Added `"move"` option to `--other_axis_mode`, which allows merging `.obs` and `.var` by only keeping elements of the matrices which are the same in each of the samples, moving the conflicting values to `.varm` or `.obsm`. ## MAJOR CHANGES * Multiple components: Update to anndata 0.8 with mudata 0.2.0. This means that the format of the `.h5mu` files have changed. * `multiomics/rna_singlesample`: Move transformation counts into layers instead of overwriting `.X`. * Updated to Viash 0.6.0. ## MINOR CHANGES * `velocity/velocyto`: Allow configuring memory and parallellisation. * `cluster/leiden`: Add `--obsp_connectivities` parameter to allow choosing the output slot. * `workflows/multiomics/rna_singlesample`, `workflows/multiomics/rna_multisample` and `workflows/multiomics/integration`: Allow choosing the output paths. * `neighbors/bbknn` and `neighbors/find_neighbors`: Add parameters for choosing the input/output slots. * `dimred/pca` and `dimred/umap`: Add parameters for choosing the input/output slots. * `dataflow/concat`: Optimize concat performance by adding multiprocessing and refactoring functions. * `workflows/multimodal_integration`: Add `obs_covariates` argument to pipeline. ## BUG FIXES * Several components: Revert using slim versions of containers because they do not provide the tools to run nextflow with trace capabilities. * `dataflow/concat`: Fix an issue where joining boolean values caused `TypeError`. * `workflows/multiomics/rna_multisample`, `workflows/multiomics/rna_singlesample` and `workflows/multiomics/integration`: Use nextflow trace reporting when running integration tests. # openpipeline 0.4.1 ## BUG FIXES * `workflows/ingestion/bd_rhapsody_wta`: use ':' as a seperator for multiple input files and fix integration tests. ## MINOR CHANGES * Several components: pin mudata and scanpy dependencies so that anndata version <0.8.0 is used. # openpipeline 0.4.0 ## NEW FUNCTIONALITY * `convert/from_bdrhap_to_h5mu`: Merge one or more BD rhapsody outputs into an h5mu file. * `dataflow/split_modalities`: Split the modalities from a single .h5mu multimodal sample into seperate .h5mu files. * `dataflow/concat`: Combine data from multiple samples together. ## MINOR CHANGES * `mapping/bd_rhapsody_wta`: Update to BD Rhapsody 1.10.1. * `mapping/bd_rhapsody_wta`: Add parameters for overriding the minimum RAM & cores. Add `--dryrun` parameter. * Switch to Viash 0.5.14. * `convert/from_bdrhap_to_h5mu`: Update to BD Rhapsody 1.10.1. * `resources_test/bdrhap_5kjrt`: Add subsampled BD rhapsody datasets to test pipeline with. * `resources_test/bdrhap_ref_gencodev40_chr1`: Add subsampled reference to test BD rhapsody pipeline with. * `dataflow/merge`: Merge several unimodal .h5mu files into one multimodal .h5mu file. * Updated several python docker images to slim version. * `mapping/cellranger_count_split`: update container from ubuntu focal to ubuntu jammy * `download/sync_test_resources`: update AWS cli tools from 2.7.11 to 2.7.12 by updating docker image * `download/download_file`: now uses bash container instead of python. * `mapping/bd_rhapsody_wta`: Use squashed docker image in which log4j issues are resolved. ## BUG FIXES * `workflows/utils/WorkflowHelper.nf`: Renamed `utils.nf` to `WorkflowHelper.nf`. * `workflows/utils/WorkflowHelper.nf`: Fix error message when required parameter is not specified. * `workflows/utils/WorkflowHelper.nf`: Added helper functions: - `readConfig`: Read a Viash config from a yaml file. - `viashChannel`: Create a channel from the Viash config and the params object. - `helpMessage`: Print a help message and exit. * `mapping/bd_rhapsody_wta`: Update picard to 2.27.3. ## DEPRECATED * `convert/from_bdrhap_to_h5ad`: Deprecated in favour for `convert/from_bdrhap_to_h5mu`. * `convert/from_10xh5_to_h5ad`: Deprecated in favour for `convert/from_10xh5_to_h5mu`. # openpipeline 0.3.1 ## NEW FUNCTIONALITY * `bin/port_from_czbiohub_utilities.sh`: Added helper script to import components and pipelines from `czbiohub/utilities` Imported components from `czbiohub/utilities`: * `demux/cellranger_mkfastq`: Demultiplex raw sequencing data. * `mapping/cellranger_count`: Align fastq files using Cell Ranger count. * `mapping/cellranger_count_split`: Split 10x Cell Ranger output directory into separate output fields. Imported workflows from `czbiohub/utilities`: * `workflows/1_ingestion/cellranger`: Use Cell Ranger to preprocess 10x data. * `workflows/1_ingestion/cellranger_demux`: Use cellranger demux to demultiplex sequencing BCL output to FASTQ. * `workflows/1_ingestion/cellranger_mapping`: Use cellranger count to align 10x fastq files to a reference. ## MINOR CHANGES * Fix `interactive/run_cirrocumulus` script raising `NotImplementedError` caused by using `MutData.var_names_make_unique()` on each modality instead of on the whole `MuData` object. * Fix `transform/normalize_total` and `interactive/run_cirrocumulus` component build missing a hdf5 dependency. * `interactive/run_cellxgene`: Updated container to ubuntu:focal because it contains python3.6 but cellxgene dropped python3.6 support. * `mapping/bd_rhapsody_wta`: Set `--parallel` to true by default. * `mapping/bd_rhapsody_wta`: Translate Bash script into Python. * `download/sync_test_resources`: Add `--dryrun`, `--quiet`, and `--delete` arguments. * `convert/from_h5mu_to_seurat`: Use `eddelbuettel/r2u:22.04` docker container in order to speed up builds by downloading precompiled R packages. * `mapping/cellranger_count`: Use 5Gb for testing (to adhere to github CI runner memory constraints). * `convert/from_bdrhap_to_h5ad`: change test data to output from `mapping/bd_rhapsody_wta` after reducing the BD Rhapsody test data size. * Various `config.vsh.yaml`s: Renamed `values:` to `choices:`. * `download/download_file` and `transfer/publish`: Switch base container from `bash:5.1` to `python:3.10`. * `mapping/bd_rhapsody_wta`: Make sure procps is installed. ## BUG FIXES * `mapping/bd_rhapsody_wta`: Use a smaller test dataset to reduce test time and make sure that the Github Action runners do not run out of disk space. * `download/sync_test_resources`: Disable the use of the Amazon EC2 instance metadata service to make script work on Github Actions runners. * `convert/from_h5mu_to_seurat`: Fix unit test requiring Seurat by using native R functions to test the Seurat object instead. * `mapping/cellranger_count` and `bcl_demus/cellranger_mkfastq`: cellranger uses the `--parameter=value` formatting instead of `--parameter value` to set command line arguments. * `mapping/cellranger_count`: `--nosecondary` is no longer always applied. * `mapping/bd_rhapsody_wta`: Added workaround for bug in Viash 0.5.12 where triple single quotes are incorrectly escaped (viash-io/viash#139). ## DEPRECATED * `bcl_demux/cellranger_mkfastq`: Duplicate of `demux/cellranger_mkfastq`. # openpipeline 0.3.0 * Add `tx_processing` pipeline with following components: - `filter_with_counts` - `filter_with_scrublet` - `filter_with_hvg` - `do_filter` - `normalize_total` - `regress_out` - `log1p` - `pca` - `find_neighbors` - `leiden` - `umap` # openpipeline 0.2.0 ## NEW FUNCTIONALITY * Added `from_10x_to_h5ad` and `download_10x_dataset` components. ## MINOR CHANGES * Workflow `bd_rhapsody_wta`: Minor change to workflow to allow for easy processing of multiple samples with a tsv. * Component `bd_rhapsody_wta`: Added more parameters, `--parallel` and `--timestamps`. * Added `pbmc_1k_protein_v3` as a test resource. * Translate `bd_rhapsody_extracth5ad` from R into Python script. * `bd_rhapsody_wta`: Remove temporary directory after execution. * `files/make_params`: Implement unit tests (PR #505). # openpipeline 0.1.0 * Initial release containing only a `bd_rhapsody_wta` pipeline and corresponding components.