Build branch biobox/main with version main to biobox on branch main (7158daa)

Build pipeline: viash-hub.biobox.main-tb4cv

Source commit: 7158daa5f6

Source message: Fix bases2fastq component, update to latest practices (#190)

* wip updates

* refactor component

* assume bases2fastq follows semver

* fix version command

* add entry to changelog

* move to minor changes
This commit is contained in:
CI
2025-09-01 11:04:56 +00:00
parent 9cc17eaa6f
commit 04a5851ff8
859 changed files with 311497 additions and 6746 deletions

View File

@@ -1,445 +1,145 @@
# Contributing Guidelines
# Contributing guidelines
We encourage contributions from the community! This guide will help you get started with creating new components for the biobox repository.
We encourage contributions from the community. To contribute:
**Quick overview:** Fork → Develop → Test → Submit PR
1. **Fork the Repository**: Start by forking this repository to your account.
2. **Develop Your Component**: Create your Viash component, ensuring it aligns with our best practices (detailed below).
3. **Submit a Pull Request**: After testing your component, submit a pull request for review.
## Quick Start
## Procedure of adding a component
### Step 1: Find a component to contribute
* Find a tool to contribute to this repo.
* Check whether it is already in the [Project board](https://github.com/orgs/viash-hub/projects/1).
* Check whether there is a corresponding [Snakemake wrapper](https://github.com/snakemake/snakemake-wrappers/blob/master/bio) or [nf-core module](https://github.com/nf-core/modules/tree/master/modules/nf-core) which we can use as inspiration.
* Create an issue to show that you are working on this component.
### Step 2: Add config template
Change all occurrences of `xxx` to the name of the component.
Create a file at `src/xxx/config.vsh.yaml` with contents:
### Essential Config Template
```yaml
name: xxx
description: xxx
name: your_tool
namespace: category
description: Brief description of what the tool does
keywords: [tag1, tag2]
links:
homepage: yyy
documentation: yyy
issue_tracker: yyy
repository: yyy
references:
doi: 12345/12345678.yz
license: MIT/Apache-2.0/GPL-3.0/...
homepage: https://tool-homepage.com
documentation: https://tool-docs.com
repository: https://github.com/user/repo
references:
doi: 10.1000/journal.12345
license: MIT/Apache-2.0/GPL-3.0
requirements:
commands: [your-tool, dependency-tool]
authors:
- __merge__: /src/_authors/your_name.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments: <...>
- name: Outputs
arguments: <...>
- name: Arguments
arguments: <...>
arguments: [...]
- name: Outputs
arguments: [...]
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- <...>
- type: docker
image: quay.io/biocontainers/tool:version--build_string
setup:
- type: docker
run:
- tool --version 2>&1 | head -1 | sed 's/.*version /tool: /' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
```
### Step 3: Fill in the metadata
Fill in the relevant metadata fields in the config. Here is an example of the metadata of an existing component.
```yaml
name: arriba
description: Detect gene fusions from RNA-Seq data
keywords: [Gene fusion, RNA-Seq]
links:
homepage: https://arriba.readthedocs.io/en/latest/
documentation: https://arriba.readthedocs.io/en/latest/
repository: https://github.com/suhrig/arriba
issue_tracker: https://github.com/suhrig/arriba/issues
references:
doi: 10.1101/gr.257246.119
bibtex: |
@article{
... a bibtex entry in case the doi is not available ...
}
license: MIT
```
### Step 4: Find a suitable container
Google `biocontainer <name of component>` and find the container that is most suitable. Typically the link will be `https://quay.io/repository/biocontainers/xxx?tab=tags`.
If no such container is found, you can create a custom container in the next step.
### Step 5: Create help file
To help develop the component, we store the `--help` output of the tool in a file at `src/xxx/help.txt`.
````bash
cat <<EOF > src/xxx/help.txt
```sh
xxx --help
```
EOF
docker run quay.io/biocontainers/xxx:tag xxx --help >> src/xxx/help.txt
````
Notes:
* This help file has no functional purpose, but it is useful for the developer to see the help output of the tool.
* Some tools might not have a `--help` argument but instead have a `-h` argument. For example, for `arriba`, the help message is obtained by running `arriba -h`:
```bash
docker run quay.io/biocontainers/arriba:2.4.0--h0033a41_2 arriba -h
```
### Step 6: Create or fetch test data
To help develop the component, it's interesting to have some test data available. In most cases, we can use the test data from the Snakemake wrappers.
To make sure we can reproduce the test data in the future, we store the command to fetch the test data in a file at `src/xxx/test_data/script.sh`.
### Essential Commands
```bash
cat <<EOF > src/xxx/test_data/script.sh
# Create component structure
mkdir -p src/namespace/tool_name
touch src/namespace/tool_name/{script.sh,test.sh,config.vsh.yaml}
# clone repo
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi
# Generate help file
docker run container tool --help > src/namespace/tool_name/help.txt
# copy test data
cp -r /tmp/snakemake-wrappers/bio/xxx/test/* src/xxx/test_data
EOF
# Test your component
viash test src/namespace/tool_name/config.vsh.yaml
# Build for testing
viash build src/namespace/tool_name/config.vsh.yaml --setup cachedbuild
```
The test data should be suitable for testing this component. Ensure that the test data is small enough: ideally <1KB, preferably <10KB, if need be <100KB.
### Key Best Practices
### Step 7: Add arguments for the input files
- **Follow modern standards**: Use current coding patterns and component structure
- **Ensure reproducibility**: Pin versions and document dependencies clearly
- **Generate test data**: Create self-contained tests that don't rely on external files
- **Write clean code**: Use consistent naming and clear, maintainable scripts
By looking at the help file, we add the input arguments to the config file. Here is an example of the input arguments of an existing component.
For detailed implementation guidelines, check out our development guides:
For instance, in the [arriba help file](src/arriba/help.txt), we see the following:
## Development Guides
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
-o fusions.tsv [-O fusions.discarded.tsv] \
[OPTIONS]
### 🔧 [Component Development Guide](docs/COMPONENT_DEVELOPMENT.md)
How to create components: config templates, metadata, arguments, containers, help files, and Docker setup.
-x FILE File in SAM/BAM/CRAM format with main alignments as generated by STAR
(Aligned.out.sam). Arriba extracts candidate reads from this file.
### 📝 [Script Development Guide](docs/SCRIPT_DEVELOPMENT.md)
Writing good scripts: array-based commands, error handling, conditional parameters, boolean flags, and parameter patterns.
Based on this information, we can add the following input arguments to the config file.
### ✅ [Testing Guide](docs/TESTING.md)
Testing your components: self-contained tests, generating test data, output validation, and testing multiple scenarios.
```yaml
argument_groups:
- name: Inputs
arguments:
- name: --bam
alternatives: -x
type: file
description: |
File in SAM/BAM/CRAM format with main alignments as generated by STAR
(`Aligned.out.sam`). Arriba extracts candidate reads from this file.
required: true
example: Aligned.out.bam
```
### 🐳 [Docker Guide](docs/DOCKER_GUIDE.md)
Working with containers: choosing biocontainers, version pinning, detecting software versions, and container best practices.
Check the [documentation](https://viash.io/reference/config/functionality/arguments) for more information on the format of input arguments.
## Contribution Process
Several notes:
### Submitting Your Component
* Argument names should be formatted in `--snake_case`. This means arguments like `--foo-bar` should be formatted as `--foo_bar`, and short arguments like `-f` should receive a longer name like `--foo`.
1. **Test thoroughly**: Ensure your component passes all tests
```bash
viash test src/namespace/tool_name/config.vsh.yaml
```
* Input arguments can have `multiple: true` to allow the user to specify multiple files.
2. **Add changelog entry**: Document your changes in `CHANGELOG.md` under the "Unreleased" section
* The description should be formatted in markdown.
3. **Review your changes**: Check your code for:
- Consistent naming and coding conventions
- Clear, maintainable code structure
- Proper error handling
- Robust edge case management
- Complete documentation and helpful comments
### Step 8: Add arguments for the output files
4. **Create a pull request**: Submit your changes.
- Include a clear description of the changes you've made
- Link to any relevant issues or discussions
- Review the changes critically before submitting the PR
By looking at the help file, we now also add output arguments to the config file.
### Review Process
For example, in the [arriba help file](src/arriba/help.txt), we see the following:
- All contributions go through code review
- Components must pass automated tests
- Docker containers must be properly versioned
- Documentation must be complete and accurate
## Getting Help
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
-o fusions.tsv [-O fusions.discarded.tsv] \
[OPTIONS]
### Resources
-o FILE Output file with fusions that have passed all filters.
- **[Viash Documentation](https://viash.io/)**
- **[GitHub Discussions](https://github.com/viash-io/biobox/discussions)**
- **[Issue Tracker](https://github.com/viash-io/biobox/issues)**
-O FILE Output file with fusions that were discarded due to filtering.
### Common Questions
Based on this information, we can add the following output arguments to the config file.
**Q: How do I find the right Docker container?**
A: Search for "biocontainer [tool_name]" or check [quay.io/biocontainers](https://quay.io/organization/biocontainers)
```yaml
argument_groups:
- name: Outputs
arguments:
- name: --fusions
alternatives: -o
type: file
direction: output
description: |
Output file with fusions that have passed all filters.
required: true
example: fusions.tsv
- name: --fusions_discarded
alternatives: -O
type: file
direction: output
description: |
Output file with fusions that were discarded due to filtering.
required: false
example: fusions.discarded.tsv
```
**Q: My component fails to build. What should I check?**
A: Verify the Docker image exists, check syntax in config.vsh.yaml, and ensure all required commands are available
Note:
**Q: How do I handle tools with complex argument patterns?**
A: Check existing similar components for patterns, or ask in GitHub Discussions
* Preferably, these outputs should not be directories but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
**Q: Can I create custom Docker containers?**
A: Yes, but biocontainers are preferred when available. See the [Docker Guide](docs/DOCKER_GUIDE.md) for details.
### Step 9: Add arguments for the other arguments
---
Finally, add all other arguments to the config file. There are a few exceptions:
* Arguments related to specifying CPU and memory requirements are handled separately and should not be added to the config file.
* Arguments related to printing the information such as printing the version (`-v`, `--version`) or printing the help (`-h`, `--help`) should not be added to the config file.
* If the help lists defaults, do not add them as defaults but to the description. Example: `description: <Explanation of parameter>. Default: 10.`
Note:
* Prefer using `boolean_true` over `boolean_false`. This avoids confusion when specifying values for this argument in a Nextflow workflow.
For example, consider the CLI option `--no-indels` for `cutadapt`. If the config for `cutadapt` would specify an argument `no_indels` of type `boolean_false`,
the script of the component must pass a `--no-indels` argument to `cutadapt` when `par_no_indels` is set to `false`. This becomes problematic setting a value for this argument using `fromState` in a nextflow workflow: with `fromState: ["no_indels": true]`, the value that gets passed to the script is `true` and the `--no-indels` flag would *not* be added to the options for `cutadapt`. This is inconsitent to what one might expect when interpreting `["no_indels": true]`.
When using `boolean_true`, the reasoning becomes simpler because its value no longer represents the effect of the argument, but wether or not the flag is set.
### Step 10: Add a Docker engine
To ensure reproducibility of components, we require that all components are run in a Docker container.
```yaml
engines:
- type: docker
image: quay.io/biocontainers/xxx:0.1.0--py_0
```
The container should have your tool installed, as well as `ps`.
If you didn't find a suitable container in the previous step, you can create a custom container. For example:
```yaml
engines:
- type: docker
image: python:3.10
setup:
- type: python
packages: numpy
```
For more information on how to do this, see the [documentation](https://viash.io/guide/component/add-dependencies.html#steps-for-creating-a-custom-docker-platform).
Here is a list of base containers we can recommend:
* Bash: [`bash`](https://hub.docker.com/_/bash), [`ubuntu`](https://hub.docker.com/_/ubuntu)
* C#: [`ghcr.io/data-intuitive/dotnet-script`](https://github.com/data-intuitive/ghcr-dotnet-script/pkgs/container/dotnet-script)
* JavaScript: [`node`](https://hub.docker.com/_/node)
* Python: [`python`](https://hub.docker.com/_/python), [`nvcr.io/nvidia/pytorch`](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
* R: [`eddelbuettel/r2u`](https://hub.docker.com/r/eddelbuettel/r2u), [`rocker/tidyverse`](https://hub.docker.com/r/rocker/tidyverse)
* Scala: [`sbtscala/scala-sbt`](https://hub.docker.com/r/sbtscala/scala-sbt)
### Step 11: Write a runner script
Next, we need to write a runner script that runs the tool with the input arguments. Create a Bash script named `src/xxx/script.sh` which runs the tool with the input arguments.
```bash
#!/bin/bash
## VIASH START
## VIASH END
# unset flags
[[ "$par_option" == "false" ]] && unset par_option
xxx \
--input "$par_input" \
--output "$par_output" \
${par_option:+--option}
```
When building a Viash component, Viash will automatically replace the `## VIASH START` and `## VIASH END` lines (and anything in between) with environment variables based on the arguments specified in the config.
As an example, this is what the Bash script for the `arriba` component looks like:
```bash
#!/bin/bash
## VIASH START
## VIASH END
# unset flags
[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking
[[ "$par_extra_information" == "false" ]] && unset par_extra_information
[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps
arriba \
-x "$par_bam" \
-a "$par_genome" \
-g "$par_gene_annotation" \
-o "$par_fusions" \
${par_known_fusions:+-k "${par_known_fusions}"} \
${par_blacklist:+-b "${par_blacklist}"} \
# ...
${par_extra_information:+-X} \
${par_fill_gaps:+-I}
```
Notes:
* If your arguments can contain special variables (e.g. `$`), you can use quoting (need to find a documentation page for this) to make sure you can use the string as input. Example: `-x ${par_bam@Q}`.
* Optional arguments can be passed to the command conditionally using Bash [parameter expansion](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html). For example: `${par_known_fusions:+-k ${par_known_fusions@Q}}`
* If your tool allows for multiple inputs using a separator other than `;` (which is the default Viash multiple separator), you can substitute these values with a command like: `par_disable_filters=$(echo $par_disable_filters | tr ';' ',')`.
* If you have a lot of boolean variables that you would like to unset when the value is `false`, you can avoid duplicate code by using the following syntax:
```bash
unset_if_false=(
par_argument_1
par_argument_2
par_argument_3
par_argument_4
)
for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
done
```
this code is equivalent to
```bash
[[ "$par_argument_1" == "false" ]] && unset par_argument_1
[[ "$par_argument_2" == "false" ]] && unset par_argument_2
[[ "$par_argument_3" == "false" ]] && unset par_argument_3
[[ "$par_argument_4" == "false" ]] && unset par_argument_4
```
### Step 12: Create test script
If the unit test requires test resources, these should be provided in the `test_resources` section of the component.
```yaml
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
```
Create a test script at `src/xxx/test.sh` that runs the component with the test data. This script should run the component (available with `$meta_executable`) with the test data and check if the output is as expected. The script should exit with a non-zero exit code if the output is not as expected. For example:
```bash
#!/bin/bash
set -e
## VIASH START
## VIASH END
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_doesnt_exist() {
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
}
assert_file_empty() {
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains() {
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
assert_file_contains_regex() {
grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains_regex() {
grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
#############################################
echo "> Run $meta_name with test data"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads_R1.fastq" \
--output "output.txt" \
--option
echo ">> Check if output exists"
assert_file_exists "output.txt"
echo ">> Check if output is empty"
assert_file_not_empty "output.txt"
echo ">> Check if output is correct"
assert_file_contains "output.txt" "some expected output"
echo "> All tests succeeded!"
```
Notes:
* Do always check the contents of the output file. If the output is not deterministic, you can use regular expressions to check the output.
* If possible, generate your own test data instead of copying it from an external resource.
### Step 13: Create a `/var/software_versions.txt` file
For the sake of transparency and reproducibility, we require that the versions of the software used in the component are documented.
For now, this is managed by creating a file `/var/software_versions.txt` in the `setup` section of the Docker engine.
```yaml
engines:
- type: docker
image: quay.io/biocontainers/xxx:0.1.0--py_0
setup:
- type: docker
# note: /var/software_versions.txt should contain:
# arriba: "2.4.0"
run: |
echo "xxx: \"0.1.0\"" > /var/software_versions.txt
```
Happy contributing!