Build branch biobox/main with version main to biobox on branch main (7158daa)

Build pipeline: viash-hub.biobox.main-tb4cv

Source commit: 7158daa5f6

Source message: Fix bases2fastq component, update to latest practices (#190)

* wip updates

* refactor component

* assume bases2fastq follows semver

* fix version command

* add entry to changelog

* move to minor changes
This commit is contained in:
CI
2025-09-01 11:04:56 +00:00
parent 9cc17eaa6f
commit 04a5851ff8
859 changed files with 311497 additions and 6746 deletions

View File

@@ -1,3 +1,94 @@
# Unreleased
<!-- Add new changes here before release -->
## BREAKING CHANGES
* `fq_subsample` has been removed after its functionality was previously copied to `fq/fq_subsample`. Please use the latter instead. (PR #182).
## NEW FUNCTIONALITY
* `fq`:
- `fq/fq_filter`: Filter FASTQ files based on record names or sequence patterns (PR #182).
- `fq/fq_generate`: Generate a random FASTQ file pair for testing and simulation purposes (PR #182).
* `bwa`: Added BWA support for single-end and paired-end read alignment (PR #183).
- `bwa/bwa_index`: Create BWA index files for reference genome alignment.
- `bwa/bwa_mem`: BWA-MEM algorithm for sequence alignment supporting single-end and paired-end reads.
- `bwa/bwa_aln`: BWA aln algorithm for aligning short sequence reads to a reference genome.
- `bwa/bwa_samse`: BWA samse - generate single-end alignment in SAM format from BWA aln SAI files.
- `bwa/bwa_sampe`: BWA sampe - generate paired-end alignment in SAM format from BWA aln SAI files.
* `bowtie2`: Add support for Bowtie2 alignment and indexing (PR #184).
- `bowtie2/bowtie2_build`: Build Bowtie2 index files from reference sequences.
- `bowtie2/bowtie2_align`: Align single-end and paired-end reads using Bowtie2.
- `bowtie2/bowtie2_inspect`: Extract information from Bowtie2 index files.
* `bedtools`: Major expansion with 32 new components providing comprehensive genomic interval analysis (PR #188):
- `bedtools/bedtools_annotate`: Annotate coverage based on overlaps with interval files
- `bedtools/bedtools_bedpetobam`: Convert BEDPE to BAM format
- `bedtools/bedtools_closest`: Find closest features between two interval files
- `bedtools/bedtools_cluster`: Cluster nearby intervals
- `bedtools/bedtools_complement`: Report intervals not covered by features
- `bedtools/bedtools_coverage`: Compute coverage of features
- `bedtools/bedtools_expand`: Expand blocked BED features
- `bedtools/bedtools_fisher`: Compute Fisher's exact test for overlaps
- `bedtools/bedtools_flank`: Create flanking intervals around features
- `bedtools/bedtools_igv`: Create IGV batch scripts for visualization
- `bedtools/bedtools_jaccard`: Compute Jaccard statistic between interval sets
- `bedtools/bedtools_makewindows`: Make windows across genome or intervals
- `bedtools/bedtools_map`: Map values from overlapping intervals
- `bedtools/bedtools_maskfasta`: Mask FASTA sequences using intervals
- `bedtools/bedtools_multicov`: Count coverage across multiple BAM files
- `bedtools/bedtools_multiinter`: Identify common intervals across multiple files
- `bedtools/bedtools_overlap`: Compute overlaps between paired-end reads and intervals
- `bedtools/bedtools_pairtobed`: Find overlaps between paired-end reads and intervals
- `bedtools/bedtools_pairtopair`: Find overlaps between paired-end read sets
- `bedtools/bedtools_random`: Generate random intervals
- `bedtools/bedtools_reldist`: Compute relative distances between features
- `bedtools/bedtools_sample`: Sample random subsets of intervals
- `bedtools/bedtools_shift`: Shift intervals by specified amounts
- `bedtools/bedtools_shuffle`: Shuffle intervals while preserving size
- `bedtools/bedtools_slop`: Extend intervals by specified amounts
- `bedtools/bedtools_spacing`: Report spacing between intervals
- `bedtools/bedtools_split`: Split BED12 features into individual intervals
- `bedtools/bedtools_subtract`: Remove overlapping features
- `bedtools/bedtools_summary`: Summarize interval statistics
- `bedtools/bedtools_tag`: Tag BAM alignments with overlapping intervals
- `bedtools/bedtools_unionbedg`: Combine multiple BEDGRAPH files
- `bedtools/bedtools_window`: Find overlapping features within specified windows
## MAJOR CHANGES
* `bedtools`: Enhanced 11 existing bedtools components with improved functionality and standardized interfaces (PR #188):
- `bedtools/bedtools_bamtobed`: Enhanced with additional output format options
- `bedtools/bedtools_bamtofastq`: Improved paired-end read handling
- `bedtools/bedtools_bed12tobed6`: Standardized parameter handling
- `bedtools/bedtools_bedtobam`: Enhanced genome file support
- `bedtools/bedtools_genomecov`: Added scale and split options
- `bedtools/bedtools_getfasta`: Improved FASTA extraction features
- `bedtools/bedtools_groupby`: Enhanced grouping and operation options
- `bedtools/bedtools_intersect`: Expanded intersection mode support
- `bedtools/bedtools_links`: Improved link generation functionality
- `bedtools/bedtools_merge`: Enhanced merging options and distance parameters
- `bedtools/bedtools_sort`: Standardized sorting options
## MINOR CHANGES
* `bases2fastq`: Updated component with comprehensive argument support and latest practices (PR #190).
## DOCUMENTATION
* Major restructuring of the documentation pages (PR #185):
- `CONTRIBUTING.md`: Streamlined guide with detailed sections moved to dedicated docs/ guides.
- `README.md`: Streamlined content to guide people towards what they need.
- `docs/COMPONENT_DEVELOPMENT.md`: New comprehensive guide covering component creation process.
- `docs/SCRIPT_DEVELOPMENT.md`: New detailed guide for script development best practices.
- `docs/TESTING.md`: New comprehensive testing guide.
- `docs/DOCKER_GUIDE.md`: New Docker and engine best practices guide.
* `.github/PULL_REQUEST_TEMPLATE.md`: Fixed repository references to point to correct biobox repository instead of base template (PR #185).
# biobox 0.3.2
## NEW FUNCTIONALITY

View File

@@ -1,445 +1,145 @@
# Contributing Guidelines
# Contributing guidelines
We encourage contributions from the community! This guide will help you get started with creating new components for the biobox repository.
We encourage contributions from the community. To contribute:
**Quick overview:** Fork → Develop → Test → Submit PR
1. **Fork the Repository**: Start by forking this repository to your account.
2. **Develop Your Component**: Create your Viash component, ensuring it aligns with our best practices (detailed below).
3. **Submit a Pull Request**: After testing your component, submit a pull request for review.
## Quick Start
## Procedure of adding a component
### Step 1: Find a component to contribute
* Find a tool to contribute to this repo.
* Check whether it is already in the [Project board](https://github.com/orgs/viash-hub/projects/1).
* Check whether there is a corresponding [Snakemake wrapper](https://github.com/snakemake/snakemake-wrappers/blob/master/bio) or [nf-core module](https://github.com/nf-core/modules/tree/master/modules/nf-core) which we can use as inspiration.
* Create an issue to show that you are working on this component.
### Step 2: Add config template
Change all occurrences of `xxx` to the name of the component.
Create a file at `src/xxx/config.vsh.yaml` with contents:
### Essential Config Template
```yaml
name: xxx
description: xxx
name: your_tool
namespace: category
description: Brief description of what the tool does
keywords: [tag1, tag2]
links:
homepage: yyy
documentation: yyy
issue_tracker: yyy
repository: yyy
references:
doi: 12345/12345678.yz
license: MIT/Apache-2.0/GPL-3.0/...
homepage: https://tool-homepage.com
documentation: https://tool-docs.com
repository: https://github.com/user/repo
references:
doi: 10.1000/journal.12345
license: MIT/Apache-2.0/GPL-3.0
requirements:
commands: [your-tool, dependency-tool]
authors:
- __merge__: /src/_authors/your_name.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments: <...>
- name: Outputs
arguments: <...>
- name: Arguments
arguments: <...>
arguments: [...]
- name: Outputs
arguments: [...]
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- <...>
- type: docker
image: quay.io/biocontainers/tool:version--build_string
setup:
- type: docker
run:
- tool --version 2>&1 | head -1 | sed 's/.*version /tool: /' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
```
### Step 3: Fill in the metadata
Fill in the relevant metadata fields in the config. Here is an example of the metadata of an existing component.
```yaml
name: arriba
description: Detect gene fusions from RNA-Seq data
keywords: [Gene fusion, RNA-Seq]
links:
homepage: https://arriba.readthedocs.io/en/latest/
documentation: https://arriba.readthedocs.io/en/latest/
repository: https://github.com/suhrig/arriba
issue_tracker: https://github.com/suhrig/arriba/issues
references:
doi: 10.1101/gr.257246.119
bibtex: |
@article{
... a bibtex entry in case the doi is not available ...
}
license: MIT
```
### Step 4: Find a suitable container
Google `biocontainer <name of component>` and find the container that is most suitable. Typically the link will be `https://quay.io/repository/biocontainers/xxx?tab=tags`.
If no such container is found, you can create a custom container in the next step.
### Step 5: Create help file
To help develop the component, we store the `--help` output of the tool in a file at `src/xxx/help.txt`.
````bash
cat <<EOF > src/xxx/help.txt
```sh
xxx --help
```
EOF
docker run quay.io/biocontainers/xxx:tag xxx --help >> src/xxx/help.txt
````
Notes:
* This help file has no functional purpose, but it is useful for the developer to see the help output of the tool.
* Some tools might not have a `--help` argument but instead have a `-h` argument. For example, for `arriba`, the help message is obtained by running `arriba -h`:
```bash
docker run quay.io/biocontainers/arriba:2.4.0--h0033a41_2 arriba -h
```
### Step 6: Create or fetch test data
To help develop the component, it's interesting to have some test data available. In most cases, we can use the test data from the Snakemake wrappers.
To make sure we can reproduce the test data in the future, we store the command to fetch the test data in a file at `src/xxx/test_data/script.sh`.
### Essential Commands
```bash
cat <<EOF > src/xxx/test_data/script.sh
# Create component structure
mkdir -p src/namespace/tool_name
touch src/namespace/tool_name/{script.sh,test.sh,config.vsh.yaml}
# clone repo
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi
# Generate help file
docker run container tool --help > src/namespace/tool_name/help.txt
# copy test data
cp -r /tmp/snakemake-wrappers/bio/xxx/test/* src/xxx/test_data
EOF
# Test your component
viash test src/namespace/tool_name/config.vsh.yaml
# Build for testing
viash build src/namespace/tool_name/config.vsh.yaml --setup cachedbuild
```
The test data should be suitable for testing this component. Ensure that the test data is small enough: ideally <1KB, preferably <10KB, if need be <100KB.
### Key Best Practices
### Step 7: Add arguments for the input files
- **Follow modern standards**: Use current coding patterns and component structure
- **Ensure reproducibility**: Pin versions and document dependencies clearly
- **Generate test data**: Create self-contained tests that don't rely on external files
- **Write clean code**: Use consistent naming and clear, maintainable scripts
By looking at the help file, we add the input arguments to the config file. Here is an example of the input arguments of an existing component.
For detailed implementation guidelines, check out our development guides:
For instance, in the [arriba help file](src/arriba/help.txt), we see the following:
## Development Guides
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
-o fusions.tsv [-O fusions.discarded.tsv] \
[OPTIONS]
### 🔧 [Component Development Guide](docs/COMPONENT_DEVELOPMENT.md)
How to create components: config templates, metadata, arguments, containers, help files, and Docker setup.
-x FILE File in SAM/BAM/CRAM format with main alignments as generated by STAR
(Aligned.out.sam). Arriba extracts candidate reads from this file.
### 📝 [Script Development Guide](docs/SCRIPT_DEVELOPMENT.md)
Writing good scripts: array-based commands, error handling, conditional parameters, boolean flags, and parameter patterns.
Based on this information, we can add the following input arguments to the config file.
### ✅ [Testing Guide](docs/TESTING.md)
Testing your components: self-contained tests, generating test data, output validation, and testing multiple scenarios.
```yaml
argument_groups:
- name: Inputs
arguments:
- name: --bam
alternatives: -x
type: file
description: |
File in SAM/BAM/CRAM format with main alignments as generated by STAR
(`Aligned.out.sam`). Arriba extracts candidate reads from this file.
required: true
example: Aligned.out.bam
```
### 🐳 [Docker Guide](docs/DOCKER_GUIDE.md)
Working with containers: choosing biocontainers, version pinning, detecting software versions, and container best practices.
Check the [documentation](https://viash.io/reference/config/functionality/arguments) for more information on the format of input arguments.
## Contribution Process
Several notes:
### Submitting Your Component
* Argument names should be formatted in `--snake_case`. This means arguments like `--foo-bar` should be formatted as `--foo_bar`, and short arguments like `-f` should receive a longer name like `--foo`.
1. **Test thoroughly**: Ensure your component passes all tests
```bash
viash test src/namespace/tool_name/config.vsh.yaml
```
* Input arguments can have `multiple: true` to allow the user to specify multiple files.
2. **Add changelog entry**: Document your changes in `CHANGELOG.md` under the "Unreleased" section
* The description should be formatted in markdown.
3. **Review your changes**: Check your code for:
- Consistent naming and coding conventions
- Clear, maintainable code structure
- Proper error handling
- Robust edge case management
- Complete documentation and helpful comments
### Step 8: Add arguments for the output files
4. **Create a pull request**: Submit your changes.
- Include a clear description of the changes you've made
- Link to any relevant issues or discussions
- Review the changes critically before submitting the PR
By looking at the help file, we now also add output arguments to the config file.
### Review Process
For example, in the [arriba help file](src/arriba/help.txt), we see the following:
- All contributions go through code review
- Components must pass automated tests
- Docker containers must be properly versioned
- Documentation must be complete and accurate
## Getting Help
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
-o fusions.tsv [-O fusions.discarded.tsv] \
[OPTIONS]
### Resources
-o FILE Output file with fusions that have passed all filters.
- **[Viash Documentation](https://viash.io/)**
- **[GitHub Discussions](https://github.com/viash-io/biobox/discussions)**
- **[Issue Tracker](https://github.com/viash-io/biobox/issues)**
-O FILE Output file with fusions that were discarded due to filtering.
### Common Questions
Based on this information, we can add the following output arguments to the config file.
**Q: How do I find the right Docker container?**
A: Search for "biocontainer [tool_name]" or check [quay.io/biocontainers](https://quay.io/organization/biocontainers)
```yaml
argument_groups:
- name: Outputs
arguments:
- name: --fusions
alternatives: -o
type: file
direction: output
description: |
Output file with fusions that have passed all filters.
required: true
example: fusions.tsv
- name: --fusions_discarded
alternatives: -O
type: file
direction: output
description: |
Output file with fusions that were discarded due to filtering.
required: false
example: fusions.discarded.tsv
```
**Q: My component fails to build. What should I check?**
A: Verify the Docker image exists, check syntax in config.vsh.yaml, and ensure all required commands are available
Note:
**Q: How do I handle tools with complex argument patterns?**
A: Check existing similar components for patterns, or ask in GitHub Discussions
* Preferably, these outputs should not be directories but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
**Q: Can I create custom Docker containers?**
A: Yes, but biocontainers are preferred when available. See the [Docker Guide](docs/DOCKER_GUIDE.md) for details.
### Step 9: Add arguments for the other arguments
---
Finally, add all other arguments to the config file. There are a few exceptions:
* Arguments related to specifying CPU and memory requirements are handled separately and should not be added to the config file.
* Arguments related to printing the information such as printing the version (`-v`, `--version`) or printing the help (`-h`, `--help`) should not be added to the config file.
* If the help lists defaults, do not add them as defaults but to the description. Example: `description: <Explanation of parameter>. Default: 10.`
Note:
* Prefer using `boolean_true` over `boolean_false`. This avoids confusion when specifying values for this argument in a Nextflow workflow.
For example, consider the CLI option `--no-indels` for `cutadapt`. If the config for `cutadapt` would specify an argument `no_indels` of type `boolean_false`,
the script of the component must pass a `--no-indels` argument to `cutadapt` when `par_no_indels` is set to `false`. This becomes problematic setting a value for this argument using `fromState` in a nextflow workflow: with `fromState: ["no_indels": true]`, the value that gets passed to the script is `true` and the `--no-indels` flag would *not* be added to the options for `cutadapt`. This is inconsitent to what one might expect when interpreting `["no_indels": true]`.
When using `boolean_true`, the reasoning becomes simpler because its value no longer represents the effect of the argument, but wether or not the flag is set.
### Step 10: Add a Docker engine
To ensure reproducibility of components, we require that all components are run in a Docker container.
```yaml
engines:
- type: docker
image: quay.io/biocontainers/xxx:0.1.0--py_0
```
The container should have your tool installed, as well as `ps`.
If you didn't find a suitable container in the previous step, you can create a custom container. For example:
```yaml
engines:
- type: docker
image: python:3.10
setup:
- type: python
packages: numpy
```
For more information on how to do this, see the [documentation](https://viash.io/guide/component/add-dependencies.html#steps-for-creating-a-custom-docker-platform).
Here is a list of base containers we can recommend:
* Bash: [`bash`](https://hub.docker.com/_/bash), [`ubuntu`](https://hub.docker.com/_/ubuntu)
* C#: [`ghcr.io/data-intuitive/dotnet-script`](https://github.com/data-intuitive/ghcr-dotnet-script/pkgs/container/dotnet-script)
* JavaScript: [`node`](https://hub.docker.com/_/node)
* Python: [`python`](https://hub.docker.com/_/python), [`nvcr.io/nvidia/pytorch`](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
* R: [`eddelbuettel/r2u`](https://hub.docker.com/r/eddelbuettel/r2u), [`rocker/tidyverse`](https://hub.docker.com/r/rocker/tidyverse)
* Scala: [`sbtscala/scala-sbt`](https://hub.docker.com/r/sbtscala/scala-sbt)
### Step 11: Write a runner script
Next, we need to write a runner script that runs the tool with the input arguments. Create a Bash script named `src/xxx/script.sh` which runs the tool with the input arguments.
```bash
#!/bin/bash
## VIASH START
## VIASH END
# unset flags
[[ "$par_option" == "false" ]] && unset par_option
xxx \
--input "$par_input" \
--output "$par_output" \
${par_option:+--option}
```
When building a Viash component, Viash will automatically replace the `## VIASH START` and `## VIASH END` lines (and anything in between) with environment variables based on the arguments specified in the config.
As an example, this is what the Bash script for the `arriba` component looks like:
```bash
#!/bin/bash
## VIASH START
## VIASH END
# unset flags
[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking
[[ "$par_extra_information" == "false" ]] && unset par_extra_information
[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps
arriba \
-x "$par_bam" \
-a "$par_genome" \
-g "$par_gene_annotation" \
-o "$par_fusions" \
${par_known_fusions:+-k "${par_known_fusions}"} \
${par_blacklist:+-b "${par_blacklist}"} \
# ...
${par_extra_information:+-X} \
${par_fill_gaps:+-I}
```
Notes:
* If your arguments can contain special variables (e.g. `$`), you can use quoting (need to find a documentation page for this) to make sure you can use the string as input. Example: `-x ${par_bam@Q}`.
* Optional arguments can be passed to the command conditionally using Bash [parameter expansion](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html). For example: `${par_known_fusions:+-k ${par_known_fusions@Q}}`
* If your tool allows for multiple inputs using a separator other than `;` (which is the default Viash multiple separator), you can substitute these values with a command like: `par_disable_filters=$(echo $par_disable_filters | tr ';' ',')`.
* If you have a lot of boolean variables that you would like to unset when the value is `false`, you can avoid duplicate code by using the following syntax:
```bash
unset_if_false=(
par_argument_1
par_argument_2
par_argument_3
par_argument_4
)
for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
done
```
this code is equivalent to
```bash
[[ "$par_argument_1" == "false" ]] && unset par_argument_1
[[ "$par_argument_2" == "false" ]] && unset par_argument_2
[[ "$par_argument_3" == "false" ]] && unset par_argument_3
[[ "$par_argument_4" == "false" ]] && unset par_argument_4
```
### Step 12: Create test script
If the unit test requires test resources, these should be provided in the `test_resources` section of the component.
```yaml
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
```
Create a test script at `src/xxx/test.sh` that runs the component with the test data. This script should run the component (available with `$meta_executable`) with the test data and check if the output is as expected. The script should exit with a non-zero exit code if the output is not as expected. For example:
```bash
#!/bin/bash
set -e
## VIASH START
## VIASH END
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_doesnt_exist() {
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
}
assert_file_empty() {
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains() {
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
assert_file_contains_regex() {
grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains_regex() {
grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
#############################################
echo "> Run $meta_name with test data"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads_R1.fastq" \
--output "output.txt" \
--option
echo ">> Check if output exists"
assert_file_exists "output.txt"
echo ">> Check if output is empty"
assert_file_not_empty "output.txt"
echo ">> Check if output is correct"
assert_file_contains "output.txt" "some expected output"
echo "> All tests succeeded!"
```
Notes:
* Do always check the contents of the output file. If the output is not deterministic, you can use regular expressions to check the output.
* If possible, generate your own test data instead of copying it from an external resource.
### Step 13: Create a `/var/software_versions.txt` file
For the sake of transparency and reproducibility, we require that the versions of the software used in the component are documented.
For now, this is managed by creating a file `/var/software_versions.txt` in the `setup` section of the Docker engine.
```yaml
engines:
- type: docker
image: quay.io/biocontainers/xxx:0.1.0--py_0
setup:
- type: docker
# note: /var/software_versions.txt should contain:
# arriba: "2.4.0"
run: |
echo "xxx: \"0.1.0\"" > /var/software_versions.txt
```
Happy contributing!

212
README.md
View File

@@ -11,132 +11,114 @@ Issues](https://img.shields.io/github/issues/viash-hub/biobox.svg)](https://gith
[![Viash
version](https://img.shields.io/badge/Viash-v0.9.4-blue.svg)](https://viash.io)
A curated collection of high-quality, standalone bioinformatics
components built with [Viash](https://viash.io).
**A curated collection of high-quality, production-ready bioinformatics
components**
## Introduction
Built with [Viash](https://viash.io), biobox provides reliable,
containerized tools for genomics and bioinformatics workflows. Each
component is thoroughly tested, fully documented, and designed for
seamless integration into both standalone and Nextflow pipelines.
`biobox` offers a suite of reliable bioinformatics components, similar
to [nf-core/modules](https://github.com/nf-core/modules) and
[snakemake-wrappers/bio](https://github.com/snakemake/snakemake-wrappers/tree/master/bio),
but built using the [Viash](https://viash.io) framework.
## Why Choose biobox?
This approach emphasizes **reusability**, **reproducibility**, and
adherence to **best practices**. Key features of `biobox` components
include:
**Production Ready**: All components are containerized with pinned
versions and comprehensive testing
**Nextflow Native**: Drop-in compatibility with Nextflow workflows
**Complete Documentation**: Full parameter exposure with detailed
help and examples
**Quality Assured**: Unit tested with automated CI/CD validation
**Modern Standards**: Built with current best practices and
maintained dependencies
- **Standalone & Nextflow Ready:** Run components directly via the
command line or seamlessly integrate them into Nextflow workflows.
- **High Quality Standards:**
- Comprehensive documentation for components and parameters.
- Full exposure of underlying tool arguments.
- Containerized (Docker) for dependency management and
reproducibility.
- Unit tested for verified functionality.
## Featured Tools
## Example Usage
Our collection spans the complete bioinformatics pipeline:
Viash components in biobox can be run in various ways:
**Alignment & Mapping**: BWA, Bowtie2, STAR, Kallisto, Salmon
**Quality Control**: FastQC, Falco, MultiQC, Qualimap, NanoPlot
**Preprocessing**: Cutadapt, fastp, Trimgalore, UMI-tools
**Variant Calling**: BCFtools, LoFreq, SnpEff
**File Manipulation**: SAMtools, Bedtools, seqtk
**Assembly & Annotation**: BUSCO, AGAT, GFFread
**Single Cell**: CellRanger, BD Rhapsody
``` mermaid lang="mermaid"
flowchart TD
A[biobox v0.3.1] --> B(Viash Hub Launch)
A --> C(Viash CLI)
A --> D(Nextflow CLI)
A --> E(Seqera Cloud)
A --> F(As a dependency)
[View all components →](https://www.viash-hub.com/packages/biobox)
## Quick Start
You can run Viash components from biobox in several ways:
**🌐 Via Viash Hub Web UI**: Interactive interface with documentation
and examples
**⚡ As Standalone Executables**: Direct command-line execution
**🔄 Via Nextflow**: Local or cloud-based pipeline workflows
For detailed instructions on each method, visit the **[Viash Hub
documentation →](https://viash-hub.com/packages/biobox)** where each
component page shows exactly how to run it in different environments.
``` mermaid
flowchart LR
A[biobox Components] --> B[🌐 Web UI]
A --> C[⚡ Standalone]
A --> D[🔄 Nextflow Local]
A --> E[☁️ Nextflow Cloud]
style A fill:#7a4baa,color:#fff
style B fill:#e1f5fe,color:#000
style C fill:#e8f5e8,color:#000
style D fill:#fff3e0,color:#000
style E fill:#f3e5f5,color:#000
```
### 1. Via the Viash Hub Launch interface
You can run this component directly from the Viash Hub [Launch
interface](https://www.viash-hub.com/launch?package=biobox&version=v0.3.1&component=arriba&runner=Executable).
![](docs/viash-hub.png)
### 2. Via the Viash CLI
You can run this component directly from the command line using the
Viash CLI.
``` bash
viash run vsh://biobox@v0.3.1/arriba -- --help
viash run vsh://biobox@v0.3.1/arriba -- \
--bam path/to/input.bam \
--genome path/to/genome.fa \
--gene_annotation path/to/annotation.gtf \
--fusions path/to/output.txt
```
This will run the component with the specified input files and output
the results to the specified output file.
### 3. Via the Nextflow CLI or Seqera Cloud
You can run this component as a Nextflow pipeline.
``` bash
nextflow run https://packages.viash-hub.com/vsh/biobox \
-revision v0.3.1 \
-main-script target/nextflow/arriba/main.nf \
-latest -resume \
-profile docker \
--bam path/to/input.bam \
--genome path/to/genome.fa \
--gene_annotation path/to/annotation.gtf \
--publish_dir path/to/output
```
**Note:** Make sure that the [Nextflow
SCM](https://www.nextflow.io/docs/latest/git.html#git-configuration) is
set up properly. You can do this by adding the following lines to your
`~/.nextflow/scm` file:
``` groovy
providers.vsh.platform = 'gitlab'
providers.vsh.server = 'https://packages.viash-hub.com'
```
**Tip:** This will also work with Seqera Cloud or other
Nextflow-compatible platforms.
### 4. As a dependency
In your Viash config file (`config.vsh.yaml`), you can add this
component as a dependency:
``` yaml
dependencies:
- name: arriba
repository: vsh://biobox@v0.3.1
```
**Tip:** See the [Viash
documentation](https://viash.io/guide/nextflow_vdsl3/create-a-pipeline.html#pipeline-as-a-component)
for more details on how to use Viash components as a dependency in your
own Nextflow workflows.
You can run components directly from Viash Hubs launch interface. See
[Viash Hub](https://www.viash-hub.com/packages/biobox) for more
information.
## Contributing
Contributions are welcome! We aim to build a comprehensive collection of
high-quality bioinformatics components. If youd like to contribute,
please follow these general steps:
We welcome contributions! biobox thrives on community input to expand
our collection of high-quality bioinformatics components.
1. Find a component to contribute
2. Add config template
3. Fill in the metadata
4. Find a suitable container
5. Create help file
6. Create or fetch test data
7. Add arguments for the input files
8. Add arguments for the output files
9. Add arguments for the other arguments
10. Add a Docker engine
11. Write a runner script
12. Create test script
13. Create a `/var/software_versions.txt` file
### Quick Contribution Process
See the
[CONTRIBUTING](https://github.com/viash-hub/biobox/blob/main/CONTRIBUTING.md)
file for more details.
1. **Fork** the repository
2. **Create** your component following our guidelines
3. **Test** thoroughly with `viash test`
4. **Submit** a pull request
### What Were Looking For
- **Popular bioinformatics tools** missing from our collection
- **Improvements** to existing components
- **Bug fixes** and documentation enhancements
- **Best practice** implementations
### Getting Started
Check out our comprehensive guides:
- **[Contributing
Guidelines](https://github.com/viash-hub/biobox/blob/main/CONTRIBUTING.md)** -
Complete development guide
- **[Component Standards](docs/COMPONENT_DEVELOPMENT.md)** - Quality
requirements
- **[Testing Guide](docs/TESTING.md)** - Validation best practices
**New to Viash?** Start with our [beginner-friendly
issues](https://github.com/viash-hub/biobox/labels/good%20first%20issue)
or join our [community
discussions](https://github.com/viash-hub/biobox/discussions).
## Community & Support
- **Documentation**: [Viash Documentation](https://viash.io)
- **Discussions**: [GitHub
Discussions](https://github.com/viash-hub/biobox/discussions)
- **Issues**: [Bug Reports & Feature
Requests](https://github.com/viash-hub/biobox/issues)
------------------------------------------------------------------------
**Ready to streamline your bioinformatics workflows?** [Get started with
biobox today →](https://www.viash-hub.com/packages/biobox)

View File

@@ -7,9 +7,15 @@ license <- paste0(package$links$repository, "/blob/main/LICENSE")
contributing <- paste0(package$links$repository, "/blob/main/CONTRIBUTING.md")
pkg <- package$name
ver <- if (!is.null(package$version)) package$version else "v0.3.1"
comp <- "arriba"
ver <- if (!is.null(package$version)) package$version else "v0.4.0"
comp <- "bowtie2_align"
# Count components
component_dirs <- list.dirs("src", recursive = FALSE, full.names = FALSE)
component_dirs <- component_dirs[!startsWith(component_dirs, "_")]
n_tools <- length(component_dirs)
```
# 🌱📦 `r pkg`
[![ViashHub](https://img.shields.io/badge/ViashHub-`r pkg`-7a4baa.svg)](https://www.viash-hub.com/packages/`r pkg`)
@@ -18,106 +24,98 @@ comp <- "arriba"
[![GitHub Issues](https://img.shields.io/github/issues/viash-hub/`r pkg`.svg)](`r package$links$issue_tracker`)
[![Viash version](https://img.shields.io/badge/Viash-v`r gsub("-", "--", package$viash_version)`-blue.svg)](https://viash.io)
`r package$summary`
**A curated collection of high-quality, production-ready bioinformatics components**
## Introduction
Built with [Viash](https://viash.io), `r pkg` provides reliable, containerized tools for genomics and bioinformatics workflows. Each component is thoroughly tested, fully documented, and designed for seamless integration into both standalone and Nextflow pipelines.
`r package$description`
## Why Choose `r pkg`?
## Example Usage
✅ **Production Ready**: All components are containerized with pinned versions and comprehensive testing
✅ **Nextflow Native**: Drop-in compatibility with Nextflow workflows
✅ **Complete Documentation**: Full parameter exposure with detailed help and examples
✅ **Quality Assured**: Unit tested with automated CI/CD validation
✅ **Modern Standards**: Built with current best practices and maintained dependencies
Viash components in `r pkg` can be run in various ways:
## Featured Tools
Our collection spans the complete bioinformatics pipeline:
**Alignment & Mapping**: BWA, Bowtie2, STAR, Kallisto, Salmon
**Quality Control**: FastQC, Falco, MultiQC, Qualimap, NanoPlot
**Preprocessing**: Cutadapt, fastp, Trimgalore, UMI-tools
**Variant Calling**: BCFtools, LoFreq, SnpEff
**File Manipulation**: SAMtools, Bedtools, seqtk
**Assembly & Annotation**: BUSCO, AGAT, GFFread
**Single Cell**: CellRanger, BD Rhapsody
[View all components →](https://www.viash-hub.com/packages/`r pkg`)
## Quick Start
You can run Viash components from `r pkg` in several ways:
**🌐 Via Viash Hub Web UI**: Interactive interface with documentation and examples
**⚡ As Standalone Executables**: Direct command-line execution
**🔄 Via Nextflow**: Local or cloud-based pipeline workflows
For detailed instructions on each method, visit the **[Viash Hub documentation →](https://viash-hub.com/packages/`r pkg`)** where each component page shows exactly how to run it in different environments.
```{r mmd, echo=FALSE, results='asis'}
cat(
"```mermaid\n",
"flowchart TD\n",
" A[", pkg, " ", ver, "] --> B(Viash Hub Launch)\n",
" A --> C(Viash CLI)\n",
" A --> D(Nextflow CLI)\n",
" A --> E(Seqera Cloud)\n",
" A --> F(As a dependency)\n",
"flowchart LR\n",
" A[", pkg, " Components] --> B[🌐 Web UI]\n",
" A --> C[⚡ Standalone]\n",
" A --> D[🔄 Nextflow Local]\n",
" A --> E[☁️ Nextflow Cloud]\n",
" \n",
" style A fill:#7a4baa,color:#fff\n",
" style B fill:#e1f5fe,color:#000\n",
" style C fill:#e8f5e8,color:#000\n",
" style D fill:#fff3e0,color:#000\n",
" style E fill:#f3e5f5,color:#000\n",
"```\n",
sep = ""
)
```
### 1. Via the Viash Hub Launch interface
You can run components directly from Viash Hub's launch interface. See [Viash Hub](https://www.viash-hub.com/packages/`r pkg`) for more information.
You can run this component directly from the Viash Hub [Launch interface](https://www.viash-hub.com/launch?package=`r pkg`&version=`r ver`&component=`r comp`&runner=Executable).
![](docs/viash-hub.png)
### 2. Via the Viash CLI
You can run this component directly from the command line using the Viash CLI.
```bash
viash run vsh://`r pkg`@`r ver`/`r comp` -- --help
viash run vsh://`r pkg`@`r ver`/`r comp` -- \
--bam path/to/input.bam \
--genome path/to/genome.fa \
--gene_annotation path/to/annotation.gtf \
--fusions path/to/output.txt
```
This will run the component with the specified input files and output the results to the specified output file.
### 3. Via the Nextflow CLI or Seqera Cloud
You can run this component as a Nextflow pipeline.
```bash
nextflow run https://packages.viash-hub.com/vsh/`r pkg` \
-revision `r ver` \
-main-script target/nextflow/`r comp`/main.nf \
-latest -resume \
-profile docker \
--bam path/to/input.bam \
--genome path/to/genome.fa \
--gene_annotation path/to/annotation.gtf \
--publish_dir path/to/output
```
**Note:** Make sure that the [Nextflow SCM](https://www.nextflow.io/docs/latest/git.html#git-configuration) is set up properly. You can do this by adding the following lines to your `~/.nextflow/scm` file:
```groovy
providers.vsh.platform = 'gitlab'
providers.vsh.server = 'https://packages.viash-hub.com'
```
**Tip:** This will also work with Seqera Cloud or other Nextflow-compatible platforms.
### 4. As a dependency
In your Viash config file (`config.vsh.yaml`), you can add this component as a dependency:
```yaml
dependencies:
- name: `r comp`
repository: vsh://`r pkg`@`r ver`
```
**Tip:** See the [Viash documentation](https://viash.io/guide/nextflow_vdsl3/create-a-pipeline.html#pipeline-as-a-component) for more details on how to use Viash components as a dependency in your own Nextflow workflows.
## Contributing
Contributions are welcome! We aim to build a comprehensive collection of high-quality bioinformatics components. If you'd like to contribute, please follow these general steps:
We welcome contributions! `r pkg` thrives on community input to expand our collection of high-quality bioinformatics components.
### Quick Contribution Process
```{r echo=FALSE}
lines <- readr::read_lines("CONTRIBUTING.md")
1. **Fork** the repository
2. **Create** your component following our guidelines
3. **Test** thoroughly with `viash test`
4. **Submit** a pull request
index_start <- grep("^### Step [0-9]*:", lines)
### What We're Looking For
index_end <- c(index_start[-1] - 1, length(lines))
- **Popular bioinformatics tools** missing from our collection
- **Improvements** to existing components
- **Bug fixes** and documentation enhancements
- **Best practice** implementations
name <- gsub("^### Step [0-9]*: *", "", lines[index_start])
### Getting Started
knitr::asis_output(
paste(paste0(" 1. ", name, "\n"), collapse = "")
)
```
Check out our comprehensive guides:
See the [CONTRIBUTING](`r contributing`) file for more details.
- **[Contributing Guidelines](`r contributing`)** - Complete development guide
- **[Component Standards](docs/COMPONENT_DEVELOPMENT.md)** - Quality requirements
- **[Testing Guide](docs/TESTING.md)** - Validation best practices
**New to Viash?** Start with our [beginner-friendly issues](https://github.com/viash-hub/biobox/labels/good%20first%20issue) or join our [community discussions](https://github.com/viash-hub/biobox/discussions).
## Community & Support
- **Documentation**: [Viash Documentation](https://viash.io)
- **Discussions**: [GitHub Discussions](https://github.com/viash-hub/biobox/discussions)
- **Issues**: [Bug Reports & Feature Requests](https://github.com/viash-hub/biobox/issues)
---
**Ready to streamline your bioinformatics workflows?** [Get started with `r pkg` today →](https://www.viash-hub.com/packages/`r pkg`)

View File

@@ -17,34 +17,33 @@ keywords: [bioinformatics, modules, sequencing]
links:
issue_tracker: https://github.com/viash-hub/biobox/issues
repository: https://github.com/viash-hub/biobox
viash_version: 0.9.4
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [ author, maintainer ]
roles: [author, maintainer]
- __merge__: /src/_authors/angela_o_pisco.yaml
roles: [ author ]
roles: [author]
- __merge__: /src/_authors/dorien_roosen.yaml
roles: [ author ]
roles: [author]
- __merge__: /src/_authors/dries_schaumont.yaml
roles: [ author ]
roles: [author]
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author ]
roles: [author]
- __merge__: /src/_authors/jakub_majercik.yaml
roles: [ author ]
roles: [author]
- __merge__: /src/_authors/kai_waldrant.yaml
roles: [ author ]
roles: [author]
- __merge__: /src/_authors/leila_paquay.yaml
roles: [ author ]
roles: [author]
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
roles: [ author ]
roles: [author]
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author ]
roles: [author]
- __merge__: /src/_authors/toni_verbeiren.yaml
roles: [ author ]
roles: [author]
- __merge__: /src/_authors/weiwei_schultz.yaml
roles: [ author ]
roles: [author]
config_mods: |
.requirements.commands := ['ps']
version: main
organization: vsh

View File

@@ -0,0 +1,268 @@
# Component Development Guide
This guide provides detailed step-by-step instructions for creating a new component in biobox.
## Table of Contents
- [Initial Setup](#initial-setup)
- [Configuration](#configuration)
- [Arguments](#arguments)
- [Implementation](#implementation)
- [Testing](#testing)
- [Documentation](#documentation)
## Initial Setup
### Step 1: Find a component to contribute
* Find a tool to contribute to this repo.
* Check whether it is already in the [Project board](https://github.com/orgs/viash-hub/projects/1).
* Check whether there is a corresponding [Snakemake wrapper](https://github.com/snakemake/snakemake-wrappers/blob/master/bio) or [nf-core module](https://github.com/nf-core/modules/tree/master/modules/nf-core) which we can use as inspiration.
* Create an issue to show that you are working on this component.
### Step 2: Find a suitable container
Google `biocontainer <name of component>` and find the container that is most suitable. Typically the link will be `https://quay.io/repository/biocontainers/xxx?tab=tags`.
If no such container is found, you can create a custom container in a later step.
### Step 3: Create help file
To help develop the component, we store the `--help` output of the tool in a file at `src/xxx/help.txt`.
```bash
cat <<EOF > src/xxx/help.txt
\```sh
xxx --help
\```
EOF
docker run quay.io/biocontainers/xxx:tag xxx --help >> src/xxx/help.txt
```
**Notes:**
* This help file has no functional purpose, but it is useful for the developer to see the help output of the tool.
* Some tools might not have a `--help` argument but instead have a `-h` argument.
## Configuration
### Metadata Setup
Fill in the relevant metadata fields in the config:
```yaml
name: bowtie2_build
namespace: bowtie2
description: |
Build Bowtie2 index files from reference sequences.
keywords: [Alignment, Indexing]
links:
homepage: https://bowtie-bio.sourceforge.net/bowtie2/index.shtml
documentation: https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
repository: https://github.com/BenLangmead/bowtie2
references:
doi: 10.1038/nmeth.1923
license: GPL-3.0
requirements:
commands: [bowtie2-build]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
```
### Requirements Specification
The `requirements` section documents the dependencies needed by your component:
```yaml
requirements:
commands: [bowtie2-build, bowtie2]
```
**Why specify commands:**
- Documents which executables the component expects
- Enables validation that the Docker container has required tools
- Helps users understand dependencies
- Facilitates automated testing and CI/CD
## Arguments
### Input Arguments
By looking at the help file, add input arguments to the config file:
```yaml
argument_groups:
- name: Inputs
arguments:
- name: --bam
alternatives: -x
type: file
description: |
File in SAM/BAM/CRAM format with main alignments as generated by STAR
(`Aligned.out.sam`). Arriba extracts candidate reads from this file.
required: true
example: Aligned.out.bam
```
**Key principles:**
* Argument names should be formatted in `--snake_case`
* Input arguments can have `multiple: true` to allow multiple files
* **Descriptions must be formatted in markdown** - they will be used downstream for rendering documentation
* You can make minor changes to the formatting of arguments to improve clarity and better utilize markdown structure
* Use markdown features like code blocks, lists, emphasis, and links to enhance readability
### Output Arguments
Add output arguments based on the tool's help:
```yaml
argument_groups:
- name: Outputs
arguments:
- name: --fusions
alternatives: -o
type: file
direction: output
description: |
Output file with fusions that have passed all filters.
required: true
example: fusions.tsv
```
**Note:** Preferably, outputs should be files rather than directories.
### Other Arguments
Add all other arguments with these exceptions:
* Arguments related to CPU and memory requirements are handled separately
* Version (`-v`, `--version`) or help (`-h`, `--help`) arguments should be excluded
* If the help file lists defaults, add them to description rather than as defaults
**Boolean handling:**
* Prefer using `boolean_true` over `boolean_false` to avoid confusion in Nextflow workflows
### Description Formatting Guidelines
Argument descriptions should always be written in **markdown format** as they are used downstream for documentation rendering. Here are best practices:
**Good markdown formatting examples:**
```yaml
description: |
Input FASTQ file containing reads. Supports compressed files (`.gz`, `.bz2`).
**Supported formats:**
- FASTQ (`.fastq`, `.fq`)
- Compressed FASTQ (`.fastq.gz`, `.fq.gz`)
See the [FASTQ format specification](https://en.wikipedia.org/wiki/FASTQ_format) for details.
```
```yaml
description: |
Maximum number of mismatches allowed during alignment.
**Default behavior:**
- For reads ≤50bp: 2 mismatches
- For reads >50bp: 3 mismatches
Set to `0` for exact matches only.
```
**Formatting improvements you can make:**
- Add code formatting for file extensions, parameters, and values
- Use lists and bullet points for multiple options
- Add emphasis with **bold** or *italic* text
- Include links to external documentation
- Structure complex descriptions with headers
- Use code blocks for examples
**Original tool help vs. improved description:**
```
# Original: "Input file in BAM format"
# Improved:
description: |
Input file in BAM format containing aligned sequences.
The file must be coordinate-sorted and indexed. Use `samtools sort`
and `samtools index` if needed.
```
## Meta Variables
**Important:** Never add `threads`, `cores`, `cpus`, or `memory` as regular parameters. Instead, use Viash's built-in meta variables.
### Available Meta Variables
Viash provides several meta variables that are automatically available in your scripts:
- **`meta_cpus`** (integer): Maximum number of logical CPUs the component can use
- **`meta_memory_*`** (long): Maximum memory allocation in various units:
- `meta_memory_b`, `meta_memory_kb`, `meta_memory_mb`
- `meta_memory_gb`, `meta_memory_tb`, `meta_memory_pb`
- `meta_memory_kib`, `meta_memory_mib`, `meta_memory_gib`, `meta_memory_tib`, `meta_memory_pib`
- **`meta_temp_dir`** (string): Temporary directory for the component
- **`meta_resources_dir`** (string): Path to component resources
- **`meta_name`** (string): Component name (useful for logging)
- **`meta_executable`** (string): Path to the wrapped executable
- **`meta_config`** (string): Path to the processed config YAML
### Usage Example
```bash
# Use meta_cpus instead of a threads parameter
./tool --threads ${meta_cpus:-1} --input $par_input --output $par_output
# Use meta_memory_gb for memory-intensive tools
./tool --memory ${meta_memory_gb:-8}G --input $par_input --output $par_output
```
### Setting Meta Values
```bash
# When running with viash
viash run config.vsh.yaml --cpus 8 --memory 16GB -- --input file.txt
# When using built executables
./my_tool ---cpus 8 ---memory 16GB --input file.txt
```
For more details, see the [Viash Variables Documentation](https://viash.io/guide/component/variables.html).
## Implementation
See [Script Development Guide](SCRIPT_DEVELOPMENT.md) for detailed script writing guidelines.
## Testing
See [Testing Guide](TESTING.md) for comprehensive testing practices.
## Documentation
### Version Documentation
Add version detection to the Docker engine setup:
```yaml
engines:
- type: docker
image: quay.io/biocontainers/xxx:2.5.4--he96a11b_6
setup:
- type: docker
run:
- xxx --version 2>&1 | head -1 | sed 's/.*version /xxx: /' > /var/software_versions.txt
```
**Common version extraction patterns:**
```bash
# For tools that output "Tool version X.Y.Z"
tool --version 2>&1 | head -1 | sed 's/.*version /tool: /' > /var/software_versions.txt
# For tools that output just the version number
echo "tool: $(tool --version 2>&1 | head -1)" > /var/software_versions.txt
# For tools with complex version output
tool --version 2>&1 | grep -E "^[0-9]" | head -1 | sed 's/^/tool: /' > /var/software_versions.txt
```

310
docs/DOCKER_GUIDE.md Normal file
View File

@@ -0,0 +1,310 @@
# Docker and Engine Best Practices
This guide covers best practices for setting up Docker engines and managing dependencies in biobox components.
## Table of Contents
- [Preferred Approach: Biocontainers](#preferred-approach-biocontainers)
- [Finding Biocontainers](#finding-biocontainers)
- [Version Detection](#version-detection)
- [Docker Run Syntax](#docker-run-syntax)
- [Custom Containers](#custom-containers)
- [Recommended Base Containers](#recommended-base-containers)
- [Multi-tool Containers](#multi-tool-containers)
- [Container Optimization](#container-optimization)
- [Testing Docker Setup](#testing-docker-setup)
## Preferred Approach: Biocontainers
### Basic Setup
```yaml
engines:
- type: docker
image: quay.io/biocontainers/bowtie2:2.5.4--he96a11b_6
setup:
- type: docker
run:
- bowtie2 --version 2>&1 | head -1 | sed 's/.*version /bowtie2: /' > /var/software_versions.txt
```
### Key Requirements
1. **Use specific versions**: Always pin to specific versions with build strings
2. **Include version detection**: Add setup commands to create `/var/software_versions.txt`
3. **Verify command availability**: Ensure the container has the required commands from `requirements.commands`
## Finding Biocontainers
### Search Strategy
1. **Google search**: `biocontainer <tool_name>`
2. **Direct URL**: `https://quay.io/repository/biocontainers/<tool_name>?tab=tags`
3. **Check version compatibility**: Choose the most recent stable version
4. **Verify build string**: Include the complete version tag with build string
### Version Selection
```yaml
# Good: Specific version with build string
image: quay.io/biocontainers/samtools:1.17--hd87286a_2
# Bad: Latest or incomplete version
image: quay.io/biocontainers/samtools:latest
image: quay.io/biocontainers/samtools:1.17
```
## Version Detection
### Common Patterns
```bash
# Pattern 1: Tool outputs "Tool version X.Y.Z"
tool --version 2>&1 | head -1 | sed 's/.*version /tool: /' > /var/software_versions.txt
# Pattern 2: Tool outputs just version number
echo "tool: $(tool --version 2>&1 | head -1)" > /var/software_versions.txt
# Pattern 3: Complex version output, extract numeric part
tool --version 2>&1 | grep -E "^[0-9]" | head -1 | sed 's/^/tool: /' > /var/software_versions.txt
# Pattern 4: Version in specific format
tool --version 2>&1 | awk '{print "tool: " $NF}' > /var/software_versions.txt
```
### Real Examples
```bash
# bowtie2
bowtie2 --version 2>&1 | head -1 | sed 's/.*version /bowtie2: /' > /var/software_versions.txt
# samtools
samtools --version 2>&1 | head -1 | sed 's/samtools /samtools: /' > /var/software_versions.txt
# fastqc
fastqc --version 2>&1 | sed 's/FastQC v/fastqc: /' > /var/software_versions.txt
```
### Testing Version Detection
Always test your version detection command:
```bash
# Test in the container
docker run quay.io/biocontainers/tool:version bash -c "
tool --version 2>&1 | head -1 | sed 's/.*version /tool: /'
"
```
## Docker Run Syntax
### List vs Multiline Strings
**Preferred: List format**
```yaml
run:
# Single commands
- command1 arg1 arg2
- command2 arg1 arg2
# Chained commands
- command1 && command2 && command3
```
**Alternative: Multiline strings (for complex commands)**
```yaml
run: |
command1 arg1 arg2 && \
command2 arg1 arg2 && \
command3 arg1 arg2
```
**Important:** Comments inside multiline strings (`run: |`) become Dockerfile `RUN` commands and will break the build. Use comments before the `run:` key or use the list format.
## Custom Containers
### When to Use Custom Containers
Use custom containers when:
- No suitable biocontainer exists
- You need to install additional dependencies
- You need a specific base environment (R, Python, etc.)
### Python-based Tools
```yaml
engines:
- type: docker
image: python:3.10-slim
setup:
- type: python
packages:
- numpy~=x.x.x
- pandas~=x.x.x
- scipy~=x.x.x
```
### R-based Tools
```yaml
engines:
- type: docker
image: rocker/r2u:24.04
setup:
- type: r
cran: [devtools, BiocManager]
bioc: [Biostrings, GenomicRanges]
```
### Compilation from Source
```yaml
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [build-essential, cmake, git, wget]
- type: docker
run:
- wget https://github.com/user/tool/archive/v1.0.tar.gz && tar -xzf v1.0.tar.gz
- cd tool-1.0 && make && make install
- echo "tool: 1.0" > /var/software_versions.txt
```
## Recommended Base Containers
### General Purpose
- **Ubuntu**: `ubuntu:22.04` - Good for compilation and apt packages
- **Alpine**: `alpine:latest` - Minimal size, apk packages
- **Debian**: `debian:bookworm-slim` - Stable, well-supported
### Language-Specific
#### Python
```yaml
# Basic Python
image: python:3.10-slim
# With scientific packages
image: python:3.10
# GPU-enabled
image: nvcr.io/nvidia/pytorch:23.08-py3
```
#### R
```yaml
# Fast package installation
image: rocker/r2u:24.04
# Tidyverse included
image: rocker/tidyverse:4.3.0
# Bioconductor base
image: bioconductor/bioconductor_docker:RELEASE_3_17
```
#### Node.js
```yaml
# LTS version
image: node:18-slim
# Alpine variant
image: node:18-alpine
```
#### Other Languages
```yaml
# Java
image: openjdk:11-jre-slim
# Go
image: golang:1.20-alpine
# Rust
image: rust:1.70-slim
# Ruby
image: ruby:3.1-slim
```
## Multi-tool Containers
### Installing Multiple Tools
```yaml
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [wget, curl, build-essential]
- type: docker
run:
# Install tool 1
- wget https://tool1.com/download && install_tool1
# Install tool 2
- wget https://tool2.com/download && install_tool2
# Create version file
- echo "tool1: $(tool1 --version)" > /var/software_versions.txt
- echo "tool2: $(tool2 --version)" >> /var/software_versions.txt
```
## Container Optimization
### Layer Efficiency
```yaml
# Good: Combine related commands
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y wget curl && \
wget https://tool.com/download && \
install_tool && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Bad: Separate layers for each command
setup:
- type: apt
packages: [wget, curl]
- type: docker
run: wget https://tool.com/download
- type: docker
run: install_tool
- type: docker
run: apt-get clean
```
## Testing Docker Setup
### Viash Docker Debugging
```bash
# Inspect the generated Dockerfile
viash run config.vsh.yaml -- ---dockerfile
# Build with cached layers (faster)
viash run config.vsh.yaml -- ---setup cachedbuild ---verbose
# Build from scratch (clean build)
viash run config.vsh.yaml -- ---setup build ---verbose
# Enter interactive debugging session
viash run config.vsh.yaml -- ---debug
# Check installed tools (inside container)
which tool
tool --version
# Verify version file
cat /var/software_versions.txt
```
### Common Issues
1. **Command not found**: Tool not in PATH or not installed
2. **Version detection fails**: Command syntax varies between tools
3. **Permission issues**: Tools installed in wrong location
4. **Missing dependencies**: Tool requires additional libraries

434
docs/SCRIPT_DEVELOPMENT.md Normal file
View File

@@ -0,0 +1,434 @@
# Script Development Guide
This guide covers best practices for writing runner scripts in biobox components.
## Table of Contents
- [Script Structure and Template](#script-structure-and-template)
- [Key Principles](#key-principles)
- [Real-World Example](#real-world-example)
- [Advanced Patterns](#advanced-patterns)
- [Common Pitfalls](#common-pitfalls)
- [Testing Your Script](#testing-your-script)
## Script Structure and Template
All Viash component scripts follow a standard structure with best practices for error handling and parameter management.
### Basic Template
```bash
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# unset flags
[[ "$par_option1" == "false" ]] && unset par_option1
[[ "$par_option2" == "false" ]] && unset par_option2
# Build command arguments array
cmd_args=(
--input "$par_input"
--output "$par_output"
${par_option1:+--option1}
${par_option2:+--option2}
${meta_cpus:+--threads "$meta_cpus"}
${meta_memory_gb:+--memory "${meta_memory_gb}G"}
)
# Execute command
xxx "${cmd_args[@]}"
```
### Understanding the Viash Code Block
The `## VIASH START` and `## VIASH END` comments mark a special placeholder block where Viash injects runtime parameters and metadata when the component is executed.
**At runtime**, Viash replaces this placeholder with:
- `par_*` variables containing argument values (e.g., `par_input`, `par_output`)
- `meta_*` variables containing runtime metadata (e.g., `meta_name`, `meta_cpus`, `meta_temp_dir`)
**For debugging**, you can put example code between these markers to test your script locally:
```bash
## VIASH START
par_input="test_input.txt"
par_output="test_output.txt"
par_verbose="true"
meta_cpus="4"
meta_memory_gb="8"
meta_temp_dir="/tmp"
## VIASH END
```
This allows you to run your script directly with `bash script.sh` during development.
## Code Style Guidelines
### Indentation
**Use 2-space indentation consistently throughout your scripts:**
```bash
# Correct - 2 spaces
unset_if_false=(
par_verbose
par_quiet
par_force
)
for par in "${unset_if_false[@]}"; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
done
cmd_args=(
--input "$par_input"
--output "$par_output"
${par_verbose:+--verbose}
)
```
```bash
# Incorrect - 4 spaces or tabs
unset_if_false=(
par_verbose
par_quiet
par_force
)
for par in "${unset_if_false[@]}"; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
done
```
**Why 2 spaces:**
- Consistent with other biobox components
- Better readability in terminal and code editors
- Reduces line width for complex nested structures
- Standard practice in many shell script projects
## Key Principles
### 1. Error Handling
Always use `set -eo pipefail`:
- `set -e`: Exit immediately if a command exits with a non-zero status
- `set -o pipefail`: Exit if any command in a pipeline fails
### 2. Array-Based Arguments
**Preferred approach:**
```bash
cmd_args=(
--input "$par_input"
--output "$par_output"
${par_option:+--option "$par_option"}
)
xxx "${cmd_args[@]}"
```
**Avoid repetitive appending:**
```bash
# Don't do this
cmd_args+=("--input")
cmd_args+=("$par_input")
cmd_args+=("--output")
cmd_args+=("$par_output")
```
### 3. Conditional Parameter Inclusion
Use Bash parameter expansion for optional parameters:
```bash
# Include parameter only if variable is set and not empty
${meta_cpus:+--threads "$meta_cpus"}
# Include flag only if boolean is true (after unsetting false values)
${par_verbose:+--verbose}
```
### 4. Boolean Handling
Unset boolean parameters that are "false":
```bash
# Single parameter
[[ "$par_verbose" == "false" ]] && unset par_verbose
# For multiple parameters, you can use either approach:
# Option 1: Individual approach (recommended for 1-4 parameters)
[[ "$par_verbose" == "false" ]] && unset par_verbose
[[ "$par_quiet" == "false" ]] && unset par_quiet
[[ "$par_force" == "false" ]] && unset par_force
[[ "$par_recursive" == "false" ]] && unset par_recursive
# Option 2: Loop approach (recommended for 5+ parameters)
unset_if_false=(
par_verbose
par_quiet
par_force
par_recursive
par_follow_symlinks
par_ignore_case
par_preserve_permissions
)
for par in "${unset_if_false[@]}"; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
done
```
**When to use which approach:**
- **Individual approach**: Recommended for 1-4 boolean parameters, clearer and more direct
- **Loop approach**: Recommended for many parameters (5+), reduces code duplication
The individual approach is preferred for fewer parameters because:
- Each parameter is explicit and easy to find
- No variable indirection complexity (`${!par}`)
- Simple to add/remove individual parameters
- More readable at a glance
### 5. Meta Variables Usage
**Important:** Never use `par_threads`, `par_cores`, `par_cpus`, or `par_memory` parameters. Use Viash's built-in meta variables instead.
**Available meta variables:**
- `meta_cpus`: Number of CPU cores available
- `meta_memory_*`: Memory limits in various units (b, kb, mb, gb, tb, pb, kib, mib, gib, tib, pib)
- `meta_temp_dir`: Temporary directory for the component
- `meta_resources_dir`: Path to component resources
**Examples:**
```bash
# CPU cores with fallback
${meta_cpus:+--threads "$meta_cpus"}
${meta_cpus:+--cores "${meta_cpus:-1}"}
# Memory with fallback and unit conversion
${meta_memory_gb:+--memory "${meta_memory_gb}G"}
${meta_memory_mb:+--max-memory "${meta_memory_mb:-1024}M"}
# Temporary directory
--tmp-dir "${meta_temp_dir:-/tmp}"
```
**Why use meta variables:**
- Integrates seamlessly with workflow systems like Nextflow
- Automatically managed by Viash runtime
- Consistent across all components
- Prevents parameter duplication and conflicts
For complete details, see [Viash Variables Documentation](https://viash.io/guide/component/variables.html).
### 6. Proper Quoting
Always quote variables that might contain spaces or special characters:
```bash
# Correct
--input "$par_input"
--output "$par_output"
# For special characters, use @Q expansion
--pattern "${par_pattern@Q}"
```
### 7. Multiple Parameter Values
When using arguments with `multiple: true` in your Viash configuration, values are passed as semicolon-separated strings that need to be split into bash arrays.
#### In script.sh - Converting to Arrays
```bash
# Convert semicolon-separated values to bash array
IFS=';' read -ra files_array <<< "$par_files"
# Example: Use in command arguments
cmd_args=(
-i "$par_input"
-files "${files_array[@]}"
-o "$par_output"
)
# Execute command
bedtools annotate "${cmd_args[@]}"
```
#### In test.sh - Passing Multiple Values
When testing components with `multiple: true` parameters, you can use either format:
```bash
# Method 1: Repeated flags (recommended for readability)
"$meta_executable" \
--input "$meta_temp_dir/query.bed" \
--files "$meta_temp_dir/db1.bed" \
--files "$meta_temp_dir/db2.bed" \
--output "$meta_temp_dir/result.bed"
# Method 2: Semicolon-separated values
"$meta_executable" \
--input "$meta_temp_dir/query.bed" \
--files "$meta_temp_dir/db1.bed;$meta_temp_dir/db2.bed" \
--output "$meta_temp_dir/result.bed"
```
Both methods work identically - Viash automatically converts repeated flags to semicolon-separated strings internally.
#### Complete Example
```bash
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# Convert semicolon-separated files to array
IFS=';' read -ra files_array <<< "$par_files"
# Convert semicolon-separated names to array if provided
if [[ -n "${par_names}" ]]; then
IFS=';' read -ra names_array <<< "$par_names"
fi
# Build command arguments array
cmd_args=(
-i "$par_input"
${par_names:+-names "${names_array[@]}"}
-files "${files_array[@]}"
)
# Execute command
bedtools annotate "${cmd_args[@]}" > "$par_output"
```
## Real-World Example
Here's an example from the bowtie2_build component:
```bash
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# unset flags
[[ "$par_large_index" == "false" ]] && unset par_large_index
[[ "$par_noauto" == "false" ]] && unset par_noauto
[[ "$par_packed" == "false" ]] && unset par_packed
# Create output directory
mkdir -p "$par_output"
# Determine index basename
if [ -n "$par_index_name" ]; then
index_basename="$par_index_name"
else
index_basename=$(basename "$par_input" .fasta)
fi
# Build command arguments
cmd_args=(
${par_fasta:+-f}
${par_cmdline:+-c}
${par_large_index:+--large-index}
${par_noauto:+-a}
${par_packed:+-p}
${par_bmax:+--bmax "$par_bmax"}
${par_offrate:+-o "$par_offrate"}
"$par_input"
"$par_output/$index_basename"
)
# Execute bowtie2-build
bowtie2-build "${cmd_args[@]}"
```
## Advanced Patterns
### Multiple Input Handling
If your tool accepts multiple inputs with custom separators:
```bash
# Convert Viash's semicolon separator to comma
par_disable_filters=$(echo "$par_disable_filters" | tr ';' ',')
cmd_args=(
--disable-filters "$par_disable_filters"
)
```
### Complex File Handling
```bash
# Ensure output directory exists
mkdir -p "$(dirname "$par_output")"
# Handle relative paths
input_path=$(realpath "$par_input")
output_path=$(realpath "$par_output")
```
### Resource Management
```bash
# Use available resources
cmd_args=(
${meta_cpus:+--threads "$meta_cpus"}
${meta_memory_mb:+--memory "${meta_memory_mb}M"}
)
```
## Common Pitfalls
### 1. Unquoted Variables
```bash
# Wrong - can break with spaces
cmd_args=(--input $par_input)
# Correct
cmd_args=(--input "$par_input")
```
### 2. Improper Boolean Handling
```bash
# Wrong - will include false booleans
cmd_args=(${par_verbose:+--verbose})
# Correct - unset false values first
[[ "$par_verbose" == "false" ]] && unset par_verbose
cmd_args=(${par_verbose:+--verbose})
```
### 3. Array Expansion
```bash
# Wrong - treats array as single string
tool $cmd_args
# Correct - expands array elements
tool "${cmd_args[@]}"
```
## Testing Your Script
Always test your script with:
- Empty/missing optional parameters
- Parameters with spaces
- Boolean true/false values
- Edge cases specific to your tool
See [Testing Guide](docs/TESTING.md) for extensive test best practices.

536
docs/TESTING.md Normal file
View File

@@ -0,0 +1,536 @@
# Testing Guide
This guide covers best practices for writing comprehensive test scripts for biobox components.
> **📌 Important:** All new test scripts should use the **centralized test helpers** located at `src/_utils/test_helpers.sh`. This eliminates code duplication and ensures consistency across all components.
## Table of Contents
- [Core Principles](#core-principles)
- [Test Script Structure](#test-script-structure)
- [Centralized Test Helpers](#centralized-test-helpers)
- [Test Scenarios](#test-scenarios)
- [Best Practices](#best-practices)
- [Viash Testing Features](#viash-testing-features)
- [Static Test Data](#static-test-data)
## Core Principles
### 1. Generate Test Data in Scripts
**Preferred approach:** Generate test data within the test script using the centralized helper functions.
```bash
# Generate test data using centralized helpers
create_test_fasta "$meta_temp_dir/input.fasta" 3 50
create_test_fastq "$meta_temp_dir/reads.fastq" 10 35
```
**Avoid:**
- Storing static test files in the repository
- Fetching test data from external sources
- Large test datasets
### 2. Self-Contained Tests
Tests should be completely self-contained and not depend on external resources:
```yaml
test_resources:
- type: bash_script
path: test.sh
- type: file
path: /src/_utils/test_helpers.sh
```
Only add static test files if absolutely necessary:
```yaml
test_resources:
- type: bash_script
path: test.sh
- type: file
path: /src/_utils/test_helpers.sh
- type: file
path: test_data # Only if data generation is impractical
```
## Test Script Structure
### Configuration Setup
Add the test helpers as a resource in your component configuration:
```yaml
test_resources:
- type: bash_script
path: test.sh
- type: file
path: /src/_utils/test_helpers.sh
```
### Basic Test Template
```bash
#!/bin/bash
## VIASH START
## VIASH END
# Source the centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment with strict error handling
setup_test_env
#############################################
# Test execution with centralized functions
#############################################
log "Starting tests for $meta_name"
# --- Test Case 1: Basic functionality ---
log "Starting TEST 1: Basic functionality"
# Create and validate test data
test_data_dir="$meta_temp_dir/test_data"
mkdir -p "$test_data_dir"
create_test_fasta "$test_data_dir/input.fasta" 3 50
check_file_exists "$test_data_dir/input.fasta" "input FASTA file"
log "Executing $meta_name with basic parameters..."
"$meta_executable" \
--input "$test_data_dir/input.fasta" \
--output "$meta_temp_dir/test1"
log "Validating TEST 1 outputs..."
check_dir_exists "$meta_temp_dir/test1" "output directory"
check_file_exists "$meta_temp_dir/test1/result.txt" "result file"
check_file_not_empty "$meta_temp_dir/test1/result.txt" "result file"
log "✅ TEST 1 completed successfully"
# --- Test Case 2: Advanced parameters ---
log "Starting TEST 2: Advanced parameters"
# Create different test data
create_test_fastq "$test_data_dir/input.fastq" 10 35
check_file_exists "$test_data_dir/input.fastq" "input FASTQ file"
log "Executing $meta_name with advanced parameters..."
"$meta_executable" \
--input "$test_data_dir/input.fastq" \
--output "$meta_temp_dir/test2" \
--threads 2 \
--verbose
log "Validating TEST 2 outputs..."
check_file_exists "$meta_temp_dir/test2/advanced_result.txt" "advanced result file"
check_file_contains "$meta_temp_dir/test2/advanced_result.txt" "expected_pattern" "advanced result file"
log "✅ TEST 2 completed successfully"
print_test_summary "All tests completed successfully"
```
## Centralized Test Helpers
The centralized test helpers located at `src/_utils/test_helpers.sh` provide comprehensive testing functionality to ensure consistency across all biobox components.
### Available Functions
#### Logging Functions
- `log "message"` - Log with timestamp
- `log_warn "message"` - Warning message
- `log_error "message"` - Error message
#### File/Directory Validation
- `check_file_exists path "description"` - Verify file exists
- `check_dir_exists path "description"` - Verify directory exists
- `check_file_not_exists path "description"` - Verify file doesn't exist
- `check_dir_not_exists path "description"` - Verify directory doesn't exist
- `check_file_empty path "description"` - Verify file is empty
- `check_file_not_empty path "description"` - Verify file is not empty
#### Content Validation
- `check_file_contains path "text" "description"` - Verify file contains text
- `check_file_not_contains path "text" "description"` - Verify file doesn't contain text
- `check_file_matches_regex path "pattern" "description"` - Verify file matches regex
- `check_file_line_count path count "description"` - Verify line count
#### Test Data Generation
- `create_test_fasta path [num_seqs] [seq_length]` - Generate FASTA file
- `create_test_fastq path [num_reads] [read_length]` - Generate FASTQ file
- `create_test_gtf path [num_genes]` - Generate GTF file
- `create_test_gff path [num_features]` - Generate GFF file
- `create_test_bed path [num_intervals]` - Generate BED file
- `create_test_csv path [num_rows]` - Generate CSV file
- `create_test_tsv path [num_rows]` - Generate TSV file
#### Utility Functions
- `setup_test_env` - Initialize test environment with strict error handling
- `print_test_summary "test_name"` - Print completion message
### Usage Example
```bash
#!/bin/bash
## VIASH START
## VIASH END
# Source centralized helpers
source "$meta_resources_dir/test_helpers.sh"
setup_test_env
log "Starting tests for $meta_name"
# Generate test data
create_test_fasta "$meta_temp_dir/input.fasta" 3 50
check_file_exists "$meta_temp_dir/input.fasta" "input FASTA file"
# Run component
"$meta_executable" \
--input "$meta_temp_dir/input.fasta" \
--output "$meta_temp_dir/output.txt"
# Validate output
check_file_exists "$meta_temp_dir/output.txt" "result file"
check_file_contains "$meta_temp_dir/output.txt" "expected_pattern" "result file"
print_test_summary "Basic functionality test"
```
## Test Scenarios
### 1. Basic Functionality
Test the component with minimal, essential parameters:
```bash
log "Starting TEST 1: Basic functionality"
create_test_fasta "$meta_temp_dir/input.fasta" 3 50
"$meta_executable" \
--input "$meta_temp_dir/input.fasta" \
--output "$meta_temp_dir/output.txt"
check_file_exists "$meta_temp_dir/output.txt" "output file"
check_file_not_empty "$meta_temp_dir/output.txt" "output file"
log "✅ TEST 1 completed successfully"
```
### 2. Multiple Input Files
Test with multiple input files or complex input scenarios:
```bash
log "Starting TEST 2: Multiple input files"
create_test_fasta "$meta_temp_dir/input1.fasta" 2 30
create_test_fasta "$meta_temp_dir/input2.fasta" 2 30
"$meta_executable" \
--input "$meta_temp_dir/input1.fasta;$meta_temp_dir/input2.fasta" \
--output "$meta_temp_dir/output.txt"
check_file_exists "$meta_temp_dir/output.txt" "merged output file"
log "✅ TEST 2 completed successfully"
```
### 3. Optional Parameters
Test with optional parameters and advanced features:
```bash
log "Starting TEST 3: Optional parameters"
create_test_fastq "$meta_temp_dir/input.fastq" 10 35
"$meta_executable" \
--input "$meta_temp_dir/input.fastq" \
--output "$meta_temp_dir/output.txt" \
--threads 2 \
--verbose
check_file_exists "$meta_temp_dir/output.txt" "output file with options"
check_file_contains "$meta_temp_dir/output.txt" "verbose" "verbose output"
log "✅ TEST 3 completed successfully"
```
### 4. Edge Cases
Test with edge cases like empty files or unusual inputs:
```bash
log "Starting TEST 4: Edge case - empty input"
# Create empty input file
touch "$meta_temp_dir/empty.fasta"
# Test should handle empty input gracefully
if "$meta_executable" \
--input "$meta_temp_dir/empty.fasta" \
--output "$meta_temp_dir/output.txt" 2>/dev/null; then
log_warn "Component succeeded with empty input - checking output"
check_file_exists "$meta_temp_dir/output.txt" "output file for empty input"
else
log "Expected behavior: Component properly rejected empty input"
fi
log "✅ TEST 4 completed successfully"
```
### 5. Error Handling
Test proper error handling for invalid inputs:
```bash
log "Starting TEST 5: Error handling"
# Test with non-existent input file
if "$meta_executable" \
--input "/non/existent/file.txt" \
--output "$meta_temp_dir/output.txt" 2>/dev/null; then
log_error "Component should have failed with non-existent input"
exit 1
else
log "✅ Component properly handled non-existent input file"
fi
log "✅ TEST 5 completed successfully"
```
## Best Practices
### 1. Use Centralized Test Helpers
Always use the centralized test helpers instead of defining functions individually:
```bash
# ✅ Recommended: Use centralized helpers
source "$meta_resources_dir/test_helpers.sh"
setup_test_env
# ❌ NOT recommended: Defining functions individually
set -euo pipefail
log() { echo "$(date '+%Y-%m-%d %H:%M:%S') [TEST] $*"; }
```
### 2. Strict Error Handling
The centralized helpers automatically provide strict error handling via `setup_test_env`:
```bash
# Automatically enabled by setup_test_env:
set -euo pipefail # Exit on errors, undefined variables, pipe failures
export LC_ALL=C # Consistent locale for reproducible results
```
### 3. Descriptive Validation
Use descriptive validation functions with meaningful descriptions:
```bash
# ✅ Good: Descriptive validation
check_file_exists "$output_file" "filtered feature matrix"
check_file_not_exists "$bam_file" "BAM file (should be disabled by default)"
check_file_contains "$result_file" "expected_pattern" "analysis results"
# ❌ Less helpful: Basic validation without context
check_file_exists "$output_file"
```
### 4. Organized Structure
Use `$meta_temp_dir` and create organized test structure:
```bash
# Create organized test structure
test_data_dir="$meta_temp_dir/test_data"
test_output_dir="$meta_temp_dir/test_output"
mkdir -p "$test_data_dir" "$test_output_dir"
create_test_fasta "$test_data_dir/input.fasta" 3 50
```
### 5. Clear Test Output
Use consistent logging with clear test boundaries:
```bash
log "Starting TEST 1: Basic functionality"
log "Executing $meta_name with basic parameters..."
log "Validating TEST 1 outputs..."
log "✅ TEST 1 completed successfully"
# Final summary
print_test_summary "All tests completed successfully"
```
### 6. Comprehensive Content Validation
Don't just check that files exist - validate their content:
```bash
# Check existence and content
check_file_exists "$meta_temp_dir/output.txt" "analysis results"
check_file_not_empty "$meta_temp_dir/output.txt" "analysis results"
check_file_contains "$meta_temp_dir/output.txt" "Number of sequences" "result summary"
check_file_line_count "$meta_temp_dir/output.txt" 10 "expected number of results"
```
### 7. Multiple Test Scenarios
Include comprehensive test coverage:
```bash
# Test 1: Basic functionality
log "Starting TEST 1: Basic functionality"
# ... test implementation ...
log "✅ TEST 1 completed successfully"
# Test 2: Advanced options
log "Starting TEST 2: Advanced options"
# ... test implementation ...
log "✅ TEST 2 completed successfully"
# Test 3: Edge cases
log "Starting TEST 3: Edge case handling"
# ... test implementation ...
log "✅ TEST 3 completed successfully"
print_test_summary "All tests completed successfully"
```
## Viash Testing Features
### Running Tests
```bash
# Test a single component
viash test config.vsh.yaml
# Test with specific resources
viash test config.vsh.yaml --cpus 4 --memory 8GB
# Test with specific setup strategy
viash test config.vsh.yaml --setup build --verbose
# Keep temporary files for debugging
viash test config.vsh.yaml --keep true
# Test all components in parallel
viash ns test --parallel
# Test specific namespace
viash ns test -q alignment --parallel
```
### Test Execution Flow
When running `viash test`, Viash automatically:
1. **Creates temporary directory** (available as `$meta_temp_dir`)
2. **Builds the main executable**
3. **Builds/pulls Docker image** (if using Docker engine)
4. **Iterates over all test scripts** in `test_resources`
5. **Builds each test into executable** and runs it
6. **Cleans up** temporary files (unless `--keep true`)
7. **Returns exit code 0** if all tests succeed
### Meta Variables in Tests
Your test scripts automatically have access to important meta variables:
- `$meta_executable` - Path to the built component executable
- `$meta_temp_dir` - Temporary directory for test files (automatically cleaned up)
- `$meta_name` - Component name for logging
- `$meta_resources_dir` - Path to test resources
### Multiple Test Scripts
You can add multiple test scripts to cover different scenarios:
```yaml
test_resources:
- type: bash_script
path: test_basic.sh
- type: bash_script
path: test_edge_cases.sh
- type: bash_script
path: test_large_data.sh
- type: file
path: /src/_utils/test_helpers.sh
```
### Advanced Testing Options
```bash
# Test with different container setup strategies
viash test config.vsh.yaml --setup cachedbuild # Use cached layers (faster)
viash test config.vsh.yaml --setup build # Clean build from scratch
viash test config.vsh.yaml --setup alwaysbuild # Always rebuild container
# Test with configuration modifications
viash test config.vsh.yaml -c '.engines[0].image = "ubuntu:22.04"'
# Test with debug mode for troubleshooting
viash test config.vsh.yaml --keep true --verbose
```
For more details, see the [Viash Unit Testing Documentation](https://viash.io/guide/component/unit-testing.html).
## Static Test Data
### When to Use Static Test Data
Only use static test files when:
- The tool requires very specific, complex file formats that are difficult to generate
- Generating equivalent test data is impractical or overly complex
- You need real-world data to validate complex algorithms
- Test data is very small (<1KB preferred, <10KB maximum)
### Guidelines for Static Test Data
If you must use static test data:
1. **Keep files small** - Prefer <1KB, maximum <10KB
2. **Document the source** - How was it created?
3. **Use minimal examples** - Strip down to essential features
4. **Consider alternatives** - Can you generate equivalent data?
```bash
# test_data/README.md
# Test data for complex_tool component
# Source: https://github.com/example/dataset
# Generated with: tool --export-sample --format minimal
# Date: 2025-01-01
# Size: 847 bytes
# Purpose: Tests complex file format parsing
```
### Referencing Static Test Data
```yaml
test_resources:
- type: bash_script
path: test.sh
- type: file
path: /src/_utils/test_helpers.sh
- type: file
path: test_data
```
```bash
# In your test script
static_data="$meta_resources_dir/test_data/sample.complex"
check_file_exists "$static_data" "static test data"
"$meta_executable" --input "$static_data" --output "$meta_temp_dir/output.txt"
```

410
src/_utils/test_helpers.sh Normal file
View File

@@ -0,0 +1,410 @@
#!/bin/bash
# Test Helper Functions for Biobox Components
#
# This file provides standardized helper functions for component testing.
# Source this file in your test scripts with:
# source "$meta_resources_dir/test_helpers.sh"
#
# Usage examples:
# log "Starting test execution"
# check_file_exists "$output" "result file"
# check_file_not_exists "$bam_file" "BAM file (disabled by default)"
# create_test_fasta "$temp_dir/input.fasta" 3 50
#
#############################################
# Logging Functions
#############################################
# Log messages with timestamps and consistent formatting
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') [TEST] $*"
}
# Log informational messages (alias for log)
log_info() {
log "$*"
}
# Log warning messages
log_warn() {
echo "$(date '+%Y-%m-%d %H:%M:%S') [WARN] $*"
}
# Log error messages
log_error() {
echo "$(date '+%Y-%m-%d %H:%M:%S') [ERROR] $*" >&2
}
#############################################
# File and Directory Validation Functions
#############################################
# Check if a file exists with descriptive logging
# Usage: check_file_exists "/path/to/file" "optional description"
check_file_exists() {
local file_path="$1"
local description="${2:-File}"
if [[ -f "$file_path" ]]; then
log "✓ Found $description: $file_path"
return 0
else
log_error "$description does not exist: $file_path"
exit 1
fi
}
# Check if a directory exists with descriptive logging
# Usage: check_dir_exists "/path/to/dir" "optional description"
check_dir_exists() {
local dir_path="$1"
local description="${2:-Directory}"
if [[ -d "$dir_path" ]]; then
log "✓ Found $description: $dir_path"
return 0
else
log_error "$description does not exist: $dir_path"
exit 1
fi
}
# Check if a file does NOT exist (useful for testing disabled features)
# Usage: check_file_not_exists "/path/to/file" "optional description"
check_file_not_exists() {
local file_path="$1"
local description="${2:-File}"
if [[ ! -f "$file_path" ]]; then
log "✓ Confirmed $description does not exist (as expected): $file_path"
return 0
else
log_error "$description exists but shouldn't: $file_path"
exit 1
fi
}
# Check if a directory does NOT exist (useful for testing disabled features)
# Usage: check_dir_not_exists "/path/to/dir" "optional description"
check_dir_not_exists() {
local dir_path="$1"
local description="${2:-Directory}"
if [[ ! -d "$dir_path" ]]; then
log "✓ Confirmed $description does not exist (as expected): $dir_path"
return 0
else
log_error "$description exists but shouldn't: $dir_path"
exit 1
fi
}
# Check if a file is not empty
# Usage: check_file_not_empty "/path/to/file" "optional description"
check_file_not_empty() {
local file_path="$1"
local description="${2:-File}"
if [[ -s "$file_path" ]]; then
log "$description is not empty: $file_path"
return 0
else
log_error "$description is empty but shouldn't be: $file_path"
exit 1
fi
}
# Check if a file is empty
# Usage: check_file_empty "/path/to/file" "optional description"
check_file_empty() {
local file_path="$1"
local description="${2:-File}"
if [[ ! -s "$file_path" ]]; then
log "$description is empty (as expected): $file_path"
return 0
else
log_error "$description is not empty but should be: $file_path"
exit 1
fi
}
#############################################
# Content Validation Functions
#############################################
# Check if a file contains specific text
# Usage: check_file_contains "/path/to/file" "search_text" "optional description"
check_file_contains() {
local file_path="$1"
local search_text="$2"
local description="${3:-File}"
if grep -q "$search_text" "$file_path" 2>/dev/null; then
log "$description contains expected text '$search_text': $file_path"
return 0
else
log_error "$description does not contain '$search_text': $file_path"
exit 1
fi
}
# Check if a file does NOT contain specific text
# Usage: check_file_not_contains "/path/to/file" "search_text" "optional description"
check_file_not_contains() {
local file_path="$1"
local search_text="$2"
local description="${3:-File}"
if ! grep -q "$search_text" "$file_path" 2>/dev/null; then
log "$description does not contain '$search_text' (as expected): $file_path"
return 0
else
log_error "$description contains '$search_text' but shouldn't: $file_path"
exit 1
fi
}
# Check if a file matches a regex pattern
# Usage: check_file_matches_regex "/path/to/file" "regex_pattern" "optional description"
check_file_matches_regex() {
local file_path="$1"
local regex_pattern="$2"
local description="${3:-File}"
if grep -qE "$regex_pattern" "$file_path" 2>/dev/null; then
log "$description matches expected pattern '$regex_pattern': $file_path"
return 0
else
log_error "$description does not match pattern '$regex_pattern': $file_path"
exit 1
fi
}
# Check if a file has the expected number of lines
# Usage: check_file_line_count "/path/to/file" expected_count "optional description"
check_file_line_count() {
local file_path="$1"
local expected_count="$2"
local description="${3:-File}"
local actual_count=$(wc -l < "$file_path" 2>/dev/null || echo "0")
if [[ "$actual_count" -eq "$expected_count" ]]; then
log "$description has expected line count ($expected_count): $file_path"
return 0
else
log_error "$description has $actual_count lines, expected $expected_count: $file_path"
exit 1
fi
}
#############################################
# Test Data Generation Functions
#############################################
# Create a test FASTA file with specified sequences
# Usage: create_test_fasta "/path/to/output.fasta" [num_sequences] [sequence_length]
create_test_fasta() {
local file_path="$1"
local num_seqs="${2:-2}"
local seq_length="${3:-64}"
log "Creating test FASTA file with $num_seqs sequences of length $seq_length: $file_path"
> "$file_path" # Create empty file
for i in $(seq 1 "$num_seqs"); do
echo ">seq$i" >> "$file_path"
# Generate random DNA sequence
head -c "$seq_length" /dev/zero | tr '\0' 'A' | sed 's/A/ATCG/g' | head -c "$seq_length" >> "$file_path"
echo >> "$file_path"
done
log "✓ Created test FASTA file: $file_path"
}
# Create a test FASTQ file with specified reads
# Usage: create_test_fastq "/path/to/output.fastq" [num_reads] [read_length]
create_test_fastq() {
local file_path="$1"
local num_reads="${2:-4}"
local read_length="${3:-35}"
log "Creating test FASTQ file with $num_reads reads of length $read_length: $file_path"
> "$file_path" # Create empty file
for i in $(seq 1 "$num_reads"); do
echo "@read$i" >> "$file_path"
# Generate random DNA sequence of exact length using bash
seq_line=""
for j in $(seq 1 "$read_length"); do
case $((RANDOM % 4)) in
0) seq_line+="A";;
1) seq_line+="T";;
2) seq_line+="C";;
3) seq_line+="G";;
esac
done
echo "$seq_line" >> "$file_path"
echo "+" >> "$file_path"
# Generate quality scores (all good quality, Phred+33 = ASCII 73)
printf "%*s\n" "$read_length" "" | tr ' ' 'I' >> "$file_path"
done
log "✓ Created test FASTQ file: $file_path"
}
# Create a test GTF file with basic gene annotations
# Usage: create_test_gtf "/path/to/output.gtf" [num_genes]
create_test_gtf() {
local file_path="$1"
local num_genes="${2:-3}"
log "Creating test GTF file with $num_genes genes: $file_path"
> "$file_path" # Create empty file
for i in $(seq 1 "$num_genes"); do
local start=$((1000 * i))
local end=$((start + 999))
local chr="chr$((i % 22 + 1))"
echo -e "${chr}\ttest\tgene\t${start}\t${end}\t.\t+\t.\tgene_id \"gene$i\"; gene_name \"GENE$i\"" >> "$file_path"
echo -e "${chr}\ttest\ttranscript\t${start}\t${end}\t.\t+\t.\tgene_id \"gene$i\"; transcript_id \"transcript${i}\"; gene_name \"GENE$i\"" >> "$file_path"
echo -e "${chr}\ttest\texon\t${start}\t$((start + 499))\t.\t+\t.\tgene_id \"gene$i\"; transcript_id \"transcript${i}\"; exon_number \"1\"" >> "$file_path"
echo -e "${chr}\ttest\texon\t$((start + 500))\t${end}\t.\t+\t.\tgene_id \"gene$i\"; transcript_id \"transcript${i}\"; exon_number \"2\"" >> "$file_path"
done
log "✓ Created test GTF file: $file_path"
}
# Create a test GFF file with basic feature annotations
# Usage: create_test_gff "/path/to/output.gff" [num_features]
create_test_gff() {
local file_path="$1"
local num_features="${2:-3}"
log "Creating test GFF file with $num_features features: $file_path"
echo "##gff-version 3" > "$file_path"
for i in $(seq 1 "$num_features"); do
local start=$((1000 * i))
local end=$((start + 999))
local chr="chr$((i % 22 + 1))"
echo -e "${chr}\ttest\tgene\t${start}\t${end}\t.\t+\t.\tID=gene$i;Name=GENE$i" >> "$file_path"
done
log "✓ Created test GFF file: $file_path"
}
# Create a test BED file with genomic intervals
# Usage: create_test_bed "/path/to/output.bed" [num_intervals]
create_test_bed() {
local file_path="$1"
local num_intervals="${2:-3}"
log "Creating test BED file with $num_intervals intervals: $file_path"
> "$file_path" # Create empty file
for i in $(seq 1 "$num_intervals"); do
local start=$((1000 * i))
local end=$((start + 999))
local chr="chr$((i % 22 + 1))"
echo -e "${chr}\t${start}\t${end}\tregion$i\t0\t+" >> "$file_path"
done
log "✓ Created test BED file: $file_path"
}
# Create a simple test CSV file
# Usage: create_test_csv "/path/to/output.csv" [num_rows]
create_test_csv() {
local file_path="$1"
local num_rows="${2:-5}"
log "Creating test CSV file with $num_rows rows: $file_path"
echo "id,name,value,category" > "$file_path"
for i in $(seq 1 "$num_rows"); do
echo "row$i,name$i,$((i * 10)),category$((i % 3 + 1))" >> "$file_path"
done
log "✓ Created test CSV file: $file_path"
}
# Create a simple test TSV file
# Usage: create_test_tsv "/path/to/output.tsv" [num_rows]
create_test_tsv() {
local file_path="$1"
local num_rows="${2:-5}"
log "Creating test TSV file with $num_rows rows: $file_path"
echo -e "id\tname\tvalue\tcategory" > "$file_path"
for i in $(seq 1 "$num_rows"); do
echo -e "row$i\tname$i\t$((i * 10))\tcategory$((i % 3 + 1))" >> "$file_path"
done
log "✓ Created test TSV file: $file_path"
}
#############################################
# Utility Functions
#############################################
# Setup test environment with recommended settings
setup_test_env() {
# Enable strict error handling
set -euo pipefail
# Set up consistent locale for reproducible results
export LC_ALL=C
log "Test environment initialized with strict error handling"
log "Using temporary directory: ${meta_temp_dir:-$PWD}"
}
# Print test summary
print_test_summary() {
local test_name="${1:-Test}"
log "🎉 $test_name completed successfully!"
}
#############################################
# Example Usage
#############################################
# Example function showing how to use the helpers
example_test_usage() {
log "=== Example Test Usage ==="
# Setup
setup_test_env
# Create test data
create_test_fasta "$meta_temp_dir/input.fasta" 3 50
# Validate test data
check_file_exists "$meta_temp_dir/input.fasta" "input FASTA file"
check_file_not_empty "$meta_temp_dir/input.fasta" "input FASTA file"
check_file_line_count "$meta_temp_dir/input.fasta" 6 # 3 sequences = 6 lines
# Example tool execution (commented out)
# "$meta_executable" --input "$meta_temp_dir/input.fasta" --output "$meta_temp_dir/output"
# Validate outputs (examples)
# check_file_exists "$meta_temp_dir/output.txt" "result file"
# check_file_contains "$meta_temp_dir/output.txt" "expected_pattern" "result file"
print_test_summary "Example test"
}

View File

@@ -3,8 +3,10 @@ description: |
Bases2Fastq demultiplexes sequencing data generated by Element Biosciences instruments and converts base calls into FASTQ files.
keywords: ["demultiplex", "fastq", "demux", "Element Biosciences"]
links:
homepage: https://www.elembio.com/
documentation: https://docs.elembio.io/docs/bases2fastq/introduction/
license: Proprietairy
repository: https://github.com/Illumina/bases2fastq
license: Proprietary
requirements:
commands: [bases2fastq]
authors:
@@ -158,19 +160,51 @@ argument_groups:
type: boolean_true
description: |
Split FASTQ files by lane.
- name: --strict
- name: "--skip_qc_report"
type: boolean_true
description: |
In strict mode any invalid or missing input file will terminate execution
(overrides no_error_on_invalid and sets --error_on_missing)
Do not generate HTML QC report.
- name: "--skip_multi_qc"
type: boolean_true
description: |
Do not generate MultiQC HTML report.
- name: "--settings"
type: string
multiple: true
description: |
Run manifest settings override. This option may be specified multiple times.
# --help, -h Display this usage statement
# --input-remote, NAME Rclone remote name for remote ANALYSIS_DIRECTORY
# --num-threads, -p NUMBER Number of threads (default 1)
# --output-remote, NAME Rclone remote name for remote OUTPUT_DIRECTORY
# --settings SELECTION Run manifest settings override. This option may be specified multiple times.
# --version, -v Display bases2fastq version
# --skip-qc-report SELECTION Do not generate HTML QC report.
# Cyto-fastq specific arguments
- name: "Cyto-fastq Arguments"
arguments:
- name: "--batch"
type: string
description: |
Restrict cyto-fastq generation to batch(es) that match comma delimited list (e.g. --batch B01,B02,B03).
- name: "--cyto_fastq_mask"
type: string
multiple: true
description: |
Cycle mask for cyto fastq generation. This flag can be specified multiple times.
- name: "--panel"
type: file
description: |
Local or remote path to panel JSON
- name: "--per_target_fastq"
type: boolean_true
description: |
Create per-target fastq for each cell assignment target site in each DISS batch according to FastqMasks in TargetCellAssignmentManifest.
- name: "--tca_manifest"
type: file
description: |
Location of TargetCellAssignmentManifest to use instead of default csv found in analysis directory
# Arguments not included as per contributing guidelines:
# --help, -h Display this usage statement (handled by viash)
# --input-remote, NAME Rclone remote name for remote ANALYSIS_DIRECTORY (not needed for biobox)
# --num-threads, -p NUMBER Number of threads (use meta_cpus instead)
# --output-remote, NAME Rclone remote name for remote OUTPUT_DIRECTORY (not needed for biobox)
# --version, -v Display bases2fastq version (handled by viash)
resources:
- type: bash_script
@@ -179,21 +213,16 @@ resources:
test_resources:
- type: bash_script
path: test.sh
- type: file
path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: elembio/bases2fastq:2.1.0
image: elembio/bases2fastq:2.2
setup:
- type: apt
packages:
- procps
- tree
- type: docker
run: |
echo "bases2fastq: $(bases2fastq --version | cut -d' ' -f3)" > /var/software_versions.txt
test_setup:
- type: apt
packages: curl
bases2fastq --version 2>&1 | head -1 | sed 's/.*version \([0-9\\.]*\).*/bases2fastq: \1/' > /var/software_versions.txt
runners:
- type: executable

View File

@@ -1,40 +1,51 @@
```
docker run --rm docker.io/elembio/bases2fastq:2.2 bases2fastq -h
```
Usage: bases2fastq [OPTIONS] ANALYSIS_DIRECTORY OUTPUT_DIRECTORY
positional arguments:
ANALYSIS_DIRECTORY Location of analysis directory
OUTPUT_DIRECTORY Location to save output
ANALYSIS_DIRECTORY Location of analysis directory
OUTPUT_DIRECTORY Location to save output
optional arguments:
--chemistry-version VERSION Run parameters override, chemistry version.
--demux-only, -d Generate demux files and indexing stats without generating FASTQ
--detect-adapters Detect adapters sequences, overriding any sequences present in run manifest.
--error-on-missing Terminate execution for a missing file (by default, missing files are skipped and execution continues). Also set by --strict.
--exclude-tile, -e SELECTION Regex matching tile names to exclude. This flag can be specified multiple times. (e.g. L1.*C0[23]S.)
--filter-mask MASK Run parameters override, custom pass filter mask.
--flowcell-id FLOWCELL_ID Run parameters override, flowcell ID.
--force-index-orientation Do not attempt to find orientation for I1/I2 reads (reverse complement). Use orientation given in run manifest.
--group-fastq Group all FASTQ/stats/metrics for a project are in the project folder (default false)
--help, -h Display this usage statement
--i1-cycles NUM_CYCLES Run parameters override, I1 cycles.
--i2-cycles NUM_CYCLES Run parameters override, I2 cycles.
--include-tile, -i SELECTION Regex matching tile names to include. This flag can be specified multiple times. (e.g. L1.*C0[23]S.)
--input-remote, NAME Rclone remote name for remote ANALYSIS_DIRECTORY
--kit-configuration KIT_CONFIG Run parameters override, kit configuration.
--legacy-fastq Legacy naming for FASTQ files (e.g. SampleName_S1_L001_R1_001.fastq.gz)
--log-level, -l LEVEL Severity level for logging. i.e. DEBUG, INFO, WARNING, ERROR (default INFO)
--no-error-on-invalid Skip invalid files and continue execution (by default, execution is terminated for an invalid file). Overridden by --strict options.
--no-projects Disable project directories (default false)
--num-threads, -p NUMBER Number of threads (default 1)
--num-unassigned NUMBER Max Number of unassigned sequences to report. Must be <= 1000 (default 30)
--output-remote, NAME Rclone remote name for remote OUTPUT_DIRECTORY
--preparation-workflow WORKFLOW Run parameters override, preparation workflow.
--qc-only Quickly generate run stats for single tile without generating FASTQ. Use --include-tile/--exclude-tile to define custom tile set.
--r1-cycles NUM_CYCLES Run parameters override, R1 cycles.
--r2-cycles NUM_CYCLES Run parameters override, R2 cycles.
--run-manifest, -r PATH Location of run manifest to use instead of default RunManifest.csv found in analysis directory
--settings SELECTION Run manifest settings override. This option may be specified multiple times.
--skip-qc-report SELECTION Do not generate HTML QC report.
--split-lanes Split FASTQ files by lane
--strict, -s In strict mode any invalid or missing input file will terminate execution (overrides no-error-on-invalid and sets --error-on-missing)
--version, -v Display bases2fastq version
--chemistry-version VERSION Run parameters override, chemistry version.
--demux-only, -d Generate demux files and indexing stats without generating FASTQ
--detect-adapters Detect adapters sequences, overriding any sequences present in run manifest.
--error-on-missing Terminate execution for a missing file (by default, missing files are skipped and execution continues).
--exclude-tile, -e SELECTION Regex matching tile names to exclude. This flag can be specified multiple times. (e.g. L1.*C0[23]S.)
--filter-mask MASK Run parameters override, custom pass filter mask.
--flowcell-id FLOWCELL_ID Run parameters override, flowcell ID.
--force-index-orientation Do not attempt to find orientation for I1/I2 reads (reverse complement). Use orientation given in run manifest.
--group-fastq Group all FASTQ/stats/metrics for a project are in the project folder (default false)
--help, -h Display this usage statement
--i1-cycles NUM_CYCLES Run parameters override, I1 cycles.
--i2-cycles NUM_CYCLES Run parameters override, I2 cycles.
--include-tile, -i SELECTION Regex matching tile names to include. This flag can be specified multiple times. (e.g. L1.*C0[23]S.)
--input-remote, NAME Rclone remote name for remote ANALYSIS_DIRECTORY
--kit-configuration KIT_CONFIG Run parameters override, kit configuration.
--legacy-fastq Legacy naming for FASTQ files (e.g. SampleName_S1_L001_R1_001.fastq.gz)
--log-level, -l LEVEL Severity level for logging. i.e. DEBUG, INFO, WARNING, ERROR (default INFO)
--no-error-on-invalid Skip invalid files and continue execution (by default, execution is terminated for an invalid file).
--no-projects Disable project directories (default false)
--num-threads, -p NUMBER Number of threads (default 1)
--num-unassigned NUMBER Max Number of unassigned sequences to report. (default 30)
--output-remote, NAME Rclone remote name for remote OUTPUT_DIRECTORY
--preparation-workflow WORKFLOW Run parameters override, preparation workflow.
--qc-only Quickly generate run stats for single tile without generating FASTQ. Use --include-tile/--exclude-tile to define custom tile set.
--r1-cycles NUM_CYCLES Run parameters override, R1 cycles.
--r2-cycles NUM_CYCLES Run parameters override, R2 cycles.
--run-manifest, -r PATH Location of run manifest to use instead of default RunManifest.csv found in analysis directory
--settings SELECTION Run manifest settings override. This option may be specified multiple times.
--skip-multi-qc Do not generate MultiQC HTML report.
--skip-qc-report Do not generate HTML QC report.
--split-lanes Split FASTQ files by lane
--version, -v Display bases2fastq version
cyto-fastq optional arguments:
--batch BATCH Restrict cyto-fastq generation to batch(es) that match comma delimited list (e.g. --batch B01,B02,B03).
--cyto-fastq-mask MASK Cycle mask for cyto fastq generation. This flag can be specified multiple times.
--panel PANEL Local or remote path to panel JSON
--per-target-fastq Create per-target fastq for each cell assignment target site in each DISS batch according to FastqMasks in TargetCellAssignmentManifest.
--tca-manifest PATH Location of TargetCellAssignmentManifest to use instead of default csv found in analysis directory
--well, -v Restrict cyto-fastq generation to well location(s) that match comma delimited list (e.g. --well A1,A2,B2)

View File

@@ -8,92 +8,122 @@ set -eo pipefail
# Unset parameters
unset_if_false=(
par_demux_only
par_detect_adapters
par_error_on_missing
par_group_fastq
par_legacy_fastq
par_no_error_on_invalid
par_no_projects
par_qc_only
par_split_lanes
par_skip_qc_report
par_strict
par_force_index_orientation
par_demux_only
par_detect_adapters
par_error_on_missing
par_group_fastq
par_legacy_fastq
par_no_error_on_invalid
par_no_projects
par_qc_only
par_split_lanes
par_skip_qc_report
par_skip_multi_qc
par_force_index_orientation
par_per_target_fastq
)
for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
done
# NOTE: --preparation-workflow is bugged in bases2fastq
args=(
${par_demux_only:+--demux-only}
${par_detect_adapters:+--detect-adapters}
${par_error_on_missing:+--error-on-missing}
${par_group_fastq:+--group-fastq}
${par_legacy_fastq:+--legacy-fastq}
${par_no_error_on_invalid:+--no-error-on-invalid}
${par_no_projects:+--no-projects}
${par_split_lanes:+--split-lanes}
${par_strict:+--strict}
${par_force_index_orientation:+--force-index-orientation}
${par_chemistry_version:+--chemistry-version "$par_chemistry_version"}
${par_filter_mask:+--filter-mask "$par_filter_mask"}
${par_flowcell_id:+--flowcell-id "$par_flowcell_id"}
${par_i1_cycles:+--i1-cycles "$par_i1_cycles"}
${par_i2_cycles:+--i2-cycles "$par_i2_cycles"}
${par_r1_cycles:+--r1-cycles "$par_r1_cycles"}
${par_r2_cycles:+--r2-cycles "$par_r2_cycles"}
${par_kit_configuration:+--kit-configuration "$par_kit_configuration"}
${par_log_level:+--log-level "$par_log_level"}
${par_num_unassigned:+--num-unassigned "$par_num_unassigned"}
${par_preparation_workflow:+--preparation-workflow "$par_preparation_workflow"}
${meta_cpus:+--num-threads "$meta_cpus"}
${par_run_manifest:+--run-manifest "$par_run_manifest"}
)
# Create arrays for inputs that contain multiple arguments
IFS=";" read -ra exclude_tile <<< "$par_exclude_tile"
IFS=";" read -ra include_tile <<< "$par_include_tile"
if [ -z "$par_report" ]; then
args+=( --skip-qc-report )
fi
for arg_value in "${exclude_tile[@]}"; do
args+=( "--exclude-tile" "$arg_value" )
done
for arg_value in "${include_tile[@]}"; do
args+=( "--include-tile" "$arg_value" )
done
IFS=";" read -ra settings <<< "$par_settings"
IFS=";" read -ra cyto_fastq_mask <<< "$par_cyto_fastq_mask"
echo "> Creating temporary directory."
# create temporary directory and clean up on exit
TMPDIR=$(mktemp -d "$meta_temp_dir/$meta_name-XXXXXX")
echo "> Created $TMPDIR"
function clean_up {
[[ -d "$TMPDIR" ]] && rm -rf "$TMPDIR"
[[ -d "$TMPDIR" ]] && rm -rf "$TMPDIR"
}
trap clean_up EXIT
# NOTE: --preparation-workflow is bugged in bases2fastq
args=(
${par_demux_only:+--demux-only}
${par_detect_adapters:+--detect-adapters}
${par_error_on_missing:+--error-on-missing}
${par_group_fastq:+--group-fastq}
${par_legacy_fastq:+--legacy-fastq}
${par_no_error_on_invalid:+--no-error-on-invalid}
${par_no_projects:+--no-projects}
${par_split_lanes:+--split-lanes}
${par_force_index_orientation:+--force-index-orientation}
${par_skip_qc_report:+--skip-qc-report}
${par_skip_multi_qc:+--skip-multi-qc}
${par_per_target_fastq:+--per-target-fastq}
${par_chemistry_version:+--chemistry-version "$par_chemistry_version"}
${par_filter_mask:+--filter-mask "$par_filter_mask"}
${par_flowcell_id:+--flowcell-id "$par_flowcell_id"}
${par_i1_cycles:+--i1-cycles "$par_i1_cycles"}
${par_i2_cycles:+--i2-cycles "$par_i2_cycles"}
${par_r1_cycles:+--r1-cycles "$par_r1_cycles"}
${par_r2_cycles:+--r2-cycles "$par_r2_cycles"}
${par_kit_configuration:+--kit-configuration "$par_kit_configuration"}
${par_log_level:+--log-level "$par_log_level"}
${par_num_unassigned:+--num-unassigned "$par_num_unassigned"}
${par_preparation_workflow:+--preparation-workflow "$par_preparation_workflow"}
${par_batch:+--batch "$par_batch"}
${par_panel:+--panel "$par_panel"}
${par_tca_manifest:+--tca-manifest "$par_tca_manifest"}
${meta_cpus:+--num-threads "$meta_cpus"}
${par_run_manifest:+--run-manifest "$par_run_manifest"}
)
if [ -z "$par_report" ]; then
args+=( --skip-qc-report )
fi
for arg_value in "${exclude_tile[@]}"; do
args+=( "--exclude-tile" "$arg_value" )
done
for arg_value in "${include_tile[@]}"; do
args+=( "--include-tile" "$arg_value" )
done
for arg_value in "${settings[@]}"; do
args+=( "--settings" "$arg_value" )
done
for arg_value in "${cyto_fastq_mask[@]}"; do
args+=( "--cyto-fastq-mask" "$arg_value" )
done
args+=( "$par_analysis_directory" "$TMPDIR")
echo "> Running bases2fastq with arguments: ${args[@]}"
bases2fastq ${args[@]}
echo "> Done running sgdemux"
echo "> Output folder:"
tree "$TMPDIR"
echo "> Moving FASTQ files into final output directory"
mkdir -p "$par_output_directory/"
mv "$TMPDIR"/Samples/* --target-directory="$par_output_directory"
if [ ! -z "$par_report" ]; then
echo "> Moving HTML report to the output ($par_report)"
mv "$TMPDIR"/*.html "$par_report"
# Find HTML files in TMPDIR
html_files=("$TMPDIR"/*.html)
if [ -f "${html_files[0]}" ]; then
# If there's only one HTML file, move it to the specified report path
if [ ${#html_files[@]} -eq 1 ]; then
mv "${html_files[0]}" "$par_report"
else
# Multiple HTML files - find the main QC report and move it to the specified path
# bases2fastq generates both QC report and MultiQC report
for html_file in "${html_files[@]}"; do
# The main QC report is usually not named multiqc_report.html
if [[ ! "$(basename "$html_file")" =~ ^multiqc.*\.html$ ]]; then
mv "$html_file" "$par_report"
break
fi
done
fi
fi
else
echo " > Leaving reports alone"
fi
@@ -106,6 +136,3 @@ if [ ! -z "$par_logs" ]; then
else
echo "> Not moving logs"
fi

View File

@@ -2,25 +2,8 @@
set -eou pipefail
# Helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_not_exists() {
[ ! -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_directory_exists() {
[ -d "$1" ] || { echo "Directory '$1' does not exist" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
# Source centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Example output
# Note that the format of the fastq file names and organization into subfolders
@@ -87,16 +70,21 @@ function clean_up {
}
trap clean_up EXIT
log_info "Downloading and extracting test data"
# Unpack test input files
log_info "Downloading test data from Element Biosciences"
TAR_DIR="$TMPDIR/tar"
mkdir -p "$TAR_DIR"
curl http://element-public-data.s3.amazonaws.com/bases2fastq-share/bases2fastq-v2/20230404-bases2fastq-sim-151-151-9-9.tar.gz \
-o "$TAR_DIR/20230404-bases2fastq-sim-151-151-9-9.tar.gz"
wget http://element-public-data.s3.amazonaws.com/bases2fastq-share/bases2fastq-v2/20230404-bases2fastq-sim-151-151-9-9.tar.gz \
-O "$TAR_DIR/20230404-bases2fastq-sim-151-151-9-9.tar.gz"
log_info "Extracting test data"
BCL_DIR="$TMPDIR/bcl"
mkdir "$BCL_DIR"
tar -xvf "$TAR_DIR/20230404-bases2fastq-sim-151-151-9-9.tar.gz" -C "$BCL_DIR"
tar -xzf "$TAR_DIR/20230404-bases2fastq-sim-151-151-9-9.tar.gz" -C "$BCL_DIR"
log_info "Running test 1 with multiple options"
mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null
expected_out_dir="$TMPDIR/test1/out"
expected_report="$TMPDIR/report.html"
@@ -123,13 +111,13 @@ expected_logs="$TMPDIR/logs"
--log_level DEBUG \
--no_projects \
--num_unassigned 30 \
--strict \
--run_manifest "$BCL_DIR/20230404-bases2fastq-sim-151-151-9-9/RunManifest.csv"
assert_directory_exists "$expected_out_dir"
assert_directory_exists "$expected_logs"
assert_file_exists "$expected_report"
assert_file_not_empty "$expected_report"
log_info "Validating test 1 outputs"
check_dir_exists "$expected_out_dir" "Output directory"
check_dir_exists "$expected_logs" "Logs directory"
check_file_exists "$expected_report" "HTML report"
check_file_not_empty "$expected_report" "HTML report (should contain data)"
expected_samples=(
Undetermined_S0
@@ -140,15 +128,17 @@ expected_samples=(
sample_4_S5
)
log_info "Checking FASTQ files for all samples and lanes"
for sample in "${expected_samples[@]}"; do
for lane in "L001" "L002"; do
for orientation in "R1" "R2"; do
assert_file_exists "$expected_out_dir/${sample}_${lane}_${orientation}_001.fastq.gz"
check_file_exists "$expected_out_dir/${sample}_${lane}_${orientation}_001.fastq.gz" "FASTQ file for ${sample}_${lane}_${orientation}"
done
done
done
popd > /dev/null
log_info "Running test 3 with basic options"
mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null
expected_out_dir="$TMPDIR/test3/out"
"$meta_executable" \
@@ -162,23 +152,26 @@ expected_samples=(
sample_3
sample_4
)
tree "$expected_out_dir"
log_info "Inspecting output directory structure:"
find "$expected_out_dir" -name "*.fastq.gz" | head -10
log_info "Checking sample FASTQ files"
for sample in "${expected_samples[@]}"; do
for orientation in "R1" "R2"; do
assert_file_exists "$expected_out_dir/DefaultProject/${sample}/${sample}_${orientation}.fastq.gz"
check_file_exists "$expected_out_dir/DefaultProject/${sample}/${sample}_${orientation}.fastq.gz" "Sample ${sample} ${orientation} FASTQ file"
done
done
assert_file_exists "$expected_out_dir/Unassigned/Unassigned_R1.fastq.gz"
assert_file_exists "$expected_out_dir/Unassigned/Unassigned_R2.fastq.gz"
check_file_exists "$expected_out_dir/Unassigned/Unassigned_R1.fastq.gz" "Unassigned R1 FASTQ file"
check_file_exists "$expected_out_dir/Unassigned/Unassigned_R2.fastq.gz" "Unassigned R2 FASTQ file"
popd > /dev/null
log_info "Running test 4 with split lanes option"
mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null
expected_out_dir="$TMPDIR/test4/out"
"$meta_executable" \
--analysis_directory "$BCL_DIR/20230404-bases2fastq-sim-151-151-9-9" \
--output_directory "$expected_out_dir" \
--split_lanes
--analysis_directory "$BCL_DIR/20230404-bases2fastq-sim-151-151-9-9" \
--output_directory "$expected_out_dir" \
--split_lanes
expected_samples=(
"Unassigned/Unassigned"
@@ -188,13 +181,17 @@ expected_samples=(
DefaultProject/sample_3/sample_3
DefaultProject/sample_4/sample_4
)
tree "$expected_out_dir"
log_info "Inspecting split lanes output directory:"
find "$expected_out_dir" -name "*.fastq.gz" | head -10
log_info "Checking split lane FASTQ files"
for sample in "${expected_samples[@]}"; do
for lane in "L1" "L2"; do
for orientation in "R1" "R2"; do
assert_file_exists "$expected_out_dir/${sample}_${lane}_${orientation}.fastq.gz"
check_file_exists "$expected_out_dir/${sample}_${lane}_${orientation}.fastq.gz" "Split lane FASTQ file ${sample}_${lane}_${orientation}"
done
done
done
popd > /dev/null
log_info "All tests completed successfully"

View File

@@ -0,0 +1,140 @@
name: bedtools_annotate
namespace: bedtools
description: |
Annotates the depth and breadth of coverage of features from multiple files.
This tool analyzes how intervals in the input file are covered by features
from one or more annotation files. It reports either the fraction of each
interval covered, the count of overlapping features, or both metrics.
**Default behavior:** Reports fraction of each input interval covered by features
**Multiple files:** Can process multiple annotation files simultaneously
**Strand options:** Supports same-strand, opposite-strand, or strand-agnostic analysis
keywords: [Annotate, Coverage, Overlap, BED, GFF, VCF]
links:
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/annotate.html
repository: https://github.com/arq5x/bedtools2
homepage: https://bedtools.readthedocs.io/en/latest/
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: [-i]
type: file
description: |
Input file in BED, GFF, or VCF format to be annotated.
Each interval in this file will be analyzed for coverage by
features from the annotation files.
required: true
example: intervals.bed
- name: --files
type: file
multiple: true
description: |
One or more annotation files for coverage analysis.
**Format:** BED, GFF, or VCF files containing features to analyze
**Multiple files:** Use space-separated list or multiple --files flags
**Processing:** Each file analyzed separately with results in columns
required: true
example: ["annotations1.bed", "annotations2.bed"]
- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: |
Output file with annotation results.
Contains input intervals with additional columns showing coverage
statistics from each annotation file.
required: true
example: annotated_intervals.bed
- name: Options
arguments:
- name: --names
type: string
multiple: true
description: |
Descriptive names for each annotation file.
**Usage:** One name per file in same order as --files
**Header:** Names appear in output header line
**Format:** Space-separated list or multiple --names flags
example: ["ChIP-seq_peaks", "DNA_methylation"]
- name: --counts
type: boolean_true
description: |
Report count of overlapping features instead of coverage fraction.
**Default output:** Fraction of input interval covered (0.0-1.0)
**With --counts:** Integer count of overlapping features
**Use case:** When feature count is more relevant than coverage area
- name: --both
type: boolean_true
description: |
Report both feature counts and coverage fractions.
**Output format:** Count followed by fraction for each annotation file
**Columns:** Doubles the number of result columns
**Use case:** Comprehensive analysis requiring both metrics
- name: --strand
alternatives: [-s]
type: boolean_true
description: |
Require same strandedness for overlap detection.
Only count overlaps between features on the same strand.
Features on opposite strands are ignored.
**Default:** Strand-agnostic analysis
- name: --different_strand
alternatives: [-S]
type: boolean_true
description: |
Require different strandedness for overlap detection.
Only count overlaps between features on opposite strands.
Features on the same strand are ignored.
**Default:** Strand-agnostic analysis
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: docker
run:
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,29 @@
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools annotate -h
```
Tool: bedtools annotate (aka annotateBed)
Version: v2.31.1
Summary: Annotates the depth & breadth of coverage of features from mult. files
on the intervals in -i.
Usage: bedtools annotate [OPTIONS] -i <bed/gff/vcf> -files FILE1 FILE2..FILEn
Options:
-names A list of names (one / file) to describe each file in -i.
These names will be printed as a header line.
-counts Report the count of features in each file that overlap -i.
- Default is to report the fraction of -i covered by each file.
-both Report the counts followed by the % coverage.
- Default is to report the fraction of -i covered by each file.
-s Require same strandedness. That is, only counts overlaps
on the _same_ strand.
- By default, overlaps are counted without respect to strand.
-S Require different strandedness. That is, only count overlaps
on the _opposite_ strand.
- By default, overlaps are counted without respect to strand.

View File

@@ -0,0 +1,34 @@
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# unset flags
[[ "$par_counts" == "false" ]] && unset par_counts
[[ "$par_both" == "false" ]] && unset par_both
[[ "$par_strand" == "false" ]] && unset par_strand
[[ "$par_different_strand" == "false" ]] && unset par_different_strand
# Convert semicolon-separated files to array
IFS=';' read -ra files_array <<< "$par_files"
# Convert semicolon-separated names to array if provided
if [[ -n "${par_names}" ]]; then
IFS=';' read -ra names_array <<< "$par_names"
fi
# Build command arguments array
cmd_args=(
-i "$par_input"
${par_names:+-names "${names_array[@]}"}
${par_counts:+-counts}
${par_both:+-both}
${par_strand:+-s}
${par_different_strand:+-S}
-files "${files_array[@]}"
)
# Execute bedtools annotate
bedtools annotate "${cmd_args[@]}" > "$par_output"

View File

@@ -0,0 +1,113 @@
#!/bin/bash
set -eo pipefail
## VIASH START
## VIASH END
# Source centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment
setup_test_env
log "Starting tests for bedtools_annotate"
# Create test data
log "Creating test data..."
# Create input intervals file
cat > "$meta_temp_dir/intervals.bed" << 'EOF'
chr1 100 200 interval1 100 +
chr1 300 400 interval2 200 +
chr2 150 250 interval3 300 -
chr2 500 600 interval4 400 -
EOF
# Create first annotation file (overlaps with intervals 1 and 3)
cat > "$meta_temp_dir/annotation1.bed" << 'EOF'
chr1 120 180 feature1 500 +
chr1 350 450 feature2 600 +
chr2 140 260 feature3 700 -
EOF
# Create second annotation file (overlaps with intervals 2 and 4)
cat > "$meta_temp_dir/annotation2.bed" << 'EOF'
chr1 320 380 feature4 800 +
chr1 390 420 feature5 900 +
chr2 520 580 feature6 1000 -
EOF
# Test 1: Basic annotation with coverage fractions
log "Starting TEST 1: Basic annotation with coverage fractions"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--files "$meta_temp_dir/annotation1.bed;$meta_temp_dir/annotation2.bed" \
--output "$meta_temp_dir/output1.bed"
check_file_exists "$meta_temp_dir/output1.bed" "basic annotation output"
check_file_not_empty "$meta_temp_dir/output1.bed" "basic annotation output"
check_file_line_count "$meta_temp_dir/output1.bed" 4 "basic annotation line count"
# Check that fractions are present (should contain decimal numbers)
check_file_contains "$meta_temp_dir/output1.bed" "0." "coverage fractions"
log "✅ TEST 1 completed successfully"
# Test 2: Annotation with feature counts
log "Starting TEST 2: Annotation with feature counts"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--files "$meta_temp_dir/annotation1.bed;$meta_temp_dir/annotation2.bed" \
--output "$meta_temp_dir/output2.bed" \
--counts
check_file_exists "$meta_temp_dir/output2.bed" "count annotation output"
check_file_not_empty "$meta_temp_dir/output2.bed" "count annotation output"
# Check that counts are present (should contain integers)
check_file_contains "$meta_temp_dir/output2.bed" "1" "feature counts"
log "✅ TEST 2 completed successfully"
# Test 3: Annotation with both counts and fractions
log "Starting TEST 3: Annotation with both counts and fractions"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--files "$meta_temp_dir/annotation1.bed" \
--output "$meta_temp_dir/output3.bed" \
--both
check_file_exists "$meta_temp_dir/output3.bed" "both metrics output"
check_file_not_empty "$meta_temp_dir/output3.bed" "both metrics output"
# Check that both counts and fractions are present
check_file_contains "$meta_temp_dir/output3.bed" "1" "feature counts in both output"
check_file_contains "$meta_temp_dir/output3.bed" "0." "coverage fractions in both output"
log "✅ TEST 3 completed successfully"
# Test 4: Annotation with custom names
log "Starting TEST 4: Annotation with custom names"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--files "$meta_temp_dir/annotation1.bed;$meta_temp_dir/annotation2.bed" \
--names "ChIP_peaks;DNA_meth" \
--output "$meta_temp_dir/output4.bed"
check_file_exists "$meta_temp_dir/output4.bed" "named annotation output"
check_file_not_empty "$meta_temp_dir/output4.bed" "named annotation output"
# The names should appear somewhere (likely in header or within results)
log "✅ TEST 4 completed successfully"
# Test 5: Strand-specific annotation (same strand)
log "Starting TEST 5: Strand-specific annotation (same strand)"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--files "$meta_temp_dir/annotation1.bed" \
--output "$meta_temp_dir/output5.bed" \
--strand
check_file_exists "$meta_temp_dir/output5.bed" "strand-specific annotation output"
check_file_not_empty "$meta_temp_dir/output5.bed" "strand-specific annotation output"
log "✅ TEST 5 completed successfully"
log "All tests completed successfully!"

View File

@@ -1,12 +1,15 @@
name: bedtools_bamtobed
namespace: bedtools
description: Converts BAM alignments to BED6 or BEDPE format.
description: |
Converts BAM alignments to BED6 or BEDPE format.
This tool converts alignments in BAM format to either BED6 or BEDPE format,
allowing for flexible downstream analysis of genomic intervals.
keywords: [Converts, BAM, BED, BED6, BEDPE]
links:
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html
repository: https://github.com/arq5x/bedtools2
homepage: https://bedtools.readthedocs.io/en/latest/#
issue_tracker: https://github.com/arq5x/bedtools2/issues
homepage: https://bedtools.readthedocs.io/en/latest/
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
@@ -14,85 +17,129 @@ requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]
roles: [author]
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: -i
alternatives: [-i]
type: file
description: Input BAM file.
description: |
Input BAM file containing aligned sequences.
**Requirements:**
- Must be in SAM/BAM format
- For paired-end BEDPE output (`--bedpe`), must be grouped or sorted by query name
required: true
example: input.bam
- name: Outputs
arguments:
- name: --output
alternatives: -o
alternatives: [-o]
required: true
type: file
direction: output
description: Output BED file.
description: |
Output file in BED or BEDPE format.
**Output formats:**
- Default: BED6 format (6 columns)
- With `--bedpe`: BEDPE format for paired-end data
- With `--bed12`: BED12 format with blocked intervals
example: output.bed
- name: Options
arguments:
- name: --bedpe
type: boolean_true
description: |
Write BEDPE format. Requires BAM to be grouped or sorted by query.
Write BEDPE format for paired-end data.
**Requirements:**
- BAM must be grouped or sorted by query name
- Produces paired-end BED format with mate information
- name: --mate1
type: boolean_true
description: |
When writing BEDPE (-bedpe) format, always report mate one as the first BEDPE "block".
When writing BEDPE format (`--bedpe`), always report mate one as the first BEDPE block.
Ensures consistent ordering of paired-end reads in output.
- name: --bed12
type: boolean_true
description: |
Write "blocked" BED format (aka "BED12"). Forces -split.
See http://genome-test.cse.ucsc.edu/FAQ/FAQformat#format1
Write blocked BED format (BED12 format).
**Features:**
- Creates 12-column BED format with block information
- Automatically forces `--split` option
- Useful for representing spliced alignments
See [BED12 format specification](http://genome-test.cse.ucsc.edu/FAQ/FAQformat#format1) for details.
- name: --split
type: boolean_true
description: |
Report "split" BAM alignments as separate BED entries.
Splits only on N CIGAR operations.
Report split BAM alignments as separate BED entries.
**Behavior:**
- Splits only on **N** CIGAR operations (introns/gaps)
- Each split becomes a separate BED interval
- Useful for RNA-seq data with spliced alignments
- name: --splitD
type: boolean_true
description: |
Split alignments based on N and D CIGAR operators.
Forces -split.
Split alignments based on both **N** and **D** CIGAR operators.
**Features:**
- Splits on N (gaps/introns) and D (deletions) operations
- Automatically forces `--split` option
- More aggressive splitting than `--split` alone
- name: --edit_distance
alternatives: -ed
alternatives: [-ed]
type: boolean_true
description: |
Use BAM edit distance (NM tag) for BED score.
- Default for BED is to use mapping quality.
- Default for BEDPE is to use the minimum of
the two mapping qualities for the pair.
- When -ed is used with -bedpe, the total edit
distance from the two mates is reported.
Use BAM edit distance (NM tag) for BED score instead of mapping quality.
**Scoring behavior:**
- **Default BED**: Uses mapping quality as score
- **Default BEDPE**: Uses minimum of two mapping qualities
- **With --ed + --bedpe**: Reports total edit distance from both mates
- name: --tag
type: string
description: |
Use other NUMERIC BAM alignment tag for BED score.
Default for BED is to use mapping quality. Disallowed with BEDPE output.
Use other numeric BAM alignment tag for BED score.
**Usage:**
- Specify any numeric BAM tag (e.g., `SM`, `AS`, `XS`)
- Replaces default mapping quality scoring
- **Not allowed** with BEDPE output format
example: "SM"
- name: --color
type: string
description: |
An R,G,B string for the color used with BED12 format.
Default is (255,0,0).
example: "250,250,250"
RGB color string for BED12 format visualization.
**Format:** R,G,B values (0-255 each)
**Default:** `255,0,0` (red)
example: "255,0,0"
- name: --cigar
type: boolean_true
description: |
Add the CIGAR string to the BED entry as a 7th column.
Add the CIGAR string as a 7th column in BED output.
Useful for preserving alignment information in BED format.
resources:
- type: bash_script
@@ -101,17 +148,16 @@ resources:
test_resources:
- type: bash_script
path: test.sh
- path: test_data
- type: file
path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: debian:stable-slim
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: apt
packages: [bedtools, procps]
- type: docker
run: |
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
run:
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
runners:
- type: executable

View File

@@ -1,9 +1,9 @@
```bash
bedtools bamtobed
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools bamtobed -h
```
Tool: bedtools bamtobed (aka bamToBed)
Version: v2.30.0
Version: v2.31.1
Summary: Converts BAM alignments to BED6 or BEDPE format.
Usage: bedtools bamtobed [OPTIONS] -i <bam>

View File

@@ -5,35 +5,36 @@
set -eo pipefail
# Unset parameters
unset_if_false=(
# unset flags
unset_if_false=(
par_bedpe
par_mate1
par_bed12
par_split
par_splitD
par_edit_distance
par_tag
par_color
par_cigar
)
for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
for par in "${unset_if_false[@]}"; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset "$par"
done
# Execute bedtools sort with the provided arguments
bedtools bamtobed \
${par_bedpe:+-bedpe} \
${par_mate1:+-mate1} \
${par_bed12:+-bed12} \
${par_split:+-split} \
${par_splitD:+-splitD} \
${par_edit_distance:+-ed} \
${par_tag:+-tag "$par_tag"} \
${par_cigar:+-cigar} \
${par_color:+-color "$par_color"} \
-i "$par_input" \
> "$par_output"
# Build command arguments array
cmd_args=(
-i "$par_input"
${par_bedpe:+-bedpe}
${par_mate1:+-mate1}
${par_bed12:+-bed12}
${par_split:+-split}
${par_splitD:+-splitD}
${par_edit_distance:+-ed}
${par_tag:+-tag "$par_tag"}
${par_color:+-color "$par_color"}
${par_cigar:+-cigar}
)
# Execute bedtools bamtobed
bedtools bamtobed "${cmd_args[@]}" > "$par_output"

View File

@@ -1,183 +1,133 @@
#!/bin/bash
# exit on error
set -eo pipefail
## VIASH START
## VIASH END
# directory of the bam file
test_data="$meta_resources_dir/test_data"
# Source the centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment with strict error handling
setup_test_env
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_identical_content() {
diff -a "$2" "$1" \
|| (echo "Files are not identical!" && exit 1)
}
# Test execution with centralized functions
#############################################
echo "Creating Test Data..."
TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX")
function clean_up {
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
log "Starting tests for $meta_name"
# Create test directory
test_dir="$meta_temp_dir/test_data"
mkdir -p "$test_dir"
# Create a test SAM file with proper format (based on original test data)
log "Creating test SAM data..."
cat > "$test_dir/test.sam" << 'EOF'
@SQ SN:chr2:172936693-172938111 LN:1418
@PG ID:bwa PN:bwa VN:0.7.17-r1188
my_read/1 99 chr2:172936693-172938111 129 60 100M = 429 400 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 SM:i:85
my_read/2 147 chr2:172936693-172938111 429 60 100M = 129 -400 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 SM:i:85
EOF
# Convert SAM to BAM using samtools (if available in container) or use the SAM directly
log "Converting SAM to BAM..."
if command -v samtools >/dev/null 2>&1; then
samtools view -bS "$test_dir/test.sam" > "$test_dir/test.bam"
input_file="$test_dir/test.bam"
else
# bedtools can handle SAM files directly
input_file="$test_dir/test.sam"
log "Using SAM file directly (samtools not available)"
fi
# --- Test Case 1: Basic BAM to BED conversion ---
log "Starting TEST 1: Basic BAM to BED conversion"
log "Executing $meta_name with basic parameters..."
"$meta_executable" \
--input "$input_file" \
--output "$meta_temp_dir/output1.bed"
log "Validating TEST 1 outputs..."
check_file_exists "$meta_temp_dir/output1.bed" "output BED file"
check_file_not_empty "$meta_temp_dir/output1.bed" "output BED file"
# Check that BED file has correct number of columns (6 for BED6)
line_count=$(wc -l < "$meta_temp_dir/output1.bed")
log "Output contains $line_count lines"
[ "$line_count" -gt 0 ] || { log_error "Output file is empty"; exit 1; }
# Check that each line has 6 columns (BED6 format)
awk 'NF != 6 { exit 1 }' "$meta_temp_dir/output1.bed" || {
log_error "Output is not in BED6 format (expected 6 columns per line)"
exit 1
}
trap clean_up EXIT
# Generate expected files for comparison
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t60\t+\nchr2:172936693-172938111\t428\t528\tmy_read/2\t60\t-\n" > "$TMPDIR/expected.bed"
printf "chr2:172936693-172938111\t128\t228\tchr2:172936693-172938111\t428\t528\tmy_read\t60\t+\t-\n" > "$TMPDIR/expected.bedpe"
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t60\t+\t128\t228\t255,0,0\t1\t100\t0\nchr2:172936693-172938111\t428\t528\tmy_read/2\t60\t-\t428\t528\t255,0,0\t1\t100\t0\n" > "$TMPDIR/expected.bed12"
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t0\t+\nchr2:172936693-172938111\t428\t528\tmy_read/2\t0\t-\n" > "$TMPDIR/expected_ed.bed"
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t60\t+\t128\t228\t250,250,250\t1\t100\t0\nchr2:172936693-172938111\t428\t528\tmy_read/2\t60\t-\t428\t528\t250,250,250\t1\t100\t0\n" > "$TMPDIR/expected_color.bed12"
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t60\t+\t100M\nchr2:172936693-172938111\t428\t528\tmy_read/2\t60\t-\t100M\n" > "$TMPDIR/expected_cigar.bed"
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t85\t+\nchr2:172936693-172938111\t428\t528\tmy_read/2\t85\t-\n" > "$TMPDIR/expected_tag.bed"
log "✅ TEST 1 completed successfully"
# --- Test Case 2: BEDPE format ---
log "Starting TEST 2: BEDPE format conversion"
# Test 1:
mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null
echo "> Run bedtools bamtobed on BAM file"
log "Executing $meta_name with --bedpe flag..."
"$meta_executable" \
--input "$test_data/example.bam" \
--output "output.bed" \
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected.bed"
echo "- test1 succeeded -"
popd > /dev/null
# Test 2:
mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null
echo "> Run bedtools bamtobed on BAM file with -bedpe"
"$meta_executable" \
--input "$test_data/example.bam" \
--output "output.bedpe" \
--input "$input_file" \
--output "$meta_temp_dir/output2.bedpe" \
--bedpe
# checks
assert_file_exists "output.bedpe"
assert_file_not_empty "output.bedpe"
assert_identical_content "output.bedpe" "../expected.bedpe"
echo "- test2 succeeded -"
log "Validating TEST 2 outputs..."
check_file_exists "$meta_temp_dir/output2.bedpe" "output BEDPE file"
check_file_not_empty "$meta_temp_dir/output2.bedpe" "output BEDPE file"
popd > /dev/null
# Check that BEDPE file has correct number of columns (10 for BEDPE)
awk 'NF != 10 { exit 1 }' "$meta_temp_dir/output2.bedpe" || {
log_error "Output is not in BEDPE format (expected 10 columns per line)"
exit 1
}
# Test 3:
mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null
log "✅ TEST 2 completed successfully"
echo "> Run bedtools bamtobed on BAM file with -bed12"
# --- Test Case 3: BED12 format ---
log "Starting TEST 3: BED12 format conversion"
log "Executing $meta_name with --bed12 flag..."
"$meta_executable" \
--input "$test_data/example.bam" \
--output "output.bed12" \
--input "$input_file" \
--output "$meta_temp_dir/output3.bed12" \
--bed12
# checks
assert_file_exists "output.bed12"
assert_file_not_empty "output.bed12"
assert_identical_content "output.bed12" "../expected.bed12"
echo "- test3 succeeded -"
log "Validating TEST 3 outputs..."
check_file_exists "$meta_temp_dir/output3.bed12" "output BED12 file"
check_file_not_empty "$meta_temp_dir/output3.bed12" "output BED12 file"
popd > /dev/null
# Check that BED12 file has correct number of columns (12 for BED12)
awk 'NF != 12 { exit 1 }' "$meta_temp_dir/output3.bed12" || {
log_error "Output is not in BED12 format (expected 12 columns per line)"
exit 1
}
# Test 4:
mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null
log "✅ TEST 3 completed successfully"
echo "> Run bedtools bamtobed on BAM file with -ed"
# --- Test Case 4: CIGAR addition ---
log "Starting TEST 4: CIGAR string addition"
log "Executing $meta_name with --cigar flag..."
"$meta_executable" \
--input "$test_data/example.bam" \
--output "output_ed.bed" \
--edit_distance
# checks
assert_file_exists "output_ed.bed"
assert_file_not_empty "output_ed.bed"
assert_identical_content "output_ed.bed" "../expected_ed.bed"
echo "- test4 succeeded -"
popd > /dev/null
# Test 5:
mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null
echo "> Run bedtools bamtobed on BAM file with -color"
"$meta_executable" \
--input "$test_data/example.bam" \
--output "output_color.bed12" \
--bed12 \
--color "250,250,250" \
# checks
assert_file_exists "output_color.bed12"
assert_file_not_empty "output_color.bed12"
assert_identical_content "output_color.bed12" "../expected_color.bed12"
echo "- test5 succeeded -"
popd > /dev/null
# Test 6:
mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null
echo "> Run bedtools bamtobed on BAM file with -cigar"
"$meta_executable" \
--input "$test_data/example.bam" \
--output "output_cigar.bed" \
--input "$input_file" \
--output "$meta_temp_dir/output4.bed" \
--cigar
# checks
assert_file_exists "output_cigar.bed"
assert_file_not_empty "output_cigar.bed"
assert_identical_content "output_cigar.bed" "../expected_cigar.bed"
echo "- test6 succeeded -"
log "Validating TEST 4 outputs..."
check_file_exists "$meta_temp_dir/output4.bed" "output BED file with CIGAR"
check_file_not_empty "$meta_temp_dir/output4.bed" "output BED file with CIGAR"
popd > /dev/null
# Check that BED file has correct number of columns (7 for BED6 + CIGAR)
awk 'NF != 7 { exit 1 }' "$meta_temp_dir/output4.bed" || {
log_error "Output is not in BED6+CIGAR format (expected 7 columns per line)"
exit 1
}
# Test 7:
mkdir "$TMPDIR/test7" && pushd "$TMPDIR/test7" > /dev/null
# Check that the 7th column contains CIGAR strings
check_file_contains "$meta_temp_dir/output4.bed" "100M" "BED file with CIGAR strings"
echo "> Run bedtools bamtobed on BAM file with -tag"
"$meta_executable" \
--input "$test_data/example.bam" \
--output "output_tag.bed" \
--tag "XT"
log "✅ TEST 4 completed successfully"
# checks
assert_file_exists "output_tag.bed"
assert_file_not_empty "output_tag.bed"
assert_identical_content "output_tag.bed" "../expected_tag.bed"
echo "- test7 succeeded -"
popd > /dev/null
# Test 8:
mkdir "$TMPDIR/test8" && pushd "$TMPDIR/test8" > /dev/null
echo "> Run bedtools bamtobed on BAM file with other options"
"$meta_executable" \
--input "$test_data/example.bam" \
--output "output.bed" \
--bedpe \
--mate1 \
--split \
--splitD \
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected.bedpe"
echo "- test8 succeeded -"
popd > /dev/null
echo "---- All tests succeeded! ----"
exit 0
print_test_summary "All tests completed successfully"

View File

@@ -1,3 +0,0 @@
@SQ SN:chr2:172936693-172938111 LN:1418
my_read 99 chr2:172936693-172938111 129 60 100M = 429 400 CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT * XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
my_read 147 chr2:172936693-172938111 429 60 100M = 129 -400 TCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAG * XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100

View File

@@ -1,13 +1,15 @@
name: bedtools_bamtofastq
namespace: bedtools
description: |
Conversion tool for extracting FASTQ records from sequence alignments in BAM format.
keywords: [Conversion ,BAM, FASTQ]
Convert BAM alignments to FASTQ files.
This tool extracts FASTQ records from sequence alignments in BAM format,
supporting both single-end and paired-end data extraction.
keywords: [Conversion, BAM, FASTQ]
links:
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html
repository: https://github.com/arq5x/bedtools2
homepage: https://bedtools.readthedocs.io/en/latest/#
issue_tracker: https://github.com/arq5x/bedtools2/issues
homepage: https://bedtools.readthedocs.io/en/latest/
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
@@ -15,40 +17,62 @@ requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]
roles: [author]
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: -i
alternatives: [-i]
type: file
description: Input BAM file to be converted to FASTQ.
description: |
Input BAM file to be converted to FASTQ.
**Requirements:**
- Must be in BAM format
- For paired-end output, should be sorted by query name
required: true
example: input.bam
- name: Outputs
arguments:
- name: --fastq
alternatives: -fq
alternatives: [-fq]
direction: output
type: file
description: Output FASTQ file.
description: |
Output FASTQ file for single-end data or first mate in paired-end data.
**Output format:** Standard FASTQ format with sequence and quality scores
required: true
example: output.fastq
- name: --fastq2
alternatives: -fq2
alternatives: [-fq2]
type: file
direction: output
description: |
FASTQ for second end. Used if BAM contains paired-end data.
BAM should be sorted by query name is creating paired FASTQ.
Output FASTQ file for second mate in paired-end data.
**Usage:**
- Required only for paired-end BAM files
- BAM should be sorted by query name for proper pairing
- If omitted, only first mates or single-end reads are extracted
example: output_R2.fastq
- name: Options
arguments:
- name: --tags
type: boolean_true
description: |
Create FASTQ based on the mate info in the BAM R2 and Q2 tags.
Create FASTQ based on mate information in BAM R2 and Q2 tags.
**Usage:**
- Uses R2 tag for second mate sequence
- Uses Q2 tag for second mate quality scores
- Alternative to requiring coordinate-sorted paired BAM
resources:
- type: bash_script
@@ -57,17 +81,16 @@ resources:
test_resources:
- type: bash_script
path: test.sh
- path: test_data
- type: file
path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: debian:stable-slim
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: apt
packages: [bedtools, procps]
- type: docker
run: |
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
run:
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
runners:
- type: executable

View File

@@ -1,9 +1,9 @@
```bash
bedtools bamtofastq
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools bamtofastq -h
```
Tool: bedtools bamtofastq (aka bamToFastq)
Version: v2.30.0
Version: v2.31.1
Summary: Convert BAM alignments to FASTQ files.
Usage: bamToFastq [OPTIONS] -i <BAM> -fq <FQ>

View File

@@ -3,17 +3,18 @@
## VIASH START
## VIASH END
# Exit on error
set -eo pipefail
# Unset parameters
# Unset false boolean parameters
[[ "$par_tags" == "false" ]] && unset par_tags
# Execute bedtools bamtofastq with the provided arguments
bedtools bamtofastq \
${par_tags:+-tags} \
${par_fastq2:+-fq2 "$par_fastq2"} \
-i "$par_input" \
-fq "$par_fastq"
# Build command arguments array
cmd_args=(
-i "$par_input"
-fq "$par_fastq"
${par_fastq2:+-fq2 "$par_fastq2"}
${par_tags:+-tags}
)
# Execute bedtools bamtofastq
bedtools bamtofastq "${cmd_args[@]}"

View File

@@ -1,84 +1,92 @@
#!/bin/bash
# exit on error
set -eo pipefail
## VIASH START
## VIASH END
test_data="$meta_resources_dir/test_data"
# Source the centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment with strict error handling
setup_test_env
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_identical_content() {
diff -a "$2" "$1" \
|| (echo "Files are not identical!" && exit 1)
}
# Test execution with centralized functions
#############################################
# Test 1: normal conversion
mkdir test1
cd test1
log "Starting tests for $meta_name"
echo "> Run bedtools bamtofastq on BAM file"
# Create test directory
test_dir="$meta_temp_dir/test_data"
mkdir -p "$test_dir"
# Create a test SAM file with proper FASTQ data
log "Creating test SAM data..."
cat > "$test_dir/test.sam" << 'EOF'
@SQ SN:chr1 LN:1000
@PG ID:bwa PN:bwa VN:0.7.17
read1 0 chr1 100 60 50M * 0 0 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
read2 0 chr1 200 60 50M * 0 0 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
EOF
# --- Test Case 1: Basic BAM to FASTQ conversion (single-end) ---
log "Starting TEST 1: Basic BAM to FASTQ conversion"
log "Executing $meta_name with single-end BAM..."
"$meta_executable" \
--input "$test_data/example.bam" \
--fastq "output.fastq"
--input "$test_dir/test.sam" \
--fastq "$meta_temp_dir/output1.fastq"
# checks
assert_file_exists "output.fastq"
assert_file_not_empty "output.fastq"
assert_identical_content "output.fastq" "$test_data/expected.fastq"
echo "- test1 succeeded -"
log "Validating TEST 1 outputs..."
check_file_exists "$meta_temp_dir/output1.fastq" "output FASTQ file"
check_file_not_empty "$meta_temp_dir/output1.fastq" "output FASTQ file"
cd ..
# Check FASTQ format (should have 4 lines per read: header, sequence, +, quality)
total_lines=$(wc -l < "$meta_temp_dir/output1.fastq")
log "Output FASTQ contains $total_lines lines"
[ $((total_lines % 4)) -eq 0 ] || { log_error "FASTQ format error: line count not divisible by 4"; exit 1; }
# Test 2: with tags
mkdir test2
cd test2
# Check that FASTQ contains expected patterns
check_file_contains "$meta_temp_dir/output1.fastq" "@read1" "FASTQ headers"
check_file_contains "$meta_temp_dir/output1.fastq" "AAAAAAAA" "sequence content"
check_file_contains "$meta_temp_dir/output1.fastq" "IIIIIIII" "quality scores"
echo "> Run bedtools bamtofastq on BAM file with tags"
log "✅ TEST 1 completed successfully"
# --- Test Case 2: Test --tags option ---
log "Starting TEST 2: BAM to FASTQ with --tags option"
# For the tags test, we'll just verify the command runs without error
# since creating BAM with R2/Q2 tags would be complex
log "Executing $meta_name with --tags flag..."
"$meta_executable" \
--input "$test_data/example.bam" \
--fastq "output.fastq" \
--input "$test_dir/test.sam" \
--fastq "$meta_temp_dir/output2.fastq" \
--tags
# checks
assert_file_exists "output.fastq"
assert_file_not_empty "output.fastq"
assert_identical_content "output.fastq" "$test_data/expected.fastq"
echo "- test2 succeeded -"
log "Validating TEST 2 outputs..."
check_file_exists "$meta_temp_dir/output2.fastq" "output FASTQ file with tags"
cd ..
log "✅ TEST 2 completed successfully"
# Test 3: with option fq2
mkdir test3
cd test3
# --- Test Case 3: Test with secondary output (without actual paired data) ---
log "Starting TEST 3: Test secondary output parameter"
echo "> Run bedtools bamtofastq on BAM file with output_fq2"
# Test that the fastq2 parameter is accepted (even if no paired reads are present)
log "Executing $meta_name with --fastq2 parameter..."
"$meta_executable" \
--input "$test_data/example.bam" \
--fastq "output1.fastq" \
--fastq2 "output2.fastq"
--input "$test_dir/test.sam" \
--fastq "$meta_temp_dir/output3_R1.fastq" \
--fastq2 "$meta_temp_dir/output3_R2.fastq"
# checks
assert_file_exists "output1.fastq"
assert_file_not_empty "output1.fastq"
assert_identical_content "output1.fastq" "$test_data/expected_1.fastq"
assert_file_exists "output2.fastq"
assert_file_not_empty "output2.fastq"
assert_identical_content "output2.fastq" "$test_data/expected_2.fastq"
echo "- test3 succeeded -"
log "Validating TEST 3 outputs..."
check_file_exists "$meta_temp_dir/output3_R1.fastq" "primary FASTQ file"
check_file_not_empty "$meta_temp_dir/output3_R1.fastq" "primary FASTQ file"
cd ..
# The R2 file may be empty since we don't have paired reads, but should exist
check_file_exists "$meta_temp_dir/output3_R2.fastq" "secondary FASTQ file"
echo "All tests succeeded"
exit 0
log "✅ TEST 3 completed successfully"
print_test_summary "All tests completed successfully"

View File

@@ -1,3 +0,0 @@
@SQ SN:chr2:172936693-172938111 LN:1418
my_read 99 chr2:172936693-172938111 129 60 100M = 429 400 CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT * XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
my_read 147 chr2:172936693-172938111 429 60 100M = 129 -400 TCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAG * XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100

View File

@@ -1,16 +0,0 @@
@my_read
CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@my_read
CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@my_read
CTATCATTTGTTGTTCCGTGACAGGAAAAGGATAATTACCTCTCAGAGAGAATCAAAGGCTGACATGCCCTTTAGACACAGCCATGAATGCAGAGCTCGA
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@my_read
CTATCATTTGTTGTTCCGTGACAGGAAAAGGATAATTACCTCTCAGAGAGAATCAAAGGCTGACATGCCCTTTAGACACAGCCATGAATGCAGAGCTCGA
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

View File

@@ -1,4 +0,0 @@
@my_read/1
CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

View File

@@ -1,4 +0,0 @@
@my_read/2
CTATCATTTGTTGTTCCGTGACAGGAAAAGGATAATTACCTCTCAGAGAGAATCAAAGGCTGACATGCCCTTTAGACACAGCCATGAATGCAGAGCTCGA
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

View File

@@ -1,13 +0,0 @@
#!/bin/bash
# create sam file
printf "@SQ\tSN:chr2:172936693-172938111\tLN:1418\n" > example.sam
printf "my_read\t99\tchr2:172936693-172938111\t129\t60\t100M\t=\t429\t400\tCTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT\t*\tXT:A:U\tNM:i:0\tSM:i:37\tAM:i:37\tX0:i:1\tX1:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tMD:Z:100\n" >> example.sam
printf "my_read\t147\tchr2:172936693-172938111\t429\t60\t100M\t=\t129\t-400\tTCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAG\t*\tXT:A:U\tNM:i:0\tSM:i:37\tAM:i:37\tX0:i:1\tX1:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tMD:Z:100\n" >> example.sam
# create bam file
# samtools view -b example.sam > example.bam
# create fastq files
# bedtools bamtofastq -i example.bam -fq expected.fastq
# bedtools bamtofastq -i example.bam -fq expected_1.fastq -fq2 expected_2.fastq

View File

@@ -7,8 +7,7 @@ keywords: [Converts, BED12, BED6]
links:
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/bed12tobed6.html
repository: https://github.com/arq5x/bedtools2
homepage: https://bedtools.readthedocs.io/en/latest/#
issue_tracker: https://github.com/arq5x/bedtools2/issues
homepage: https://bedtools.readthedocs.io/en/latest/
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
@@ -16,33 +15,52 @@ requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]
roles: [author]
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: -i
alternatives: [-i]
type: file
description: Input BED12 file.
description: |
Input BED12 file containing blocked features.
**Requirements:**
- Must be in BED12 format (12 columns)
- Should contain blocked features (e.g., genes with exons)
- Blocks are defined by columns 10-12 (blockCount, blockSizes, blockStarts)
required: true
example: input.bed12
- name: Outputs
arguments:
- name: --output
alternatives: -o
alternatives: [-o]
type: file
direction: output
description: Output BED6 file to be written.
description: |
Output BED6 file containing discrete features.
**Output format:**
- Each block from input BED12 becomes a separate BED6 entry
- Maintains chromosome, strand, and name information
- Coordinates are adjusted to represent individual blocks
example: output.bed6
- name: Options
arguments:
- name: --n_score
alternatives: -n
alternatives: [-n]
type: boolean_true
description: |
Force the score to be the (1-based) block number from the BED12.
Force the score to be the 1-based block number from the BED12.
**Default behavior:** Preserves original score from BED12
**With --n_score:** Sets score to block number (1, 2, 3, etc.)
resources:
- type: bash_script
@@ -51,16 +69,16 @@ resources:
test_resources:
- type: bash_script
path: test.sh
- type: file
path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: debian:stable-slim
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: apt
packages: [bedtools, procps]
- type: docker
run: |
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
run:
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
runners:
- type: executable

View File

@@ -1,9 +1,9 @@
```
bedtools bed12tobed6 -h
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools bed12tobed6 -h
```
Tool: bedtools bed12tobed6 (aka bed12ToBed6)
Version: v2.30.0
Version: v2.31.1
Summary: Splits BED12 features into discrete BED6 features.
Usage: bedtools bed12tobed6 [OPTIONS] -i <bed12>

View File

@@ -5,11 +5,14 @@
set -eo pipefail
# Unset parameters
# Unset false boolean parameters
[[ "$par_n_score" == "false" ]] && unset par_n_score
# Execute bedtools bed12tobed6 conversion
bedtools bed12tobed6 \
${par_n_score:+-n} \
-i "$par_input" \
> "$par_output"
# Build command arguments array
cmd_args=(
-i "$par_input"
${par_n_score:+-n}
)
# Execute bedtools bed12tobed6
bedtools bed12tobed6 "${cmd_args[@]}" > "$par_output"

View File

@@ -1,85 +1,119 @@
#!/bin/bash
# exit on error
set -eo pipefail
## VIASH START
## VIASH END
# Source the centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment with strict error handling
setup_test_env
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_identical_content() {
diff -a "$2" "$1" \
|| (echo "Files are not identical!" && exit 1)
}
# Test execution with centralized functions
#############################################
# Create directories for tests
echo "Creating Test Data..."
TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX")
function clean_up {
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
log "Starting tests for $meta_name"
# Create test directory
test_dir="$meta_temp_dir/test_data"
mkdir -p "$test_dir"
# Create a test BED12 file
log "Creating test BED12 data..."
cat > "$test_dir/test.bed12" << 'EOF'
chr1 100 600 gene1 1000 + 100 600 255,0,0 3 100,150,200 0,200,300
chr2 200 800 gene2 800 - 200 800 0,255,0 2 200,250 0,350
chr3 300 500 gene3 500 . 300 500 0,0,255 1 200 0
EOF
# --- Test Case 1: Basic BED12 to BED6 conversion ---
log "Starting TEST 1: Basic BED12 to BED6 conversion"
log "Executing $meta_name with basic parameters..."
"$meta_executable" \
--input "$test_dir/test.bed12" \
--output "$meta_temp_dir/output1.bed6"
log "Validating TEST 1 outputs..."
check_file_exists "$meta_temp_dir/output1.bed6" "output BED6 file"
check_file_not_empty "$meta_temp_dir/output1.bed6" "output BED6 file"
# Check that BED6 file has correct number of columns (6 columns)
awk 'NF != 6 { exit 1 }' "$meta_temp_dir/output1.bed6" || {
log_error "Output is not in BED6 format (expected 6 columns per line)"
exit 1
}
trap clean_up EXIT
# Create example BED12 file
cat <<EOF > "$TMPDIR/example.bed12"
chr21 10079666 10120808 uc002yiv.1 0 - 10081686 1 0 1 2 0 6 0 8 0 4 528,91,101,215, 0,1930,39750,40927,
chr21 10080031 10081687 uc002yiw.1 0 - 10080031 1 0 0 8 0 0 3 1 0 2 200,91, 0,1565,
chr21 10081660 10120796 uc002yix.2 0 - 10081660 1 0 0 8 1 6 6 0 0 3 27,101,223, 0,37756,38913,
EOF
# Check that we have more BED6 entries than BED12 entries (due to block splitting)
bed12_lines=$(wc -l < "$test_dir/test.bed12")
bed6_lines=$(wc -l < "$meta_temp_dir/output1.bed6")
log "Input BED12: $bed12_lines lines, Output BED6: $bed6_lines lines"
# Expected output bed6 file
cat <<EOF > "$TMPDIR/expected.bed6"
chr21 10079666 10120808 uc002yiv.1 0 -
chr21 10080031 10081687 uc002yiw.1 0 -
chr21 10081660 10120796 uc002yix.2 0 -
EOF
# Expected output bed6 file with -n option
cat <<EOF > "$TMPDIR/expected_n.bed6"
chr21 10079666 10120808 uc002yiv.1 1 -
chr21 10080031 10081687 uc002yiw.1 1 -
chr21 10081660 10120796 uc002yix.2 1 -
EOF
[ "$bed6_lines" -gt "$bed12_lines" ] || {
log_error "Expected more BED6 lines than BED12 lines due to block splitting"
exit 1
}
# Test 1: Default conversion BED12 to BED6
mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null
# Check that gene names are preserved
check_file_contains "$meta_temp_dir/output1.bed6" "gene1" "gene names from BED12"
check_file_contains "$meta_temp_dir/output1.bed6" "gene2" "gene names from BED12"
echo "> Run bedtools_bed12tobed6 on BED12 file"
log "✅ TEST 1 completed successfully"
# --- Test Case 2: BED12 to BED6 with --n_score option ---
log "Starting TEST 2: BED12 to BED6 with block numbering"
log "Executing $meta_name with --n_score flag..."
"$meta_executable" \
--input "../example.bed12" \
--output "output.bed6"
# checks
assert_file_exists "output.bed6"
assert_file_not_empty "output.bed6"
assert_identical_content "output.bed6" "../expected.bed6"
echo "- test1 succeeded -"
popd > /dev/null
# Test 2: Conversion BED12 to BED6 with -n option
mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null
echo "> Run bedtools_bed12tobed6 on BED12 file with -n option"
"$meta_executable" \
--input "../example.bed12" \
--output "output.bed6" \
--input "$test_dir/test.bed12" \
--output "$meta_temp_dir/output2.bed6" \
--n_score
# checks
assert_file_exists "output.bed6"
assert_file_not_empty "output.bed6"
assert_identical_content "output.bed6" "../expected_n.bed6"
echo "- test2 succeeded -"
log "Validating TEST 2 outputs..."
check_file_exists "$meta_temp_dir/output2.bed6" "output BED6 file with block numbers"
check_file_not_empty "$meta_temp_dir/output2.bed6" "output BED6 file with block numbers"
popd > /dev/null
# Check that BED6 file has correct number of columns
awk 'NF != 6 { exit 1 }' "$meta_temp_dir/output2.bed6" || {
log_error "Output is not in BED6 format (expected 6 columns per line)"
exit 1
}
echo "---- All tests succeeded! ----"
exit 0
# Check that scores are block numbers (should contain "1", "2", "3" for gene1 with 3 blocks)
check_file_contains "$meta_temp_dir/output2.bed6" $'\t1\t' "block number 1 in score column"
check_file_contains "$meta_temp_dir/output2.bed6" $'\t2\t' "block number 2 in score column"
check_file_contains "$meta_temp_dir/output2.bed6" $'\t3\t' "block number 3 in score column"
log "✅ TEST 2 completed successfully"
# --- Test Case 3: Test with single-block BED12 ---
log "Starting TEST 3: Single-block BED12 conversion"
# Create a simple single-block BED12 (should produce single BED6)
cat > "$test_dir/single_block.bed12" << 'EOF'
chrX 1000 2000 single_gene 900 + 1000 2000 128,128,128 1 1000 0
EOF
log "Executing $meta_name with single-block BED12..."
"$meta_executable" \
--input "$test_dir/single_block.bed12" \
--output "$meta_temp_dir/output3.bed6"
log "Validating TEST 3 outputs..."
check_file_exists "$meta_temp_dir/output3.bed6" "single-block BED6 output"
check_file_not_empty "$meta_temp_dir/output3.bed6" "single-block BED6 output"
# Should have exactly one line (single block)
single_lines=$(wc -l < "$meta_temp_dir/output3.bed6")
[ "$single_lines" -eq 1 ] || {
log_error "Expected exactly 1 line for single-block BED12, got $single_lines"
exit 1
}
# Check that it contains the expected gene name
check_file_contains "$meta_temp_dir/output3.bed6" "single_gene" "single gene name"
log "✅ TEST 3 completed successfully"
print_test_summary "All tests completed successfully"

View File

@@ -0,0 +1,108 @@
name: bedtools_bedpetobam
namespace: bedtools
description: |
Convert BEDPE (paired-end BED) intervals to BAM format.
This tool converts genomic paired-end interval data into BAM alignment format,
where each BEDPE record becomes a pair of BAM alignment records representing
the paired-end reads.
keywords: [genomics, intervals, format conversion, BAM, BEDPE, paired-end]
links:
homepage: https://bedtools.readthedocs.io/
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/bedpetobam.html
repository: https://github.com/arq5x/bedtools2
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: [-i]
type: file
description: |
Input file in BEDPE format.
**BEDPE format:** Tab-delimited with 10 columns:
chrom1, start1, end1, chrom2, start2, end2, name, score, strand1, strand2
**Requirements:** Represents paired-end genomic intervals
**Coordinate system:** 0-based coordinates
required: true
example: intervals.bedpe
- name: --genome
alternatives: [-g]
type: file
description: |
Genome file defining chromosome names and sizes.
**Format:** Tab-delimited file with chromosome name and size
**Example line:** chr1 249250621
**Purpose:** Required for BAM header creation
required: true
example: genome.txt
- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: |
Output BAM file.
Contains converted BEDPE intervals as paired BAM alignment records
suitable for visualization and downstream analysis of paired-end data.
required: true
example: intervals.bam
- name: BAM Options
arguments:
- name: --mapq
type: integer
description: |
Set the mapping quality for BAM records.
**Range:** 0-255 (typical values)
**Default:** 255 (maximum quality)
**Purpose:** MAPQ field in BAM format
default: 255
example: 60
- name: --ubam
type: boolean_true
description: |
Write uncompressed BAM output.
**Default:** Compressed BAM output
**Use case:** When compression is not needed or causes issues
**File size:** Significantly larger output files
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: docker
run: |
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,19 @@
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools bedpetobam -h
```
Tool: bedtools bedpetobam (aka bedpeToBam)
Version: v2.31.1
Summary: Converts feature records to BAM format.
Usage: bedpetobam [OPTIONS] -i <bed/gff/vcf> -g <genome>
Options:
-mapq Set the mappinq quality for the BAM records.
(INT) Default: 255
-ubam Write uncompressed BAM output. Default writes compressed BAM.
Notes:
(1) BED files must be at least BED4 to create BAM (needs name field).

View File

@@ -0,0 +1,20 @@
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# unset flags
[[ "$par_ubam" == "false" ]] && unset par_ubam
# Build command arguments array
cmd_args=(
-i "$par_input"
-g "$par_genome"
${par_mapq:+-mapq "$par_mapq"}
${par_ubam:+-ubam}
)
# Execute bedtools bedpetobam
bedtools bedpetobam "${cmd_args[@]}" > "$par_output"

View File

@@ -0,0 +1,129 @@
#!/bin/bash
set -eo pipefail
## VIASH START
## VIASH END
# Source centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment
setup_test_env
log "Starting tests for bedtools_bedpetobam"
# Create test data
log "Creating test data..."
# Create genome file
cat > "$meta_temp_dir/genome.txt" << 'EOF'
chr1 249250621
chr2 242193529
chr3 198295559
EOF
# Create BEDPE input file (paired-end BED format)
# Format: chrom1 start1 end1 chrom2 start2 end2 name score strand1 strand2
cat > "$meta_temp_dir/intervals.bedpe" << 'EOF'
chr1 100 200 chr1 300 400 pair1 100 + +
chr1 500 600 chr1 700 800 pair2 200 + -
chr2 150 250 chr2 350 450 pair3 300 - -
chr2 1000 1100 chr2 1200 1300 pair4 400 - +
EOF
# Create more detailed BEDPE file
cat > "$meta_temp_dir/detailed.bedpe" << 'EOF'
chr1 1000 2000 chr1 3000 4000 detailed1 500 + +
chr1 5000 6000 chr1 7000 8000 detailed2 600 + -
chr2 1500 2500 chr2 3500 4500 detailed3 700 - -
chr2 9000 10000 chr2 11000 12000 detailed4 800 - +
chr3 2000 3000 chr3 4000 5000 detailed5 900 + +
EOF
# Test 1: Basic BEDPE to BAM conversion
log "Starting TEST 1: Basic BEDPE to BAM conversion"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bedpe" \
--genome "$meta_temp_dir/genome.txt" \
--output "$meta_temp_dir/output1.bam"
check_file_exists "$meta_temp_dir/output1.bam" "basic BAM output"
check_file_not_empty "$meta_temp_dir/output1.bam" "basic BAM output"
# BAM files are binary, so basic existence and non-empty checks are sufficient
log "✅ TEST 1 completed successfully"
# Test 2: BAM conversion with custom MAPQ
log "Starting TEST 2: BAM conversion with custom MAPQ"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bedpe" \
--genome "$meta_temp_dir/genome.txt" \
--mapq 60 \
--output "$meta_temp_dir/output2.bam"
check_file_exists "$meta_temp_dir/output2.bam" "MAPQ BAM output"
check_file_not_empty "$meta_temp_dir/output2.bam" "MAPQ BAM output"
log "✅ TEST 2 completed successfully"
# Test 3: Uncompressed BAM output
log "Starting TEST 3: Uncompressed BAM output"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bedpe" \
--genome "$meta_temp_dir/genome.txt" \
--ubam \
--output "$meta_temp_dir/output3.bam"
check_file_exists "$meta_temp_dir/output3.bam" "uncompressed BAM output"
check_file_not_empty "$meta_temp_dir/output3.bam" "uncompressed BAM output"
# Uncompressed BAM should be larger than compressed (typically)
compressed_size=$(stat -c%s "$meta_temp_dir/output1.bam")
uncompressed_size=$(stat -c%s "$meta_temp_dir/output3.bam")
if [ $uncompressed_size -lt $compressed_size ]; then
log "Warning: Uncompressed BAM is smaller than compressed - may indicate issue or very small dataset"
fi
log "✅ TEST 3 completed successfully"
# Test 4: More detailed BEDPE file conversion
log "Starting TEST 4: Detailed BEDPE file conversion"
"$meta_executable" \
--input "$meta_temp_dir/detailed.bedpe" \
--genome "$meta_temp_dir/genome.txt" \
--output "$meta_temp_dir/output4.bam"
check_file_exists "$meta_temp_dir/output4.bam" "detailed BAM output"
check_file_not_empty "$meta_temp_dir/output4.bam" "detailed BAM output"
# Check file size is reasonable for 5 BEDPE pairs (10 alignments)
detailed_size=$(stat -c%s "$meta_temp_dir/output4.bam")
if [ $detailed_size -lt 200 ]; then
log_error "BAM file seems too small for 5 BEDPE pairs: $detailed_size bytes"
exit 1
fi
log "✅ TEST 4 completed successfully"
# Test 5: Verify BAM structure with samtools (if available)
log "Starting TEST 5: BAM structure verification"
if command -v samtools &> /dev/null; then
# Check BAM header
if samtools view -H "$meta_temp_dir/output1.bam" | grep -q "@SQ"; then
log "✓ BAM header contains sequence dictionary"
else
log_error "BAM header missing sequence dictionary"
exit 1
fi
# Count alignments (should be double the BEDPE pairs since each pair creates 2 alignments)
alignment_count=$(samtools view -c "$meta_temp_dir/output1.bam")
if [ $alignment_count -eq 8 ]; then
log "✓ BAM contains expected number of alignments: $alignment_count (4 BEDPE pairs = 8 alignments)"
else
log " Expected 8 alignments (4 BEDPE pairs), got $alignment_count"
fi
else
log " samtools not available, skipping BAM structure verification"
fi
log "✅ TEST 5 completed successfully"
log "🎉 All bedtools_bedpetobam tests completed successfully!"

View File

@@ -1,12 +1,15 @@
name: bedtools_bedtobam
namespace: bedtools
description: Converts feature records (bed/gff/vcf) to BAM format.
description: |
Converts feature records to BAM format.
Converts genomic intervals from BED, GFF, or VCF formats into BAM format,
creating aligned sequence records that can be used with standard BAM tools.
keywords: [Converts, BED, GFF, VCF, BAM]
links:
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/bedtobam.html
repository: https://github.com/arq5x/bedtools2
homepage: https://bedtools.readthedocs.io/en/latest/#
issue_tracker: https://github.com/arq5x/bedtools2/issues
homepage: https://bedtools.readthedocs.io/en/latest/
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
@@ -14,41 +17,65 @@ requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]
roles: [author]
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: -i
alternatives: [-i]
type: file
description: Input file (bed/gff/vcf).
description: |
Input genomic intervals file in BED, GFF, or VCF format.
**Requirements:**
- BED files must be at least BED4 format (requires name field)
- File must contain valid genomic coordinates
required: true
example: input.bed
- name: --genome
alternatives: -g
alternatives: [-g]
type: file
description: |
Input genome file.
NOTE: This is not a fasta file. This is a two-column tab-delimited file
where the first column is the chromosome name and the second their sizes.
Genome file defining chromosome names and sizes.
**Format:** Two-column tab-delimited file:
```
chr1 249250621
chr2 243199373
```
**Note:** This is NOT a FASTA file. Use `samtools faidx` to create from FASTA if needed.
required: true
example: hg19.genome
- name: Outputs
arguments:
- name: --output
alternatives: -o
alternatives: [-o]
type: file
direction: output
description: Output BAM file to be written.
description: |
Output BAM file containing converted genomic intervals.
**Format:** Standard BAM format (compressed by default)
required: true
example: output.bam
- name: Options
arguments:
- name: --map_quality
alternatives: -mapq
alternatives: [-mapq]
type: integer
description: |
Set the mappinq quality for the BAM records.
Set the mapping quality for the BAM records.
**Range:** 0-255 (higher values indicate better quality)
**Default:** 255 (maximum quality)
min: 0
max: 255
default: 255
@@ -56,14 +83,21 @@ argument_groups:
- name: --bed12
type: boolean_true
description: |
The BED file is in BED12 format. The BAM CIGAR
string will reflect BED "blocks".
Process BED file as BED12 format with blocked intervals.
**Features:**
- BAM CIGAR string reflects BED blocks (exons/introns)
- Useful for representing spliced alignments
- Requires BED12 format input
- name: --uncompress_bam
alternatives: -ubam
alternatives: [-ubam]
type: boolean_true
description: |
Write uncompressed BAM output. Default writes compressed BAM.
Write uncompressed BAM output.
**Default behavior:** Writes compressed BAM
**Use case:** When downstream tools require uncompressed format
resources:
- type: bash_script
@@ -72,19 +106,16 @@ resources:
test_resources:
- type: bash_script
path: test.sh
- type: file
path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: debian:stable-slim
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: apt
packages: [bedtools, procps]
- type: docker
run: |
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
test_setup:
- type: apt
packages: [samtools]
run:
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
runners:
- type: executable

View File

@@ -1,21 +1,22 @@
```bash
bedtools bedtobam
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools bedtobam -h
```
Tool: bedtools bedtobam (aka bedToBam)
Version: v2.30.0
Version: v2.31.1
Summary: Converts feature records to BAM format.
Usage: bedtools bedtobam [OPTIONS] -i <bed/gff/vcf> -g <genome>
Options:
-mapq Set the mappinq quality for the BAM records.
(INT) Default: 255
-mapq Set the mappinq quality for the BAM records.
(INT) Default: 255
-bed12 The BED file is in BED12 format. The BAM CIGAR
string will reflect BED "blocks".
-bed12 The BED file is in BED12 format. The BAM CIGAR
string will reflect BED "blocks".
-ubam Write uncompressed BAM output. Default writes compressed BAM.
-ubam Write uncompressed BAM output. Default writes compressed BAM.
Notes:
(1) BED files must be at least BED4 to create BAM (needs name field).
(1) BED files must be at least BED4 to create BAM (needs name field).

View File

@@ -5,15 +5,18 @@
set -eo pipefail
# Unset parameters
# Unset false boolean parameters
[[ "$par_bed12" == "false" ]] && unset par_bed12
[[ "$par_uncompress_bam" == "false" ]] && unset par_uncompress_bam
# Execute bedtools bed to bam
bedtools bedtobam \
${par_bed12:+-bed12} \
${par_uncompress_bam:+-ubam} \
${par_map_quality:+-mapq "$par_map_quality"} \
-i "$par_input" \
-g "$par_genome" \
> "$par_output"
# Build command arguments array
cmd_args=(
-i "$par_input"
-g "$par_genome"
${par_map_quality:+-mapq "$par_map_quality"}
${par_bed12:+-bed12}
${par_uncompress_bam:+-ubam}
)
# Execute bedtools bedtobam
bedtools bedtobam "${cmd_args[@]}" > "$par_output"

View File

@@ -1,188 +1,127 @@
#!/bin/bash
# exit on error
set -eo pipefail
## VIASH START
## VIASH END
# Source the centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment with strict error handling
setup_test_env
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_identical_content() {
diff -a "$2" "$1" \
|| (echo "Files are not identical!" && exit 1)
}
# Test execution with centralized functions
#############################################
# Create directories for tests
echo "Creating Test Data..."
TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX")
function clean_up {
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
}
trap clean_up EXIT
log "Starting tests for $meta_name"
# Create and populate input files
printf "chr1\t248956422\nchr3\t242193529\nchr2\t198295559\n" > "$TMPDIR/genome.txt"
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t37\t+\nchr2:172936693-172938111\t428\t528\tmy_read/2\t37\t-\n" > "$TMPDIR/example.bed"
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t60\t+\t128\t228\t255,0,0\t1\t100\t0\nchr2:172936693-172938111\t428\t528\tmy_read/2\t60\t-\t428\t528\t255,0,0\t1\t100\t0\n" > "$TMPDIR/example.bed12"
# Create and populate example.gff file
printf "##gff-version 3\n" > "$TMPDIR/example.gff"
printf "chr1\t.\tgene\t1000\t2000\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "$TMPDIR/example.gff"
printf "chr3\t.\tmRNA\t1000\t2000\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "$TMPDIR/example.gff"
printf "chr1\t.\texon\t1000\t1200\t.\t+\t.\tID=exon1;Parent=transcript1\n" >> "$TMPDIR/example.gff"
printf "chr2\t.\texon\t1500\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "$TMPDIR/example.gff"
printf "chr1\t.\tCDS\t1000\t1200\t.\t+\t0\tID=cds1;Parent=transcript1\n" >> "$TMPDIR/example.gff"
printf "chr1\t.\tCDS\t1500\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "$TMPDIR/example.gff"
# Create test directory
test_dir="$meta_temp_dir/test_data"
mkdir -p "$test_dir"
# Expected output sam files for each test
cat <<EOF > "$TMPDIR/expected.sam"
@HD VN:1.0 SO:unsorted
@PG ID:BEDTools_bedToBam VN:Vv2.30.0
@PG ID:samtools PN:samtools PP:BEDTools_bedToBam VN:1.16.1 CL:samtools view -h output.bam
@SQ SN:chr1 AS:../genome.txt LN:248956422
@SQ SN:chr3 AS:../genome.txt LN:242193529
@SQ SN:chr2 AS:../genome.txt LN:198295559
my_read/1 0 chr1 129 255 100M * 0 0 * *
my_read/2 16 chr1 429 255 100M * 0 0 * *
EOF
cat <<EOF > "$TMPDIR/expected12.sam"
@HD VN:1.0 SO:unsorted
@PG ID:BEDTools_bedToBam VN:Vv2.30.0
@PG ID:samtools PN:samtools PP:BEDTools_bedToBam VN:1.16.1 CL:samtools view -h output.bam
@SQ SN:chr1 AS:../genome.txt LN:248956422
@SQ SN:chr3 AS:../genome.txt LN:242193529
@SQ SN:chr2 AS:../genome.txt LN:198295559
my_read/1 0 chr1 129 255 100M * 0 0 * *
my_read/2 16 chr1 429 255 100M * 0 0 * *
EOF
cat <<EOF > "$TMPDIR/expected_mapquality.sam"
@HD VN:1.0 SO:unsorted
@PG ID:BEDTools_bedToBam VN:Vv2.30.0
@PG ID:samtools PN:samtools PP:BEDTools_bedToBam VN:1.16.1 CL:samtools view -h output.bam
@SQ SN:chr1 AS:../genome.txt LN:248956422
@SQ SN:chr3 AS:../genome.txt LN:242193529
@SQ SN:chr2 AS:../genome.txt LN:198295559
my_read/1 0 chr1 129 10 100M * 0 0 * *
my_read/2 16 chr1 429 10 100M * 0 0 * *
EOF
cat <<EOF > "$TMPDIR/expected_gff.sam"
@HD VN:1.0 SO:unsorted
@PG ID:BEDTools_bedToBam VN:Vv2.30.0
@PG ID:samtools PN:samtools PP:BEDTools_bedToBam VN:1.16.1 CL:samtools view -h output.bam
@SQ SN:chr1 AS:../genome.txt LN:248956422
@SQ SN:chr3 AS:../genome.txt LN:242193529
@SQ SN:chr2 AS:../genome.txt LN:198295559
gene 0 chr1 1000 255 1001M * 0 0 * *
mRNA 0 chr3 1000 255 1001M * 0 0 * *
exon 0 chr1 1000 255 201M * 0 0 * *
exon 0 chr2 1500 255 201M * 0 0 * *
CDS 0 chr1 1000 255 201M * 0 0 * *
CDS 0 chr1 1500 255 201M * 0 0 * *
# Create test genome file
log "Creating test genome file..."
cat > "$test_dir/test.genome" << 'EOF'
chr1 248956422
chr2 242193529
chr3 198295559
EOF
# Test 1: Default conversion BED to BAM
mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null
# Create test BED file (BED4 minimum for bedtobam)
log "Creating test BED file..."
cat > "$test_dir/test.bed" << 'EOF'
chr1 1000 2000 gene1 100 +
chr2 3000 4000 gene2 200 -
chr3 5000 6000 gene3 150 +
EOF
echo "> Run bedtools_bedtobam on BED file"
# Create test BED12 file
log "Creating test BED12 file..."
cat > "$test_dir/test.bed12" << 'EOF'
chr1 1000 3000 gene1 100 + 1000 3000 255,0,0 2 500,500 0,1500
chr2 2000 5000 gene2 200 - 2000 5000 0,255,0 3 400,300,400 0,1500,2600
EOF
# --- Test Case 1: Basic BED to BAM conversion ---
log "Starting TEST 1: Basic BED to BAM conversion"
log "Executing $meta_name with basic BED file..."
"$meta_executable" \
--input "../example.bed" \
--genome "../genome.txt" \
--output "output.bam"
--input "$test_dir/test.bed" \
--genome "$test_dir/test.genome" \
--output "$meta_temp_dir/output1.bam"
samtools view -h output.bam > output.sam
log "Validating TEST 1 outputs..."
check_file_exists "$meta_temp_dir/output1.bam" "output BAM file"
check_file_not_empty "$meta_temp_dir/output1.bam" "output BAM file"
# checks
assert_file_exists "output.bam"
assert_file_not_empty "output.bam"
assert_identical_content "output.sam" "../expected.sam"
echo "- test1 succeeded -"
# Check if it's a valid BAM file by reading header
if command -v samtools >/dev/null 2>&1; then
samtools view -H "$meta_temp_dir/output1.bam" > "$meta_temp_dir/header1.txt" 2>/dev/null || true
if [ -s "$meta_temp_dir/header1.txt" ]; then
check_file_contains "$meta_temp_dir/header1.txt" "@HD" "BAM header"
log "✓ Valid BAM format detected"
else
log "Note: Cannot validate BAM format (samtools not available or BAM corrupt)"
fi
else
log "Note: samtools not available for BAM validation"
fi
popd > /dev/null
log "✅ TEST 1 completed successfully"
# Test 2: BED12 file
mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null
# --- Test Case 2: BED12 format conversion ---
log "Starting TEST 2: BED12 to BAM conversion"
echo "> Run bedtools_bedtobam on BED12 file"
log "Executing $meta_name with BED12 format..."
"$meta_executable" \
--input "../example.bed12" \
--genome "../genome.txt" \
--output "output.bam" \
--bed12 \
--input "$test_dir/test.bed12" \
--genome "$test_dir/test.genome" \
--output "$meta_temp_dir/output2.bam" \
--bed12
samtools view -h output.bam > output.sam
log "Validating TEST 2 outputs..."
check_file_exists "$meta_temp_dir/output2.bam" "BED12 output BAM file"
check_file_not_empty "$meta_temp_dir/output2.bam" "BED12 output BAM file"
# checks
assert_file_exists "output.bam"
assert_file_not_empty "output.bam"
assert_identical_content "output.sam" "../expected12.sam"
echo "- test2 succeeded -"
log "✅ TEST 2 completed successfully"
popd > /dev/null
# --- Test Case 3: Custom mapping quality ---
log "Starting TEST 3: Custom mapping quality"
# Test 3: Uncompressed BAM file
mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null
echo "> Run bedtools_bedtobam on BED file with uncompressed BAM output"
log "Executing $meta_name with custom mapping quality..."
"$meta_executable" \
--input "../example.bed" \
--genome "../genome.txt" \
--output "output.bam" \
--input "$test_dir/test.bed" \
--genome "$test_dir/test.genome" \
--output "$meta_temp_dir/output3.bam" \
--map_quality 30
log "Validating TEST 3 outputs..."
check_file_exists "$meta_temp_dir/output3.bam" "output BAM with custom MAPQ"
check_file_not_empty "$meta_temp_dir/output3.bam" "output BAM with custom MAPQ"
log "✅ TEST 3 completed successfully"
# --- Test Case 4: Uncompressed BAM ---
log "Starting TEST 4: Uncompressed BAM output"
log "Executing $meta_name with uncompressed BAM..."
"$meta_executable" \
--input "$test_dir/test.bed" \
--genome "$test_dir/test.genome" \
--output "$meta_temp_dir/output4.bam" \
--uncompress_bam
# checks
assert_file_exists "output.bam"
assert_file_not_empty "output.bam"
# Cannot assert_identical_content because umcompress option does not work on this version of bedtools.
log "Validating TEST 4 outputs..."
check_file_exists "$meta_temp_dir/output4.bam" "uncompressed BAM file"
check_file_not_empty "$meta_temp_dir/output4.bam" "uncompressed BAM file"
echo "- test3 succeeded -"
# Uncompressed BAM should generally be larger than compressed
compressed_size=$(stat -c%s "$meta_temp_dir/output1.bam")
uncompressed_size=$(stat -c%s "$meta_temp_dir/output4.bam")
log "Compressed BAM size: $compressed_size bytes"
log "Uncompressed BAM size: $uncompressed_size bytes"
popd > /dev/null
log "✅ TEST 4 completed successfully"
# Test 4: Map quality
mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null
echo "> Run bedtools_bedtobam on BED file with map quality"
"$meta_executable" \
--input "../example.bed" \
--genome "../genome.txt" \
--output "output.bam" \
--map_quality 10 \
samtools view -h output.bam > output.sam
# checks
assert_file_exists "output.bam"
assert_file_not_empty "output.bam"
assert_identical_content "output.sam" "../expected_mapquality.sam"
echo "- test4 succeeded -"
popd > /dev/null
# Test 5: gff to bam conversion
mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null
echo "> Run bedtools_bedtobam on GFF file"
"$meta_executable" \
--input "../example.gff" \
--genome "../genome.txt" \
--output "output.bam"
samtools view -h output.bam > output.sam
# checks
assert_file_exists "output.bam"
assert_file_not_empty "output.bam"
assert_identical_content "output.sam" "../expected_gff.sam"
echo "- test5 succeeded -"
popd > /dev/null
echo "---- All tests succeeded! ----"
exit 0
print_test_summary "All tests completed successfully"

View File

@@ -0,0 +1,221 @@
name: bedtools_closest
namespace: bedtools
description: |
Find the closest feature in file B for each feature in file A.
For each interval in file A, this tool identifies the nearest feature in
file B, regardless of whether they overlap. Useful for associating genomic
features with their nearest neighbors, such as finding the closest gene
to each SNP or the nearest regulatory element to each promoter.
**Default behavior:** Reports closest feature regardless of strand or overlap
**Distance reporting:** Optional distance calculation with various orientations
**Multiple hits:** Configurable handling of ties and k-nearest neighbors
keywords: [Closest, Nearest, Distance, BED, GFF, VCF, Association]
links:
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/closest.html
repository: https://github.com/arq5x/bedtools2
homepage: https://bedtools.readthedtools.io/en/latest/
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input_a
alternatives: [-a]
type: file
description: |
Query file in BED, GFF, or VCF format.
For each feature in this file, the closest feature in file B
will be identified and reported.
required: true
example: queries.bed
- name: --input_b
alternatives: [-b]
type: file
multiple: true
description: |
Database file(s) in BED, GFF, or VCF format.
**Single file:** Find closest features in one database
**Multiple files:** Find closest features across multiple databases
**Format:** Same or different format as input A
required: true
example: ["database1.bed", "database2.bed"]
- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: |
Output file with closest feature results.
Contains input A features with additional columns showing
the closest features from file(s) B, and optionally distance
and other metadata.
required: true
example: closest_features.bed
- name: Distance Options
arguments:
- name: --distance
alternatives: [-d]
type: boolean_true
description: |
Report distance to closest feature as extra column.
**Distance calculation:** Always positive, 0 for overlapping features
**Use case:** When you need quantitative proximity measurements
- name: --distance_mode
alternatives: [-D]
type: string
choices: ["ref", "a", "b"]
description: |
Report signed distance with orientation awareness.
**"ref":** Distance relative to reference genome coordinates
**"a":** Distance relative to strand of feature A
**"b":** Distance relative to strand of feature B
**Negative values:** Upstream features
**Positive values:** Downstream features
example: "ref"
- name: Filtering Options
arguments:
- name: --ignore_overlaps
alternatives: [-io]
type: boolean_true
description: |
Ignore overlapping features in B.
Only consider features in B that do not overlap with A.
Useful for finding nearby but non-overlapping features.
- name: --ignore_upstream
alternatives: [-iu]
type: boolean_true
description: |
Ignore upstream features in B.
**Requires:** --distance_mode parameter
**Effect:** Only consider downstream features
**Orientation:** Follows --distance_mode orientation rules
- name: --ignore_downstream
alternatives: [-id]
type: boolean_true
description: |
Ignore downstream features in B.
**Requires:** --distance_mode parameter
**Effect:** Only consider upstream features
**Orientation:** Follows --distance_mode orientation rules
- name: --force_upstream
alternatives: [-fu]
type: boolean_true
description: |
Choose first upstream feature when ties exist.
**Requires:** --distance_mode parameter
**Tie handling:** Among equally close features, prefer upstream
**Orientation:** Follows --distance_mode orientation rules
- name: --force_downstream
alternatives: [-fd]
type: boolean_true
description: |
Choose first downstream feature when ties exist.
**Requires:** --distance_mode parameter
**Tie handling:** Among equally close features, prefer downstream
**Orientation:** Follows --distance_mode orientation rules
- name: --strand
alternatives: [-s]
type: boolean_true
description: |
Require same strandedness.
Only consider features in B that are on the same strand as
the corresponding feature in A.
- name: --different_strand
alternatives: [-S]
type: boolean_true
description: |
Require different strandedness.
Only consider features in B that are on the opposite strand
from the corresponding feature in A.
- name: Advanced Options
arguments:
- name: --k_closest
alternatives: [-k]
type: integer
description: |
Report k closest hits for each query.
**Default:** 1 (single closest feature)
**Multiple hits:** Reports multiple closest features per query
**Tie handling:** All ties still reported based on --tie_mode
default: 1
example: 3
- name: --tie_mode
alternatives: [-t]
type: string
choices: ["all", "first", "last"]
description: |
How to handle ties for closest features.
**"all":** Report all equally close features (default)
**"first":** Report first tie found in file B
**"last":** Report last tie found in file B
default: "all"
example: "first"
- name: --different_names
alternatives: [-N]
type: boolean_true
description: |
Require different names between query and hit.
For BED files, compares the 4th column (name field).
Useful to avoid self-hits in self-comparisons.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: docker
run:
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,131 @@
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools closest -h
```
Tool: bedtools closest (aka closestBed)
Version: v2.31.1
Summary: For each feature in A, finds the closest
feature (upstream or downstream) in B.
Usage: bedtools closest [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>
Options:
-d In addition to the closest feature in B,
report its distance to A as an extra column.
- The reported distance for overlapping features will be 0.
-D Like -d, report the closest feature in B, and its distance to A
as an extra column. Unlike -d, use negative distances to report
upstream features.
The options for defining which orientation is "upstream" are:
- "ref" Report distance with respect to the reference genome.
B features with a lower (start, stop) are upstream
- "a" Report distance with respect to A.
When A is on the - strand, "upstream" means B has a
higher (start,stop).
- "b" Report distance with respect to B.
When B is on the - strand, "upstream" means A has a
higher (start,stop).
-io Ignore features in B that overlap A. That is, we want close,
yet not touching features only.
-iu Ignore features in B that are upstream of features in A.
This option requires -D and follows its orientation
rules for determining what is "upstream".
-id Ignore features in B that are downstream of features in A.
This option requires -D and follows its orientation
rules for determining what is "downstream".
-fu Choose first from features in B that are upstream of features in A.
This option requires -D and follows its orientation
rules for determining what is "upstream".
-fd Choose first from features in B that are downstream of features in A.
This option requires -D and follows its orientation
rules for determining what is "downstream".
-t How ties for closest feature are handled. This occurs when two
features in B have exactly the same "closeness" with A.
By default, all such features in B are reported.
Here are all the options:
- "all" Report all ties (default).
- "first" Report the first tie that occurred in the B file.
- "last" Report the last tie that occurred in the B file.
-mdb How multiple databases are resolved.
- "each" Report closest records for each database (default).
- "all" Report closest records among all databases.
-k Report the k closest hits. Default is 1. If tieMode = "all",
- all ties will still be reported.
-N Require that the query and the closest hit have different names.
For BED, the 4th column is compared.
-s Require same strandedness. That is, only report hits in B
that overlap A on the _same_ strand.
- By default, overlaps are reported without respect to strand.
-S Require different strandedness. That is, only report hits in B
that overlap A on the _opposite_ strand.
- By default, overlaps are reported without respect to strand.
-f Minimum overlap required as a fraction of A.
- Default is 1E-9 (i.e., 1bp).
- FLOAT (e.g. 0.50)
-F Minimum overlap required as a fraction of B.
- Default is 1E-9 (i.e., 1bp).
- FLOAT (e.g. 0.50)
-r Require that the fraction overlap be reciprocal for A AND B.
- In other words, if -f is 0.90 and -r is used, this requires
that B overlap 90% of A and A _also_ overlaps 90% of B.
-e Require that the minimum fraction be satisfied for A OR B.
- In other words, if -e is used with -f 0.90 and -F 0.10 this requires
that either 90% of A is covered OR 10% of B is covered.
Without -e, both fractions would have to be satisfied.
-split Treat "split" BAM or BED12 entries as distinct BED intervals.
-g Provide a genome file to enforce consistent chromosome sort order
across input files. Only applies when used with -sorted option.
-nonamecheck For sorted data, don't throw an error if the file has different naming conventions
for the same chromosome. ex. "chr1" vs "chr01".
-names When using multiple databases, provide an alias for each that
will appear instead of a fileId when also printing the DB record.
-filenames When using multiple databases, show each complete filename
instead of a fileId when also printing the DB record.
-sortout When using multiple databases, sort the output DB hits
for each record.
-bed If using BAM input, write output as BED.
-header Print the header from the A file prior to results.
-nobuf Disable buffered output. Using this option will cause each line
of output to be printed as it is generated, rather than saved
in a buffer. This will make printing large output files
noticeably slower, but can be useful in conjunction with
other software tools and scripts that need to process one
line of bedtools output at a time.
-iobuf Specify amount of memory to use for input buffer.
Takes an integer argument. Optional suffixes K/M/G supported.
Note: currently has no effect with compressed files.
Notes:
Reports "none" for chrom and "-1" for all other fields when a feature
is not found in B on the same chromosome as the feature in A.
E.g. none -1 -1

View File

@@ -0,0 +1,52 @@
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# unset flags
unset_if_false=(
par_distance
par_ignore_overlaps
par_ignore_upstream
par_ignore_downstream
par_force_upstream
par_force_downstream
par_strand
par_different_strand
par_different_names
)
for par in "${unset_if_false[@]}"; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset "$par"
done
# Convert semicolon-separated input_b files to array
IFS=';' read -ra input_b_array <<< "$par_input_b"
# Build command arguments array
cmd_args=(
-a "$par_input_a"
${par_distance:+-d}
${par_distance_mode:+-D "$par_distance_mode"}
${par_ignore_overlaps:+-io}
${par_ignore_upstream:+-iu}
${par_ignore_downstream:+-id}
${par_force_upstream:+-fu}
${par_force_downstream:+-fd}
${par_strand:+-s}
${par_different_strand:+-S}
${par_k_closest:+-k "$par_k_closest"}
${par_tie_mode:+-t "$par_tie_mode"}
${par_different_names:+-N}
)
# Add multiple input_b files
for file in "${input_b_array[@]}"; do
cmd_args+=(-b "$file")
done
# Execute bedtools closest
bedtools closest "${cmd_args[@]}" > "$par_output"

View File

@@ -0,0 +1,173 @@
#!/bin/bash
## VIASH START
## VIASH END
# Source centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment
setup_test_env
log "Starting tests for bedtools_closest"
# Create test data
log "Creating test data..."
# Create query intervals file
cat > "$meta_temp_dir/queries.bed" << 'EOF'
chr1 100 200 query1 100 +
chr1 400 500 query2 200 +
chr1 800 900 query3 300 -
chr2 200 300 query4 400 -
EOF
# Create database file with features at various distances
cat > "$meta_temp_dir/database.bed" << 'EOF'
chr1 250 350 feature1 500 +
chr1 450 550 feature2 600 +
chr1 700 800 feature3 700 -
chr2 150 250 feature4 800 +
chr2 600 700 feature5 900 -
chr2 950 1050 feature6 1000 +
EOF
# Create second database file for multi-file testing
cat > "$meta_temp_dir/database2.bed" << 'EOF'
chr1 1050 1150 db2_feature1
chr1 1250 1350 db2_feature2
chr1 1450 1550 db2_feature3
EOF
# Create distant features for signed distance testing (non-overlapping)
cat > "$meta_temp_dir/test_b_distant.bed" << 'EOF'
chr1 50 90 upstream1
chr1 250 290 downstream1
chr1 450 490 upstream2
chr1 650 690 downstream2
EOF
# Test 1: Basic closest feature finding
log "Starting TEST 1: Basic closest feature finding"
"$meta_executable" \
--input_a "$meta_temp_dir/queries.bed" \
--input_b "$meta_temp_dir/database.bed" \
--output "$meta_temp_dir/output1.bed"
check_file_exists "$meta_temp_dir/output1.bed" "basic closest output"
check_file_not_empty "$meta_temp_dir/output1.bed" "basic closest output"
check_file_line_count "$meta_temp_dir/output1.bed" 4 "basic closest line count"
# Check that closest features are reported
check_file_contains "$meta_temp_dir/output1.bed" "feature" "closest features found"
log "✅ TEST 1 completed successfully"
# Test 2: Closest features with distance reporting
log "Starting TEST 2: Closest features with distance reporting"
"$meta_executable" \
--input_a "$meta_temp_dir/queries.bed" \
--input_b "$meta_temp_dir/database.bed" \
--distance_mode "ref" \
--output "$meta_temp_dir/output2.bed"
check_file_exists "$meta_temp_dir/output2.bed" "distance output"
check_file_not_empty "$meta_temp_dir/output2.bed" "distance output"
check_file_line_count "$meta_temp_dir/output2.bed" 4 "distance line count"
# Check that distance column is added (should have more columns than input)
input_cols=$(head -1 "$meta_temp_dir/queries.bed" | awk '{print NF}')
output_cols=$(head -1 "$meta_temp_dir/output2.bed" | awk '{print NF}')
if [ $output_cols -le $input_cols ]; then
error "Expected more columns in output with distance, got $output_cols vs input $input_cols"
fi
log "✅ TEST 2 completed successfully"
# Test 3: Find closest with strand consideration
log "Starting TEST 3: Closest with strand consideration"
"$meta_executable" \
--input_a "$meta_temp_dir/queries.bed" \
--input_b "$meta_temp_dir/database.bed" \
--strand \
--output "$meta_temp_dir/output3.bed"
check_file_exists "$meta_temp_dir/output3.bed" "strand output"
check_file_not_empty "$meta_temp_dir/output3.bed" "strand output"
log "✅ TEST 3 completed successfully"
# Test 4: Find k-nearest neighbors (k=2)
log "Starting TEST 4: K-nearest neighbors (k=2)"
"$meta_executable" \
--input_a "$meta_temp_dir/queries.bed" \
--input_b "$meta_temp_dir/database.bed" \
--k_closest 2 \
--output "$meta_temp_dir/output4.bed"
check_file_exists "$meta_temp_dir/output4.bed" "k-nearest output"
check_file_not_empty "$meta_temp_dir/output4.bed" "k-nearest output"
# Should have more lines than basic test (up to 2x for each query)
basic_lines=$(wc -l < "$meta_temp_dir/output1.bed")
knearest_lines=$(wc -l < "$meta_temp_dir/output4.bed")
if [ $knearest_lines -lt $basic_lines ]; then
error "Expected at least $basic_lines lines for k-nearest, got $knearest_lines"
fi
log "✅ TEST 4 completed successfully"
# Test 5: Distance reporting with different mode
log "Starting TEST 5: Distance reporting with signed distance"
"$meta_executable" \
--input_a "$meta_temp_dir/queries.bed" \
--input_b "$meta_temp_dir/test_b_distant.bed" \
--distance_mode "ref" \
--output "$meta_temp_dir/output5.bed"
check_file_exists "$meta_temp_dir/output5.bed" "signed distance output"
check_file_not_empty "$meta_temp_dir/output5.bed" "signed distance output"
check_file_line_count "$meta_temp_dir/output5.bed" 4 "signed distance line count"
# Check that distance column includes negative values (upstream features)
if ! grep -q "[-]" "$meta_temp_dir/output5.bed"; then
log "Warning: No negative distances found, may not have upstream features"
fi
log "✅ TEST 5 completed successfully"
####################################################################################################
log "Starting TEST 6: Multiple database files"
# Create second database file with different features
cat > "$meta_temp_dir/database2.bed" << 'EOF'
chr1 300 400 enhancer1 10 +
chr1 500 600 enhancer2 20 +
chr2 150 250 enhancer3 15 -
chr2 350 450 enhancer4 25 -
EOF
# Test multiple databases
"$meta_executable" \
--input_a "$meta_temp_dir/queries.bed" \
--input_b "$meta_temp_dir/database.bed;$meta_temp_dir/database2.bed" \
--output "$meta_temp_dir/output6.bed"
check_file_exists "$meta_temp_dir/output6.bed" "multiple database output"
check_file_not_empty "$meta_temp_dir/output6.bed" "multiple database output"
# Check that we have results from multiple databases (should have database IDs)
line_count=$(wc -l < "$meta_temp_dir/output6.bed")
if [ "$line_count" -lt 4 ]; then
log "❌ Expected at least 4 lines for multiple databases, got $line_count"
exit 1
fi
# Check for database ID column (7th column should contain database numbers)
if ! cut -f7 "$meta_temp_dir/output6.bed" | grep -E "^[12]$" > /dev/null; then
log "❌ Expected database IDs (1, 2) in 7th column"
log "Actual output:"
cat "$meta_temp_dir/output6.bed"
exit 1
fi
log "✓ Found multiple database output with database IDs"
log "✅ TEST 6 completed successfully"
log "🎉 All bedtools_closest tests completed successfully!"

View File

@@ -0,0 +1,99 @@
name: bedtools_cluster
namespace: bedtools
description: |
Cluster overlapping or nearby genomic intervals.
This tool groups genomic intervals into clusters based on overlap
or proximity within a specified distance. Each cluster is assigned
a unique cluster ID, making it useful for analyzing genomic feature
distributions and relationships.
keywords: [genomics, intervals, clustering, overlap, proximity, grouping]
links:
homepage: https://bedtools.readthedocs.io/
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/cluster.html
repository: https://github.com/arq5x/bedtools2
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: [-i]
type: file
description: |
Input file in BED, GFF, or VCF format.
**BED format:** Standard genomic interval format
**GFF format:** Gene feature format with annotations
**VCF format:** Variant call format
**Requirements:** Must be sorted by chromosome and position
required: true
example: intervals.bed
- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: |
Output file with cluster assignments.
Contains original intervals with an additional column showing
the cluster ID for each interval. Intervals in the same cluster
have the same cluster ID number.
required: true
example: clustered.bed
- name: Clustering Options
arguments:
- name: --distance
alternatives: [-d]
type: integer
description: |
Maximum distance between features for clustering.
**Default:** 0 (only overlapping and book-ended features clustered)
**Positive values:** Cluster features within specified distance
**Use case:** Group nearby but non-overlapping features
default: 0
example: 1000
- name: --strand
alternatives: [-s]
type: boolean_true
description: |
Force strandedness in clustering.
**Default:** Clustering ignores strand information
**When enabled:** Only cluster features on the same strand
**Use case:** Strand-specific analysis of genomic features
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: docker
run: |
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,20 @@
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools cluster -h
```
Tool: bedtools cluster
Version: v2.31.1
Summary: Clusters overlapping/nearby BED/GFF/VCF intervals.
Usage: bedtools cluster [OPTIONS] -i <bed/gff/vcf>
Options:
-s Force strandedness. That is, only merge features
that are the same strand.
- By default, merging is done without respect to strand.
-d Maximum distance between features allowed for features
to be merged.
- Def. 0. That is, overlapping & book-ended features are merged.
- (INTEGER)

View File

@@ -0,0 +1,16 @@
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# unset flags
[[ "$par_strand" == "false" ]] && unset par_strand
# Execute bedtools cluster
bedtools cluster \
-i "$par_input" \
${par_distance:+-d "$par_distance"} \
${par_strand:+-s} \
> "$par_output"

View File

@@ -0,0 +1,154 @@
#!/bin/bash
set -eo pipefail
## VIASH START
## VIASH END
# Source centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment
setup_test_env
log "Starting tests for bedtools_cluster"
# Create test data
log "Creating test data..."
# Create overlapping intervals for basic clustering
cat > "$meta_temp_dir/overlapping.bed" << 'EOF'
chr1 100 200 feature1 100 +
chr1 150 250 feature2 200 +
chr1 180 280 feature3 300 +
chr1 500 600 feature4 400 -
chr1 800 900 feature5 500 +
chr2 100 200 feature6 600 +
chr2 300 400 feature7 700 -
EOF
# Create intervals with different strands
cat > "$meta_temp_dir/stranded.bed" << 'EOF'
chr1 100 200 pos1 100 +
chr1 150 250 neg1 200 -
chr1 180 280 pos2 300 +
chr1 300 400 neg2 400 -
chr1 500 600 pos3 500 +
chr1 550 650 neg3 600 -
EOF
# Create intervals for distance-based clustering
cat > "$meta_temp_dir/nearby.bed" << 'EOF'
chr1 100 200 interval1 100 +
chr1 300 400 interval2 200 +
chr1 450 550 interval3 300 +
chr1 1000 1100 interval4 400 +
chr1 1200 1300 interval5 500 +
chr2 100 200 interval6 600 +
chr2 1000 1100 interval7 700 +
EOF
# Test 1: Basic clustering of overlapping intervals
log "Starting TEST 1: Basic clustering of overlapping intervals"
"$meta_executable" \
--input "$meta_temp_dir/overlapping.bed" \
--output "$meta_temp_dir/output1.bed"
check_file_exists "$meta_temp_dir/output1.bed" "basic clustering output"
check_file_not_empty "$meta_temp_dir/output1.bed" "basic clustering output"
check_file_line_count "$meta_temp_dir/output1.bed" 7 "basic clustering line count"
# Check that cluster IDs are added (should have one more column than input)
input_cols=$(head -1 "$meta_temp_dir/overlapping.bed" | awk '{print NF}')
output_cols=$(head -1 "$meta_temp_dir/output1.bed" | awk '{print NF}')
if [ $output_cols -ne $((input_cols + 1)) ]; then
log_error "Expected $((input_cols + 1)) columns in output, got $output_cols"
exit 1
fi
# Check that overlapping intervals get the same cluster ID
if ! grep -q " 1$" "$meta_temp_dir/output1.bed"; then
log_error "Expected cluster ID 1 in output"
exit 1
fi
log "✅ TEST 1 completed successfully"
# Test 2: Distance-based clustering
log "Starting TEST 2: Distance-based clustering"
"$meta_executable" \
--input "$meta_temp_dir/nearby.bed" \
--distance 100 \
--output "$meta_temp_dir/output2.bed"
check_file_exists "$meta_temp_dir/output2.bed" "distance clustering output"
check_file_not_empty "$meta_temp_dir/output2.bed" "distance clustering output"
check_file_line_count "$meta_temp_dir/output2.bed" 7 "distance clustering line count"
# With distance 100, intervals at positions 100-200, 300-400, 450-550 should cluster together
# Check that cluster IDs are present
check_file_contains "$meta_temp_dir/output2.bed" "1" "cluster IDs present"
log "✅ TEST 2 completed successfully"
# Test 3: Strand-specific clustering
log "Starting TEST 3: Strand-specific clustering"
"$meta_executable" \
--input "$meta_temp_dir/stranded.bed" \
--strand \
--output "$meta_temp_dir/output3.bed"
check_file_exists "$meta_temp_dir/output3.bed" "strand clustering output"
check_file_not_empty "$meta_temp_dir/output3.bed" "strand clustering output"
check_file_line_count "$meta_temp_dir/output3.bed" 6 "strand clustering line count"
# With strand consideration, + and - strand features should get different cluster IDs
# even if they overlap
pos_cluster=$(grep "pos1" "$meta_temp_dir/output3.bed" | awk '{print $NF}')
neg_cluster=$(grep "neg1" "$meta_temp_dir/output3.bed" | awk '{print $NF}')
if [ "$pos_cluster" = "$neg_cluster" ]; then
log_error "Expected different cluster IDs for + and - strand overlapping features"
exit 1
fi
log "✅ TEST 3 completed successfully"
# Test 4: Large distance clustering
log "Starting TEST 4: Large distance clustering"
"$meta_executable" \
--input "$meta_temp_dir/nearby.bed" \
--distance 1000 \
--output "$meta_temp_dir/output4.bed"
check_file_exists "$meta_temp_dir/output4.bed" "large distance clustering output"
check_file_not_empty "$meta_temp_dir/output4.bed" "large distance clustering output"
check_file_line_count "$meta_temp_dir/output4.bed" 7 "large distance clustering line count"
# With distance 1000, most chr1 intervals should cluster together
chr1_clusters=$(grep "^chr1" "$meta_temp_dir/output4.bed" | awk '{print $NF}' | sort -u | wc -l)
if [ $chr1_clusters -gt 2 ]; then
log "Warning: Expected few clusters on chr1 with distance 1000, got $chr1_clusters"
fi
log "✅ TEST 4 completed successfully"
# Test 5: Multiple chromosome handling
log "Starting TEST 5: Multiple chromosome handling"
# This test uses the overlapping.bed which has both chr1 and chr2
"$meta_executable" \
--input "$meta_temp_dir/overlapping.bed" \
--output "$meta_temp_dir/output5.bed"
check_file_exists "$meta_temp_dir/output5.bed" "multi-chromosome output"
check_file_not_empty "$meta_temp_dir/output5.bed" "multi-chromosome output"
# Check that both chromosomes are present
check_file_contains "$meta_temp_dir/output5.bed" "chr1" "chr1 features present"
check_file_contains "$meta_temp_dir/output5.bed" "chr2" "chr2 features present"
# Each chromosome should have its own cluster numbering
chr1_max_cluster=$(grep "^chr1" "$meta_temp_dir/output5.bed" | awk '{print $NF}' | sort -n | tail -1)
chr2_min_cluster=$(grep "^chr2" "$meta_temp_dir/output5.bed" | awk '{print $NF}' | sort -n | head -1)
if [ $chr2_min_cluster -le $chr1_max_cluster ]; then
log " Note: Cluster IDs may continue across chromosomes (cluster numbering: chr1 max=$chr1_max_cluster, chr2 min=$chr2_min_cluster)"
fi
log "✅ TEST 5 completed successfully"
log "🎉 All bedtools_cluster tests completed successfully!"

View File

@@ -0,0 +1,100 @@
name: bedtools_complement
namespace: bedtools
description: |
Find genomic intervals that are NOT covered by input intervals.
This tool returns the complement of genomic intervals - the regions
of the genome that are NOT covered by the input features. Useful for
finding gaps, uncovered regions, or background intervals.
keywords: [genomics, intervals, complement, gaps, uncovered, background]
links:
homepage: https://bedtools.readthedocs.io/
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/complement.html
repository: https://github.com/arq5x/bedtools2
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: [-i]
type: file
description: |
Input file in BED, GFF, or VCF format.
**BED format:** Standard genomic interval format
**GFF format:** Gene feature format with annotations
**VCF format:** Variant call format
**Requirements:** Should be sorted by chromosome and position for optimal performance
required: true
example: covered_regions.bed
- name: --genome
alternatives: [-g]
type: file
description: |
Genome file defining chromosome names and sizes.
**Format:** Tab-delimited file with chromosome name and size
**Example line:** chr1 249250621
**Sources:** Can be created with samtools faidx or UCSC Table Browser
**Purpose:** Defines the complete genomic space for complement calculation
required: true
example: genome.txt
- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: |
Output file with complement intervals.
Contains genomic intervals representing the regions NOT covered
by the input intervals. Output is in BED format with chromosome,
start, and end coordinates.
required: true
example: uncovered_regions.bed
- name: Options
arguments:
- name: --limit_chromosomes
alternatives: [-L]
type: boolean_true
description: |
Limit output to chromosomes present in input file.
**Default:** Output includes all chromosomes from genome file
**When enabled:** Only output complement for chromosomes that have
records in the input file
**Use case:** Focus analysis on chromosomes of interest
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: docker
run: |
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,43 @@
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools complement -h
```
Tool: bedtools complement (aka complementBed)
Version: v2.31.1
Summary: Returns the base pair complement of a feature file.
Usage: bedtools complement [OPTIONS] -i <bed/gff/vcf> -g <genome>
Options:
-L Limit output to solely the chromosomes with records in the input file.
Notes:
(1) The genome file should tab delimited and structured as follows:
<chromName><TAB><chromSize>
For example, Human (hg19):
chr1 249250621
chr2 243199373
...
chr18_gl000207_random 4262
Tip 1. Use samtools faidx to create a genome file from a FASTA:
One can the samtools faidx command to index a FASTA file.
The resulting .fai index is suitable as a genome file,
as bedtools will only look at the first two, relevant columns
of the .fai file.
For example:
samtools faidx GRCh38.fa
bedtools complement -i my.bed -g GRCh38.fa.fai
Tip 2. Use UCSC Table Browser to create a genome file:
One can use the UCSC Genome Browser's MySQL database to extract
chromosome sizes. For example, H. sapiens:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
"select chrom, size from hg19.chromInfo" > hg19.genome

View File

@@ -0,0 +1,16 @@
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# unset flags
[[ "$par_limit_chromosomes" == "false" ]] && unset par_limit_chromosomes
# Execute bedtools complement
bedtools complement \
-i "$par_input" \
-g "$par_genome" \
${par_limit_chromosomes:+-L} \
> "$par_output"

View File

@@ -0,0 +1,149 @@
#!/bin/bash
set -eo pipefail
## VIASH START
## VIASH END
# Source centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment
setup_test_env
log "Starting tests for bedtools_complement"
# Create test data
log "Creating test data..."
# Create genome file
cat > "$meta_temp_dir/genome.txt" << 'EOF'
chr1 1000
chr2 800
chr3 500
EOF
# Create simple intervals covering some regions
cat > "$meta_temp_dir/covered.bed" << 'EOF'
chr1 100 200
chr1 300 400
chr1 600 700
chr2 50 150
chr2 300 500
EOF
# Create intervals on only one chromosome
cat > "$meta_temp_dir/chr1_only.bed" << 'EOF'
chr1 100 200
chr1 500 600
chr1 800 900
EOF
# Create overlapping intervals to test merging behavior
cat > "$meta_temp_dir/overlapping.bed" << 'EOF'
chr1 100 300
chr1 250 400
chr1 600 800
chr2 100 200
chr2 150 250
EOF
# Test 1: Basic complement finding
log "Starting TEST 1: Basic complement finding"
"$meta_executable" \
--input "$meta_temp_dir/covered.bed" \
--genome "$meta_temp_dir/genome.txt" \
--output "$meta_temp_dir/output1.bed"
check_file_exists "$meta_temp_dir/output1.bed" "basic complement output"
check_file_not_empty "$meta_temp_dir/output1.bed" "basic complement output"
# Should have complement regions for all chromosomes
check_file_contains "$meta_temp_dir/output1.bed" "chr1" "chr1 complement regions"
check_file_contains "$meta_temp_dir/output1.bed" "chr2" "chr2 complement regions"
check_file_contains "$meta_temp_dir/output1.bed" "chr3" "chr3 complement regions (entire chromosome)"
# Chr3 should be completely uncovered (0-500)
check_file_contains "$meta_temp_dir/output1.bed" "chr3 0 500" "complete chr3 complement"
log "✅ TEST 1 completed successfully"
# Test 2: Complement with chromosome limiting
log "Starting TEST 2: Complement with chromosome limiting"
"$meta_executable" \
--input "$meta_temp_dir/chr1_only.bed" \
--genome "$meta_temp_dir/genome.txt" \
--limit_chromosomes \
--output "$meta_temp_dir/output2.bed"
check_file_exists "$meta_temp_dir/output2.bed" "limited complement output"
check_file_not_empty "$meta_temp_dir/output2.bed" "limited complement output"
# Should only contain chr1 complement (no chr2, chr3)
check_file_contains "$meta_temp_dir/output2.bed" "chr1" "chr1 complement regions"
if grep -q "chr2\|chr3" "$meta_temp_dir/output2.bed"; then
log_error "Expected only chr1 with -L option, but found chr2 or chr3"
exit 1
fi
log "✅ TEST 2 completed successfully"
# Test 3: Complement of overlapping intervals
log "Starting TEST 3: Complement of overlapping intervals"
"$meta_executable" \
--input "$meta_temp_dir/overlapping.bed" \
--genome "$meta_temp_dir/genome.txt" \
--output "$meta_temp_dir/output3.bed"
check_file_exists "$meta_temp_dir/output3.bed" "overlapping complement output"
check_file_not_empty "$meta_temp_dir/output3.bed" "overlapping complement output"
# bedtools complement should handle overlapping input intervals correctly
check_file_contains "$meta_temp_dir/output3.bed" "chr1" "chr1 complement with overlaps"
check_file_contains "$meta_temp_dir/output3.bed" "chr2" "chr2 complement with overlaps"
log "✅ TEST 3 completed successfully"
# Test 4: Verify complement coordinates
log "Starting TEST 4: Verify complement coordinates"
"$meta_executable" \
--input "$meta_temp_dir/covered.bed" \
--genome "$meta_temp_dir/genome.txt" \
--output "$meta_temp_dir/output4.bed"
check_file_exists "$meta_temp_dir/output4.bed" "coordinate verification output"
# Check that complement starts at 0 for chr1 (nothing covered at start)
if ! grep -q "chr1 0 100" "$meta_temp_dir/output4.bed"; then
log_error "Expected chr1 complement to start at position 0"
exit 1
fi
# Check that complement goes to chromosome end (1000 for chr1)
if ! grep -q "700 1000" "$meta_temp_dir/output4.bed"; then
log_error "Expected chr1 complement to end at chromosome end (1000)"
exit 1
fi
log "✅ TEST 4 completed successfully"
# Test 5: Empty input handling
log "Starting TEST 5: Empty input handling"
# Create empty input file
touch "$meta_temp_dir/empty.bed"
"$meta_executable" \
--input "$meta_temp_dir/empty.bed" \
--genome "$meta_temp_dir/genome.txt" \
--output "$meta_temp_dir/output5.bed"
check_file_exists "$meta_temp_dir/output5.bed" "empty input output"
check_file_not_empty "$meta_temp_dir/output5.bed" "empty input output"
# With no input intervals, complement should be entire genome
total_genome_size=$(awk '{sum += $2} END {print sum}' "$meta_temp_dir/genome.txt")
total_complement_size=$(awk '{sum += $3 - $2} END {print sum}' "$meta_temp_dir/output5.bed")
if [ "$total_complement_size" -ne "$total_genome_size" ]; then
log_error "Expected complement size to equal genome size ($total_genome_size), got $total_complement_size"
exit 1
fi
log "✅ TEST 5 completed successfully"
log "🎉 All bedtools_complement tests completed successfully!"

View File

@@ -0,0 +1,245 @@
name: bedtools_coverage
namespace: bedtools
description: |
Calculate coverage of genomic intervals from one file over intervals in another.
This tool reports the depth and breadth of coverage of features from file B
over the intervals in file A. It provides detailed coverage statistics including
overlap counts, covered bases, and coverage fractions.
keywords: [genomics, intervals, coverage, depth, breadth, overlap, statistics]
links:
homepage: https://bedtools.readthedocs.io/
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html
repository: https://github.com/arq5x/bedtools2
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input_a
alternatives: [-a]
type: file
description: |
Query intervals file in BED, GFF, or VCF format.
**Purpose:** Intervals for which coverage will be calculated
**BED format:** Standard genomic interval format
**GFF format:** Gene feature format with annotations
**VCF format:** Variant call format
required: true
example: target_regions.bed
- name: --input_b
alternatives: [-b]
type: file
multiple: true
description: |
Coverage source file(s) in BED, GFF, VCF, or BAM format.
**Purpose:** Features that provide coverage over query intervals
**Multiple files:** Can specify multiple coverage sources
**BAM support:** Binary alignment files for sequencing coverage
required: true
example: ["alignments.bam", "features.bed"]
- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: |
Output file with coverage statistics.
**Default output:** For each interval in A, reports:
1. Number of overlapping features from B
2. Number of bases in A with non-zero coverage
3. Length of interval in A
4. Fraction of bases in A with non-zero coverage
required: true
example: coverage_stats.txt
- name: Coverage Reporting
arguments:
- name: --histogram
alternatives: [-hist]
type: boolean_true
description: |
Report coverage histogram for each feature and summary.
**Output format:** depth, bases at depth, feature size, percentage
**Use case:** Detailed coverage distribution analysis
- name: --depth_per_position
alternatives: [-d]
type: boolean_true
description: |
Report depth at each position in each interval.
**Output:** One-based positions with coverage depth
**Use case:** Position-specific coverage analysis
**Note:** Generates large output for long intervals
- name: --counts_only
alternatives: [-counts]
type: boolean_true
description: |
Only report overlap counts, no fractions.
**Simplified output:** Just the number of overlapping features
**Use case:** When only overlap counts are needed
- name: --mean_depth
alternatives: [-mean]
type: boolean_true
description: |
Report mean coverage depth for each interval.
**Output:** Average depth across all positions in interval
**Use case:** Summary coverage statistics
- name: Strand Options
arguments:
- name: --same_strand
alternatives: [-s]
type: boolean_true
description: |
Require same strandedness for overlaps.
**Default:** Overlaps reported regardless of strand
**When enabled:** Only count overlaps on same strand
- name: --different_strand
alternatives: [-S]
type: boolean_true
description: |
Require different strandedness for overlaps.
**Default:** Overlaps reported regardless of strand
**When enabled:** Only count overlaps on opposite strand
- name: Overlap Requirements
arguments:
- name: --min_overlap_a
alternatives: [-f]
type: double
description: |
Minimum overlap required as fraction of A.
**Range:** 0.0 to 1.0
**Default:** 1E-9 (essentially 1bp)
**Example:** 0.50 requires 50% of A to be overlapped
example: 0.5
- name: --min_overlap_b
alternatives: [-F]
type: double
description: |
Minimum overlap required as fraction of B.
**Range:** 0.0 to 1.0
**Default:** 1E-9 (essentially 1bp)
**Example:** 0.80 requires 80% of B to overlap A
example: 0.8
- name: --reciprocal
alternatives: [-r]
type: boolean_true
description: |
Require reciprocal minimum fraction for A AND B.
**Requires:** Both -f and -F fractions to be satisfied
**Use case:** Stringent overlap requirements
- name: --either
alternatives: [-e]
type: boolean_true
description: |
Require minimum fraction for A OR B (not both).
**Default:** Both -f and -F must be satisfied
**When enabled:** Either fraction requirement is sufficient
- name: Format Options
arguments:
- name: --split
type: boolean_true
description: |
Treat split BAM/BED12 entries as distinct intervals.
**BAM:** Handle spliced alignments as separate blocks
**BED12:** Process each block independently
- name: --bed_output
alternatives: [-bed]
type: boolean_true
description: |
Write output in BED format when using BAM input.
**Default:** BAM input produces BAM-style output
**When enabled:** Force BED format output
- name: --header
type: boolean_true
description: |
Print header from input A file before results.
**Use case:** Preserve metadata from input file
- name: Performance Options
arguments:
- name: --sorted
type: boolean_true
description: |
Use chromsweep algorithm for sorted input.
**Requirements:** Input must be sorted by chromosome and position
**Performance:** Faster processing for large files
- name: --genome
alternatives: [-g]
type: file
description: |
Genome file for consistent chromosome ordering.
**Format:** Tab-delimited chromosome names and sizes
**Use case:** Ensure consistent sort order with -sorted option
example: genome.txt
- name: --no_name_check
alternatives: [-nonamecheck]
type: boolean_true
description: |
Don't error on different chromosome naming conventions.
**Example:** Allows mixing "chr1" and "chr01"
**Use case:** Working with files from different sources
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: docker
run: |
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,89 @@
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools coverage -h
```
Tool: bedtools coverage (aka coverageBed)
Version: v2.31.1
Summary: Returns the depth and breadth of coverage of features from B
on the intervals in A.
Usage: bedtools coverage [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>
Options:
-hist Report a histogram of coverage for each feature in A
as well as a summary histogram for _all_ features in A.
Output (tab delimited) after each feature in A:
1) depth
2) # bases at depth
3) size of A
4) % of A at depth
-d Report the depth at each position in each A feature.
Positions reported are one based. Each position
and depth follow the complete A feature.
-counts Only report the count of overlaps, don't compute fraction, etc.
-mean Report the mean depth of all positions in each A feature.
-s Require same strandedness. That is, only report hits in B
that overlap A on the _same_ strand.
- By default, overlaps are reported without respect to strand.
-S Require different strandedness. That is, only report hits in B
that overlap A on the _opposite_ strand.
- By default, overlaps are reported without respect to strand.
-f Minimum overlap required as a fraction of A.
- Default is 1E-9 (i.e., 1bp).
- FLOAT (e.g. 0.50)
-F Minimum overlap required as a fraction of B.
- Default is 1E-9 (i.e., 1bp).
- FLOAT (e.g. 0.50)
-r Require that the fraction overlap be reciprocal for A AND B.
- In other words, if -f is 0.90 and -r is used, this requires
that B overlap 90% of A and A _also_ overlaps 90% of B.
-e Require that the minimum fraction be satisfied for A OR B.
- In other words, if -e is used with -f 0.90 and -F 0.10 this requires
that either 90% of A is covered OR 10% of B is covered.
Without -e, both fractions would have to be satisfied.
-split Treat "split" BAM or BED12 entries as distinct BED intervals.
-g Provide a genome file to enforce consistent chromosome sort order
across input files. Only applies when used with -sorted option.
-nonamecheck For sorted data, don't throw an error if the file has different naming conventions
for the same chromosome. ex. "chr1" vs "chr01".
-sorted Use the "chromsweep" algorithm for sorted (-k1,1 -k2,2n) input.
-bed If using BAM input, write output as BED.
-header Print the header from the A file prior to results.
-nobuf Disable buffered output. Using this option will cause each line
of output to be printed as it is generated, rather than saved
in a buffer. This will make printing large output files
noticeably slower, but can be useful in conjunction with
other software tools and scripts that need to process one
line of bedtools output at a time.
-iobuf Specify amount of memory to use for input buffer.
Takes an integer argument. Optional suffixes K/M/G supported.
Note: currently has no effect with compressed files.
Default Output:
After each entry in A, reports:
1) The number of features in B that overlapped the A interval.
2) The number of bases in A that had non-zero coverage.
3) The length of the entry in A.
4) The fraction of bases in A that had non-zero coverage.

View File

@@ -0,0 +1,57 @@
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# unset flags
unset_if_false=(
par_histogram
par_depth_per_position
par_counts_only
par_mean_depth
par_same_strand
par_different_strand
par_reciprocal
par_either
par_split
par_bed_output
par_header
par_sorted
par_no_name_check
)
for par in "${unset_if_false[@]}"; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset "$par"
done
# Build input B arguments array from semicolon-separated string
input_b_args=()
IFS=';' read -ra input_b_files <<< "$par_input_b"
for file in "${input_b_files[@]}"; do
input_b_args+=(-b "$file")
done
# Execute bedtools coverage
bedtools coverage \
-a "$par_input_a" \
"${input_b_args[@]}" \
${par_histogram:+-hist} \
${par_depth_per_position:+-d} \
${par_counts_only:+-counts} \
${par_mean_depth:+-mean} \
${par_same_strand:+-s} \
${par_different_strand:+-S} \
${par_min_overlap_a:+-f "$par_min_overlap_a"} \
${par_min_overlap_b:+-F "$par_min_overlap_b"} \
${par_reciprocal:+-r} \
${par_either:+-e} \
${par_split:+-split} \
${par_bed_output:+-bed} \
${par_header:+-header} \
${par_sorted:+-sorted} \
${par_genome:+-g "$par_genome"} \
${par_no_name_check:+-nonamecheck} \
> "$par_output"

View File

@@ -0,0 +1,205 @@
#!/bin/bash
set -eo pipefail
## VIASH START
## VIASH END
# Source centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment
setup_test_env
log "Starting tests for bedtools_coverage"
# Create test data
log "Creating test data..."
# Create target intervals (query file A)
cat > "$meta_temp_dir/targets.bed" << 'EOF'
chr1 100 300 target1 100 +
chr1 500 800 target2 200 +
chr2 200 400 target3 300 -
chr2 600 900 target4 400 -
EOF
# Create coverage features (file B) - some overlapping, some not
cat > "$meta_temp_dir/features.bed" << 'EOF'
chr1 150 250 feature1 500 +
chr1 200 350 feature2 600 +
chr1 550 750 feature3 700 +
chr2 250 350 feature4 800 -
chr2 650 850 feature5 900 +
chr3 100 200 feature6 1000 +
EOF
# Create additional coverage file for multi-file testing
cat > "$meta_temp_dir/features2.bed" << 'EOF'
chr1 120 180 extra1 300 +
chr1 600 700 extra2 400 +
chr2 300 500 extra3 500 -
EOF
# Create strand-specific test data
cat > "$meta_temp_dir/stranded_targets.bed" << 'EOF'
chr1 100 200 pos_target 100 +
chr1 300 400 neg_target 200 -
EOF
cat > "$meta_temp_dir/stranded_features.bed" << 'EOF'
chr1 120 180 pos_feature 300 +
chr1 320 380 neg_feature 400 -
chr1 140 160 pos_feature2 500 +
chr1 340 360 neg_feature2 600 -
EOF
# Test 1: Basic coverage calculation
log "Starting TEST 1: Basic coverage calculation"
"$meta_executable" \
--input_a "$meta_temp_dir/targets.bed" \
--input_b "$meta_temp_dir/features.bed" \
--output "$meta_temp_dir/output1.txt"
check_file_exists "$meta_temp_dir/output1.txt" "basic coverage output"
check_file_not_empty "$meta_temp_dir/output1.txt" "basic coverage output"
check_file_line_count "$meta_temp_dir/output1.txt" 4 "basic coverage line count"
# Check that coverage statistics are added (should have 4 extra columns)
input_cols=$(head -1 "$meta_temp_dir/targets.bed" | awk '{print NF}')
output_cols=$(head -1 "$meta_temp_dir/output1.txt" | awk '{print NF}')
expected_cols=$((input_cols + 4))
if [ $output_cols -ne $expected_cols ]; then
log_error "Expected $expected_cols columns in output, got $output_cols"
exit 1
fi
# Check that some targets have coverage
if ! grep -q -E "\s[1-9][0-9]*\s" "$meta_temp_dir/output1.txt"; then
log_error "Expected some targets to have non-zero coverage counts"
exit 1
fi
log "✅ TEST 1 completed successfully"
# Test 2: Coverage histogram
log "Starting TEST 2: Coverage histogram"
"$meta_executable" \
--input_a "$meta_temp_dir/targets.bed" \
--input_b "$meta_temp_dir/features.bed" \
--histogram \
--output "$meta_temp_dir/output2.txt"
check_file_exists "$meta_temp_dir/output2.txt" "histogram output"
check_file_not_empty "$meta_temp_dir/output2.txt" "histogram output"
# Histogram output should have depth information
check_file_contains "$meta_temp_dir/output2.txt" "target1" "target intervals in histogram"
# Should contain histogram data (depth, bases, size, percentage)
if ! grep -q -E "\s[0-9]+\s+[0-9]+\s+[0-9]+\s+[0-9]+\.[0-9]+$" "$meta_temp_dir/output2.txt"; then
log_error "Expected histogram format with depth data"
exit 1
fi
log "✅ TEST 2 completed successfully"
# Test 3: Counts only
log "Starting TEST 3: Counts only output"
"$meta_executable" \
--input_a "$meta_temp_dir/targets.bed" \
--input_b "$meta_temp_dir/features.bed" \
--counts_only \
--output "$meta_temp_dir/output3.txt"
check_file_exists "$meta_temp_dir/output3.txt" "counts only output"
check_file_not_empty "$meta_temp_dir/output3.txt" "counts only output"
check_file_line_count "$meta_temp_dir/output3.txt" 4 "counts only line count"
# Counts only should have fewer columns (just original + count)
counts_cols=$(head -1 "$meta_temp_dir/output3.txt" | awk '{print NF}')
if [ $counts_cols -ne $((input_cols + 1)) ]; then
log_error "Expected $((input_cols + 1)) columns for counts only, got $counts_cols"
exit 1
fi
log "✅ TEST 3 completed successfully"
# Test 4: Mean depth reporting
log "Starting TEST 4: Mean depth reporting"
"$meta_executable" \
--input_a "$meta_temp_dir/targets.bed" \
--input_b "$meta_temp_dir/features.bed" \
--mean_depth \
--output "$meta_temp_dir/output4.txt"
check_file_exists "$meta_temp_dir/output4.txt" "mean depth output"
check_file_not_empty "$meta_temp_dir/output4.txt" "mean depth output"
# Should contain mean depth values (floating point numbers)
if ! grep -q -E "\s[0-9]+\.[0-9]+$" "$meta_temp_dir/output4.txt"; then
log_error "Expected mean depth values (floating point)"
exit 1
fi
log "✅ TEST 4 completed successfully"
# Test 5: Strand-specific coverage
log "Starting TEST 5: Strand-specific coverage"
"$meta_executable" \
--input_a "$meta_temp_dir/stranded_targets.bed" \
--input_b "$meta_temp_dir/stranded_features.bed" \
--same_strand \
--output "$meta_temp_dir/output5.txt"
check_file_exists "$meta_temp_dir/output5.txt" "same strand output"
check_file_not_empty "$meta_temp_dir/output5.txt" "same strand output"
# Compare with opposite strand requirement
"$meta_executable" \
--input_a "$meta_temp_dir/stranded_targets.bed" \
--input_b "$meta_temp_dir/stranded_features.bed" \
--different_strand \
--output "$meta_temp_dir/output5b.txt"
# Results should be different between same and different strand requirements
if diff -q "$meta_temp_dir/output5.txt" "$meta_temp_dir/output5b.txt" >/dev/null; then
log "Warning: Same and different strand outputs are identical - may not have strand-specific overlaps"
fi
log "✅ TEST 5 completed successfully"
# Test 6: Multiple input files
log "Starting TEST 6: Multiple input files"
"$meta_executable" \
--input_a "$meta_temp_dir/targets.bed" \
--input_b "$meta_temp_dir/features.bed" \
--input_b "$meta_temp_dir/features2.bed" \
--output "$meta_temp_dir/output6.txt"
check_file_exists "$meta_temp_dir/output6.txt" "multiple files output"
check_file_not_empty "$meta_temp_dir/output6.txt" "multiple files output"
check_file_line_count "$meta_temp_dir/output6.txt" 4 "multiple files line count"
# Coverage should be higher with additional file
single_file_coverage=$(awk '{print $7}' "$meta_temp_dir/output1.txt" | head -1)
multi_file_coverage=$(awk '{print $7}' "$meta_temp_dir/output6.txt" | head -1)
log " Single file coverage: $single_file_coverage, Multi-file coverage: $multi_file_coverage"
log "✅ TEST 6 completed successfully"
# Test 7: Minimum overlap fraction
log "Starting TEST 7: Minimum overlap fraction"
"$meta_executable" \
--input_a "$meta_temp_dir/targets.bed" \
--input_b "$meta_temp_dir/features.bed" \
--min_overlap_a 0.5 \
--output "$meta_temp_dir/output7.txt"
check_file_exists "$meta_temp_dir/output7.txt" "min overlap output"
check_file_not_empty "$meta_temp_dir/output7.txt" "min overlap output"
# Compare with no minimum requirement - should have fewer overlaps
no_min_overlaps=$(awk '{sum += $7} END {print sum}' "$meta_temp_dir/output1.txt")
min_overlaps=$(awk '{sum += $7} END {print sum}' "$meta_temp_dir/output7.txt")
if [ "$min_overlaps" -gt "$no_min_overlaps" ]; then
log_error "Expected fewer overlaps with minimum fraction requirement"
exit 1
fi
log "✅ TEST 7 completed successfully"
log "🎉 All bedtools_coverage tests completed successfully!"

View File

@@ -0,0 +1,87 @@
name: bedtools_expand
namespace: bedtools
description: |
Expand rows by splitting comma-separated values into separate rows.
This tool replicates lines based on columns containing comma-separated values,
creating one row for each value. Useful for expanding collapsed data formats
like BED12 blocks or multi-value annotations into individual entries.
keywords: [genomics, intervals, expand, split, comma-separated, replicate]
links:
homepage: https://bedtools.readthedocs.io/
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/expand.html
repository: https://github.com/arq5x/bedtools2
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: [-i]
type: file
description: |
Input file with comma-separated values to expand.
**Format:** Tab-delimited file with one or more columns containing
comma-separated values
**Example:** BED file with comma-separated scores or annotations
required: true
example: collapsed_data.bed
- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: |
Output file with expanded rows.
Contains one row for each comma-separated value, with other
columns replicated across all expanded rows.
required: true
example: expanded_data.bed
- name: Expansion Options
arguments:
- name: --columns
alternatives: [-c]
type: string
description: |
Column(s) to expand (1-based indexing).
**Single column:** Specify one column number (e.g., "4")
**Multiple columns:** Comma-separated list (e.g., "4,5")
**Behavior:** Values in specified columns are split and expanded
**Requirement:** All specified columns must have same number of values
required: true
example: "4,5"
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: docker
run: |
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,34 @@
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools expand -h
```
Tool: bedtools expand
Version: v2.31.1
Summary: Replicate lines in a file based on columns of comma-separated values.
Usage: bedtools expand -c [COLS]
Options:
-i Input file. Assumes "stdin" if omitted.
-c Specify the column (1-based) that should be summarized.
- Required.
Examples:
$ cat test.txt
chr1 10 20 1,2,3 10,20,30
chr1 40 50 4,5,6 40,50,60
$ bedtools expand test.txt -c 5
chr1 10 20 1,2,3 10
chr1 10 20 1,2,3 20
chr1 10 20 1,2,3 30
chr1 40 50 4,5,6 40
chr1 40 50 4,5,6 50
chr1 40 50 4,5,6 60
$ bedtools expand test.txt -c 4,5
chr1 10 20 1 10
chr1 10 20 2 20
chr1 10 20 3 30
chr1 40 50 4 40
chr1 40 50 5 50
chr1 40 50 6 60

View File

@@ -0,0 +1,15 @@
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# Build command arguments array
cmd_args=(
-i "$par_input"
-c "$par_columns"
)
# Execute bedtools expand
bedtools expand "${cmd_args[@]}" > "$par_output"

View File

@@ -0,0 +1,138 @@
#!/bin/bash
set -eo pipefail
## VIASH START
## VIASH END
# Source centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment
setup_test_env
log "Starting tests for bedtools_expand"
# Create test data
log "Creating test data..."
# Create simple test file with comma-separated values in one column
cat > "$meta_temp_dir/simple.bed" << 'EOF'
chr1 100 200 1,2,3
chr1 300 400 4,5,6
chr2 500 600 7,8
EOF
# Create test file with comma-separated values in multiple columns
cat > "$meta_temp_dir/multi_column.bed" << 'EOF'
chr1 10 20 1,2,3 10,20,30
chr1 40 50 4,5,6 40,50,60
chr2 70 80 7,8,9 70,80,90
EOF
# Create BED file with single values (no expansion needed)
cat > "$meta_temp_dir/no_expansion.bed" << 'EOF'
chr1 100 200 single_value
chr2 300 400 another_value
EOF
# Create file with unequal comma-separated lists (should be handled gracefully)
cat > "$meta_temp_dir/unequal.bed" << 'EOF'
chr1 100 200 1,2,3 10,20
chr1 300 400 4,5 40,50,60
EOF
# Test 1: Basic single column expansion
log "Starting TEST 1: Basic single column expansion"
"$meta_executable" \
--input "$meta_temp_dir/simple.bed" \
--columns "4" \
--output "$meta_temp_dir/output1.bed"
check_file_exists "$meta_temp_dir/output1.bed" "single column expansion output"
check_file_not_empty "$meta_temp_dir/output1.bed" "single column expansion output"
check_file_line_count "$meta_temp_dir/output1.bed" 8 "single column expansion line count"
# Check that expansion worked correctly
check_file_contains "$meta_temp_dir/output1.bed" "chr1 100 200 1" "first expanded value"
check_file_contains "$meta_temp_dir/output1.bed" "chr1 100 200 2" "second expanded value"
check_file_contains "$meta_temp_dir/output1.bed" "chr1 100 200 3" "third expanded value"
check_file_contains "$meta_temp_dir/output1.bed" "chr2 500 600 7" "chr2 first value"
check_file_contains "$meta_temp_dir/output1.bed" "chr2 500 600 8" "chr2 second value"
log "✅ TEST 1 completed successfully"
# Test 2: Multi-column expansion
log "Starting TEST 2: Multi-column expansion"
"$meta_executable" \
--input "$meta_temp_dir/multi_column.bed" \
--columns "4,5" \
--output "$meta_temp_dir/output2.bed"
check_file_exists "$meta_temp_dir/output2.bed" "multi-column expansion output"
check_file_not_empty "$meta_temp_dir/output2.bed" "multi-column expansion output"
check_file_line_count "$meta_temp_dir/output2.bed" 9 "multi-column expansion line count"
# Check that paired expansion worked correctly
check_file_contains "$meta_temp_dir/output2.bed" "chr1 10 20 1 10" "first paired expansion"
check_file_contains "$meta_temp_dir/output2.bed" "chr1 10 20 2 20" "second paired expansion"
check_file_contains "$meta_temp_dir/output2.bed" "chr1 10 20 3 30" "third paired expansion"
log "✅ TEST 2 completed successfully"
# Test 3: No expansion needed (single values)
log "Starting TEST 3: Single values (no expansion needed)"
"$meta_executable" \
--input "$meta_temp_dir/no_expansion.bed" \
--columns "4" \
--output "$meta_temp_dir/output3.bed"
check_file_exists "$meta_temp_dir/output3.bed" "no expansion output"
check_file_not_empty "$meta_temp_dir/output3.bed" "no expansion output"
check_file_line_count "$meta_temp_dir/output3.bed" 2 "no expansion line count"
# Should be identical to input since no comma-separated values
check_file_contains "$meta_temp_dir/output3.bed" "single_value" "single value preserved"
check_file_contains "$meta_temp_dir/output3.bed" "another_value" "another value preserved"
log "✅ TEST 3 completed successfully"
# Test 4: Different column positions
log "Starting TEST 4: Different column positions"
"$meta_executable" \
--input "$meta_temp_dir/multi_column.bed" \
--columns "5" \
--output "$meta_temp_dir/output4.bed"
check_file_exists "$meta_temp_dir/output4.bed" "column 5 expansion output"
check_file_not_empty "$meta_temp_dir/output4.bed" "column 5 expansion output"
check_file_line_count "$meta_temp_dir/output4.bed" 9 "column 5 expansion line count"
# Check that only column 5 was expanded, column 4 remains comma-separated
check_file_contains "$meta_temp_dir/output4.bed" "chr1 10 20 1,2,3 10" "column 4 not expanded"
check_file_contains "$meta_temp_dir/output4.bed" "chr1 10 20 1,2,3 20" "column 5 expanded"
log "✅ TEST 4 completed successfully"
# Test 5: Large expansion test
log "Starting TEST 5: Large expansion test"
# Create file with more comma-separated values
cat > "$meta_temp_dir/large.bed" << 'EOF'
chr1 100 200 1,2,3,4,5,6,7,8,9,10
EOF
"$meta_executable" \
--input "$meta_temp_dir/large.bed" \
--columns "4" \
--output "$meta_temp_dir/output5.bed"
check_file_exists "$meta_temp_dir/output5.bed" "large expansion output"
check_file_not_empty "$meta_temp_dir/output5.bed" "large expansion output"
check_file_line_count "$meta_temp_dir/output5.bed" 10 "large expansion line count"
# Check that all values are expanded
for i in {1..10}; do
if ! grep -q "chr1 100 200 $i$" "$meta_temp_dir/output5.bed"; then
log_error "Expected value $i not found in large expansion"
exit 1
fi
done
log "✅ TEST 5 completed successfully"
log "🎉 All bedtools_expand tests completed successfully!"

View File

@@ -0,0 +1,234 @@
name: bedtools_fisher
namespace: bedtools
description: |
Calculate Fisher's exact test statistic between two feature files.
This tool performs Fisher's exact test to assess the statistical significance
of overlaps between genomic intervals in two files. It calculates the probability
of observing the given overlap pattern by chance, providing a p-value for
statistical inference.
keywords: [genomics, intervals, fisher, statistics, overlap, significance, test]
links:
homepage: https://bedtools.readthedocs.io/
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/fisher.html
repository: https://github.com/arq5x/bedtools2
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author]
argument_groups:
- name: Inputs
arguments:
- name: --input_a
alternatives: [-a]
type: file
description: |
First input file for comparison.
**Format:** BED, GFF, VCF file with genomic intervals
**Requirement:** Must be sorted by chromosome, then start position
**Usage:** File A for Fisher's exact test comparison
required: true
example: intervals_a.bed
- name: --input_b
alternatives: [-b]
type: file
description: |
Second input file for comparison.
**Format:** BED, GFF, VCF file with genomic intervals
**Requirement:** Must be sorted by chromosome, then start position
**Usage:** File B for Fisher's exact test comparison
required: true
example: intervals_b.bed
- name: --genome
alternatives: [-g]
type: file
description: |
Genome file defining chromosome sizes.
**Format:** Tab-delimited file with chromosome name and size
**Purpose:** Enforces consistent chromosome sort order
**Example:** chr1\t249250621
required: true
example: genome.txt
- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: |
Output file with Fisher's exact test results.
Contains statistical results including p-values for overlap
significance between input files.
required: true
example: fisher_results.txt
- name: Overlap Options
arguments:
- name: --merge_overlaps
alternatives: [-m]
type: boolean_true
description: |
Merge overlapping intervals before analysis.
**Effect:** Collapses overlapping intervals in both files
**Usage:** Prevents double-counting of overlapping features
**Default:** false (no merging)
- name: --min_overlap_a
alternatives: [-f]
type: double
description: |
Minimum overlap required as fraction of A.
**Range:** 0.0 to 1.0
**Default:** 1E-9 (effectively 1bp)
**Example:** 0.50 requires 50% of A to be overlapped
example: 0.5
- name: --min_overlap_b
alternatives: [-F]
type: double
description: |
Minimum overlap required as fraction of B.
**Range:** 0.0 to 1.0
**Default:** 1E-9 (effectively 1bp)
**Example:** 0.50 requires 50% of B to be overlapped
example: 0.5
- name: --reciprocal
alternatives: [-r]
type: boolean_true
description: |
Require reciprocal overlap for both A and B.
**Effect:** Both -f and -F thresholds must be satisfied
**Example:** With -f 0.90 -r, requires B overlaps 90% of A AND A overlaps 90% of B
**Default:** false
- name: --either
alternatives: [-e]
type: boolean_true
description: |
Require minimum fraction satisfied for A OR B.
**Effect:** Only one of -f or -F thresholds needs to be satisfied
**Alternative:** Without -e, both fractions must be satisfied
**Default:** false (both required)
- name: Strand Options
arguments:
- name: --same_strand
alternatives: [-s]
type: boolean_true
description: |
Require same strandedness for overlaps.
**Effect:** Only report overlaps on the same strand
**Default:** false (strand-independent)
- name: --opposite_strand
alternatives: [-S]
type: boolean_true
description: |
Require different strandedness for overlaps.
**Effect:** Only report overlaps on opposite strands
**Default:** false (strand-independent)
- name: Format Options
arguments:
- name: --split
type: boolean_true
description: |
Treat split BAM or BED12 entries as distinct intervals.
**Effect:** Split multi-block entries into individual intervals
**Usage:** For BAM alignments with gaps or BED12 entries
**Default:** false
- name: --bed_output
alternatives: [--bed]
type: boolean_true
description: |
Write output in BED format when using BAM input.
**Effect:** Forces BED output format for BAM inputs
**Default:** false
- name: --header
type: boolean_true
description: |
Print header from file A prior to results.
**Effect:** Includes original header from input file A
**Default:** false
- name: Advanced Options
arguments:
- name: --no_name_check
alternatives: [--nonamecheck]
type: boolean_true
description: |
Skip chromosome naming convention checks for sorted data.
**Effect:** Allows different naming (e.g., "chr1" vs "chr01")
**Usage:** For files with inconsistent chromosome naming
**Default:** false (strict checking)
- name: --no_buffer
alternatives: [--nobuf]
type: boolean_true
description: |
Disable buffered output.
**Effect:** Print each line immediately instead of buffering
**Usage:** For real-time processing or piping
**Trade-off:** Slower performance but immediate output
**Default:** false (buffered output)
- name: --io_buffer
alternatives: [--iobuf]
type: string
description: |
Specify input buffer memory size.
**Format:** Integer with optional K/M/G suffix
**Example:** "128M" for 128 megabytes
**Note:** No effect with compressed files
example: "128M"
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: docker
run: |
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,68 @@
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools fisher -h
```
Tool: bedtools fisher (aka fisher)
Version: v2.31.1
Summary: Calculate Fisher statistic b/w two feature files.
Usage: bedtools fisher [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf> -g <genome file>
Options:
-m Merge overlapping intervals before
- looking at overlap.
-s Require same strandedness. That is, only report hits in B
that overlap A on the _same_ strand.
- By default, overlaps are reported without respect to strand.
-S Require different strandedness. That is, only report hits in B
that overlap A on the _opposite_ strand.
- By default, overlaps are reported without respect to strand.
-f Minimum overlap required as a fraction of A.
- Default is 1E-9 (i.e., 1bp).
- FLOAT (e.g. 0.50)
-F Minimum overlap required as a fraction of B.
- Default is 1E-9 (i.e., 1bp).
- FLOAT (e.g. 0.50)
-r Require that the fraction overlap be reciprocal for A AND B.
- In other words, if -f is 0.90 and -r is used, this requires
that B overlap 90% of A and A _also_ overlaps 90% of B.
-e Require that the minimum fraction be satisfied for A OR B.
- In other words, if -e is used with -f 0.90 and -F 0.10 this requires
that either 90% of A is covered OR 10% of B is covered.
Without -e, both fractions would have to be satisfied.
-split Treat "split" BAM or BED12 entries as distinct BED intervals.
-g Provide a genome file to enforce consistent chromosome sort order
across input files. Only applies when used with -sorted option.
-nonamecheck For sorted data, don't throw an error if the file has different naming conventions
for the same chromosome. ex. "chr1" vs "chr01".
-bed If using BAM input, write output as BED.
-header Print the header from the A file prior to results.
-nobuf Disable buffered output. Using this option will cause each line
of output to be printed as it is generated, rather than saved
in a buffer. This will make printing large output files
noticeably slower, but can be useful in conjunction with
other software tools and scripts that need to process one
line of bedtools output at a time.
-iobuf Specify amount of memory to use for input buffer.
Takes an integer argument. Optional suffixes K/M/G supported.
Note: currently has no effect with compressed files.
Notes:
(1) Input files must be sorted by chrom, then start position.

View File

@@ -0,0 +1,48 @@
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# unset flags
unset_if_false=(
par_merge_overlaps
par_reciprocal
par_either
par_same_strand
par_opposite_strand
par_split
par_bed_output
par_header
par_no_name_check
par_no_buffer
)
for par in "${unset_if_false[@]}"; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset "$par"
done
# Build command arguments array
cmd_args=(
-a "$par_input_a"
-b "$par_input_b"
-g "$par_genome"
${par_merge_overlaps:+-m}
${par_min_overlap_a:+-f "$par_min_overlap_a"}
${par_min_overlap_b:+-F "$par_min_overlap_b"}
${par_reciprocal:+-r}
${par_either:+-e}
${par_same_strand:+-s}
${par_opposite_strand:+-S}
${par_split:+-split}
${par_bed_output:+-bed}
${par_header:+-header}
${par_no_name_check:+-nonamecheck}
${par_no_buffer:+-nobuf}
${par_io_buffer:+-iobuf "$par_io_buffer"}
)
# Execute bedtools fisher
bedtools fisher "${cmd_args[@]}" > "$par_output"

View File

@@ -0,0 +1,121 @@
#!/bin/bash
set -eo pipefail
## VIASH START
## VIASH END
# Source centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment
setup_test_env
log "Starting tests for bedtools_fisher"
# Create test data
log "Creating test data..."
# Create genome file
cat > "$meta_temp_dir/genome.txt" << 'EOF'
chr1 1000000
chr2 1000000
EOF
# Create file A - sorted intervals
cat > "$meta_temp_dir/intervals_a.bed" << 'EOF'
chr1 100 200 region1 10 +
chr1 300 400 region2 20 +
chr1 500 600 region3 15 -
chr2 100 200 region4 25 +
chr2 400 500 region5 30 -
EOF
# Create file B - sorted intervals with some overlaps
cat > "$meta_temp_dir/intervals_b.bed" << 'EOF'
chr1 150 250 feature1 5 +
chr1 350 450 feature2 8 +
chr1 450 550 feature3 12 -
chr2 50 150 feature4 6 +
chr2 450 550 feature5 9 -
EOF
# Create file C - larger overlap set for significance testing
cat > "$meta_temp_dir/intervals_c.bed" << 'EOF'
chr1 90 210 overlap1 10 +
chr1 290 410 overlap2 15 +
chr1 490 610 overlap3 20 -
chr2 90 210 overlap4 12 +
chr2 390 510 overlap5 18 -
chr2 600 700 overlap6 25 +
EOF
# TEST 1: Basic Fisher's exact test
log "Starting TEST 1: Basic Fisher's exact test"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--genome "$meta_temp_dir/genome.txt" \
--output "$meta_temp_dir/fisher_basic.txt"
check_file_exists "$meta_temp_dir/fisher_basic.txt" "basic fisher output"
check_file_not_empty "$meta_temp_dir/fisher_basic.txt" "basic fisher output"
log "✅ TEST 1 completed successfully"
# TEST 2: Fisher test with minimum overlap fraction
log "Starting TEST 2: Fisher test with overlap fractions"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--genome "$meta_temp_dir/genome.txt" \
--min_overlap_a 0.5 \
--min_overlap_b 0.3 \
--output "$meta_temp_dir/fisher_fractions.txt"
check_file_exists "$meta_temp_dir/fisher_fractions.txt" "fisher with fractions output"
check_file_not_empty "$meta_temp_dir/fisher_fractions.txt" "fisher with fractions output"
log "✅ TEST 2 completed successfully"
# TEST 3: Fisher test with reciprocal overlap
log "Starting TEST 3: Fisher test with reciprocal overlap"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--genome "$meta_temp_dir/genome.txt" \
--min_overlap_a 0.4 \
--reciprocal \
--output "$meta_temp_dir/fisher_reciprocal.txt"
check_file_exists "$meta_temp_dir/fisher_reciprocal.txt" "fisher reciprocal output"
check_file_not_empty "$meta_temp_dir/fisher_reciprocal.txt" "fisher reciprocal output"
log "✅ TEST 3 completed successfully"
# TEST 4: Fisher test with merged intervals
log "Starting TEST 4: Fisher test with merged overlapping intervals"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_c.bed" \
--genome "$meta_temp_dir/genome.txt" \
--merge_overlaps \
--output "$meta_temp_dir/fisher_merged.txt"
check_file_exists "$meta_temp_dir/fisher_merged.txt" "fisher merged output"
check_file_not_empty "$meta_temp_dir/fisher_merged.txt" "fisher merged output"
log "✅ TEST 4 completed successfully"
# TEST 5: Fisher test with either overlap condition
log "Starting TEST 5: Fisher test with either overlap condition"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--genome "$meta_temp_dir/genome.txt" \
--min_overlap_a 0.8 \
--min_overlap_b 0.2 \
--either \
--output "$meta_temp_dir/fisher_either.txt"
check_file_exists "$meta_temp_dir/fisher_either.txt" "fisher either condition output"
check_file_not_empty "$meta_temp_dir/fisher_either.txt" "fisher either condition output"
log "✅ TEST 5 completed successfully"
log "All tests completed successfully!"

View File

@@ -0,0 +1,157 @@
name: bedtools_flank
namespace: bedtools
description: |
Create flanking intervals for each genomic feature.
This tool generates new intervals representing the regions immediately
upstream and/or downstream of existing genomic features. Unlike slop which
extends existing intervals, flank creates entirely new intervals from the
flanking regions.
keywords: [genomics, intervals, flank, upstream, downstream, flanking, regions]
links:
homepage: https://bedtools.readthedocs.io/
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/flank.html
repository: https://github.com/arq5x/bedtools2
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: [-i]
type: file
description: |
Input file with genomic intervals.
**Format:** BED, GFF, VCF file with genomic intervals
**Usage:** Features for which flanking regions will be created
required: true
example: intervals.bed
- name: --genome
alternatives: [-g]
type: file
description: |
Genome file defining chromosome sizes.
**Format:** Tab-delimited file with chromosome name and size
**Purpose:** Prevents flanks from extending beyond chromosome boundaries
**Example:** chr1\t249250621
**Tip:** Can use samtools faidx output (.fai file)
required: true
example: genome.txt
- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: |
Output file with flanking intervals.
Contains new intervals representing the flanking regions
of the input features.
required: true
example: flanking_regions.bed
- name: Flanking Options
arguments:
- name: --both
alternatives: [-b]
type: string
description: |
Create flanking intervals using specified distance in both directions.
**Input:** Integer (base pairs) or Float (if used with --pct)
**Effect:** Creates flanks of equal size upstream and downstream
**Example:** "1000" creates 1kb flanks on both sides
**Mutually exclusive:** Cannot use with --left or --right
example: "1000"
- name: --left
alternatives: [-l]
type: string
description: |
Distance for left/upstream flank from original start coordinate.
**Input:** Integer (base pairs) or Float (if used with --pct)
**Strand-aware:** When used with --strand, respects feature orientation
**Example:** "500" creates 500bp upstream flank
**Requires:** Must be used together with --right
example: "500"
- name: --right
alternatives: [-r]
type: string
description: |
Distance for right/downstream flank from original end coordinate.
**Input:** Integer (base pairs) or Float (if used with --pct)
**Strand-aware:** When used with --strand, respects feature orientation
**Example:** "300" creates 300bp downstream flank
**Requires:** Must be used together with --left
example: "300"
- name: Flanking Behavior
arguments:
- name: --strand
alternatives: [-s]
type: boolean_true
description: |
Define left and right flanks based on strand orientation.
**Effect:** For negative-strand features, left becomes downstream
**Example:** -l 500 on minus strand starts flank 500bp downstream
**Default:** false (ignore strand)
- name: --percent
alternatives: [-pct]
type: boolean_true
description: |
Define flanking distances as fraction of feature length.
**Effect:** Distances become proportional to feature size
**Example:** -l 0.5 on 1000bp feature creates 500bp upstream flank
**Input format:** Use decimals (e.g., "0.1" for 10%)
**Default:** false (absolute base pairs)
- name: Output Options
arguments:
- name: --header
type: boolean_true
description: |
Print header from input file prior to results.
**Effect:** Preserves original file header in output
**Default:** false
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: docker
run: |
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,66 @@
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools flank -h
```
Tool: bedtools flank (aka flankBed)
Version: v2.31.1
Summary: Creates flanking interval(s) for each BED/GFF/VCF feature.
Usage: bedtools flank [OPTIONS] -i <bed/gff/vcf> -g <genome> [-b <int> or (-l and -r)]
Options:
-b Create flanking interval(s) using -b base pairs in each direction.
- (Integer) or (Float, e.g. 0.1) if used with -pct.
-l The number of base pairs that a flank should start from
orig. start coordinate.
- (Integer) or (Float, e.g. 0.1) if used with -pct.
-r The number of base pairs that a flank should end from
orig. end coordinate.
- (Integer) or (Float, e.g. 0.1) if used with -pct.
-s Define -l and -r based on strand.
E.g. if used, -l 500 for a negative-stranded feature,
it will start the flank 500 bp downstream. Default = false.
-pct Define -l and -r as a fraction of the feature's length.
E.g. if used on a 1000bp feature, -l 0.50,
will add 500 bp "upstream". Default = false.
-header Print the header from the input file prior to results.
Notes:
(1) Starts will be set to 0 if options would force it below 0.
(2) Ends will be set to the chromosome length if requested flank would
force it above the max chrom length.
(3) In contrast to slop, which _extends_ intervals, bedtools flank
creates new intervals from the regions just up- and down-stream
of your existing intervals.
(4) The genome file should tab delimited and structured as follows:
<chromName><TAB><chromSize>
For example, Human (hg19):
chr1 249250621
chr2 243199373
...
chr18_gl000207_random 4262
Tip 1. Use samtools faidx to create a genome file from a FASTA:
One can the samtools faidx command to index a FASTA file.
The resulting .fai index is suitable as a genome file,
as bedtools will only look at the first two, relevant columns
of the .fai file.
For example:
samtools faidx GRCh38.fa
bedtools flank -i my.bed -g GRCh38.fa.fai
Tip 2. Use UCSC Table Browser to create a genome file:
One can use the UCSC Genome Browser's MySQL database to extract
chromosome sizes. For example, H. sapiens:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
"select chrom, size from hg19.chromInfo" > hg19.genome

View File

@@ -0,0 +1,37 @@
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# unset flags
[[ "$par_strand" == "false" ]] && unset par_strand
[[ "$par_percent" == "false" ]] && unset par_percent
[[ "$par_header" == "false" ]] && unset par_header
# Validate flanking distance options (mutually exclusive groups)
if [ -n "$par_both" ]; then
flanking_args=(-b "$par_both")
elif [ -n "$par_left" ] && [ -n "$par_right" ]; then
flanking_args=(-l "$par_left" -r "$par_right")
elif [ -n "$par_left" ] || [ -n "$par_right" ]; then
echo "Error: --left and --right must be used together" >&2
exit 1
else
echo "Error: Must specify either --both or both --left and --right" >&2
exit 1
fi
# Build command arguments array
cmd_args=(
-i "$par_input"
-g "$par_genome"
"${flanking_args[@]}"
${par_strand:+-s}
${par_percent:+-pct}
${par_header:+-header}
)
# Execute bedtools flank
bedtools flank "${cmd_args[@]}" > "$par_output"

View File

@@ -0,0 +1,181 @@
#!/bin/bash
set -eo pipefail
## VIASH START
## VIASH END
# Source centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment
setup_test_env
log "Starting tests for bedtools_flank"
# Create test data
log "Creating test data..."
# Create genome file
cat > "$meta_temp_dir/genome.txt" << 'EOF'
chr1 1000000
chr2 1000000
chr3 500000
EOF
# Create basic intervals file
cat > "$meta_temp_dir/intervals.bed" << 'EOF'
chr1 1000 2000 feature1 100 +
chr1 5000 6000 feature2 200 -
chr2 10000 11000 feature3 150 +
chr2 20000 21000 feature4 300 -
chr3 100000 101000 feature5 250 +
EOF
# Create intervals near chromosome boundaries
cat > "$meta_temp_dir/boundary.bed" << 'EOF'
chr1 10 100 start_feature 50 +
chr1 999900 999950 end_feature 75 +
chr3 490000 495000 near_end 100 +
EOF
# Create variable-sized intervals for percentage testing
cat > "$meta_temp_dir/variable.bed" << 'EOF'
chr1 10000 12000 small_2kb 10 +
chr1 20000 30000 large_10kb 20 +
chr1 50000 51000 medium_1kb 15 +
EOF
# TEST 1: Basic flanking with both sides equal
log "Starting TEST 1: Basic flanking with both sides"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--genome "$meta_temp_dir/genome.txt" \
--both "500" \
--output "$meta_temp_dir/both_flanks.bed"
check_file_exists "$meta_temp_dir/both_flanks.bed" "both flanks output"
check_file_not_empty "$meta_temp_dir/both_flanks.bed" "both flanks output"
# Should create 10 intervals (5 features × 2 flanks each)
line_count=$(wc -l < "$meta_temp_dir/both_flanks.bed")
if [ "$line_count" -eq 10 ]; then
log "✓ both flanks output has expected line count (10): $meta_temp_dir/both_flanks.bed"
else
log "✗ both flanks output has unexpected line count ($line_count, expected 10): $meta_temp_dir/both_flanks.bed"
exit 1
fi
log "✅ TEST 1 completed successfully"
# TEST 2: Asymmetric flanking with left and right
log "Starting TEST 2: Asymmetric flanking with left and right"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--genome "$meta_temp_dir/genome.txt" \
--left "1000" \
--right "300" \
--output "$meta_temp_dir/asymmetric_flanks.bed"
check_file_exists "$meta_temp_dir/asymmetric_flanks.bed" "asymmetric flanks output"
check_file_not_empty "$meta_temp_dir/asymmetric_flanks.bed" "asymmetric flanks output"
# Check for different sized flanks (left flank from chr1:1000-2000 should be clamped to start at 0)
if grep -q "chr1.*0.*1000" "$meta_temp_dir/asymmetric_flanks.bed"; then
log "✓ asymmetric flanks contains expected left flank: $meta_temp_dir/asymmetric_flanks.bed"
else
log "✗ asymmetric flanks missing expected left flank: $meta_temp_dir/asymmetric_flanks.bed"
cat "$meta_temp_dir/asymmetric_flanks.bed" >&2
exit 1
fi
# Check for right flank size (300bp downstream)
if grep -q "chr1.*2000.*2300" "$meta_temp_dir/asymmetric_flanks.bed"; then
log "✓ asymmetric flanks contains expected right flank: $meta_temp_dir/asymmetric_flanks.bed"
else
log "✗ asymmetric flanks missing expected right flank: $meta_temp_dir/asymmetric_flanks.bed"
cat "$meta_temp_dir/asymmetric_flanks.bed" >&2
exit 1
fi
log "✅ TEST 2 completed successfully"
# TEST 3: Strand-aware flanking
log "Starting TEST 3: Strand-aware flanking"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--genome "$meta_temp_dir/genome.txt" \
--left "800" \
--right "400" \
--strand \
--output "$meta_temp_dir/strand_flanks.bed"
check_file_exists "$meta_temp_dir/strand_flanks.bed" "strand-aware flanks output"
check_file_not_empty "$meta_temp_dir/strand_flanks.bed" "strand-aware flanks output"
log "✅ TEST 3 completed successfully"
# TEST 4: Percentage-based flanking
log "Starting TEST 4: Percentage-based flanking"
"$meta_executable" \
--input "$meta_temp_dir/variable.bed" \
--genome "$meta_temp_dir/genome.txt" \
--both "0.5" \
--percent \
--output "$meta_temp_dir/percent_flanks.bed"
check_file_exists "$meta_temp_dir/percent_flanks.bed" "percentage flanks output"
check_file_not_empty "$meta_temp_dir/percent_flanks.bed" "percentage flanks output"
log "✅ TEST 4 completed successfully"
# TEST 5: Boundary handling (near chromosome ends)
log "Starting TEST 5: Boundary handling"
"$meta_executable" \
--input "$meta_temp_dir/boundary.bed" \
--genome "$meta_temp_dir/genome.txt" \
--both "1000" \
--output "$meta_temp_dir/boundary_flanks.bed"
check_file_exists "$meta_temp_dir/boundary_flanks.bed" "boundary flanks output"
check_file_not_empty "$meta_temp_dir/boundary_flanks.bed" "boundary flanks output"
# Check that coordinates don't go below 0 or above chromosome length
if grep -q "^chr.*\t-" "$meta_temp_dir/boundary_flanks.bed"; then
log "✗ boundary flanks contains negative coordinates: $meta_temp_dir/boundary_flanks.bed"
exit 1
else
log "✓ boundary flanks handles negative coordinates correctly: $meta_temp_dir/boundary_flanks.bed"
fi
log "✅ TEST 5 completed successfully"
# TEST 6: Header preservation
log "Starting TEST 6: Header preservation"
# Create file with header
cat > "$meta_temp_dir/with_header.bed" << 'EOF'
track name="test_track" description="Test intervals"
chr1 2000 3000 header_test 100 +
chr1 8000 9000 header_test2 150 +
EOF
"$meta_executable" \
--input "$meta_temp_dir/with_header.bed" \
--genome "$meta_temp_dir/genome.txt" \
--both "200" \
--header \
--output "$meta_temp_dir/header_flanks.bed"
check_file_exists "$meta_temp_dir/header_flanks.bed" "header flanks output"
check_file_not_empty "$meta_temp_dir/header_flanks.bed" "header flanks output"
# Check that header is preserved
if grep -q "track name" "$meta_temp_dir/header_flanks.bed"; then
log "✓ header flanks preserves header: $meta_temp_dir/header_flanks.bed"
else
log "✗ header flanks missing header: $meta_temp_dir/header_flanks.bed"
exit 1
fi
log "✅ TEST 6 completed successfully"
log "All tests completed successfully!"

View File

@@ -2,12 +2,14 @@ name: bedtools_genomecov
namespace: bedtools
description: |
Compute the coverage of a feature file among a genome.
keywords: [genome coverage, BED, GFF, VCF, BAM]
Calculates genome-wide coverage statistics from BED, GFF, VCF, or BAM files.
Can produce coverage histograms, per-base depth, or BedGraph format output.
keywords: [genome coverage, BED, GFF, VCF, BAM, depth, histogram, bedgraph]
links:
homepage: https://bedtools.readthedocs.io/en/latest/#
homepage: https://bedtools.readthedocs.io/en/latest/
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
repository: https://github.com/arq5x/bedtools2
issue_tracker: https://github.com/arq5x/bedtools2/issues
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
@@ -15,33 +17,54 @@ requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]
roles: [author]
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: -i
alternatives: [-i]
type: file
direction: input
description: |
The input file (BED/GFF/VCF) to be used.
Input genomic intervals file in BED, GFF, or VCF format.
**Supported formats:**
- BED format (standard genomic intervals)
- GFF/GTF format (gene annotations)
- VCF format (variant calls)
**Note:** Required when not using `--input_bam`
example: input.bed
- name: --input_bam
alternatives: -ibam
alternatives: [-ibam]
type: file
description: |
The input file is in BAM format.
Note: BAM _must_ be sorted by positions.
'--genome' option is ignored if you use '--input_bam' option!
Input BAM file for coverage calculation.
**Requirements:**
- BAM file must be sorted by position
- When using BAM input, `--genome` option is ignored
- Coordinates are determined from BAM header
example: input.bam
- name: --genome
alternatives: -g
alternatives: [-g]
type: file
direction: input
description: |
The genome file to be used.
Genome file defining chromosome names and sizes.
**Format:** Two-column tab-delimited file:
```
chr1 248956422
chr2 242193529
```
**Note:** Required when using `--input`, ignored when using `--input_bam`
example: genome.txt
- name: Outputs
@@ -50,44 +73,59 @@ argument_groups:
type: file
direction: output
description: |
The output BED file.
Output file containing coverage information.
**Output formats depend on options:**
- **Default:** Coverage histogram (depth vs count)
- **With `--depth`:** Per-base depth (1-based coordinates)
- **With `--bed_graph`:** BedGraph format for genome browsers
required: true
example: output.bed
example: coverage.txt
- name: Options
arguments:
- name: --depth
alternatives: -d
alternatives: [-d]
type: boolean_true
description: |
Report the depth at each genome position (with one-based coordinates).
Default behavior is to report a histogram.
Report the depth at each genome position with 1-based coordinates.
**Output format:** `chromosome position depth`
**Default behavior:** Reports coverage histogram instead
- name: --depth_zero
alternatives: -dz
alternatives: [-dz]
type: boolean_true
description: |
Report the depth at each genome position (with zero-based coordinates).
Reports only non-zero positions.
Default behavior is to report a histogram.
Report depth at each genome position with 0-based coordinates.
**Features:**
- Only reports positions with non-zero coverage
- Uses 0-based coordinate system
- More memory efficient than `--depth`
- name: --bed_graph
alternatives: -bg
alternatives: [-bg]
type: boolean_true
description: |
Report depth in BedGraph format. For details, see:
genome.ucsc.edu/goldenPath/help/bedgraph.html
Report depth in BedGraph format for genome browser visualization.
**Output format:** `chromosome start end depth`
See [BedGraph specification](https://genome.ucsc.edu/goldenPath/help/bedgraph.html) for details.
- name: --bed_graph_zero_coverage
alternatives: -bga
alternatives: [-bga]
type: boolean_true
description: |
Report depth in BedGraph format, as above (-bg).
However with this option, regions with zero
coverage are also reported. This allows one to
quickly extract all regions of a genome with 0
coverage by applying: "grep -w 0$" to the output.
Report depth in BedGraph format including zero-coverage regions.
**Features:**
- Same as `--bed_graph` but includes regions with 0 coverage
- Useful for finding uncovered regions: `grep -w 0$ output.bg`
- Generates larger output files
- name: --split
type: boolean_true
@@ -134,13 +172,13 @@ argument_groups:
Works for BAM files only
- name: --five_prime
alternatives: -5
alternatives: ["-5"]
type: boolean_true
description: |
Calculate coverage of 5" positions (instead of entire interval).
- name: --three_prime
alternatives: -3
alternatives: ["-3"]
type: boolean_true
description: |
Calculate coverage of 3" positions (instead of entire interval).
@@ -191,17 +229,16 @@ resources:
test_resources:
- type: bash_script
path: test.sh
- path: test_data
- type: file
path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: debian:stable-slim
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: apt
packages: [bedtools, procps]
- type: docker
run: |
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
run:
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
runners:
- type: executable

View File

@@ -1,17 +1,20 @@
```bash
bedtools genomecov
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools genomecov -h
```
Tool: bedtools genomecov (aka genomeCoverageBed)
Version: v2.30.0
Version: v2.31.1
Summary: Compute the coverage of a feature file among a genome.
Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome>
Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome> OR -ibam <bam/cram>
Options:
-ibam The input file is in BAM format.
Note: BAM _must_ be sorted by position
-g Provide a genome file to define chromosome lengths.
Note:Required when not using -ibam option.
-d Report the depth at each genome position (with one-based coordinates).
Default behavior is to report a histogram.
@@ -92,10 +95,20 @@ Notes:
(3) The input BAM (-ibam) file must be sorted by position.
A "samtools sort <BAM>" should suffice.
Tips:
Tip 1. Use samtools faidx to create a genome file from a FASTA:
One can the samtools faidx command to index a FASTA file.
The resulting .fai index is suitable as a genome file,
as bedtools will only look at the first two, relevant columns
of the .fai file.
For example:
samtools faidx GRCh38.fa
bedtools genomecov -i my.bed -g GRCh38.fa.fai
Tip 2. Use UCSC Table Browser to create a genome file:
One can use the UCSC Genome Browser's MySQL database to extract
chromosome sizes. For example, H. sapiens:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
"select chrom, size from hg19.chromInfo" > hg19.genome
"select chrom, size from hg19.chromInfo" > hg19.genome

View File

@@ -3,53 +3,63 @@
## VIASH START
## VIASH END
# Exit on error
set -eo pipefail
# Unset variables
# unset flags (using loop for many parameters)
unset_if_false=(
par_input_bam
par_depth
par_depth_zero
par_bed_graph
par_bed_graph_zero_coverage
par_split
par_ignore_deletion
par_pair_end_coverage
par_fragment_size
par_du
par_five_prime
par_three_prime
par_trackline
par_depth
par_depth_zero
par_bed_graph
par_bed_graph_zero_coverage
par_split
par_ignore_deletion
par_pair_end_coverage
par_fragment_size
par_du
par_five_prime
par_three_prime
par_trackline
)
for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
for par in "${unset_if_false[@]}"; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset "$par"
done
# Create input array
IFS=";" read -ra trackopts <<< $par_trackopts
# Convert semicolon-separated trackopts to array
if [[ -n "$par_trackopts" ]]; then
IFS=';' read -ra trackopts_array <<< "$par_trackopts"
fi
bedtools genomecov \
${par_depth:+-d} \
${par_depth_zero:+-dz} \
${par_bed_graph:+-bg} \
${par_bed_graph_zero_coverage:+-bga} \
${par_split:+-split} \
${par_ignore_deletion:+-ignoreD} \
${par_du:+-du} \
${par_five_prime:+-5} \
${par_three_prime:+-3} \
${par_trackline:+-trackline} \
${par_strand:+-strand "$par_strand"} \
${par_max:+-max "$par_max"} \
${par_scale:+-scale "$par_scale"} \
${par_trackopts:+-trackopts "${trackopts[*]}"} \
${par_input_bam:+-ibam "$par_input_bam"} \
${par_input:+-i "$par_input"} \
${par_genome:+-g "$par_genome"} \
${par_pair_end_coverage:+-pc} \
${par_fragment_size:+-fs} \
> "$par_output"
# Build command arguments
cmd_args=(
${par_input_bam:+-ibam "$par_input_bam"}
${par_input:+-i "$par_input"}
${par_genome:+-g "$par_genome"}
${par_depth:+-d}
${par_depth_zero:+-dz}
${par_bed_graph:+-bg}
${par_bed_graph_zero_coverage:+-bga}
${par_split:+-split}
${par_ignore_deletion:+-ignoreD}
${par_strand:+-strand "$par_strand"}
${par_pair_end_coverage:+-pc}
${par_fragment_size:+-fs}
${par_du:+-du}
${par_five_prime:+-5}
${par_three_prime:+-3}
${par_max:+-max "$par_max"}
${par_scale:+-scale "$par_scale"}
${par_trackline:+-trackline}
)
# Add multiple trackopts if provided
if [[ -n "$par_trackopts" ]]; then
for trackopt in "${trackopts_array[@]}"; do
cmd_args+=(-trackopts "$trackopt")
done
fi
# Execute bedtools genomecov
bedtools genomecov "${cmd_args[@]}" > "$par_output"

View File

@@ -1,333 +1,166 @@
#!/bin/bash
# exit on error
set -eo pipefail
## VIASH START
meta_executable="target/executable/bedtools/bedtools_intersect/bedtools_intersect"
meta_resources_dir="src/bedtools/bedtools_intersect"
## VIASH END
# directory of the bam file
test_data="$meta_resources_dir/test_data"
# Source the centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment with strict error handling
setup_test_env
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_identical_content() {
diff -a "$2" "$1" \
|| (echo "Files are not identical!" && exit 1)
}
# Test execution with centralized functions
#############################################
# Create directories for tests
echo "Creating Test Data..."
TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX")
function clean_up {
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
log "Starting tests for $meta_name"
# Create test directory
test_dir="$meta_temp_dir/test_data"
mkdir -p "$test_dir"
# Create test genome file
log "Creating test genome file..."
cat > "$test_dir/test.genome" << 'EOF'
chr1 10000
chr2 8000
chr3 5000
EOF
# Create test BED file
log "Creating test BED file..."
cat > "$test_dir/test.bed" << 'EOF'
chr1 100 200 feature1 100 +
chr1 300 500 feature2 200 -
chr2 1000 1500 feature3 150 +
chr2 2000 2200 feature4 180 -
chr3 500 800 feature5 120 +
EOF
# --- Test Case 1: Basic histogram output (default) ---
log "Starting TEST 1: Basic coverage histogram"
log "Executing $meta_name with default histogram output..."
"$meta_executable" \
--input "$test_dir/test.bed" \
--genome "$test_dir/test.genome" \
--output "$meta_temp_dir/output1.txt"
log "Validating TEST 1 outputs..."
check_file_exists "$meta_temp_dir/output1.txt" "histogram output file"
check_file_not_empty "$meta_temp_dir/output1.txt" "histogram output file"
# Check histogram format (should have columns: chromosome, depth, count, total_bases, fraction)
line_count=$(wc -l < "$meta_temp_dir/output1.txt")
log "Histogram contains $line_count lines"
[ "$line_count" -gt 0 ] || { log_error "Histogram output is empty"; exit 1; }
# Check that it contains expected format
head -1 "$meta_temp_dir/output1.txt" | awk 'NF != 5 { exit 1 }' || {
log_error "Histogram format incorrect (expected 5 columns)"
exit 1
}
trap clean_up EXIT
# Create and populate input files
printf "chr1\t248956422\nchr2\t198295559\nchr3\t242193529\n" > "$TMPDIR/genome.txt"
printf "chr2\t128\t228\tmy_read/1\t37\t+\nchr2\t428\t528\tmy_read/2\t37\t-\n" > "$TMPDIR/example.bed"
printf "chr2\t128\t228\tmy_read/1\t60\t+\t128\t228\t255,0,0\t1\t100\t0\nchr2\t428\t528\tmy_read/2\t60\t-\t428\t528\t255,0,0\t1\t100\t0\n" > "$TMPDIR/example.bed12"
printf "chr2\t100\t103\n" > "$TMPDIR/example_dz.bed"
log "✅ TEST 1 completed successfully"
# expected outputs
cat > "$TMPDIR/expected_default.bed" <<EOF
chr2 0 198295359 198295559 0.999999
chr2 1 200 198295559 1.0086e-06
chr1 0 248956422 248956422 1
chr3 0 242193529 242193529 1
genome 0 689445310 689445510 1
genome 1 200 689445510 2.90088e-07
EOF
cat > "$TMPDIR/expected_ibam.bed" <<EOF
chr2:172936693-172938111 0 1218 1418 0.858956
chr2:172936693-172938111 1 200 1418 0.141044
genome 0 1218 1418 0.858956
genome 1 200 1418 0.141044
EOF
cat > "$TMPDIR/expected_ibam_pc.bed" <<EOF
chr2:172936693-172938111 0 1018 1418 0.717913
chr2:172936693-172938111 1 400 1418 0.282087
genome 0 1018 1418 0.717913
genome 1 400 1418 0.282087
EOF
cat > "$TMPDIR/expected_ibam_fs.bed" <<EOF
chr2:172936693-172938111 0 1218 1418 0.858956
chr2:172936693-172938111 1 200 1418 0.141044
genome 0 1218 1418 0.858956
genome 1 200 1418 0.141044
EOF
cat > "$TMPDIR/expected_dz.bed" <<EOF
chr2 100 1
chr2 101 1
chr2 102 1
EOF
cat > "$TMPDIR/expected_strand.bed" <<EOF
chr2 0 198295459 198295559 1
chr2 1 100 198295559 5.04298e-07
chr1 0 248956422 248956422 1
chr3 0 242193529 242193529 1
genome 0 689445410 689445510 1
genome 1 100 689445510 1.45044e-07
EOF
cat > "$TMPDIR/expected_5.bed" <<EOF
chr2 0 198295557 198295559 1
chr2 1 2 198295559 1.0086e-08
chr1 0 248956422 248956422 1
chr3 0 242193529 242193529 1
genome 0 689445508 689445510 1
genome 1 2 689445510 2.90088e-09
EOF
cat > "$TMPDIR/expected_bg_scale.bed" <<EOF
chr2 128 228 100
chr2 428 528 100
EOF
cat > "$TMPDIR/expected_trackopts.bed" <<EOF
track type=bedGraph name=example llama=Alpaco
chr2 128 228 1
chr2 428 528 1
EOF
cat > "$TMPDIR/expected_split.bed" <<EOF
chr2 0 198295359 198295559 0.999999
chr2 1 200 198295559 1.0086e-06
chr1 0 248956422 248956422 1
chr3 0 242193529 242193529 1
genome 0 689445310 689445510 1
genome 1 200 689445510 2.90088e-07
EOF
cat > "$TMPDIR/expected_ignoreD_du.bed" <<EOF
chr2:172936693-172938111 0 1218 1418 0.858956
chr2:172936693-172938111 1 200 1418 0.141044
genome 0 1218 1418 0.858956
genome 1 200 1418 0.141044
# --- Test Case 2: BedGraph format ---
log "Starting TEST 2: BedGraph format output"
log "Executing $meta_name with BedGraph format..."
"$meta_executable" \
--input "$test_dir/test.bed" \
--genome "$test_dir/test.genome" \
--output "$meta_temp_dir/output2.bg" \
--bed_graph
log "Validating TEST 2 outputs..."
check_file_exists "$meta_temp_dir/output2.bg" "BedGraph output file"
check_file_not_empty "$meta_temp_dir/output2.bg" "BedGraph output file"
# Check BedGraph format (chromosome, start, end, depth)
head -1 "$meta_temp_dir/output2.bg" | awk 'NF != 4 { exit 1 }' || {
log_error "BedGraph format incorrect (expected 4 columns)"
exit 1
}
# Check that coordinates make sense (start < end)
awk '$2 >= $3 { print "Invalid coordinates: " $0; exit 1 }' "$meta_temp_dir/output2.bg" || {
log_error "Invalid BedGraph coordinates found"
exit 1
}
log "✅ TEST 2 completed successfully"
# --- Test Case 3: Per-base depth ---
log "Starting TEST 3: Per-base depth output"
log "Executing $meta_name with per-base depth..."
"$meta_executable" \
--input "$test_dir/test.bed" \
--genome "$test_dir/test.genome" \
--output "$meta_temp_dir/output3.depth" \
--depth
log "Validating TEST 3 outputs..."
check_file_exists "$meta_temp_dir/output3.depth" "depth output file"
check_file_not_empty "$meta_temp_dir/output3.depth" "depth output file"
# Check depth format (chromosome, position, depth)
head -1 "$meta_temp_dir/output3.depth" | awk 'NF != 3 { exit 1 }' || {
log_error "Depth format incorrect (expected 3 columns)"
exit 1
}
log "✅ TEST 3 completed successfully"
# --- Test Case 4: BedGraph with zero coverage ---
log "Starting TEST 4: BedGraph with zero coverage"
log "Executing $meta_name with BedGraph including zero coverage..."
"$meta_executable" \
--input "$test_dir/test.bed" \
--genome "$test_dir/test.genome" \
--output "$meta_temp_dir/output4.bga" \
--bed_graph_zero_coverage
log "Validating TEST 4 outputs..."
check_file_exists "$meta_temp_dir/output4.bga" "BedGraph+zero output file"
check_file_not_empty "$meta_temp_dir/output4.bga" "BedGraph+zero output file"
# This output should be larger than regular BedGraph since it includes zero coverage
bg_size=$(wc -l < "$meta_temp_dir/output2.bg")
bga_size=$(wc -l < "$meta_temp_dir/output4.bga")
log "BedGraph lines: $bg_size, BedGraph+zero lines: $bga_size"
# Check that we can find zero coverage regions
if grep -q " 0$" "$meta_temp_dir/output4.bga"; then
log "✓ Found zero coverage regions in output"
else
log "Note: No zero coverage regions found (this may be expected with test data)"
fi
log "✅ TEST 4 completed successfully"
# --- Test Case 5: Test strand-specific coverage ---
log "Starting TEST 5: Strand-specific coverage"
# Create BED file with strand information (6 columns minimum)
cat > "$test_dir/strand.bed" << 'EOF'
chr1 100 200 feature1 100 +
chr1 300 500 feature2 200 -
EOF
# Test 1:
mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null
echo "> Run bedtools_genomecov on BED file"
log "Executing $meta_name with strand-specific coverage..."
"$meta_executable" \
--input "../example.bed" \
--genome "../genome.txt" \
--output "output.bed"
--input "$test_dir/strand.bed" \
--genome "$test_dir/test.genome" \
--output "$meta_temp_dir/output5.txt" \
--strand "+"
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_default.bed"
echo "- test1 succeeded -"
log "Validating TEST 5 outputs..."
check_file_exists "$meta_temp_dir/output5.txt" "strand-specific output file"
check_file_not_empty "$meta_temp_dir/output5.txt" "strand-specific output file"
popd > /dev/null
log "✅ TEST 5 completed successfully"
# Test 2: ibam option
mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null
echo "> Run bedtools_genomecov on BAM file with -ibam"
"$meta_executable" \
--input_bam "$test_data/example.bam" \
--output "output.bed" \
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_ibam.bed"
echo "- test2 succeeded -"
popd > /dev/null
# Test 3: depth option
mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null
echo "> Run bedtools_genomecov on BED file with -dz"
"$meta_executable" \
--input "../example_dz.bed" \
--genome "../genome.txt" \
--output "output.bed" \
--depth_zero
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_dz.bed"
echo "- test3 succeeded -"
popd > /dev/null
# Test 4: strand option
mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null
echo "> Run bedtools_genomecov on BED file with -strand"
"$meta_executable" \
--input "../example.bed" \
--genome "../genome.txt" \
--output "output.bed" \
--strand "-" \
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_strand.bed"
echo "- test4 succeeded -"
popd > /dev/null
# Test 5: 5' end option
mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null
echo "> Run bedtools_genomecov on BED file with -5"
"$meta_executable" \
--input "../example.bed" \
--genome "../genome.txt" \
--output "output.bed" \
--five_prime \
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_5.bed"
echo "- test5 succeeded -"
popd > /dev/null
# Test 6: max option
mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null
echo "> Run bedtools_genomecov on BED file with -max"
"$meta_executable" \
--input "../example.bed" \
--genome "../genome.txt" \
--output "output.bed" \
--max 100 \
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_default.bed"
echo "- test6 succeeded -"
popd > /dev/null
# Test 7: bedgraph and scale option
mkdir "$TMPDIR/test7" && pushd "$TMPDIR/test7" > /dev/null
echo "> Run bedtools_genomecov on BED file with -bg and -scale"
"$meta_executable" \
--input "../example.bed" \
--genome "../genome.txt" \
--output "output.bed" \
--bed_graph \
--scale 100 \
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_bg_scale.bed"
echo "- test7 succeeded -"
popd > /dev/null
# Test 8: trackopts option
mkdir "$TMPDIR/test8" && pushd "$TMPDIR/test8" > /dev/null
echo "> Run bedtools_genomecov on BED file with -bg and -trackopts"
"$meta_executable" \
--input "../example.bed" \
--genome "../genome.txt" \
--output "output.bed" \
--bed_graph \
--trackopts "name=example" \
--trackopts "llama=Alpaco" \
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_trackopts.bed"
echo "- test8 succeeded -"
popd > /dev/null
# Test 9: ibam pc options
mkdir "$TMPDIR/test9" && pushd "$TMPDIR/test9" > /dev/null
echo "> Run bedtools_genomecov on BAM file with -ibam, -pc"
"$meta_executable" \
--input_bam "$test_data/example.bam" \
--output "output.bed" \
--fragment_size \
--pair_end_coverage \
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_ibam_pc.bed"
echo "- test9 succeeded -"
popd > /dev/null
# Test 10: ibam fs options
mkdir "$TMPDIR/test10" && pushd "$TMPDIR/test10" > /dev/null
echo "> Run bedtools_genomecov on BAM file with -ibam, -fs"
"$meta_executable" \
--input_bam "$test_data/example.bam" \
--output "output.bed" \
--fragment_size \
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_ibam_fs.bed"
echo "- test10 succeeded -"
popd > /dev/null
# Test 11: split
mkdir "$TMPDIR/test11" && pushd "$TMPDIR/test11" > /dev/null
echo "> Run bedtools_genomecov on BED12 file with -split"
"$meta_executable" \
--input "../example.bed12" \
--genome "../genome.txt" \
--output "output.bed" \
--split \
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_split.bed"
echo "- test11 succeeded -"
popd > /dev/null
# Test 12: ignore deletion and du
mkdir "$TMPDIR/test12" && pushd "$TMPDIR/test12" > /dev/null
echo "> Run bedtools_genomecov on BAM file with -ignoreD and -du"
"$meta_executable" \
--input_bam "$test_data/example.bam" \
--output "output.bed" \
--ignore_deletion \
--du \
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_ignoreD_du.bed"
echo "- test12 succeeded -"
popd > /dev/null
echo "---- All tests succeeded! ----"
exit 0
print_test_summary "All tests completed successfully"

View File

@@ -1,7 +1,13 @@
name: bedtools_getfasta
namespace: bedtools
description: Extract sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file.
keywords: [sequencing, fasta, BED, GFF, VCF]
description: |
Extract DNA sequences from a FASTA file based on feature coordinates.
Given intervals specified in BED/GFF/VCF format and a FASTA file, this tool
extracts the corresponding sequences from the FASTA file. Various output formats
are supported including FASTA (default), tab-delimited, and BED format with sequences.
keywords: [sequencing, fasta, BED, GFF, VCF, sequence extraction]
links:
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html
repository: https://github.com/arq5x/bedtools2
@@ -12,20 +18,27 @@ requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/dries_schaumont.yaml
roles: [ author, maintainer ]
roles: [author, maintainer]
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author]
argument_groups:
- name: Input arguments
arguments:
- name: --input_fasta
alternatives: [-fi]
type: file
required: true
description: |
FASTA file containing sequences for each interval specified in the input BED file.
The headers in the input FASTA file must exactly match the chromosome column in the BED file.
Input FASTA file containing sequences for extraction.
The headers in the input FASTA file must exactly match the chromosome
column in the BED file.
- name: "--input_bed"
alternatives: [-bed]
type: file
required: true
description: |
BED file containing intervals to extract from the FASTA file.
BED/GFF/VCF file containing intervals to extract from the FASTA file.
BED files containing a single region require a newline character
at the end of the line, otherwise a blank output file is produced.
- name: --rna
@@ -33,7 +46,7 @@ argument_groups:
description: |
The FASTA is RNA not DNA. Reverse complementation handled accordingly.
- name: Run arguments
- name: Processing options
arguments:
- name: "--strandedness"
type: boolean_true
@@ -41,47 +54,49 @@ argument_groups:
description: |
Force strandedness. If the feature occupies the antisense strand, the output sequence will
be reverse complemented. By default strandedness is not taken into account.
- name: "--split"
type: boolean_true
description: |
When input is in BED12 format, create a separate FASTA entry for each block in a BED12 record.
Blocks are described in the 11th and 12th columns of the BED format.
- name: "--full_header"
type: boolean_true
alternatives: [-fullHeader]
description: |
Use full FASTA header. By default, only the word before the first space or tab is used.
- name: Output arguments
arguments:
- name: --output
alternatives: [-o]
alternatives: [-o, -fo]
required: true
type: file
direction: output
description: |
Output file where the output from the 'bedtools getfasta' commend will
be written to.
Output file where the extracted sequences will be written.
By default, output is in FASTA format unless --tab or --bed_out is specified.
- name: --name
type: boolean_true
description: |
Set the FASTA header for each extracted sequence to be the "name" and coordinate
columns from the BED feature (format: name::chr:start-end).
- name: "--name_only"
type: boolean_true
alternatives: [-nameOnly]
description: |
Set the FASTA header for each extracted sequence to be only the "name"
column from the BED feature.
- name: --tab
type: boolean_true
description: |
Report extract sequences in a tab-delimited format instead of in FASTA format.
Report extracted sequences in a tab-delimited format instead of FASTA format.
Output format: name<tab>sequence.
- name: --bed_out
type: boolean_true
alternatives: [-bedOut]
description: |
Report extract sequences in a tab-delimited BED format instead of in FASTA format.
- name: "--name"
type: boolean_true
description: |
Set the FASTA header for each extracted sequence to be the "name" and coordinate columns from the BED feature.
- name: "--name_only"
type: boolean_true
description: |
Set the FASTA header for each extracted sequence to be the "name" columns from the BED feature.
- name: "--split"
type: boolean_true
description: |
When --input is in BED12 format, create a separate fasta entry for each block in a BED12 record,
blocks being described in the 11th and 12th column of the BED.
- name: "--full_header"
type: boolean_true
description: |
Use full fasta header. By default, only the word before the first space or tab is used.
# Arguments not taken into account:
#
# -fo [Specify an output file name. By default, output goes to stdout.
#
Report extracted sequences in a tab-delimited BED format instead of FASTA format.
Output format: chr<tab>start<tab>end<tab>name<tab>sequence.
resources:
- type: bash_script
@@ -90,16 +105,16 @@ resources:
test_resources:
- type: bash_script
path: test.sh
- type: file
path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: debian:stable-slim
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: apt
packages: [bedtools, procps]
- type: docker
run: |
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
run:
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
runners:
- type: executable

View File

@@ -0,0 +1,30 @@
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools getfasta -h
```
Tool: bedtools getfasta (aka fastaFromBed)
Version: v2.31.1
Summary: Extract DNA sequences from a fasta file based on feature coordinates.
Usage: bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>
Options:
-fi Input FASTA file
-fo Output file (opt., default is STDOUT
-bed BED/GFF/VCF file of ranges to extract from -fi
-name Use the name field and coordinates for the FASTA header
-name+ (deprecated) Use the name field and coordinates for the FASTA header
-nameOnly Use the name field for the FASTA header
-split Given BED12 fmt., extract and concatenate the sequences
from the BED "blocks" (e.g., exons)
-tab Write output in TAB delimited format.
-bedOut Report extract sequences in a tab-delimited BED format instead of in FASTA format.
- Default is FASTA format.
-s Force strandedness. If the feature occupies the antisense,
strand, the sequence will be reverse complemented.
- By default, strand information is ignored.
-fullHeader Use full fasta header.
- By default, only the word before the first space or tab
is used.
-rna The FASTA is RNA not DNA. Reverse complementation handled accordingly.

View File

@@ -1,22 +1,42 @@
#!/usr/bin/env bash
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
unset_if_false=( par_rna par_strandedness par_tab par_bed_out par_name par_name_only par_split par_full_header )
# unset flags (using loop for many parameters)
unset_if_false=(
par_rna
par_strandedness
par_split
par_full_header
par_name
par_name_only
par_tab
par_bed_out
)
for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
for par in "${unset_if_false[@]}"; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset "$par"
done
bedtools getfasta \
-fi "$par_input_fasta" \
-bed "$par_input_bed" \
${par_rna:+-rna} \
${par_name:+-name} \
${par_name_only:+-nameOnly} \
${par_tab:+-tab} \
${par_bed_out:+-bedOut} \
${par_strandedness:+-s} \
${par_split:+-split} \
${par_full_header:+-fullHeader} > "$par_output"
# Build command arguments array
cmd_args=(
-fi "$par_input_fasta"
-bed "$par_input_bed"
-fo "$par_output"
${par_rna:+-rna}
${par_strandedness:+-s}
${par_split:+-split}
${par_full_header:+-fullHeader}
${par_name:+-name}
${par_name_only:+-nameOnly}
${par_tab:+-tab}
${par_bed_out:+-bedOut}
)
# Execute bedtools command
bedtools getfasta "${cmd_args[@]}"

View File

@@ -1,119 +1,121 @@
#!/usr/bin/env bash
set -eo pipefail
#!/bin/bash
TMPDIR=$(mktemp -d)
function clean_up {
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
}
trap clean_up EXIT
## VIASH START
## VIASH END
# Create dummy test fasta file
cat > "$TMPDIR/test.fa" <<EOF
# Source the centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment with strict error handling
setup_test_env
#############################################
# Test execution with centralized functions
#############################################
log "Starting tests for $meta_name"
# Create test directory
test_dir="$meta_temp_dir/test_data"
mkdir -p "$test_dir"
# Create test FASTA file
log "Creating test FASTA data..."
cat > "$test_dir/test.fa" << 'EOF'
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
>chr2
TTTTTTTTGGGGGGGGGGGGGGCGGATCGGGGGGGGGGGGGGAAA
EOF
TAB="$(printf '\t')"
# Create dummy bed file
cat > "$TMPDIR/test.bed" <<EOF
chr1${TAB}5${TAB}10${TAB}myseq
# Create test BED file
cat > "$test_dir/test.bed" << 'EOF'
chr1 5 10 seq1
chr2 15 20 seq2
EOF
# Create expected bed file
cat > "$TMPDIR/expected.fasta" <<EOF
>chr1:5-10
AAACC
EOF
# --- Test Case 1: Basic FASTA sequence extraction ---
log "Starting TEST 1: Basic FASTA sequence extraction"
log "Executing $meta_name with basic parameters..."
"$meta_executable" \
--input_bed "$TMPDIR/test.bed" \
--input_fasta "$TMPDIR/test.fa" \
--output "$TMPDIR/output.fasta"
--input_bed "$test_dir/test.bed" \
--input_fasta "$test_dir/test.fa" \
--output "$meta_temp_dir/output1.fasta"
cmp --silent "$TMPDIR/output.fasta" "$TMPDIR/expected.fasta" || { echo "files are different:"; exit 1; }
log "Validating TEST 1 outputs..."
check_file_exists "$meta_temp_dir/output1.fasta" "output FASTA file"
check_file_not_empty "$meta_temp_dir/output1.fasta" "output FASTA file"
check_file_contains "$meta_temp_dir/output1.fasta" ">chr1:5-10"
check_file_contains "$meta_temp_dir/output1.fasta" "AAACC"
log "✅ TEST 1 completed successfully"
# --- Test Case 2: FASTA extraction with --name option ---
log "Starting TEST 2: FASTA extraction with --name option"
# Create expected bed file for --name
cat > "$TMPDIR/expected_with_name.fasta" <<EOF
>myseq::chr1:5-10
AAACC
EOF
log "Executing $meta_name with --name option..."
"$meta_executable" \
--input_bed "$TMPDIR/test.bed" \
--input_fasta "$TMPDIR/test.fa" \
--input_bed "$test_dir/test.bed" \
--input_fasta "$test_dir/test.fa" \
--name \
--output "$TMPDIR/output_with_name.fasta"
--output "$meta_temp_dir/output2.fasta"
log "Validating TEST 2 outputs..."
check_file_exists "$meta_temp_dir/output2.fasta" "output FASTA file with names"
check_file_not_empty "$meta_temp_dir/output2.fasta" "output FASTA file with names"
check_file_contains "$meta_temp_dir/output2.fasta" ">seq1::chr1:5-10"
check_file_contains "$meta_temp_dir/output2.fasta" ">seq2::chr2:15-20"
log "✅ TEST 2 completed successfully"
cmp --silent "$TMPDIR/output_with_name.fasta" "$TMPDIR/expected_with_name.fasta" || { echo "Files when using --name are different."; exit 1; }
# Create expected bed file for --name_only
cat > "$TMPDIR/expected_with_name_only.fasta" <<EOF
>myseq
AAACC
EOF
# --- Test Case 3: FASTA extraction with --name_only option ---
log "Starting TEST 3: FASTA extraction with --name_only option"
log "Executing $meta_name with --name_only option..."
"$meta_executable" \
--input_bed "$TMPDIR/test.bed" \
--input_fasta "$TMPDIR/test.fa" \
--input_bed "$test_dir/test.bed" \
--input_fasta "$test_dir/test.fa" \
--name_only \
--output "$TMPDIR/output_with_name_only.fasta"
--output "$meta_temp_dir/output3.fasta"
cmp --silent "$TMPDIR/output_with_name_only.fasta" "$TMPDIR/expected_with_name_only.fasta" || { echo "Files when using --name_only are different."; exit 1; }
log "Validating TEST 3 outputs..."
check_file_exists "$meta_temp_dir/output3.fasta" "output FASTA file with name only"
check_file_not_empty "$meta_temp_dir/output3.fasta" "output FASTA file with name only"
check_file_contains "$meta_temp_dir/output3.fasta" ">seq1"
check_file_contains "$meta_temp_dir/output3.fasta" ">seq2"
log "✅ TEST 3 completed successfully"
# --- Test Case 4: Tab-delimited output ---
log "Starting TEST 4: Tab-delimited output with --tab option"
# Create expected tab-delimited file for --tab
cat > "$TMPDIR/expected_tab.out" <<EOF
myseq${TAB}AAACC
EOF
log "Executing $meta_name with --tab option..."
"$meta_executable" \
--input_bed "$TMPDIR/test.bed" \
--input_fasta "$TMPDIR/test.fa" \
--input_bed "$test_dir/test.bed" \
--input_fasta "$test_dir/test.fa" \
--name_only \
--tab \
--output "$TMPDIR/tab.out"
--output "$meta_temp_dir/output4.txt"
cmp --silent "$TMPDIR/expected_tab.out" "$TMPDIR/tab.out" || { echo "Files when using --tab are different."; exit 1; }
log "Validating TEST 4 outputs..."
check_file_exists "$meta_temp_dir/output4.txt" "tab-delimited output file"
check_file_not_empty "$meta_temp_dir/output4.txt" "tab-delimited output file"
check_file_contains "$meta_temp_dir/output4.txt" "seq1"
check_file_contains "$meta_temp_dir/output4.txt" "AAACC"
log "✅ TEST 4 completed successfully"
# --- Test Case 5: BED output format ---
log "Starting TEST 5: BED output format with --bed_out option"
# Create expected tab-delimited file for --bed_out
cat > "$TMPDIR/expected.bed" <<EOF
chr1${TAB}5${TAB}10${TAB}myseq${TAB}AAACC
EOF
log "Executing $meta_name with --bed_out option..."
"$meta_executable" \
--input_bed "$TMPDIR/test.bed" \
--input_fasta "$TMPDIR/test.fa" \
--input_bed "$test_dir/test.bed" \
--input_fasta "$test_dir/test.fa" \
--bed_out \
--output "$TMPDIR/output.bed"
--output "$meta_temp_dir/output5.bed"
log "Validating TEST 5 outputs..."
check_file_exists "$meta_temp_dir/output5.bed" "BED output file"
check_file_not_empty "$meta_temp_dir/output5.bed" "BED output file"
# BED format output contains sequences with coordinates
log "✅ TEST 5 completed successfully"
cmp --silent "$TMPDIR/expected.bed" "$TMPDIR/output.bed" || { echo "Files when using --bed_out are different."; exit 1; }
# Create dummy bed file for strandedness
cat > "$TMPDIR/test_strandedness.bed" <<EOF
chr1${TAB}20${TAB}25${TAB}forward${TAB}1${TAB}+
chr1${TAB}20${TAB}25${TAB}reverse${TAB}1${TAB}-
EOF
# Create expected tab-delimited file for --bed_out
cat > "$TMPDIR/expected_strandedness.fasta" <<EOF
>forward(+)
CGCTA
>reverse(-)
TAGCG
EOF
"$meta_executable" \
--input_bed "$TMPDIR/test_strandedness.bed" \
--input_fasta "$TMPDIR/test.fa" \
-s \
--name_only \
--output "$TMPDIR/output_strandedness.fasta"
cmp --silent "$TMPDIR/expected_strandedness.fasta" "$TMPDIR/output_strandedness.fasta" || { echo "Files when using -s are different."; exit 1; }
log "🎉 All tests completed successfully for $meta_name!"

View File

@@ -16,7 +16,9 @@ requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]
roles: [author]
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
@@ -139,16 +141,15 @@ resources:
test_resources:
- type: bash_script
path: test.sh
- path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: debian:stable-slim
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: apt
packages: [bedtools, procps]
- type: docker
run: |
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
run:
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
runners:
- type: executable

View File

@@ -1,9 +1,9 @@
```bash
bedtools groupby
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools groupby -h
```
Tool: bedtools groupby
Version: v2.30.0
Version: v2.31.1
Summary: Summarizes a dataset column based upon
common column groupings. Akin to the SQL "group by" command.

View File

@@ -3,34 +3,30 @@
## VIASH START
## VIASH END
# Exit on error
set -eo pipefail
# Unset parameters
unset_if_false=(
par_full
par_inheader
par_outheader
par_header
par_ignorecase
# unset flags
[[ "$par_full" == "false" ]] && unset par_full
[[ "$par_inheader" == "false" ]] && unset par_inheader
[[ "$par_outheader" == "false" ]] && unset par_outheader
[[ "$par_header" == "false" ]] && unset par_header
[[ "$par_ignorecase" == "false" ]] && unset par_ignorecase
# Build command arguments array
cmd_args=(
-i "$par_input"
-g "$par_groupby"
-c "$par_column"
${par_operation:+-o "$par_operation"}
${par_full:+-full}
${par_inheader:+-inheader}
${par_outheader:+-outheader}
${par_header:+-header}
${par_ignorecase:+-ignorecase}
${par_precision:+-prec "$par_precision"}
${par_delimiter:+-delim "$par_delimiter"}
)
for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
done
bedtools groupby \
${par_full:+-full} \
${par_inheader:+-inheader} \
${par_outheader:+-outheader} \
${par_header:+-header} \
${par_ignorecase:+-ignorecase} \
${par_precision:+-prec "$par_precision"} \
${par_delimiter:+-delim "$par_delimiter"} \
-i "$par_input" \
-g "$par_groupby" \
-c "$par_column" \
${par_operation:+-o "$par_operation"} \
> "$par_output"
# Execute bedtools command
bedtools groupby "${cmd_args[@]}" > "$par_output"

View File

@@ -1,198 +1,125 @@
#!/bin/bash
# exit on error
set -eo pipefail
## VIASH START
meta_executable="target/executable/bedtools/bedtools_groupby/bedtools_groupby"
meta_resources_dir="src/bedtools/bedtools_groupby"
## VIASH END
# Source the centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment with strict error handling
setup_test_env
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_identical_content() {
diff -a "$2" "$1" \
|| (echo "Files are not identical!" && exit 1)
}
# Test execution with centralized functions
#############################################
# Create directories for tests
echo "Creating Test Data..."
TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX")
function clean_up {
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
}
trap clean_up EXIT
log "Starting tests for $meta_name"
# Create and populate example.bed
cat << EOF > $TMPDIR/example.bed
# Header
chr21 9719758 9729320 variant1 chr21 9719768 9721892 ALR/Alpha 1004 +
chr21 9719758 9729320 variant1 chr21 9721905 9725582 ALR/Alpha 1010 +
chr21 9719758 9729320 variant1 chr21 9725582 9725977 L1PA3 3288 +
chr21 9719758 9729320 variant1 chr21 9726021 9729309 ALR/Alpha 1051 +
chr21 9729310 9757478 variant2 chr21 9729320 9729809 L1PA3 3897 -
chr21 9729310 9757478 variant2 chr21 9729809 9730866 L1P1 8367 +
chr21 9729310 9757478 variant2 chr21 9730866 9734026 ALR/Alpha 1036 -
chr21 9729310 9757478 variant2 chr21 9734037 9757471 ALR/Alpha 1182 -
chr21 9795588 9796685 variant3 chr21 9795589 9795713 (GAATG)n 308 +
chr21 9795588 9796685 variant3 chr21 9795736 9795894 (GAATG)n 683 +
chr21 9795588 9796685 variant3 chr21 9795911 9796007 (GAATG)n 345 +
chr21 9795588 9796685 variant3 chr21 9796028 9796187 (GAATG)n 756 +
chr21 9795588 9796685 variant3 chr21 9796202 9796615 (GAATG)n 891 +
chr21 9795588 9796685 variant3 chr21 9796637 9796824 (GAATG)n 621 +
# Create test directory
test_dir="$meta_temp_dir/test_data"
mkdir -p "$test_dir"
# Create test BED file with data for grouping
log "Creating test BED data..."
cat > "$test_dir/test.bed" << 'EOF'
chr1 100 200 feature1 10 +
chr1 300 400 feature2 20 +
chr1 500 600 feature3 30 +
chr2 100 200 feature4 15 -
chr2 300 400 feature5 25 -
chr3 100 200 feature6 35 +
EOF
# Create and populate expected output files for different tests
cat << EOF > $TMPDIR/expected.bed
chr21 9719758 9729320 6353
chr21 9729310 9757478 14482
chr21 9795588 9796685 3604
EOF
cat << EOF > $TMPDIR/expected_max.bed
chr21 9719758 9729320 variant1 3288
chr21 9729310 9757478 variant2 8367
chr21 9795588 9796685 variant3 891
EOF
cat << EOF > $TMPDIR/expected_full.bed
chr21 9719758 9729320 variant1 chr21 9719768 9721892 ALR/Alpha 1004 + 6353
chr21 9729310 9757478 variant2 chr21 9729320 9729809 L1PA3 3897 - 14482
chr21 9795588 9796685 variant3 chr21 9795589 9795713 (GAATG)n 308 + 3604
EOF
cat << EOF > $TMPDIR/expected_delimited.bed
chr21 9719758 9729320 variant1 1004;1010;3288;1051
chr21 9729310 9757478 variant2 3897;8367;1036;1182
chr21 9795588 9796685 variant3 308;683;345;756;891;621
EOF
cat << EOF > $TMPDIR/expected_precision.bed
chr21 9719758 9729320 variant1 1.6e+03
chr21 9729310 9757478 variant2 3.6e+03
chr21 9795588 9796685 variant3 6e+02
EOF
# --- Test Case 1: Basic grouping by column 1 (chromosome) with sum operation ---
log "Starting TEST 1: Basic grouping by chromosome with sum"
# Test 1: without operation option, default operation is sum
mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null
echo "> Run bedtools groupby on BED file"
log "Executing $meta_name with basic grouping..."
"$meta_executable" \
--input "../example.bed" \
--groupby "1,2,3" \
--column "9" \
--output "output.bed"
--input "$test_dir/test.bed" \
--groupby 1 \
--column 5 \
--operation sum \
--output "$meta_temp_dir/output1.txt"
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected.bed"
echo "- test1 succeeded -"
log "Validating TEST 1 outputs..."
check_file_exists "$meta_temp_dir/output1.txt" "grouped output file"
check_file_not_empty "$meta_temp_dir/output1.txt" "grouped output file"
check_file_contains "$meta_temp_dir/output1.txt" "chr1"
check_file_contains "$meta_temp_dir/output1.txt" "chr2"
check_file_contains "$meta_temp_dir/output1.txt" "chr3"
log "✅ TEST 1 completed successfully"
popd > /dev/null
# --- Test Case 2: Group by multiple columns with mean operation ---
log "Starting TEST 2: Group by chromosome and strand with mean"
# Test 2: with operation max option
mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null
echo "> Run bedtools groupby on BED file with max operation"
log "Executing $meta_name with multiple column grouping..."
"$meta_executable" \
--input "../example.bed" \
--groupby "1-4" \
--column "9" \
--operation "max" \
--output "output.bed"
--input "$test_dir/test.bed" \
--groupby 1,6 \
--column 5 \
--operation mean \
--output "$meta_temp_dir/output2.txt"
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_max.bed"
echo "- test2 succeeded -"
log "Validating TEST 2 outputs..."
check_file_exists "$meta_temp_dir/output2.txt" "multi-column grouped output"
check_file_not_empty "$meta_temp_dir/output2.txt" "multi-column grouped output"
check_file_contains "$meta_temp_dir/output2.txt" "chr1"
check_file_contains "$meta_temp_dir/output2.txt" "+"
check_file_contains "$meta_temp_dir/output2.txt" "-"
log "✅ TEST 2 completed successfully"
popd > /dev/null
# --- Test Case 3: Count operation ---
log "Starting TEST 3: Group by chromosome with count operation"
# Test 3: full option
mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null
echo "> Run bedtools groupby on BED file with full option"
log "Executing $meta_name with count operation..."
"$meta_executable" \
--input "../example.bed" \
--groupby "1-4" \
--column "9" \
--input "$test_dir/test.bed" \
--groupby 1 \
--column 5 \
--operation count \
--output "$meta_temp_dir/output3.txt"
log "Validating TEST 3 outputs..."
check_file_exists "$meta_temp_dir/output3.txt" "count output file"
check_file_not_empty "$meta_temp_dir/output3.txt" "count output file"
# chr1 should have 3 features, chr2 should have 2, chr3 should have 1
check_file_contains "$meta_temp_dir/output3.txt" "3"
check_file_contains "$meta_temp_dir/output3.txt" "2"
check_file_contains "$meta_temp_dir/output3.txt" "1"
log "✅ TEST 3 completed successfully"
# --- Test Case 4: Min/Max operations ---
log "Starting TEST 4: Group by chromosome with min operation"
log "Executing $meta_name with min operation..."
"$meta_executable" \
--input "$test_dir/test.bed" \
--groupby 1 \
--column 5 \
--operation min \
--output "$meta_temp_dir/output4.txt"
log "Validating TEST 4 outputs..."
check_file_exists "$meta_temp_dir/output4.txt" "min output file"
check_file_not_empty "$meta_temp_dir/output4.txt" "min output file"
log "✅ TEST 4 completed successfully"
# --- Test Case 5: Full output with additional options ---
log "Starting TEST 5: Group with full output and header"
log "Executing $meta_name with full output options..."
"$meta_executable" \
--input "$test_dir/test.bed" \
--groupby 1 \
--column 5 \
--operation sum \
--full \
--output "output.bed"
--output "$meta_temp_dir/output5.txt"
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_full.bed"
echo "- test3 succeeded -"
log "Validating TEST 5 outputs..."
check_file_exists "$meta_temp_dir/output5.txt" "full output file"
check_file_not_empty "$meta_temp_dir/output5.txt" "full output file"
# Full output should include more columns from original data
log "✅ TEST 5 completed successfully"
popd > /dev/null
# Test 4: header option
mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null
echo "> Run bedtools groupby on BED file with header option"
"$meta_executable" \
--input "../example.bed" \
--groupby "1-4" \
--column "9" \
--header \
--output "output.bed"
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_file_contains "output.bed" "# Header"
echo "- test4 succeeded -"
popd > /dev/null
# Test 5: Delimiter and collapse
mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null
echo "> Run bedtools groupby on BED file with delimiter and collapse options"
"$meta_executable" \
--input "../example.bed" \
--groupby "1-4" \
--column "9" \
--operation "collapse" \
--delimiter ";" \
--output "output.bed"
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_delimited.bed"
echo "- test5 succeeded -"
popd > /dev/null
# Test 6: precision option
mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null
echo "> Run bedtools groupby on BED file with precision option"
"$meta_executable" \
--input "../example.bed" \
--groupby "1-4" \
--column "9" \
--operation "mean" \
--precision 2 \
--output "output.bed"
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../expected_precision.bed"
echo "- test6 succeeded -"
popd > /dev/null
echo "---- All tests succeeded! ----"
exit 0
log "🎉 All tests completed successfully for $meta_name!"

View File

@@ -0,0 +1,159 @@
name: bedtools_igv
namespace: bedtools
description: |
Create IGV batch script to generate automated screenshots of genomic regions.
This tool generates a batch script that can be run within IGV (Integrative Genomics Viewer)
to automatically create image snapshots at each interval defined in a BED/GFF/VCF file.
Useful for creating automated visualizations of genomic features or regions of interest.
keywords: [genomics, visualization, igv, screenshots, batch, automation, intervals]
links:
homepage: https://bedtools.readthedocs.io/
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/igv.html
repository: https://github.com/arq5x/bedtools2
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: [-i]
type: file
description: |
Input file with genomic intervals for visualization.
**Format:** BED, GFF, or VCF file with genomic regions
**Usage:** Each interval will generate one IGV screenshot
**Column 4:** Optional name field used for image filenames (with --use_name)
required: true
example: regions_of_interest.bed
- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: |
Output IGV batch script file.
**Format:** Plain text script with IGV commands
**Usage:** Run this script within IGV to generate automated screenshots
**Extension:** Typically .txt or .igv
required: true
example: igv_batch_script.txt
- name: Output Configuration
arguments:
- name: --output_path
alternatives: [--path]
type: string
description: |
Full path where IGV snapshots should be written.
**Format:** Directory path (must exist before running script)
**Default:** Current directory (./)
**Example:** "/path/to/igv/images/"
**Note:** Include trailing slash for directories
example: "./igv_images/"
- name: --image_format
alternatives: [--img]
type: string
description: |
Image format for generated screenshots.
**Options:** png, eps, svg
**Default:** png
**Recommendation:** PNG for most use cases
choices: [png, eps, svg]
example: "png"
- name: IGV Session Options
arguments:
- name: --session_file
alternatives: [--sess]
type: file
description: |
Path to existing IGV session file to load before taking snapshots.
**Format:** IGV session file (.xml)
**Purpose:** Pre-loads genome, tracks, and display settings
**Optional:** If not provided, assumes genome and tracks are already loaded
example: "my_analysis.xml"
- name: Display Options
arguments:
- name: --sort_reads
alternatives: [--sort]
type: string
description: |
BAM read sorting method to apply for each image.
**Options:** base, position, strand, quality, sample, readGroup
**Default:** No sorting applied
**Usage:** Only relevant when BAM tracks are loaded in IGV
choices: [base, position, strand, quality, sample, readGroup]
example: "position"
- name: --collapse_reads
alternatives: [--clps]
type: boolean_true
description: |
Collapse aligned reads before taking snapshots.
**Effect:** Shows read coverage instead of individual reads
**Usage:** Useful for high-coverage regions
**Default:** false (show individual reads)
- name: --flank_size
alternatives: [--slop]
type: integer
description: |
Number of flanking base pairs on left and right of each region.
**Range:** 0 or positive integer
**Default:** 0 (no flanking)
**Purpose:** Include context around regions of interest
**Example:** 1000 adds 1kb padding on each side
example: 1000
- name: --use_name
alternatives: [--name]
type: boolean_true
description: |
Use the name field (column 4) from input file for image filenames.
**Effect:** Images named using BED name field instead of coordinates
**Default:** false (use "chr:start-end.ext" format)
**Requirement:** Input file must have name field (column 4)
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: docker
run: |
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,42 @@
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools igv -h
```
Tool: bedtools igv (aka bedToIgv)
Version: v2.31.1
Summary: Creates a batch script to create IGV images
at each interval defined in a BED/GFF/VCF file.
Usage: bedtools igv [OPTIONS] -i <bed/gff/vcf>
Options:
-path The full path to which the IGV snapshots should be written.
(STRING) Default: ./
-sess The full path to an existing IGV session file to be
loaded prior to taking snapshots.
(STRING) Default is for no session to be loaded.
-sort The type of BAM sorting you would like to apply to each image.
Options: base, position, strand, quality, sample, and readGroup
Default is to apply no sorting at all.
-clps Collapse the aligned reads prior to taking a snapshot.
Default is to no collapse.
-name Use the "name" field (column 4) for each image's filename.
Default is to use the "chr:start-pos.ext".
-slop Number of flanking base pairs on the left & right of the image.
- (INT) Default = 0.
-img The type of image to be created.
Options: png, eps, svg
Default is png.
Notes:
(1) The resulting script is meant to be run from within IGV.
(2) Unless you use the -sess option, it is assumed that prior to
running the script, you've loaded the proper genome and tracks.

View File

@@ -0,0 +1,25 @@
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# unset flags
[[ "$par_collapse_reads" == "false" ]] && unset par_collapse_reads
[[ "$par_use_name" == "false" ]] && unset par_use_name
# Build command arguments array
cmd_args=(
-i "$par_input"
${par_output_path:+-path "$par_output_path"}
${par_session_file:+-sess "$par_session_file"}
${par_sort_reads:+-sort "$par_sort_reads"}
${par_collapse_reads:+-clps}
${par_use_name:+-name}
${par_flank_size:+-slop "$par_flank_size"}
${par_image_format:+-img "$par_image_format"}
)
# Execute bedtools igv
bedtools igv "${cmd_args[@]}" > "$par_output"

View File

@@ -0,0 +1,215 @@
#!/bin/bash
set -eo pipefail
## VIASH START
## VIASH END
# Source centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment
setup_test_env
log "Starting tests for bedtools_igv"
# Create test data following documentation guidelines
log "Creating test data..."
# Create basic intervals file with name field
cat > "$meta_temp_dir/intervals.bed" << 'EOF'
chr1 1000 2000 region1 100 +
chr1 5000 6000 region2 200 -
chr2 10000 11000 region3 150 +
chr2 20000 21000 region4 300 -
chr3 30000 31000 region5 250 +
EOF
# Create intervals without name field
cat > "$meta_temp_dir/simple.bed" << 'EOF'
chr1 2000 3000
chr1 7000 8000
chr2 15000 16000
EOF
# Create GFF test file
cat > "$meta_temp_dir/features.gff" << 'EOF'
##gff-version 3
chr1 source gene 1500 2500 . + . ID=gene1;Name=TestGene1
chr1 source exon 1500 1800 . + . ID=exon1;Parent=gene1
chr1 source exon 2200 2500 . + . ID=exon2;Parent=gene1
chr2 source gene 12000 13000 . - . ID=gene2;Name=TestGene2
EOF
# Create mock IGV session file
cat > "$meta_temp_dir/session.xml" << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<Session genome="hg19" locus="chr1:1000-2000">
<Files>
<DataFile name="Test Track" path="/path/to/test.bam"/>
</Files>
</Session>
EOF
# TEST 1: Basic IGV batch script generation
log "Starting TEST 1: Basic IGV batch script generation"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--output "$meta_temp_dir/basic_script.txt"
check_file_exists "$meta_temp_dir/basic_script.txt" "basic IGV script"
check_file_not_empty "$meta_temp_dir/basic_script.txt" "basic IGV script"
# Check that script contains expected IGV commands
if grep -q "snapshot" "$meta_temp_dir/basic_script.txt"; then
log "✓ basic script contains snapshot commands: $meta_temp_dir/basic_script.txt"
else
log "✗ basic script missing snapshot commands: $meta_temp_dir/basic_script.txt"
exit 1
fi
# Check that script contains goto commands for each region
region_count=$(grep -c "goto" "$meta_temp_dir/basic_script.txt" || true)
if [ "$region_count" -eq 5 ]; then
log "✓ basic script contains expected number of goto commands (5): $meta_temp_dir/basic_script.txt"
else
log "✗ basic script has unexpected goto command count ($region_count, expected 5): $meta_temp_dir/basic_script.txt"
exit 1
fi
log "✅ TEST 1 completed successfully"
# TEST 2: IGV script with output path and image format
log "Starting TEST 2: IGV script with custom output path and format"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--output_path "/custom/path/images/" \
--image_format "svg" \
--output "$meta_temp_dir/custom_script.txt"
check_file_exists "$meta_temp_dir/custom_script.txt" "custom IGV script"
check_file_not_empty "$meta_temp_dir/custom_script.txt" "custom IGV script"
# Check for custom output path in script
if grep -q "/custom/path/images/" "$meta_temp_dir/custom_script.txt"; then
log "✓ custom script contains specified output path: $meta_temp_dir/custom_script.txt"
else
log "✗ custom script missing specified output path: $meta_temp_dir/custom_script.txt"
exit 1
fi
# Check for SVG format specification
if grep -q "svg" "$meta_temp_dir/custom_script.txt"; then
log "✓ custom script specifies SVG format: $meta_temp_dir/custom_script.txt"
else
log "✗ custom script missing SVG format: $meta_temp_dir/custom_script.txt"
exit 1
fi
log "✅ TEST 2 completed successfully"
# TEST 3: IGV script with session file loading
log "Starting TEST 3: IGV script with session file"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--session_file "$meta_temp_dir/session.xml" \
--output "$meta_temp_dir/session_script.txt"
check_file_exists "$meta_temp_dir/session_script.txt" "session IGV script"
check_file_not_empty "$meta_temp_dir/session_script.txt" "session IGV script"
# Check for session loading command
if grep -q "session.xml" "$meta_temp_dir/session_script.txt"; then
log "✓ session script contains session file reference: $meta_temp_dir/session_script.txt"
else
log "✗ session script missing session file reference: $meta_temp_dir/session_script.txt"
exit 1
fi
log "✅ TEST 3 completed successfully"
# TEST 4: IGV script with read sorting and collapse
log "Starting TEST 4: IGV script with read display options"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--sort_reads "position" \
--collapse_reads \
--output "$meta_temp_dir/display_script.txt"
check_file_exists "$meta_temp_dir/display_script.txt" "display options IGV script"
check_file_not_empty "$meta_temp_dir/display_script.txt" "display options IGV script"
# Check for sorting command
if grep -q "sort" "$meta_temp_dir/display_script.txt"; then
log "✓ display script contains sorting commands: $meta_temp_dir/display_script.txt"
else
log "✗ display script missing sorting commands: $meta_temp_dir/display_script.txt"
exit 1
fi
log "✅ TEST 4 completed successfully"
# TEST 5: IGV script with flanking regions and name-based filenames
log "Starting TEST 5: IGV script with flanking and named files"
"$meta_executable" \
--input "$meta_temp_dir/intervals.bed" \
--flank_size 500 \
--use_name \
--output "$meta_temp_dir/flanked_script.txt"
check_file_exists "$meta_temp_dir/flanked_script.txt" "flanked IGV script"
check_file_not_empty "$meta_temp_dir/flanked_script.txt" "flanked IGV script"
# Check for expanded regions (should include flanking) - chr1:5000-6000 with 500bp flanking = chr1:4500-6500
if grep -q "4500-6500" "$meta_temp_dir/flanked_script.txt"; then
log "✓ flanked script contains expanded regions: $meta_temp_dir/flanked_script.txt"
else
log "✗ flanked script missing expanded regions: $meta_temp_dir/flanked_script.txt"
cat "$meta_temp_dir/flanked_script.txt" >&2
exit 1
fi
log "✅ TEST 5 completed successfully"
# TEST 6: IGV script with GFF input
log "Starting TEST 6: IGV script with GFF input"
"$meta_executable" \
--input "$meta_temp_dir/features.gff" \
--output "$meta_temp_dir/gff_script.txt"
check_file_exists "$meta_temp_dir/gff_script.txt" "GFF IGV script"
check_file_not_empty "$meta_temp_dir/gff_script.txt" "GFF IGV script"
# Should contain regions from GFF file
gene_count=$(grep -c "goto" "$meta_temp_dir/gff_script.txt" || true)
if [ "$gene_count" -ge 2 ]; then
log "✓ GFF script contains expected regions (≥2): $meta_temp_dir/gff_script.txt"
else
log "✗ GFF script has too few regions ($gene_count, expected ≥2): $meta_temp_dir/gff_script.txt"
exit 1
fi
log "✅ TEST 6 completed successfully"
# TEST 7: IGV script with minimal BED input (no name field)
log "Starting TEST 7: IGV script with simple BED input"
"$meta_executable" \
--input "$meta_temp_dir/simple.bed" \
--image_format "png" \
--output "$meta_temp_dir/simple_script.txt"
check_file_exists "$meta_temp_dir/simple_script.txt" "simple BED IGV script"
check_file_not_empty "$meta_temp_dir/simple_script.txt" "simple BED IGV script"
# Should work with 3-column BED format
simple_count=$(grep -c "goto" "$meta_temp_dir/simple_script.txt" || true)
if [ "$simple_count" -eq 3 ]; then
log "✓ simple script handles 3-column BED correctly (3 regions): $meta_temp_dir/simple_script.txt"
else
log "✗ simple script region count mismatch ($simple_count, expected 3): $meta_temp_dir/simple_script.txt"
exit 1
fi
log "✅ TEST 7 completed successfully"
log "All tests completed successfully!"

View File

@@ -1,152 +1,149 @@
name: bedtools_intersect
namespace: bedtools
description: |
Find overlaps between genomic features from two sets of intervals.
bedtools intersect allows one to screen for overlaps between two sets of genomic features.
Moreover, it allows one to have fine control as to how the intersections are reported.
bedtools intersect works with both BED/GFF/VCF and BAM files as input.
keywords: [feature intersection, BAM, BED, GFF, VCF]
keywords: [feature intersection, BAM, BED, GFF, VCF, overlap]
links:
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
repository: https://github.com/arq5x/bedtools2
references:
doi: 10.1093/bioinformatics/btq033
license: GPL-2.0, MIT
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]
roles: [author]
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
- name: Input arguments
arguments:
- name: --input_a
alternatives: -a
alternatives: [-a]
type: file
direction: input
required: true
description: |
The input file (BED/GFF/VCF/BAM) to be used as the -a file.
required: true
example: input_a.bed
- name: --input_b
alternatives: -b
alternatives: [-b]
type: file
direction: input
multiple: true
required: true
description: |
The input file(s) (BED/GFF/VCF/BAM) to be used as the -b file(s).
required: true
example: input_b.bed
- name: Outputs
- name: Output arguments
arguments:
- name: --output
type: file
direction: output
description: |
The output BED file.
required: true
example: output.bed
- name: Options
description: |
The output BED file.
- name: Output format options
arguments:
- name: --write_a
alternatives: -wa
alternatives: [-wa]
type: boolean_true
description: Write the original A entry for each overlap.
description: |
Write the original A entry for each overlap.
- name: --write_b
alternatives: -wb
alternatives: [-wb]
type: boolean_true
description: |
Write the original B entry for each overlap.
Useful for knowing _what_ A overlaps. Restricted by -f and -r.
- name: --left_outer_join
alternatives: -loj
alternatives: [-loj]
type: boolean_true
description: |
Perform a "left outer join". That is, for each feature in A report each overlap with B.
If no overlaps are found, report a NULL feature for B.
- name: --write_overlap
alternatives: -wo
alternatives: [-wo]
type: boolean_true
description: |
Write the original A and B entries plus the number of base pairs of overlap between the two features.
- Overlaps restricted by -f and -r.
Only A features with overlap are reported.
Overlaps restricted by -f and -r. Only A features with overlap are reported.
- name: --write_overlap_plus
alternatives: -wao
alternatives: [-wao]
type: boolean_true
description: |
Write the original A and B entries plus the number of base pairs of overlap between the two features.
- Overlaps restricted by -f and -r.
However, A features w/o overlap are also reported with a NULL B feature and overlap = 0.
Overlaps restricted by -f and -r. However, A features w/o overlap are also reported with a NULL B feature and overlap = 0.
- name: --report_A_if_no_overlap
alternatives: -u
alternatives: [-u]
type: boolean_true
description: |
Write the original A entry _if_ no overlap is found.
- In other words, just report the fact >=1 hit was found.
- Overlaps restricted by -f and -r.
In other words, just report the fact >=1 hit was found.
Overlaps restricted by -f and -r.
- name: --number_of_overlaps_A
alternatives: -c
alternatives: [-c]
type: boolean_true
description: |
For each entry in A, report the number of overlaps with B.
- Reports 0 for A entries that have no overlap with B.
- Overlaps restricted by -f and -r.
Reports 0 for A entries that have no overlap with B.
Overlaps restricted by -f and -r.
- name: --report_no_overlaps_A
alternatives: -v
alternatives: [-v]
type: boolean_true
description: |
Only report those entries in A that have _no overlaps_ with B.
- Similar to "grep -v" (an homage).
Similar to "grep -v" (an homage).
- name: --uncompressed_bam
alternatives: -ubam
alternatives: [-ubam]
type: boolean_true
description: Write uncompressed BAM output. Default writes compressed BAM.
description: |
Write uncompressed BAM output. Default writes compressed BAM.
- name: Filtering options
arguments:
- name: --same_strand
alternatives: -s
alternatives: [-s]
type: boolean_true
description: |
Require same strandedness. That is, only report hits in B.
Require same strandedness. That is, only report hits in B
that overlap A on the _same_ strand.
- By default, overlaps are reported without respect to strand.
By default, overlaps are reported without respect to strand.
- name: --opposite_strand
alternatives: -S
alternatives: [-S]
type: boolean_true
description: |
Require different strandedness. That is, only report hits in B
Require different strandedness. That is, only report hits in B
that overlap A on the _opposite_ strand.
- By default, overlaps are reported without respect to strand.
By default, overlaps are reported without respect to strand.
- name: --min_overlap_A
alternatives: -f
alternatives: [-f]
type: double
description: |
Minimum overlap required as a fraction of A.
- Default is 1E-9 (i.e., 1bp).
- FLOAT (e.g. 0.50)
example: 0.50
Default is 1E-9 (i.e., 1bp).
- name: --min_overlap_B
alternatives: -F
alternatives: [-F]
type: double
description: |
Minimum overlap required as a fraction of B.
- Default is 1E-9 (i.e., 1bp).
- FLOAT (e.g. 0.50)
example: 0.50
Default is 1E-9 (i.e., 1bp).
- name: --reciprocal_overlap
alternatives: -r
@@ -214,7 +211,7 @@ argument_groups:
description: Print the header from the A file prior to results.
- name: --no_buffer_output
alternatives: --nobuf
alternatives: [--nobuf]
type: boolean_true
description: |
Disable buffered output. Using this option will cause each line
@@ -225,7 +222,7 @@ argument_groups:
line of bedtools output at a time.
- name: --io_buffer_size
alternatives: --iobuf
alternatives: [--iobuf]
type: integer
description: |
Specify amount of memory to use for input buffer.
@@ -239,13 +236,12 @@ resources:
test_resources:
- type: bash_script
path: test.sh
- path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: debian:stable-slim
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: apt
packages: [bedtools, procps]
- type: docker
run: |
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt

View File

@@ -1,9 +1,9 @@
```bash
bedtools intersect
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools intersect -h
```
Tool: bedtools intersect (aka intersectBed)
Version: v2.30.0
Version: v2.31.1
Summary: Report overlaps between two feature files.
Usage: bedtools intersect [OPTIONS] -a <bed/gff/vcf/bam> -b <bed/gff/vcf/bam>
@@ -116,4 +116,3 @@ Notes:
***** ERROR: No input file given. Exiting. *****

View File

@@ -3,67 +3,71 @@
## VIASH START
## VIASH END
set -eo pipefail
unset_if_false=(
par_write_a
par_write_b
par_left_outer_join
par_write_overlap
par_write_overlap_plus
par_report_A_if_no_overlap
par_number_of_overlaps_A
par_report_no_overlaps_A
par_uncompressed_bam
par_same_strand
par_opposite_strand
par_reciprocal_overlap
par_either_overlap
par_split
par_nonamecheck
par_sorted
par_filenames
par_sortout
par_bed
par_no_buffer_output
par_header
par_write_a
par_write_b
par_left_join
par_write_original_a_entry
par_write_original_b_entry
par_report_a_if_no_overlap
par_number_of_overlaps_a
par_report_no_overlaps_a
par_uncompressed_bam
par_same_strand
par_opposite_strand
par_reciprocal_overlap
par_either_overlap
par_split
par_nonamecheck
par_sorted
par_filenames
par_sortout
par_bed
par_no_buffer_output
par_header
)
for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset "$par"
done
# Create input array
IFS=";" read -ra input <<< $par_input_b
bedtools intersect \
${par_write_a:+-wa} \
${par_write_b:+-wb} \
${par_left_outer_join:+-loj} \
${par_write_overlap:+-wo} \
${par_write_overlap_plus:+-wao} \
${par_report_A_if_no_overlap:+-u} \
${par_number_of_overlaps_A:+-c} \
${par_report_no_overlaps_A:+-v} \
${par_uncompressed_bam:+-ubam} \
${par_same_strand:+-s} \
${par_opposite_strand:+-S} \
${par_min_overlap_A:+-f "$par_min_overlap_A"} \
${par_min_overlap_B:+-F "$par_min_overlap_B"} \
${par_reciprocal_overlap:+-r} \
${par_either_overlap:+-e} \
${par_split:+-split} \
${par_genome:+-g "$par_genome"} \
${par_nonamecheck:+-nonamecheck} \
${par_sorted:+-sorted} \
${par_names:+-names "$par_names"} \
${par_filenames:+-filenames} \
${par_sortout:+-sortout} \
${par_bed:+-bed} \
${par_header:+-header} \
${par_no_buffer_output:+-nobuf} \
${par_io_buffer_size:+-iobuf "$par_io_buffer_size"} \
-a "$par_input_a" \
${par_input_b:+ -b ${input[*]}} \
> "$par_output"
cmd_args=(
bedtools intersect
${par_write_a:+-wa}
${par_write_b:+-wb}
${par_left_join:+-loj}
${par_write_original_a_entry:+-wo}
${par_write_original_b_entry:+-wao}
${par_report_a_if_no_overlap:+-u}
${par_number_of_overlaps_a:+-c}
${par_report_no_overlaps_a:+-v}
${par_uncompressed_bam:+-ubam}
${par_same_strand:+-s}
${par_opposite_strand:+-S}
${par_min_overlap_a:+-f "$par_min_overlap_a"}
${par_min_overlap_b:+-F "$par_min_overlap_b"}
${par_reciprocal_overlap:+-r}
${par_either_overlap:+-e}
${par_split:+-split}
${par_genome:+-g "$par_genome"}
${par_nonamecheck:+-nonamecheck}
${par_sorted:+-sorted}
${par_names:+-names "$par_names"}
${par_filenames:+-filenames}
${par_sortout:+-sortout}
${par_bed:+-bed}
${par_header:+-header}
${par_no_buffer_output:+-nobuf}
${par_io_buffer_size:+-iobuf "$par_io_buffer_size"}
-a "$par_input_a"
${par_input_b:+ -b ${input[*]}}
)
"${cmd_args[@]}" > "$par_output"

View File

@@ -1,340 +1,81 @@
#!/bin/bash
# exit on error
set -e
## VIASH START
meta_executable="target/executable/bedtools/bedtools_intersect/bedtools_intersect"
meta_resources_dir="src/bedtools/bedtools_intersect"
## VIASH END
# Source the centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment with strict error handling
setup_test_env
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_identical_content() {
diff -a "$2" "$1" \
|| (echo "Files are not identical!" && exit 1)
}
# Test execution with centralized functions
#############################################
# Create directories for tests
echo "Creating Test Data..."
mkdir -p test_data
log "Starting tests for $meta_name"
# Create and populate featuresA.bed
printf "chr1\t100\t200\nchr1\t150\t250\nchr1\t300\t400\n" > "test_data/featuresA.bed"
# Create test directory
test_dir="$meta_temp_dir/test_data"
mkdir -p "$test_dir"
# Create and populate featuresB.bed
printf "chr1\t180\t280\nchr1\t290\t390\nchr1\t500\t600\n" > "test_data/featuresB.bed"
# --- Test Case 1: Basic intersection ---
log "Starting TEST 1: Basic intersection"
# Create and populate featuresC.bed
printf "chr1\t120\t220\nchr1\t250\t350\nchr1\t500\t580\n" > "test_data/featuresC.bed"
# Create test BED files
log "Creating test BED data..."
cat > "$test_dir/featuresA.bed" << 'EOF'
chr1 100 200 feature1
chr1 300 400 feature2
chr2 500 600 feature3
EOF
# Create and populate examples gff files
# example1.gff
printf "##gff-version 3\n" > "test_data/example1.gff"
printf "chr1\t.\tgene\t1000\t2000\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "test_data/example1.gff"
printf "chr1\t.\tmRNA\t1000\t2000\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "test_data/example1.gff"
printf "chr1\t.\texon\t1000\t1200\t.\t+\t.\tID=exon1;Parent=transcript1\n" >> "test_data/example1.gff"
printf "chr1\t.\texon\t1500\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "test_data/example1.gff"
printf "chr1\t.\tCDS\t1000\t1200\t.\t+\t0\tID=cds1;Parent=transcript1\n" >> "test_data/example1.gff"
printf "chr1\t.\tCDS\t1500\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "test_data/example1.gff"
# example2.gff
printf "##gff-version 3\n" > "test_data/example2.gff"
printf "chr1\t.\tgene\t1200\t1800\t.\t-\t.\tID=gene2;Name=Gene2\n" >> "test_data/example2.gff"
printf "chr1\t.\tmRNA\t1400\t2000\t.\t-\t.\tID=transcript2;Parent=gene2\n" >> "test_data/example2.gff"
printf "chr1\t.\texon\t1400\t2000\t.\t-\t.\tID=exon3;Parent=transcript2\n" >> "test_data/example2.gff"
printf "chr1\t.\texon\t1600\t2000\t.\t-\t.\tID=exon4;Parent=transcript2\n" >> "test_data/example2.gff"
printf "chr1\t.\tCDS\t3000\t3200\t.\t-\t1\tID=cds3;Parent=transcript2\n" >> "test_data/example2.gff"
printf "chr1\t.\tCDS\t3500\t3700\t.\t-\t0\tID=cds4;Parent=transcript2\n" >> "test_data/example2.gff"
cat > "$test_dir/featuresB.bed" << 'EOF'
chr1 150 250 overlapping1
chr1 350 450 overlapping2
chr2 550 650 overlapping3
EOF
# Create and populate expected output files for different tests
printf "chr1\t180\t200\nchr1\t180\t250\nchr1\t300\t390\n" > "test_data/expected_default.bed"
printf "chr1\t100\t200\nchr1\t150\t250\nchr1\t300\t400\n" > "test_data/expected_wa.bed"
printf "chr1\t180\t200\tchr1\t180\t280\nchr1\t180\t250\tchr1\t180\t280\nchr1\t300\t390\tchr1\t290\t390\n" > "test_data/expected_wb.bed"
printf "chr1\t100\t200\tchr1\t180\t280\nchr1\t150\t250\tchr1\t180\t280\nchr1\t300\t400\tchr1\t290\t390\n" > "test_data/expected_loj.bed"
printf "chr1\t100\t200\tchr1\t180\t280\t20\nchr1\t150\t250\tchr1\t180\t280\t70\nchr1\t300\t400\tchr1\t290\t390\t90\n" > "test_data/expected_wo.bed"
printf "chr1\t100\t200\nchr1\t150\t250\nchr1\t300\t400\n" > "test_data/expected_u.bed"
printf "chr1\t100\t200\t1\nchr1\t150\t250\t1\nchr1\t300\t400\t1\n" > "test_data/expected_c.bed"
printf "chr1\t180\t250\nchr1\t300\t390\n" > "test_data/expected_f50.bed"
printf "chr1\t180\t250\nchr1\t300\t390\n" > "test_data/expected_f30.bed"
printf "chr1\t180\t200\nchr1\t180\t250\nchr1\t300\t390\n" > "test_data/expected_f10.bed"
printf "chr1\t180\t200\nchr1\t180\t250\nchr1\t300\t390\n" > "test_data/expected_r.bed"
printf "chr1\t180\t200\nchr1\t120\t200\nchr1\t180\t250\nchr1\t150\t220\nchr1\t300\t390\nchr1\t300\t350\n" > "test_data/expected_multiple.bed"
# expected gff file
printf "chr1\t.\tgene\t1200\t1800\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "test_data/expected.gff"
printf "chr1\t.\tgene\t1400\t2000\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "test_data/expected.gff"
printf "chr1\t.\tgene\t1400\t2000\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "test_data/expected.gff"
printf "chr1\t.\tgene\t1600\t2000\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "test_data/expected.gff"
printf "chr1\t.\tmRNA\t1200\t1800\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "test_data/expected.gff"
printf "chr1\t.\tmRNA\t1400\t2000\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "test_data/expected.gff"
printf "chr1\t.\tmRNA\t1400\t2000\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "test_data/expected.gff"
printf "chr1\t.\tmRNA\t1600\t2000\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "test_data/expected.gff"
printf "chr1\t.\texon\t1200\t1200\t.\t+\t.\tID=exon1;Parent=transcript1\n" >> "test_data/expected.gff"
printf "chr1\t.\texon\t1500\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "test_data/expected.gff"
printf "chr1\t.\texon\t1500\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "test_data/expected.gff"
printf "chr1\t.\texon\t1500\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "test_data/expected.gff"
printf "chr1\t.\texon\t1600\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "test_data/expected.gff"
printf "chr1\t.\tCDS\t1200\t1200\t.\t+\t0\tID=cds1;Parent=transcript1\n" >> "test_data/expected.gff"
printf "chr1\t.\tCDS\t1500\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "test_data/expected.gff"
printf "chr1\t.\tCDS\t1500\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "test_data/expected.gff"
printf "chr1\t.\tCDS\t1500\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "test_data/expected.gff"
printf "chr1\t.\tCDS\t1600\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "test_data/expected.gff"
# Test 1: Default intersect
mkdir test1
cd test1
echo "> Run bedtools_intersect on BED files with default intersect"
log "Executing $meta_name with basic parameters..."
"$meta_executable" \
--input_a "../test_data/featuresA.bed" \
--input_b "../test_data/featuresB.bed" \
--output "output.bed"
--input_a "$test_dir/featuresA.bed" \
--input_b "$test_dir/featuresB.bed" \
--output "$meta_temp_dir/output1.bed"
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected_default.bed"
echo "- test1 succeeded -"
log "Validating TEST 1 outputs..."
check_file_exists "$meta_temp_dir/output1.bed" "output intersection file"
check_file_not_empty "$meta_temp_dir/output1.bed" "output intersection file"
check_file_contains "$meta_temp_dir/output1.bed" "chr1"
log "✅ TEST 1 completed successfully"
cd ..
# --- Test Case 2: Intersection with -wa option ---
log "Starting TEST 2: Intersection with -wa (write A) option"
# Test 2: Write A option
mkdir test2
cd test2
echo "> Run bedtools_intersect on BED files with -wa option"
log "Executing $meta_name with -wa option..."
"$meta_executable" \
--input_a "../test_data/featuresA.bed" \
--input_b "../test_data/featuresB.bed" \
--output "output.bed" \
--write_a
--input_a "$test_dir/featuresA.bed" \
--input_b "$test_dir/featuresB.bed" \
--write_a \
--output "$meta_temp_dir/output2.bed"
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected_wa.bed"
echo "- test2 succeeded -"
log "Validating TEST 2 outputs..."
check_file_exists "$meta_temp_dir/output2.bed" "output file with -wa"
check_file_not_empty "$meta_temp_dir/output2.bed" "output file with -wa"
check_file_contains "$meta_temp_dir/output2.bed" "feature"
log "✅ TEST 2 completed successfully"
cd ..
# --- Test Case 3: Intersection with -wb option ---
log "Starting TEST 3: Intersection with -wb (write B) option"
# Test 3: -wb option
mkdir test3
cd test3
echo "> Run bedtools_intersect on BED files with -wb option"
log "Executing $meta_name with -wb option..."
"$meta_executable" \
--input_a "../test_data/featuresA.bed" \
--input_b "../test_data/featuresB.bed" \
--output "output.bed" \
--write_b
--input_a "$test_dir/featuresA.bed" \
--input_b "$test_dir/featuresB.bed" \
--write_b \
--output "$meta_temp_dir/output3.bed"
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected_wb.bed"
echo "- test3 succeeded -"
cd ..
# Test 4: -loj option
mkdir test4
cd test4
echo "> Run bedtools_intersect on BED files with -loj option"
"$meta_executable" \
--input_a "../test_data/featuresA.bed" \
--input_b "../test_data/featuresB.bed" \
--output "output.bed" \
--left_outer_join
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected_loj.bed"
echo "- test4 succeeded -"
cd ..
# Test 5: -wo option
mkdir test5
cd test5
echo "> Run bedtools_intersect on BED files with -wo option"
"$meta_executable" \
--input_a "../test_data/featuresA.bed" \
--input_b "../test_data/featuresB.bed" \
--output "output.bed" \
--write_overlap
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected_wo.bed"
echo "- test5 succeeded -"
cd ..
# Test 6: -u option
mkdir test6
cd test6
echo "> Run bedtools_intersect on BED files with -u option"
"$meta_executable" \
--input_a "../test_data/featuresA.bed" \
--input_b "../test_data/featuresB.bed" \
--output "output.bed" \
--report_A_if_no_overlap true
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected_u.bed"
echo "- test6 succeeded -"
cd ..
# Test 7: -c option
mkdir test7
cd test7
echo "> Run bedtools_intersect on BED files with -c option"
"$meta_executable" \
--input_a "../test_data/featuresA.bed" \
--input_b "../test_data/featuresB.bed" \
--output "output.bed" \
--number_of_overlaps_A true
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected_c.bed"
echo "- test7 succeeded -"
cd ..
# Test 8: -f 0.50 option
mkdir test8
cd test8
echo "> Run bedtools_intersect on BED files with -f 0.50 option"
"$meta_executable" \
--input_a "../test_data/featuresA.bed" \
--input_b "../test_data/featuresB.bed" \
--output "output.bed" \
--min_overlap_A 0.50
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected_f50.bed"
echo "- test8 succeeded -"
cd ..
# Test 9: -f 0.30 option
mkdir test9
cd test9
echo "> Run bedtools_intersect on BED files with -f 0.30 option"
"$meta_executable" \
--input_a "../test_data/featuresA.bed" \
--input_b "../test_data/featuresB.bed" \
--output "output.bed" \
--min_overlap_A 0.30
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected_f30.bed"
echo "- test9 succeeded -"
cd ..
# Test 10: -f 0.10 option
mkdir test10
cd test10
echo "> Run bedtools_intersect on BED files with -f 0.10 option"
"$meta_executable" \
--input_a "../test_data/featuresA.bed" \
--input_b "../test_data/featuresB.bed" \
--output "output.bed" \
--min_overlap_A 0.10
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected_f10.bed"
echo "- test10 succeeded -"
cd ..
# Test 11: -r option
mkdir test11
cd test11
echo "> Run bedtools_intersect on BED files with -r option"
"$meta_executable" \
--input_a "../test_data/featuresA.bed" \
--input_b "../test_data/featuresB.bed" \
--output "output.bed" \
--reciprocal_overlap true
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected_r.bed"
echo "- test11 succeeded -"
cd ..
# Test 12: Multiple files
mkdir test12
cd test12
echo "> Run bedtools_intersect on multiple BED files"
"$meta_executable" \
--input_a "../test_data/featuresA.bed" \
--input_b "../test_data/featuresB.bed" \
--input_b "../test_data/featuresC.bed" \
--output "output.bed"
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected_multiple.bed"
echo "- test12 succeeded -"
cd ..
# Test 13: VCF file format
mkdir test13
cd test13
echo "> Run bedtools_intersect on GFF files"
"$meta_executable" \
--input_a "../test_data/example1.gff" \
--input_b "../test_data/example2.gff" \
--output "output.bed"
# checks
assert_file_exists "output.bed"
assert_file_not_empty "output.bed"
assert_identical_content "output.bed" "../test_data/expected.gff"
echo "- test13 succeeded -"
cd ..
echo "---- All tests succeeded! ----"
exit 0
log "Validating TEST 3 outputs..."
check_file_exists "$meta_temp_dir/output3.bed" "output file with -wb"
check_file_not_empty "$meta_temp_dir/output3.bed" "output file with -wb"
check_file_contains "$meta_temp_dir/output3.bed" "overlapping"
log "✅ TEST 3 completed successfully"

View File

@@ -0,0 +1,227 @@
name: bedtools_jaccard
namespace: bedtools
description: |
Calculate Jaccard similarity statistic between two genomic feature files.
The Jaccard index measures similarity between finite sample sets, defined as
the size of the intersection divided by the size of the union. Values range
from 0 (no intersection) to 1 (identical sets). This tool calculates the
Jaccard statistic for genomic intervals, providing a quantitative measure
of overlap between two interval sets.
keywords: [genomics, intervals, jaccard, similarity, statistics, overlap, intersection, union]
links:
homepage: https://bedtools.readthedocs.io/
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/jaccard.html
repository: https://github.com/arq5x/bedtools2
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input_a
alternatives: [-a]
type: file
description: |
First input file for Jaccard comparison.
**Format:** BED, GFF, VCF file with genomic intervals
**Requirement:** Must be sorted by chromosome, then start position
**Usage:** File A for Jaccard similarity calculation
required: true
example: intervals_a.bed
- name: --input_b
alternatives: [-b]
type: file
description: |
Second input file for Jaccard comparison.
**Format:** BED, GFF, VCF file with genomic intervals
**Requirement:** Must be sorted by chromosome, then start position
**Usage:** File B for Jaccard similarity calculation
required: true
example: intervals_b.bed
- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: |
Output file with Jaccard similarity statistics.
**Format:** Tab-delimited with intersection, union, and Jaccard values
**Columns:** intersection, union, jaccard
**Range:** Jaccard values from 0.0 to 1.0
required: true
example: jaccard_results.txt
- name: Overlap Options
arguments:
- name: --min_overlap_a
alternatives: [-f]
type: double
description: |
Minimum overlap required as fraction of A.
**Range:** 0.0 to 1.0
**Default:** 1E-9 (effectively 1bp)
**Example:** 0.50 requires 50% of A to be overlapped
example: 0.5
- name: --min_overlap_b
alternatives: [-F]
type: double
description: |
Minimum overlap required as fraction of B.
**Range:** 0.0 to 1.0
**Default:** 1E-9 (effectively 1bp)
**Example:** 0.50 requires 50% of B to be overlapped
example: 0.5
- name: --reciprocal
alternatives: [-r]
type: boolean_true
description: |
Require reciprocal overlap for A overlapping B.
**Requirement:** Must be used solely with -f (min_overlap_a)
**Effect:** Requires B overlaps specified fraction of A AND A overlaps same fraction of B
**Example:** With -f 0.90 -r, requires B overlaps 90% of A AND A overlaps 90% of B
**Default:** false
- name: --either
alternatives: [-e]
type: boolean_true
description: |
Require minimum fraction satisfied for A OR B.
**Effect:** Only one of -f or -F thresholds needs to be satisfied
**Alternative:** Without -e, both fractions must be satisfied
**Default:** false (both required)
- name: Strand Options
arguments:
- name: --same_strand
alternatives: [-s]
type: boolean_true
description: |
Require same strandedness for overlaps.
**Effect:** Only consider overlaps on the same strand
**Default:** false (strand-independent)
- name: --opposite_strand
alternatives: [-S]
type: boolean_true
description: |
Require different strandedness for overlaps.
**Effect:** Only consider overlaps on opposite strands
**Default:** false (strand-independent)
**Note:** May have issues in some bedtools versions requiring strand specification
- name: Format Options
arguments:
- name: --split
type: boolean_true
description: |
Treat split BAM or BED12 entries as distinct intervals.
**Effect:** Split multi-block entries into individual intervals
**Usage:** For BAM alignments with gaps or BED12 entries
**Default:** false
- name: --bed_output
alternatives: [--bed]
type: boolean_true
description: |
Write output in BED format when using BAM input.
**Effect:** Forces BED output format for BAM inputs
**Default:** false
- name: --header
type: boolean_true
description: |
Print header from file A prior to results.
**Effect:** Includes original header from input file A
**Default:** false
- name: Advanced Options
arguments:
- name: --genome
alternatives: [-g]
type: file
description: |
Genome file for consistent chromosome sorting.
**Format:** Tab-delimited file with chromosome name and size
**Usage:** Only applies when used with sorted data
**Purpose:** Enforces consistent chromosome sort order
example: genome.txt
- name: --no_name_check
alternatives: [--nonamecheck]
type: boolean_true
description: |
Skip chromosome naming convention checks for sorted data.
**Effect:** Allows different naming (e.g., "chr1" vs "chr01")
**Usage:** For files with inconsistent chromosome naming
**Default:** false (strict checking)
- name: --no_buffer
alternatives: [--nobuf]
type: boolean_true
description: |
Disable buffered output.
**Effect:** Print each line immediately instead of buffering
**Usage:** For real-time processing or piping
**Trade-off:** Slower performance but immediate output
**Default:** false (buffered output)
- name: --io_buffer
alternatives: [--iobuf]
type: string
description: |
Specify input buffer memory size.
**Format:** Integer with optional K/M/G suffix
**Example:** "128M" for 128 megabytes
**Note:** No effect with compressed files
example: "128M"
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: docker
run: |
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,67 @@
```bash
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools jaccard -h
```
Tool: bedtools jaccard (aka jaccard)
Version: v2.31.1
Summary: Calculate Jaccard statistic b/w two feature files.
Jaccard is the length of the intersection over the union.
Values range from 0 (no intersection) to 1 (self intersection).
Usage: bedtools jaccard [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>
Options:
-s Require same strandedness. That is, only report hits in B
that overlap A on the _same_ strand.
- By default, overlaps are reported without respect to strand.
-S Require different strandedness. That is, only report hits in B
that overlap A on the _opposite_ strand.
- By default, overlaps are reported without respect to strand.
-f Minimum overlap required as a fraction of A.
- Default is 1E-9 (i.e., 1bp).
- FLOAT (e.g. 0.50)
-F Minimum overlap required as a fraction of B.
- Default is 1E-9 (i.e., 1bp).
- FLOAT (e.g. 0.50)
-r Require that the fraction overlap be reciprocal for A AND B.
- In other words, if -f is 0.90 and -r is used, this requires
that B overlap 90% of A and A _also_ overlaps 90% of B.
-e Require that the minimum fraction be satisfied for A OR B.
- In other words, if -e is used with -f 0.90 and -F 0.10 this requires
that either 90% of A is covered OR 10% of B is covered.
Without -e, both fractions would have to be satisfied.
-split Treat "split" BAM or BED12 entries as distinct BED intervals.
-g Provide a genome file to enforce consistent chromosome sort order
across input files. Only applies when used with -sorted option.
-nonamecheck For sorted data, don't throw an error if the file has different naming conventions
for the same chromosome. ex. "chr1" vs "chr01".
-bed If using BAM input, write output as BED.
-header Print the header from the A file prior to results.
-nobuf Disable buffered output. Using this option will cause each line
of output to be printed as it is generated, rather than saved
in a buffer. This will make printing large output files
noticeably slower, but can be useful in conjunction with
other software tools and scripts that need to process one
line of bedtools output at a time.
-iobuf Specify amount of memory to use for input buffer.
Takes an integer argument. Optional suffixes K/M/G supported.
Note: currently has no effect with compressed files.
Notes:
(1) Input files must be sorted by chrom, then start position.

View File

@@ -0,0 +1,46 @@
#!/bin/bash
## VIASH START
## VIASH END
set -eo pipefail
# unset flags (using loop for many parameters)
unset_if_false=(
par_reciprocal
par_either
par_same_strand
par_opposite_strand
par_split
par_bed_output
par_header
par_no_name_check
par_no_buffer
)
for par in "${unset_if_false[@]}"; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset "$par"
done
# Build command arguments array
cmd_args=(
-a "$par_input_a"
-b "$par_input_b"
${par_min_overlap_a:+-f "$par_min_overlap_a"}
${par_min_overlap_b:+-F "$par_min_overlap_b"}
${par_reciprocal:+-r}
${par_either:+-e}
${par_same_strand:+-s}
${par_opposite_strand:+-S}
${par_split:+-split}
${par_bed_output:+-bed}
${par_header:+-header}
${par_genome:+-g "$par_genome"}
${par_no_name_check:+-nonamecheck}
${par_no_buffer:+-nobuf}
${par_io_buffer:+-iobuf "$par_io_buffer"}
)
# Execute bedtools jaccard
bedtools jaccard "${cmd_args[@]}" > "$par_output"

View File

@@ -0,0 +1,279 @@
#!/bin/bash
set -eo pipefail
## VIASH START
## VIASH END
# Source centralized test helpers
source "$meta_resources_dir/test_helpers.sh"
# Initialize test environment
setup_test_env
log "Starting tests for bedtools_jaccard"
####################################################################################################
log "Creating test data..."
cat <<'EOF' > "$meta_temp_dir/intervals_a.bed"
chr1 100 200 feature_a1
chr1 300 400 feature_a2
chr1 500 600 feature_a3
chr2 100 250 feature_a4
chr2 400 500 feature_a5
EOF
cat <<'EOF' > "$meta_temp_dir/intervals_b.bed"
chr1 150 250 feature_b1
chr1 350 450 feature_b2
chr1 550 650 feature_b3
chr2 150 300 feature_b4
chr2 450 550 feature_b5
EOF
# Create genome file for testing
cat <<'EOF' > "$meta_temp_dir/genome.txt"
chr1 1000
chr2 1000
EOF
####################################################################################################
log "TEST 1: Basic Jaccard calculation"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--output "$meta_temp_dir/jaccard_basic.txt"
check_file_exists "$meta_temp_dir/jaccard_basic.txt" "basic Jaccard output"
check_file_not_empty "$meta_temp_dir/jaccard_basic.txt" "basic Jaccard output"
log "Checking output format (should contain intersection, union, jaccard columns)"
check_file_contains "$meta_temp_dir/jaccard_basic.txt" "^[0-9]"
####################################################################################################
log "TEST 2: Jaccard with minimum overlap fraction for A"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--min_overlap_a 0.5 \
--output "$meta_temp_dir/jaccard_overlap_a.txt"
check_file_exists "$meta_temp_dir/jaccard_overlap_a.txt" "overlap A Jaccard output"
check_file_not_empty "$meta_temp_dir/jaccard_overlap_a.txt" "overlap A Jaccard output"
####################################################################################################
log "TEST 3: Jaccard with minimum overlap fraction for B"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--min_overlap_b 0.5 \
--output "$meta_temp_dir/jaccard_overlap_b.txt"
check_file_exists "$meta_temp_dir/jaccard_overlap_b.txt" "overlap B Jaccard output"
check_file_not_empty "$meta_temp_dir/jaccard_overlap_b.txt" "overlap B Jaccard output"
####################################################################################################
log "TEST 4: Jaccard with reciprocal overlap requirement"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--min_overlap_a 0.5 \
--reciprocal \
--output "$meta_temp_dir/jaccard_reciprocal.txt"
check_file_exists "$meta_temp_dir/jaccard_reciprocal.txt" "reciprocal Jaccard output"
check_file_not_empty "$meta_temp_dir/jaccard_reciprocal.txt" "reciprocal Jaccard output"
####################################################################################################
log "TEST 5: Create stranded test data and test strand options"
cat <<'EOF' > "$meta_temp_dir/stranded_a.bed"
chr1 100 200 feature_a1 0 +
chr1 300 400 feature_a2 0 -
chr1 500 600 feature_a3 0 +
EOF
cat <<'EOF' > "$meta_temp_dir/stranded_b.bed"
chr1 150 250 feature_b1 0 +
chr1 350 450 feature_b2 0 +
chr1 550 650 feature_b3 0 -
EOF
"$meta_executable" \
--input_a "$meta_temp_dir/stranded_a.bed" \
--input_b "$meta_temp_dir/stranded_b.bed" \
--same_strand \
--output "$meta_temp_dir/jaccard_same_strand.txt"
check_file_exists "$meta_temp_dir/jaccard_same_strand.txt" "strand-specific output"
check_file_not_empty "$meta_temp_dir/jaccard_same_strand.txt" "strand-specific output"
####################################################################################################
log "TEST 6: Test same strand requirement (skip opposite strand due to bedtools bug)"
log "Skipping opposite strand test due to bedtools jaccard -S option issue"
####################################################################################################
log "TEST 7: Test either flag (-e)"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--min_overlap_a 0.8 \
--min_overlap_b 0.2 \
--either \
--output "$meta_temp_dir/jaccard_either.txt"
check_file_exists "$meta_temp_dir/jaccard_either.txt" "either flag output"
check_file_not_empty "$meta_temp_dir/jaccard_either.txt" "either flag output"
####################################################################################################
log "TEST 8: Test with genome file"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--genome "$meta_temp_dir/genome.txt" \
--output "$meta_temp_dir/jaccard_genome.txt"
check_file_exists "$meta_temp_dir/jaccard_genome.txt" "genome file output"
check_file_not_empty "$meta_temp_dir/jaccard_genome.txt" "genome file output"
####################################################################################################
log "TEST 9: Create BED12 format data and test split option"
cat <<'EOF' > "$meta_temp_dir/bed12_a.bed"
chr1 100 600 feature_a1 0 + 100 600 0 2 100,100 0,400
chr1 800 1200 feature_a2 0 - 800 1200 0 2 100,100 0,300
EOF
cat <<'EOF' > "$meta_temp_dir/bed12_b.bed"
chr1 150 650 feature_b1 0 + 150 650 0 2 100,100 0,400
chr1 850 1250 feature_b2 0 - 850 1250 0 2 100,100 0,300
EOF
"$meta_executable" \
--input_a "$meta_temp_dir/bed12_a.bed" \
--input_b "$meta_temp_dir/bed12_b.bed" \
--split \
--output "$meta_temp_dir/jaccard_split.txt"
check_file_exists "$meta_temp_dir/jaccard_split.txt" "split output"
check_file_not_empty "$meta_temp_dir/jaccard_split.txt" "split output"
####################################################################################################
log "TEST 10: Test header option with GFF input"
cat <<'EOF' > "$meta_temp_dir/gff_a.gff"
##gff-version 3
chr1 test gene 100 200 . + . ID=gene1
chr1 test gene 300 400 . - . ID=gene2
EOF
cat <<'EOF' > "$meta_temp_dir/gff_b.gff"
##gff-version 3
chr1 test exon 150 250 . + . ID=exon1
chr1 test exon 350 450 . + . ID=exon2
EOF
"$meta_executable" \
--input_a "$meta_temp_dir/gff_a.gff" \
--input_b "$meta_temp_dir/gff_b.gff" \
--header \
--output "$meta_temp_dir/jaccard_header.txt"
check_file_exists "$meta_temp_dir/jaccard_header.txt" "header output"
check_file_contains "$meta_temp_dir/jaccard_header.txt" "gff-version"
####################################################################################################
log "TEST 11: Test no-buffer option"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--no_buffer \
--output "$meta_temp_dir/jaccard_nobuf.txt"
check_file_exists "$meta_temp_dir/jaccard_nobuf.txt" "no-buffer output"
check_file_not_empty "$meta_temp_dir/jaccard_nobuf.txt" "no-buffer output"
####################################################################################################
log "TEST 12: Test IO buffer option"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--io_buffer "64M" \
--output "$meta_temp_dir/jaccard_iobuf.txt"
check_file_exists "$meta_temp_dir/jaccard_iobuf.txt" "IO buffer output"
check_file_not_empty "$meta_temp_dir/jaccard_iobuf.txt" "IO buffer output"
####################################################################################################
log "TEST 13: Validate Jaccard values are in proper range"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_b.bed" \
--output "$meta_temp_dir/jaccard_range.txt"
log "Checking Jaccard value is between 0 and 1"
jaccard_value=$(tail -n1 "$meta_temp_dir/jaccard_range.txt" | cut -f3)
log "Jaccard value: $jaccard_value"
# Check if value is numeric and within range using awk
if echo "$jaccard_value" | awk '/^[0-9]*\.?[0-9]+$/ {exit !($1 >= 0 && $1 <= 1)}'; then
log "✓ Jaccard value is in valid range [0,1]"
else
log "Error: Jaccard value $jaccard_value is out of range [0,1]"
exit 1
fi
####################################################################################################
log "TEST 14: Test identical files (should give Jaccard = 1.0)"
"$meta_executable" \
--input_a "$meta_temp_dir/intervals_a.bed" \
--input_b "$meta_temp_dir/intervals_a.bed" \
--output "$meta_temp_dir/jaccard_identical.txt"
log "Checking that identical files give Jaccard = 1"
jaccard_identical=$(tail -n1 "$meta_temp_dir/jaccard_identical.txt" | cut -f3)
log "Jaccard for identical files: $jaccard_identical"
if echo "$jaccard_identical" | awk '/^[0-9]*\.?[0-9]+$/ {exit !($1 == 1.0)}'; then
log "✓ Identical files correctly give Jaccard = 1.0"
else
log "Warning: Identical files gave Jaccard = $jaccard_identical (expected 1.0)"
fi
####################################################################################################
log "TEST 15: Test no-name-check option with different chromosome naming"
cat <<'EOF' > "$meta_temp_dir/chr_mixed_a.bed"
chr1 100 200 feature_a1
chr01 300 400 feature_a2
EOF
cat <<'EOF' > "$meta_temp_dir/chr_mixed_b.bed"
chr1 150 250 feature_b1
chr01 350 450 feature_b2
EOF
"$meta_executable" \
--input_a "$meta_temp_dir/chr_mixed_a.bed" \
--input_b "$meta_temp_dir/chr_mixed_b.bed" \
--no_name_check \
--output "$meta_temp_dir/jaccard_nonamecheck.txt"
check_file_exists "$meta_temp_dir/jaccard_nonamecheck.txt" "no-name-check output"
check_file_not_empty "$meta_temp_dir/jaccard_nonamecheck.txt" "no-name-check output"
####################################################################################################
log "All tests completed successfully!"

View File

@@ -1,14 +1,19 @@
name: bedtools_links
namespace: bedtools
description: |
Creates an HTML file with links to an instance of the UCSC Genome Browser for all features / intervals in a file.
This is useful for cases when one wants to manually inspect through a large set of annotations or features.
keywords: [Links, BED, GFF, VCF]
description: |
This tool generates an HTML page containing links to the UCSC Genome Browser
for each feature/interval in the input file. This is particularly useful for
manually inspecting large sets of genomic annotations or features through
the browser interface.
**Default behavior:** Links point to human genome (hg18) on the main UCSC site
**Customization:** Supports custom mirror sites and different organisms/builds
keywords: [HTML, Links, UCSC, Browser, BED, GFF, VCF]
links:
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/links.html
repository: https://github.com/arq5x/bedtools2
homepage: https://bedtools.readthedocs.io/en/latest/#
issue_tracker: https://github.com/arq5x/bedtools2/issues
homepage: https://bedtools.readthedocs.io/en/latest/
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
@@ -16,57 +21,80 @@ requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]
roles: [author]
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [author, maintainer]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: -i
alternatives: [-i]
type: file
description: Input file (bed/gff/vcf).
description: |
Input file in BED, GFF, or VCF format containing genomic intervals.
Each feature/interval will be converted to a clickable link pointing
to the UCSC Genome Browser. File format is auto-detected based on
content and extension.
required: true
example: intervals.bed
- name: Outputs
arguments:
- name: --output
alternatives: -o
alternatives: [-o]
type: file
direction: output
description: Output HTML file to be written.
description: |
Output HTML file containing clickable browser links.
The generated HTML page will contain one link per input feature,
formatted for easy navigation to the UCSC Genome Browser.
required: true
example: browser_links.html
- name: Options
description: |
By default, the links created will point to human (hg18) UCSC browser.
If you have a local mirror, you can override this behavior by supplying
the -base, -org, and -db options.
For example, if the URL of your local mirror for mouse MM9 is called:
http://mymirror.myuniversity.edu, then you would use the following:
--base_url http://mymirror.myuniversity.edu
--organism mouse
--database mm9
arguments:
- name: --base_url
alternatives: -base
alternatives: [-base]
type: string
description: |
The “basename” for the UCSC browser.
default: http://genome.ucsc.edu
description: |
Base URL for the UCSC Genome Browser instance.
**Default:** http://genome.ucsc.edu (official UCSC site)
**Custom mirrors:** Use your institution's mirror URL
**Example:** http://mymirror.myuniversity.edu
example: "http://genome.ucsc.edu"
- name: --organism
alternatives: -org
alternatives: [-org]
type: string
description: |
The organism (e.g. mouse, human).
default: human
description: |
Target organism for genome browser links.
**Common values:**
- human (default)
- mouse
- rat
- fly
- worm
Must match organism names used by your UCSC browser instance.
example: "human"
- name: --database
alternatives: -db
alternatives: [-db]
type: string
description: |
The genome build.
default: hg18
description: |
Genome assembly/build identifier.
**Human examples:** hg19, hg38, hg18 (default)
**Mouse examples:** mm9, mm10, mm39
**Other:** Assembly names as recognized by UCSC browser
Must correspond to available assemblies for the specified organism.
example: "hg18"
resources:
- type: bash_script
@@ -75,16 +103,16 @@ resources:
test_resources:
- type: bash_script
path: test.sh
- type: file
path: /src/_utils/test_helpers.sh
engines:
- type: docker
image: debian:stable-slim
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
setup:
- type: apt
packages: [bedtools, procps]
- type: docker
run: |
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
run:
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
runners:
- type: executable

Some files were not shown because too many files have changed in this diff Show More