Build branch biobox/main with version main to biobox on branch main (7158daa)
Build pipeline: viash-hub.biobox.main-tb4cv
Source commit: 7158daa5f6
Source message: Fix bases2fastq component, update to latest practices (#190)
* wip updates
* refactor component
* assume bases2fastq follows semver
* fix version command
* add entry to changelog
* move to minor changes
This commit is contained in:
91
CHANGELOG.md
91
CHANGELOG.md
@@ -1,3 +1,94 @@
|
||||
# Unreleased
|
||||
|
||||
<!-- Add new changes here before release -->
|
||||
|
||||
## BREAKING CHANGES
|
||||
|
||||
* `fq_subsample` has been removed after its functionality was previously copied to `fq/fq_subsample`. Please use the latter instead. (PR #182).
|
||||
|
||||
## NEW FUNCTIONALITY
|
||||
|
||||
* `fq`:
|
||||
- `fq/fq_filter`: Filter FASTQ files based on record names or sequence patterns (PR #182).
|
||||
- `fq/fq_generate`: Generate a random FASTQ file pair for testing and simulation purposes (PR #182).
|
||||
|
||||
* `bwa`: Added BWA support for single-end and paired-end read alignment (PR #183).
|
||||
- `bwa/bwa_index`: Create BWA index files for reference genome alignment.
|
||||
- `bwa/bwa_mem`: BWA-MEM algorithm for sequence alignment supporting single-end and paired-end reads.
|
||||
- `bwa/bwa_aln`: BWA aln algorithm for aligning short sequence reads to a reference genome.
|
||||
- `bwa/bwa_samse`: BWA samse - generate single-end alignment in SAM format from BWA aln SAI files.
|
||||
- `bwa/bwa_sampe`: BWA sampe - generate paired-end alignment in SAM format from BWA aln SAI files.
|
||||
|
||||
* `bowtie2`: Add support for Bowtie2 alignment and indexing (PR #184).
|
||||
- `bowtie2/bowtie2_build`: Build Bowtie2 index files from reference sequences.
|
||||
- `bowtie2/bowtie2_align`: Align single-end and paired-end reads using Bowtie2.
|
||||
- `bowtie2/bowtie2_inspect`: Extract information from Bowtie2 index files.
|
||||
|
||||
* `bedtools`: Major expansion with 32 new components providing comprehensive genomic interval analysis (PR #188):
|
||||
- `bedtools/bedtools_annotate`: Annotate coverage based on overlaps with interval files
|
||||
- `bedtools/bedtools_bedpetobam`: Convert BEDPE to BAM format
|
||||
- `bedtools/bedtools_closest`: Find closest features between two interval files
|
||||
- `bedtools/bedtools_cluster`: Cluster nearby intervals
|
||||
- `bedtools/bedtools_complement`: Report intervals not covered by features
|
||||
- `bedtools/bedtools_coverage`: Compute coverage of features
|
||||
- `bedtools/bedtools_expand`: Expand blocked BED features
|
||||
- `bedtools/bedtools_fisher`: Compute Fisher's exact test for overlaps
|
||||
- `bedtools/bedtools_flank`: Create flanking intervals around features
|
||||
- `bedtools/bedtools_igv`: Create IGV batch scripts for visualization
|
||||
- `bedtools/bedtools_jaccard`: Compute Jaccard statistic between interval sets
|
||||
- `bedtools/bedtools_makewindows`: Make windows across genome or intervals
|
||||
- `bedtools/bedtools_map`: Map values from overlapping intervals
|
||||
- `bedtools/bedtools_maskfasta`: Mask FASTA sequences using intervals
|
||||
- `bedtools/bedtools_multicov`: Count coverage across multiple BAM files
|
||||
- `bedtools/bedtools_multiinter`: Identify common intervals across multiple files
|
||||
- `bedtools/bedtools_overlap`: Compute overlaps between paired-end reads and intervals
|
||||
- `bedtools/bedtools_pairtobed`: Find overlaps between paired-end reads and intervals
|
||||
- `bedtools/bedtools_pairtopair`: Find overlaps between paired-end read sets
|
||||
- `bedtools/bedtools_random`: Generate random intervals
|
||||
- `bedtools/bedtools_reldist`: Compute relative distances between features
|
||||
- `bedtools/bedtools_sample`: Sample random subsets of intervals
|
||||
- `bedtools/bedtools_shift`: Shift intervals by specified amounts
|
||||
- `bedtools/bedtools_shuffle`: Shuffle intervals while preserving size
|
||||
- `bedtools/bedtools_slop`: Extend intervals by specified amounts
|
||||
- `bedtools/bedtools_spacing`: Report spacing between intervals
|
||||
- `bedtools/bedtools_split`: Split BED12 features into individual intervals
|
||||
- `bedtools/bedtools_subtract`: Remove overlapping features
|
||||
- `bedtools/bedtools_summary`: Summarize interval statistics
|
||||
- `bedtools/bedtools_tag`: Tag BAM alignments with overlapping intervals
|
||||
- `bedtools/bedtools_unionbedg`: Combine multiple BEDGRAPH files
|
||||
- `bedtools/bedtools_window`: Find overlapping features within specified windows
|
||||
|
||||
## MAJOR CHANGES
|
||||
|
||||
* `bedtools`: Enhanced 11 existing bedtools components with improved functionality and standardized interfaces (PR #188):
|
||||
- `bedtools/bedtools_bamtobed`: Enhanced with additional output format options
|
||||
- `bedtools/bedtools_bamtofastq`: Improved paired-end read handling
|
||||
- `bedtools/bedtools_bed12tobed6`: Standardized parameter handling
|
||||
- `bedtools/bedtools_bedtobam`: Enhanced genome file support
|
||||
- `bedtools/bedtools_genomecov`: Added scale and split options
|
||||
- `bedtools/bedtools_getfasta`: Improved FASTA extraction features
|
||||
- `bedtools/bedtools_groupby`: Enhanced grouping and operation options
|
||||
- `bedtools/bedtools_intersect`: Expanded intersection mode support
|
||||
- `bedtools/bedtools_links`: Improved link generation functionality
|
||||
- `bedtools/bedtools_merge`: Enhanced merging options and distance parameters
|
||||
- `bedtools/bedtools_sort`: Standardized sorting options
|
||||
|
||||
## MINOR CHANGES
|
||||
|
||||
* `bases2fastq`: Updated component with comprehensive argument support and latest practices (PR #190).
|
||||
|
||||
## DOCUMENTATION
|
||||
|
||||
* Major restructuring of the documentation pages (PR #185):
|
||||
- `CONTRIBUTING.md`: Streamlined guide with detailed sections moved to dedicated docs/ guides.
|
||||
- `README.md`: Streamlined content to guide people towards what they need.
|
||||
- `docs/COMPONENT_DEVELOPMENT.md`: New comprehensive guide covering component creation process.
|
||||
- `docs/SCRIPT_DEVELOPMENT.md`: New detailed guide for script development best practices.
|
||||
- `docs/TESTING.md`: New comprehensive testing guide.
|
||||
- `docs/DOCKER_GUIDE.md`: New Docker and engine best practices guide.
|
||||
|
||||
* `.github/PULL_REQUEST_TEMPLATE.md`: Fixed repository references to point to correct biobox repository instead of base template (PR #185).
|
||||
|
||||
# biobox 0.3.2
|
||||
|
||||
## NEW FUNCTIONALITY
|
||||
|
||||
484
CONTRIBUTING.md
484
CONTRIBUTING.md
@@ -1,445 +1,145 @@
|
||||
# Contributing Guidelines
|
||||
|
||||
# Contributing guidelines
|
||||
We encourage contributions from the community! This guide will help you get started with creating new components for the biobox repository.
|
||||
|
||||
We encourage contributions from the community. To contribute:
|
||||
**Quick overview:** Fork → Develop → Test → Submit PR
|
||||
|
||||
1. **Fork the Repository**: Start by forking this repository to your account.
|
||||
2. **Develop Your Component**: Create your Viash component, ensuring it aligns with our best practices (detailed below).
|
||||
3. **Submit a Pull Request**: After testing your component, submit a pull request for review.
|
||||
## Quick Start
|
||||
|
||||
## Procedure of adding a component
|
||||
|
||||
### Step 1: Find a component to contribute
|
||||
|
||||
* Find a tool to contribute to this repo.
|
||||
|
||||
* Check whether it is already in the [Project board](https://github.com/orgs/viash-hub/projects/1).
|
||||
|
||||
* Check whether there is a corresponding [Snakemake wrapper](https://github.com/snakemake/snakemake-wrappers/blob/master/bio) or [nf-core module](https://github.com/nf-core/modules/tree/master/modules/nf-core) which we can use as inspiration.
|
||||
|
||||
* Create an issue to show that you are working on this component.
|
||||
|
||||
|
||||
### Step 2: Add config template
|
||||
|
||||
Change all occurrences of `xxx` to the name of the component.
|
||||
|
||||
Create a file at `src/xxx/config.vsh.yaml` with contents:
|
||||
### Essential Config Template
|
||||
|
||||
```yaml
|
||||
name: xxx
|
||||
description: xxx
|
||||
name: your_tool
|
||||
namespace: category
|
||||
description: Brief description of what the tool does
|
||||
keywords: [tag1, tag2]
|
||||
links:
|
||||
homepage: yyy
|
||||
documentation: yyy
|
||||
issue_tracker: yyy
|
||||
repository: yyy
|
||||
references:
|
||||
doi: 12345/12345678.yz
|
||||
license: MIT/Apache-2.0/GPL-3.0/...
|
||||
homepage: https://tool-homepage.com
|
||||
documentation: https://tool-docs.com
|
||||
repository: https://github.com/user/repo
|
||||
references:
|
||||
doi: 10.1000/journal.12345
|
||||
license: MIT/Apache-2.0/GPL-3.0
|
||||
requirements:
|
||||
commands: [your-tool, dependency-tool]
|
||||
authors:
|
||||
- __merge__: /src/_authors/your_name.yaml
|
||||
roles: [author, maintainer]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments: <...>
|
||||
- name: Outputs
|
||||
arguments: <...>
|
||||
- name: Arguments
|
||||
arguments: <...>
|
||||
arguments: [...]
|
||||
- name: Outputs
|
||||
arguments: [...]
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
engines:
|
||||
- <...>
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/tool:version--build_string
|
||||
setup:
|
||||
- type: docker
|
||||
run:
|
||||
- tool --version 2>&1 | head -1 | sed 's/.*version /tool: /' > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
```
|
||||
|
||||
### Step 3: Fill in the metadata
|
||||
|
||||
Fill in the relevant metadata fields in the config. Here is an example of the metadata of an existing component.
|
||||
|
||||
```yaml
|
||||
name: arriba
|
||||
description: Detect gene fusions from RNA-Seq data
|
||||
keywords: [Gene fusion, RNA-Seq]
|
||||
links:
|
||||
homepage: https://arriba.readthedocs.io/en/latest/
|
||||
documentation: https://arriba.readthedocs.io/en/latest/
|
||||
repository: https://github.com/suhrig/arriba
|
||||
issue_tracker: https://github.com/suhrig/arriba/issues
|
||||
references:
|
||||
doi: 10.1101/gr.257246.119
|
||||
bibtex: |
|
||||
@article{
|
||||
... a bibtex entry in case the doi is not available ...
|
||||
}
|
||||
license: MIT
|
||||
```
|
||||
|
||||
### Step 4: Find a suitable container
|
||||
|
||||
Google `biocontainer <name of component>` and find the container that is most suitable. Typically the link will be `https://quay.io/repository/biocontainers/xxx?tab=tags`.
|
||||
|
||||
If no such container is found, you can create a custom container in the next step.
|
||||
|
||||
|
||||
### Step 5: Create help file
|
||||
|
||||
To help develop the component, we store the `--help` output of the tool in a file at `src/xxx/help.txt`.
|
||||
|
||||
````bash
|
||||
cat <<EOF > src/xxx/help.txt
|
||||
```sh
|
||||
xxx --help
|
||||
```
|
||||
EOF
|
||||
|
||||
docker run quay.io/biocontainers/xxx:tag xxx --help >> src/xxx/help.txt
|
||||
````
|
||||
|
||||
Notes:
|
||||
|
||||
* This help file has no functional purpose, but it is useful for the developer to see the help output of the tool.
|
||||
|
||||
* Some tools might not have a `--help` argument but instead have a `-h` argument. For example, for `arriba`, the help message is obtained by running `arriba -h`:
|
||||
|
||||
```bash
|
||||
docker run quay.io/biocontainers/arriba:2.4.0--h0033a41_2 arriba -h
|
||||
```
|
||||
|
||||
|
||||
### Step 6: Create or fetch test data
|
||||
|
||||
To help develop the component, it's interesting to have some test data available. In most cases, we can use the test data from the Snakemake wrappers.
|
||||
|
||||
To make sure we can reproduce the test data in the future, we store the command to fetch the test data in a file at `src/xxx/test_data/script.sh`.
|
||||
### Essential Commands
|
||||
|
||||
```bash
|
||||
cat <<EOF > src/xxx/test_data/script.sh
|
||||
# Create component structure
|
||||
mkdir -p src/namespace/tool_name
|
||||
touch src/namespace/tool_name/{script.sh,test.sh,config.vsh.yaml}
|
||||
|
||||
# clone repo
|
||||
if [ ! -d /tmp/snakemake-wrappers ]; then
|
||||
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
|
||||
fi
|
||||
# Generate help file
|
||||
docker run container tool --help > src/namespace/tool_name/help.txt
|
||||
|
||||
# copy test data
|
||||
cp -r /tmp/snakemake-wrappers/bio/xxx/test/* src/xxx/test_data
|
||||
EOF
|
||||
# Test your component
|
||||
viash test src/namespace/tool_name/config.vsh.yaml
|
||||
|
||||
# Build for testing
|
||||
viash build src/namespace/tool_name/config.vsh.yaml --setup cachedbuild
|
||||
```
|
||||
|
||||
The test data should be suitable for testing this component. Ensure that the test data is small enough: ideally <1KB, preferably <10KB, if need be <100KB.
|
||||
### Key Best Practices
|
||||
|
||||
### Step 7: Add arguments for the input files
|
||||
- **Follow modern standards**: Use current coding patterns and component structure
|
||||
- **Ensure reproducibility**: Pin versions and document dependencies clearly
|
||||
- **Generate test data**: Create self-contained tests that don't rely on external files
|
||||
- **Write clean code**: Use consistent naming and clear, maintainable scripts
|
||||
|
||||
By looking at the help file, we add the input arguments to the config file. Here is an example of the input arguments of an existing component.
|
||||
For detailed implementation guidelines, check out our development guides:
|
||||
|
||||
For instance, in the [arriba help file](src/arriba/help.txt), we see the following:
|
||||
## Development Guides
|
||||
|
||||
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
|
||||
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
|
||||
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
|
||||
-o fusions.tsv [-O fusions.discarded.tsv] \
|
||||
[OPTIONS]
|
||||
### 🔧 [Component Development Guide](docs/COMPONENT_DEVELOPMENT.md)
|
||||
How to create components: config templates, metadata, arguments, containers, help files, and Docker setup.
|
||||
|
||||
-x FILE File in SAM/BAM/CRAM format with main alignments as generated by STAR
|
||||
(Aligned.out.sam). Arriba extracts candidate reads from this file.
|
||||
### 📝 [Script Development Guide](docs/SCRIPT_DEVELOPMENT.md)
|
||||
Writing good scripts: array-based commands, error handling, conditional parameters, boolean flags, and parameter patterns.
|
||||
|
||||
Based on this information, we can add the following input arguments to the config file.
|
||||
### ✅ [Testing Guide](docs/TESTING.md)
|
||||
Testing your components: self-contained tests, generating test data, output validation, and testing multiple scenarios.
|
||||
|
||||
```yaml
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --bam
|
||||
alternatives: -x
|
||||
type: file
|
||||
description: |
|
||||
File in SAM/BAM/CRAM format with main alignments as generated by STAR
|
||||
(`Aligned.out.sam`). Arriba extracts candidate reads from this file.
|
||||
required: true
|
||||
example: Aligned.out.bam
|
||||
```
|
||||
### 🐳 [Docker Guide](docs/DOCKER_GUIDE.md)
|
||||
Working with containers: choosing biocontainers, version pinning, detecting software versions, and container best practices.
|
||||
|
||||
Check the [documentation](https://viash.io/reference/config/functionality/arguments) for more information on the format of input arguments.
|
||||
## Contribution Process
|
||||
|
||||
Several notes:
|
||||
### Submitting Your Component
|
||||
|
||||
* Argument names should be formatted in `--snake_case`. This means arguments like `--foo-bar` should be formatted as `--foo_bar`, and short arguments like `-f` should receive a longer name like `--foo`.
|
||||
1. **Test thoroughly**: Ensure your component passes all tests
|
||||
```bash
|
||||
viash test src/namespace/tool_name/config.vsh.yaml
|
||||
```
|
||||
|
||||
* Input arguments can have `multiple: true` to allow the user to specify multiple files.
|
||||
2. **Add changelog entry**: Document your changes in `CHANGELOG.md` under the "Unreleased" section
|
||||
|
||||
* The description should be formatted in markdown.
|
||||
3. **Review your changes**: Check your code for:
|
||||
- Consistent naming and coding conventions
|
||||
- Clear, maintainable code structure
|
||||
- Proper error handling
|
||||
- Robust edge case management
|
||||
- Complete documentation and helpful comments
|
||||
|
||||
### Step 8: Add arguments for the output files
|
||||
4. **Create a pull request**: Submit your changes.
|
||||
- Include a clear description of the changes you've made
|
||||
- Link to any relevant issues or discussions
|
||||
- Review the changes critically before submitting the PR
|
||||
|
||||
By looking at the help file, we now also add output arguments to the config file.
|
||||
### Review Process
|
||||
|
||||
For example, in the [arriba help file](src/arriba/help.txt), we see the following:
|
||||
- All contributions go through code review
|
||||
- Components must pass automated tests
|
||||
- Docker containers must be properly versioned
|
||||
- Documentation must be complete and accurate
|
||||
|
||||
## Getting Help
|
||||
|
||||
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
|
||||
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
|
||||
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
|
||||
-o fusions.tsv [-O fusions.discarded.tsv] \
|
||||
[OPTIONS]
|
||||
### Resources
|
||||
|
||||
-o FILE Output file with fusions that have passed all filters.
|
||||
- **[Viash Documentation](https://viash.io/)**
|
||||
- **[GitHub Discussions](https://github.com/viash-io/biobox/discussions)**
|
||||
- **[Issue Tracker](https://github.com/viash-io/biobox/issues)**
|
||||
|
||||
-O FILE Output file with fusions that were discarded due to filtering.
|
||||
### Common Questions
|
||||
|
||||
Based on this information, we can add the following output arguments to the config file.
|
||||
**Q: How do I find the right Docker container?**
|
||||
A: Search for "biocontainer [tool_name]" or check [quay.io/biocontainers](https://quay.io/organization/biocontainers)
|
||||
|
||||
```yaml
|
||||
argument_groups:
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --fusions
|
||||
alternatives: -o
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with fusions that have passed all filters.
|
||||
required: true
|
||||
example: fusions.tsv
|
||||
- name: --fusions_discarded
|
||||
alternatives: -O
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with fusions that were discarded due to filtering.
|
||||
required: false
|
||||
example: fusions.discarded.tsv
|
||||
```
|
||||
**Q: My component fails to build. What should I check?**
|
||||
A: Verify the Docker image exists, check syntax in config.vsh.yaml, and ensure all required commands are available
|
||||
|
||||
Note:
|
||||
**Q: How do I handle tools with complex argument patterns?**
|
||||
A: Check existing similar components for patterns, or ask in GitHub Discussions
|
||||
|
||||
* Preferably, these outputs should not be directories but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
|
||||
**Q: Can I create custom Docker containers?**
|
||||
A: Yes, but biocontainers are preferred when available. See the [Docker Guide](docs/DOCKER_GUIDE.md) for details.
|
||||
|
||||
### Step 9: Add arguments for the other arguments
|
||||
---
|
||||
|
||||
Finally, add all other arguments to the config file. There are a few exceptions:
|
||||
|
||||
* Arguments related to specifying CPU and memory requirements are handled separately and should not be added to the config file.
|
||||
|
||||
* Arguments related to printing the information such as printing the version (`-v`, `--version`) or printing the help (`-h`, `--help`) should not be added to the config file.
|
||||
|
||||
* If the help lists defaults, do not add them as defaults but to the description. Example: `description: <Explanation of parameter>. Default: 10.`
|
||||
|
||||
Note:
|
||||
|
||||
* Prefer using `boolean_true` over `boolean_false`. This avoids confusion when specifying values for this argument in a Nextflow workflow.
|
||||
For example, consider the CLI option `--no-indels` for `cutadapt`. If the config for `cutadapt` would specify an argument `no_indels` of type `boolean_false`,
|
||||
the script of the component must pass a `--no-indels` argument to `cutadapt` when `par_no_indels` is set to `false`. This becomes problematic setting a value for this argument using `fromState` in a nextflow workflow: with `fromState: ["no_indels": true]`, the value that gets passed to the script is `true` and the `--no-indels` flag would *not* be added to the options for `cutadapt`. This is inconsitent to what one might expect when interpreting `["no_indels": true]`.
|
||||
When using `boolean_true`, the reasoning becomes simpler because its value no longer represents the effect of the argument, but wether or not the flag is set.
|
||||
|
||||
### Step 10: Add a Docker engine
|
||||
|
||||
To ensure reproducibility of components, we require that all components are run in a Docker container.
|
||||
|
||||
```yaml
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/xxx:0.1.0--py_0
|
||||
```
|
||||
|
||||
The container should have your tool installed, as well as `ps`.
|
||||
|
||||
If you didn't find a suitable container in the previous step, you can create a custom container. For example:
|
||||
|
||||
```yaml
|
||||
engines:
|
||||
- type: docker
|
||||
image: python:3.10
|
||||
setup:
|
||||
- type: python
|
||||
packages: numpy
|
||||
```
|
||||
|
||||
For more information on how to do this, see the [documentation](https://viash.io/guide/component/add-dependencies.html#steps-for-creating-a-custom-docker-platform).
|
||||
|
||||
Here is a list of base containers we can recommend:
|
||||
|
||||
* Bash: [`bash`](https://hub.docker.com/_/bash), [`ubuntu`](https://hub.docker.com/_/ubuntu)
|
||||
* C#: [`ghcr.io/data-intuitive/dotnet-script`](https://github.com/data-intuitive/ghcr-dotnet-script/pkgs/container/dotnet-script)
|
||||
* JavaScript: [`node`](https://hub.docker.com/_/node)
|
||||
* Python: [`python`](https://hub.docker.com/_/python), [`nvcr.io/nvidia/pytorch`](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
|
||||
* R: [`eddelbuettel/r2u`](https://hub.docker.com/r/eddelbuettel/r2u), [`rocker/tidyverse`](https://hub.docker.com/r/rocker/tidyverse)
|
||||
* Scala: [`sbtscala/scala-sbt`](https://hub.docker.com/r/sbtscala/scala-sbt)
|
||||
|
||||
### Step 11: Write a runner script
|
||||
|
||||
Next, we need to write a runner script that runs the tool with the input arguments. Create a Bash script named `src/xxx/script.sh` which runs the tool with the input arguments.
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# unset flags
|
||||
[[ "$par_option" == "false" ]] && unset par_option
|
||||
|
||||
xxx \
|
||||
--input "$par_input" \
|
||||
--output "$par_output" \
|
||||
${par_option:+--option}
|
||||
```
|
||||
|
||||
When building a Viash component, Viash will automatically replace the `## VIASH START` and `## VIASH END` lines (and anything in between) with environment variables based on the arguments specified in the config.
|
||||
|
||||
As an example, this is what the Bash script for the `arriba` component looks like:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# unset flags
|
||||
[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking
|
||||
[[ "$par_extra_information" == "false" ]] && unset par_extra_information
|
||||
[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps
|
||||
|
||||
arriba \
|
||||
-x "$par_bam" \
|
||||
-a "$par_genome" \
|
||||
-g "$par_gene_annotation" \
|
||||
-o "$par_fusions" \
|
||||
${par_known_fusions:+-k "${par_known_fusions}"} \
|
||||
${par_blacklist:+-b "${par_blacklist}"} \
|
||||
# ...
|
||||
${par_extra_information:+-X} \
|
||||
${par_fill_gaps:+-I}
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
* If your arguments can contain special variables (e.g. `$`), you can use quoting (need to find a documentation page for this) to make sure you can use the string as input. Example: `-x ${par_bam@Q}`.
|
||||
|
||||
* Optional arguments can be passed to the command conditionally using Bash [parameter expansion](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html). For example: `${par_known_fusions:+-k ${par_known_fusions@Q}}`
|
||||
|
||||
* If your tool allows for multiple inputs using a separator other than `;` (which is the default Viash multiple separator), you can substitute these values with a command like: `par_disable_filters=$(echo $par_disable_filters | tr ';' ',')`.
|
||||
|
||||
* If you have a lot of boolean variables that you would like to unset when the value is `false`, you can avoid duplicate code by using the following syntax:
|
||||
|
||||
```bash
|
||||
unset_if_false=(
|
||||
par_argument_1
|
||||
par_argument_2
|
||||
par_argument_3
|
||||
par_argument_4
|
||||
)
|
||||
|
||||
for par in ${unset_if_false[@]}; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset $par
|
||||
done
|
||||
```
|
||||
|
||||
this code is equivalent to
|
||||
|
||||
```bash
|
||||
[[ "$par_argument_1" == "false" ]] && unset par_argument_1
|
||||
[[ "$par_argument_2" == "false" ]] && unset par_argument_2
|
||||
[[ "$par_argument_3" == "false" ]] && unset par_argument_3
|
||||
[[ "$par_argument_4" == "false" ]] && unset par_argument_4
|
||||
```
|
||||
|
||||
|
||||
### Step 12: Create test script
|
||||
|
||||
If the unit test requires test resources, these should be provided in the `test_resources` section of the component.
|
||||
|
||||
```yaml
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
```
|
||||
|
||||
Create a test script at `src/xxx/test.sh` that runs the component with the test data. This script should run the component (available with `$meta_executable`) with the test data and check if the output is as expected. The script should exit with a non-zero exit code if the output is not as expected. For example:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
assert_file_doesnt_exist() {
|
||||
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
|
||||
}
|
||||
assert_file_empty() {
|
||||
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_file_not_contains() {
|
||||
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
|
||||
}
|
||||
assert_file_contains_regex() {
|
||||
grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_file_not_contains_regex() {
|
||||
grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
|
||||
}
|
||||
#############################################
|
||||
|
||||
echo "> Run $meta_name with test data"
|
||||
"$meta_executable" \
|
||||
--input "$meta_resources_dir/test_data/reads_R1.fastq" \
|
||||
--output "output.txt" \
|
||||
--option
|
||||
|
||||
echo ">> Check if output exists"
|
||||
assert_file_exists "output.txt"
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
assert_file_not_empty "output.txt"
|
||||
|
||||
echo ">> Check if output is correct"
|
||||
assert_file_contains "output.txt" "some expected output"
|
||||
|
||||
echo "> All tests succeeded!"
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
* Do always check the contents of the output file. If the output is not deterministic, you can use regular expressions to check the output.
|
||||
|
||||
* If possible, generate your own test data instead of copying it from an external resource.
|
||||
|
||||
### Step 13: Create a `/var/software_versions.txt` file
|
||||
|
||||
For the sake of transparency and reproducibility, we require that the versions of the software used in the component are documented.
|
||||
|
||||
For now, this is managed by creating a file `/var/software_versions.txt` in the `setup` section of the Docker engine.
|
||||
|
||||
```yaml
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/xxx:0.1.0--py_0
|
||||
setup:
|
||||
- type: docker
|
||||
# note: /var/software_versions.txt should contain:
|
||||
# arriba: "2.4.0"
|
||||
run: |
|
||||
echo "xxx: \"0.1.0\"" > /var/software_versions.txt
|
||||
```
|
||||
Happy contributing!
|
||||
|
||||
212
README.md
212
README.md
@@ -11,132 +11,114 @@ Issues](https://img.shields.io/github/issues/viash-hub/biobox.svg)](https://gith
|
||||
[](https://viash.io)
|
||||
|
||||
A curated collection of high-quality, standalone bioinformatics
|
||||
components built with [Viash](https://viash.io).
|
||||
**A curated collection of high-quality, production-ready bioinformatics
|
||||
components**
|
||||
|
||||
## Introduction
|
||||
Built with [Viash](https://viash.io), biobox provides reliable,
|
||||
containerized tools for genomics and bioinformatics workflows. Each
|
||||
component is thoroughly tested, fully documented, and designed for
|
||||
seamless integration into both standalone and Nextflow pipelines.
|
||||
|
||||
`biobox` offers a suite of reliable bioinformatics components, similar
|
||||
to [nf-core/modules](https://github.com/nf-core/modules) and
|
||||
[snakemake-wrappers/bio](https://github.com/snakemake/snakemake-wrappers/tree/master/bio),
|
||||
but built using the [Viash](https://viash.io) framework.
|
||||
## Why Choose biobox?
|
||||
|
||||
This approach emphasizes **reusability**, **reproducibility**, and
|
||||
adherence to **best practices**. Key features of `biobox` components
|
||||
include:
|
||||
✅ **Production Ready**: All components are containerized with pinned
|
||||
versions and comprehensive testing
|
||||
✅ **Nextflow Native**: Drop-in compatibility with Nextflow workflows
|
||||
✅ **Complete Documentation**: Full parameter exposure with detailed
|
||||
help and examples
|
||||
✅ **Quality Assured**: Unit tested with automated CI/CD validation
|
||||
✅ **Modern Standards**: Built with current best practices and
|
||||
maintained dependencies
|
||||
|
||||
- **Standalone & Nextflow Ready:** Run components directly via the
|
||||
command line or seamlessly integrate them into Nextflow workflows.
|
||||
- **High Quality Standards:**
|
||||
- Comprehensive documentation for components and parameters.
|
||||
- Full exposure of underlying tool arguments.
|
||||
- Containerized (Docker) for dependency management and
|
||||
reproducibility.
|
||||
- Unit tested for verified functionality.
|
||||
## Featured Tools
|
||||
|
||||
## Example Usage
|
||||
Our collection spans the complete bioinformatics pipeline:
|
||||
|
||||
Viash components in biobox can be run in various ways:
|
||||
**Alignment & Mapping**: BWA, Bowtie2, STAR, Kallisto, Salmon
|
||||
**Quality Control**: FastQC, Falco, MultiQC, Qualimap, NanoPlot
|
||||
**Preprocessing**: Cutadapt, fastp, Trimgalore, UMI-tools
|
||||
**Variant Calling**: BCFtools, LoFreq, SnpEff
|
||||
**File Manipulation**: SAMtools, Bedtools, seqtk
|
||||
**Assembly & Annotation**: BUSCO, AGAT, GFFread
|
||||
**Single Cell**: CellRanger, BD Rhapsody
|
||||
|
||||
``` mermaid lang="mermaid"
|
||||
flowchart TD
|
||||
A[biobox v0.3.1] --> B(Viash Hub Launch)
|
||||
A --> C(Viash CLI)
|
||||
A --> D(Nextflow CLI)
|
||||
A --> E(Seqera Cloud)
|
||||
A --> F(As a dependency)
|
||||
[View all components →](https://www.viash-hub.com/packages/biobox)
|
||||
|
||||
## Quick Start
|
||||
|
||||
You can run Viash components from biobox in several ways:
|
||||
|
||||
**🌐 Via Viash Hub Web UI**: Interactive interface with documentation
|
||||
and examples
|
||||
**⚡ As Standalone Executables**: Direct command-line execution
|
||||
**🔄 Via Nextflow**: Local or cloud-based pipeline workflows
|
||||
|
||||
For detailed instructions on each method, visit the **[Viash Hub
|
||||
documentation →](https://viash-hub.com/packages/biobox)** where each
|
||||
component page shows exactly how to run it in different environments.
|
||||
|
||||
``` mermaid
|
||||
flowchart LR
|
||||
A[biobox Components] --> B[🌐 Web UI]
|
||||
A --> C[⚡ Standalone]
|
||||
A --> D[🔄 Nextflow Local]
|
||||
A --> E[☁️ Nextflow Cloud]
|
||||
|
||||
style A fill:#7a4baa,color:#fff
|
||||
style B fill:#e1f5fe,color:#000
|
||||
style C fill:#e8f5e8,color:#000
|
||||
style D fill:#fff3e0,color:#000
|
||||
style E fill:#f3e5f5,color:#000
|
||||
```
|
||||
|
||||
### 1. Via the Viash Hub Launch interface
|
||||
|
||||
You can run this component directly from the Viash Hub [Launch
|
||||
interface](https://www.viash-hub.com/launch?package=biobox&version=v0.3.1&component=arriba&runner=Executable).
|
||||
|
||||

|
||||
|
||||
### 2. Via the Viash CLI
|
||||
|
||||
You can run this component directly from the command line using the
|
||||
Viash CLI.
|
||||
|
||||
``` bash
|
||||
viash run vsh://biobox@v0.3.1/arriba -- --help
|
||||
|
||||
viash run vsh://biobox@v0.3.1/arriba -- \
|
||||
--bam path/to/input.bam \
|
||||
--genome path/to/genome.fa \
|
||||
--gene_annotation path/to/annotation.gtf \
|
||||
--fusions path/to/output.txt
|
||||
```
|
||||
|
||||
This will run the component with the specified input files and output
|
||||
the results to the specified output file.
|
||||
|
||||
### 3. Via the Nextflow CLI or Seqera Cloud
|
||||
|
||||
You can run this component as a Nextflow pipeline.
|
||||
|
||||
``` bash
|
||||
nextflow run https://packages.viash-hub.com/vsh/biobox \
|
||||
-revision v0.3.1 \
|
||||
-main-script target/nextflow/arriba/main.nf \
|
||||
-latest -resume \
|
||||
-profile docker \
|
||||
--bam path/to/input.bam \
|
||||
--genome path/to/genome.fa \
|
||||
--gene_annotation path/to/annotation.gtf \
|
||||
--publish_dir path/to/output
|
||||
```
|
||||
|
||||
**Note:** Make sure that the [Nextflow
|
||||
SCM](https://www.nextflow.io/docs/latest/git.html#git-configuration) is
|
||||
set up properly. You can do this by adding the following lines to your
|
||||
`~/.nextflow/scm` file:
|
||||
|
||||
``` groovy
|
||||
providers.vsh.platform = 'gitlab'
|
||||
providers.vsh.server = 'https://packages.viash-hub.com'
|
||||
```
|
||||
|
||||
**Tip:** This will also work with Seqera Cloud or other
|
||||
Nextflow-compatible platforms.
|
||||
|
||||
### 4. As a dependency
|
||||
|
||||
In your Viash config file (`config.vsh.yaml`), you can add this
|
||||
component as a dependency:
|
||||
|
||||
``` yaml
|
||||
dependencies:
|
||||
- name: arriba
|
||||
repository: vsh://biobox@v0.3.1
|
||||
```
|
||||
|
||||
**Tip:** See the [Viash
|
||||
documentation](https://viash.io/guide/nextflow_vdsl3/create-a-pipeline.html#pipeline-as-a-component)
|
||||
for more details on how to use Viash components as a dependency in your
|
||||
own Nextflow workflows.
|
||||
You can run components directly from Viash Hub’s launch interface. See
|
||||
[Viash Hub](https://www.viash-hub.com/packages/biobox) for more
|
||||
information.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions are welcome! We aim to build a comprehensive collection of
|
||||
high-quality bioinformatics components. If you’d like to contribute,
|
||||
please follow these general steps:
|
||||
We welcome contributions! biobox thrives on community input to expand
|
||||
our collection of high-quality bioinformatics components.
|
||||
|
||||
1. Find a component to contribute
|
||||
2. Add config template
|
||||
3. Fill in the metadata
|
||||
4. Find a suitable container
|
||||
5. Create help file
|
||||
6. Create or fetch test data
|
||||
7. Add arguments for the input files
|
||||
8. Add arguments for the output files
|
||||
9. Add arguments for the other arguments
|
||||
10. Add a Docker engine
|
||||
11. Write a runner script
|
||||
12. Create test script
|
||||
13. Create a `/var/software_versions.txt` file
|
||||
### Quick Contribution Process
|
||||
|
||||
See the
|
||||
[CONTRIBUTING](https://github.com/viash-hub/biobox/blob/main/CONTRIBUTING.md)
|
||||
file for more details.
|
||||
1. **Fork** the repository
|
||||
2. **Create** your component following our guidelines
|
||||
3. **Test** thoroughly with `viash test`
|
||||
4. **Submit** a pull request
|
||||
|
||||
### What We’re Looking For
|
||||
|
||||
- **Popular bioinformatics tools** missing from our collection
|
||||
- **Improvements** to existing components
|
||||
- **Bug fixes** and documentation enhancements
|
||||
- **Best practice** implementations
|
||||
|
||||
### Getting Started
|
||||
|
||||
Check out our comprehensive guides:
|
||||
|
||||
- **[Contributing
|
||||
Guidelines](https://github.com/viash-hub/biobox/blob/main/CONTRIBUTING.md)** -
|
||||
Complete development guide
|
||||
- **[Component Standards](docs/COMPONENT_DEVELOPMENT.md)** - Quality
|
||||
requirements
|
||||
- **[Testing Guide](docs/TESTING.md)** - Validation best practices
|
||||
|
||||
**New to Viash?** Start with our [beginner-friendly
|
||||
issues](https://github.com/viash-hub/biobox/labels/good%20first%20issue)
|
||||
or join our [community
|
||||
discussions](https://github.com/viash-hub/biobox/discussions).
|
||||
|
||||
## Community & Support
|
||||
|
||||
- **Documentation**: [Viash Documentation](https://viash.io)
|
||||
- **Discussions**: [GitHub
|
||||
Discussions](https://github.com/viash-hub/biobox/discussions)
|
||||
- **Issues**: [Bug Reports & Feature
|
||||
Requests](https://github.com/viash-hub/biobox/issues)
|
||||
|
||||
------------------------------------------------------------------------
|
||||
|
||||
**Ready to streamline your bioinformatics workflows?** [Get started with
|
||||
biobox today →](https://www.viash-hub.com/packages/biobox)
|
||||
|
||||
160
README.qmd
160
README.qmd
@@ -7,9 +7,15 @@ license <- paste0(package$links$repository, "/blob/main/LICENSE")
|
||||
contributing <- paste0(package$links$repository, "/blob/main/CONTRIBUTING.md")
|
||||
|
||||
pkg <- package$name
|
||||
ver <- if (!is.null(package$version)) package$version else "v0.3.1"
|
||||
comp <- "arriba"
|
||||
ver <- if (!is.null(package$version)) package$version else "v0.4.0"
|
||||
comp <- "bowtie2_align"
|
||||
|
||||
# Count components
|
||||
component_dirs <- list.dirs("src", recursive = FALSE, full.names = FALSE)
|
||||
component_dirs <- component_dirs[!startsWith(component_dirs, "_")]
|
||||
n_tools <- length(component_dirs)
|
||||
```
|
||||
|
||||
# 🌱📦 `r pkg`
|
||||
|
||||
[](https://www.viash-hub.com/packages/`r pkg`)
|
||||
@@ -18,106 +24,98 @@ comp <- "arriba"
|
||||
[](`r package$links$issue_tracker`)
|
||||
[`-blue.svg)](https://viash.io)
|
||||
|
||||
`r package$summary`
|
||||
**A curated collection of high-quality, production-ready bioinformatics components**
|
||||
|
||||
## Introduction
|
||||
Built with [Viash](https://viash.io), `r pkg` provides reliable, containerized tools for genomics and bioinformatics workflows. Each component is thoroughly tested, fully documented, and designed for seamless integration into both standalone and Nextflow pipelines.
|
||||
|
||||
`r package$description`
|
||||
## Why Choose `r pkg`?
|
||||
|
||||
## Example Usage
|
||||
✅ **Production Ready**: All components are containerized with pinned versions and comprehensive testing
|
||||
✅ **Nextflow Native**: Drop-in compatibility with Nextflow workflows
|
||||
✅ **Complete Documentation**: Full parameter exposure with detailed help and examples
|
||||
✅ **Quality Assured**: Unit tested with automated CI/CD validation
|
||||
✅ **Modern Standards**: Built with current best practices and maintained dependencies
|
||||
|
||||
Viash components in `r pkg` can be run in various ways:
|
||||
## Featured Tools
|
||||
|
||||
Our collection spans the complete bioinformatics pipeline:
|
||||
|
||||
**Alignment & Mapping**: BWA, Bowtie2, STAR, Kallisto, Salmon
|
||||
**Quality Control**: FastQC, Falco, MultiQC, Qualimap, NanoPlot
|
||||
**Preprocessing**: Cutadapt, fastp, Trimgalore, UMI-tools
|
||||
**Variant Calling**: BCFtools, LoFreq, SnpEff
|
||||
**File Manipulation**: SAMtools, Bedtools, seqtk
|
||||
**Assembly & Annotation**: BUSCO, AGAT, GFFread
|
||||
**Single Cell**: CellRanger, BD Rhapsody
|
||||
|
||||
[View all components →](https://www.viash-hub.com/packages/`r pkg`)
|
||||
|
||||
## Quick Start
|
||||
|
||||
You can run Viash components from `r pkg` in several ways:
|
||||
|
||||
**🌐 Via Viash Hub Web UI**: Interactive interface with documentation and examples
|
||||
**⚡ As Standalone Executables**: Direct command-line execution
|
||||
**🔄 Via Nextflow**: Local or cloud-based pipeline workflows
|
||||
|
||||
For detailed instructions on each method, visit the **[Viash Hub documentation →](https://viash-hub.com/packages/`r pkg`)** where each component page shows exactly how to run it in different environments.
|
||||
|
||||
```{r mmd, echo=FALSE, results='asis'}
|
||||
cat(
|
||||
"```mermaid\n",
|
||||
"flowchart TD\n",
|
||||
" A[", pkg, " ", ver, "] --> B(Viash Hub Launch)\n",
|
||||
" A --> C(Viash CLI)\n",
|
||||
" A --> D(Nextflow CLI)\n",
|
||||
" A --> E(Seqera Cloud)\n",
|
||||
" A --> F(As a dependency)\n",
|
||||
"flowchart LR\n",
|
||||
" A[", pkg, " Components] --> B[🌐 Web UI]\n",
|
||||
" A --> C[⚡ Standalone]\n",
|
||||
" A --> D[🔄 Nextflow Local]\n",
|
||||
" A --> E[☁️ Nextflow Cloud]\n",
|
||||
" \n",
|
||||
" style A fill:#7a4baa,color:#fff\n",
|
||||
" style B fill:#e1f5fe,color:#000\n",
|
||||
" style C fill:#e8f5e8,color:#000\n",
|
||||
" style D fill:#fff3e0,color:#000\n",
|
||||
" style E fill:#f3e5f5,color:#000\n",
|
||||
"```\n",
|
||||
sep = ""
|
||||
)
|
||||
```
|
||||
|
||||
### 1. Via the Viash Hub Launch interface
|
||||
You can run components directly from Viash Hub's launch interface. See [Viash Hub](https://www.viash-hub.com/packages/`r pkg`) for more information.
|
||||
|
||||
You can run this component directly from the Viash Hub [Launch interface](https://www.viash-hub.com/launch?package=`r pkg`&version=`r ver`&component=`r comp`&runner=Executable).
|
||||
|
||||

|
||||
|
||||
### 2. Via the Viash CLI
|
||||
|
||||
You can run this component directly from the command line using the Viash CLI.
|
||||
|
||||
```bash
|
||||
viash run vsh://`r pkg`@`r ver`/`r comp` -- --help
|
||||
|
||||
viash run vsh://`r pkg`@`r ver`/`r comp` -- \
|
||||
--bam path/to/input.bam \
|
||||
--genome path/to/genome.fa \
|
||||
--gene_annotation path/to/annotation.gtf \
|
||||
--fusions path/to/output.txt
|
||||
```
|
||||
|
||||
This will run the component with the specified input files and output the results to the specified output file.
|
||||
|
||||
### 3. Via the Nextflow CLI or Seqera Cloud
|
||||
|
||||
You can run this component as a Nextflow pipeline.
|
||||
|
||||
```bash
|
||||
nextflow run https://packages.viash-hub.com/vsh/`r pkg` \
|
||||
-revision `r ver` \
|
||||
-main-script target/nextflow/`r comp`/main.nf \
|
||||
-latest -resume \
|
||||
-profile docker \
|
||||
--bam path/to/input.bam \
|
||||
--genome path/to/genome.fa \
|
||||
--gene_annotation path/to/annotation.gtf \
|
||||
--publish_dir path/to/output
|
||||
```
|
||||
|
||||
**Note:** Make sure that the [Nextflow SCM](https://www.nextflow.io/docs/latest/git.html#git-configuration) is set up properly. You can do this by adding the following lines to your `~/.nextflow/scm` file:
|
||||
|
||||
```groovy
|
||||
providers.vsh.platform = 'gitlab'
|
||||
providers.vsh.server = 'https://packages.viash-hub.com'
|
||||
```
|
||||
|
||||
**Tip:** This will also work with Seqera Cloud or other Nextflow-compatible platforms.
|
||||
|
||||
### 4. As a dependency
|
||||
|
||||
In your Viash config file (`config.vsh.yaml`), you can add this component as a dependency:
|
||||
|
||||
```yaml
|
||||
dependencies:
|
||||
- name: `r comp`
|
||||
repository: vsh://`r pkg`@`r ver`
|
||||
```
|
||||
|
||||
**Tip:** See the [Viash documentation](https://viash.io/guide/nextflow_vdsl3/create-a-pipeline.html#pipeline-as-a-component) for more details on how to use Viash components as a dependency in your own Nextflow workflows.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions are welcome! We aim to build a comprehensive collection of high-quality bioinformatics components. If you'd like to contribute, please follow these general steps:
|
||||
We welcome contributions! `r pkg` thrives on community input to expand our collection of high-quality bioinformatics components.
|
||||
|
||||
### Quick Contribution Process
|
||||
|
||||
```{r echo=FALSE}
|
||||
lines <- readr::read_lines("CONTRIBUTING.md")
|
||||
1. **Fork** the repository
|
||||
2. **Create** your component following our guidelines
|
||||
3. **Test** thoroughly with `viash test`
|
||||
4. **Submit** a pull request
|
||||
|
||||
index_start <- grep("^### Step [0-9]*:", lines)
|
||||
### What We're Looking For
|
||||
|
||||
index_end <- c(index_start[-1] - 1, length(lines))
|
||||
- **Popular bioinformatics tools** missing from our collection
|
||||
- **Improvements** to existing components
|
||||
- **Bug fixes** and documentation enhancements
|
||||
- **Best practice** implementations
|
||||
|
||||
name <- gsub("^### Step [0-9]*: *", "", lines[index_start])
|
||||
### Getting Started
|
||||
|
||||
knitr::asis_output(
|
||||
paste(paste0(" 1. ", name, "\n"), collapse = "")
|
||||
)
|
||||
```
|
||||
Check out our comprehensive guides:
|
||||
|
||||
See the [CONTRIBUTING](`r contributing`) file for more details.
|
||||
- **[Contributing Guidelines](`r contributing`)** - Complete development guide
|
||||
- **[Component Standards](docs/COMPONENT_DEVELOPMENT.md)** - Quality requirements
|
||||
- **[Testing Guide](docs/TESTING.md)** - Validation best practices
|
||||
|
||||
**New to Viash?** Start with our [beginner-friendly issues](https://github.com/viash-hub/biobox/labels/good%20first%20issue) or join our [community discussions](https://github.com/viash-hub/biobox/discussions).
|
||||
|
||||
## Community & Support
|
||||
|
||||
- **Documentation**: [Viash Documentation](https://viash.io)
|
||||
- **Discussions**: [GitHub Discussions](https://github.com/viash-hub/biobox/discussions)
|
||||
- **Issues**: [Bug Reports & Feature Requests](https://github.com/viash-hub/biobox/issues)
|
||||
|
||||
---
|
||||
|
||||
**Ready to streamline your bioinformatics workflows?** [Get started with `r pkg` today →](https://www.viash-hub.com/packages/`r pkg`)
|
||||
|
||||
29
_viash.yaml
29
_viash.yaml
@@ -17,34 +17,33 @@ keywords: [bioinformatics, modules, sequencing]
|
||||
links:
|
||||
issue_tracker: https://github.com/viash-hub/biobox/issues
|
||||
repository: https://github.com/viash-hub/biobox
|
||||
|
||||
viash_version: 0.9.4
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [ author, maintainer ]
|
||||
roles: [author, maintainer]
|
||||
- __merge__: /src/_authors/angela_o_pisco.yaml
|
||||
roles: [ author ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/dorien_roosen.yaml
|
||||
roles: [ author ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/dries_schaumont.yaml
|
||||
roles: [ author ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/jakub_majercik.yaml
|
||||
roles: [ author ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/kai_waldrant.yaml
|
||||
roles: [ author ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/leila_paquay.yaml
|
||||
roles: [ author ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
|
||||
roles: [ author ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/theodoro_gasperin.yaml
|
||||
roles: [ author ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/toni_verbeiren.yaml
|
||||
roles: [ author ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/weiwei_schultz.yaml
|
||||
roles: [ author ]
|
||||
|
||||
roles: [author]
|
||||
config_mods: |
|
||||
.requirements.commands := ['ps']
|
||||
version: main
|
||||
organization: vsh
|
||||
|
||||
268
docs/COMPONENT_DEVELOPMENT.md
Normal file
268
docs/COMPONENT_DEVELOPMENT.md
Normal file
@@ -0,0 +1,268 @@
|
||||
# Component Development Guide
|
||||
|
||||
This guide provides detailed step-by-step instructions for creating a new component in biobox.
|
||||
|
||||
## Table of Contents
|
||||
- [Initial Setup](#initial-setup)
|
||||
- [Configuration](#configuration)
|
||||
- [Arguments](#arguments)
|
||||
- [Implementation](#implementation)
|
||||
- [Testing](#testing)
|
||||
- [Documentation](#documentation)
|
||||
|
||||
## Initial Setup
|
||||
|
||||
### Step 1: Find a component to contribute
|
||||
|
||||
* Find a tool to contribute to this repo.
|
||||
* Check whether it is already in the [Project board](https://github.com/orgs/viash-hub/projects/1).
|
||||
* Check whether there is a corresponding [Snakemake wrapper](https://github.com/snakemake/snakemake-wrappers/blob/master/bio) or [nf-core module](https://github.com/nf-core/modules/tree/master/modules/nf-core) which we can use as inspiration.
|
||||
* Create an issue to show that you are working on this component.
|
||||
|
||||
### Step 2: Find a suitable container
|
||||
|
||||
Google `biocontainer <name of component>` and find the container that is most suitable. Typically the link will be `https://quay.io/repository/biocontainers/xxx?tab=tags`.
|
||||
|
||||
If no such container is found, you can create a custom container in a later step.
|
||||
|
||||
### Step 3: Create help file
|
||||
|
||||
To help develop the component, we store the `--help` output of the tool in a file at `src/xxx/help.txt`.
|
||||
|
||||
```bash
|
||||
cat <<EOF > src/xxx/help.txt
|
||||
\```sh
|
||||
xxx --help
|
||||
\```
|
||||
EOF
|
||||
|
||||
docker run quay.io/biocontainers/xxx:tag xxx --help >> src/xxx/help.txt
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
* This help file has no functional purpose, but it is useful for the developer to see the help output of the tool.
|
||||
* Some tools might not have a `--help` argument but instead have a `-h` argument.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Metadata Setup
|
||||
|
||||
Fill in the relevant metadata fields in the config:
|
||||
|
||||
```yaml
|
||||
name: bowtie2_build
|
||||
namespace: bowtie2
|
||||
description: |
|
||||
Build Bowtie2 index files from reference sequences.
|
||||
keywords: [Alignment, Indexing]
|
||||
links:
|
||||
homepage: https://bowtie-bio.sourceforge.net/bowtie2/index.shtml
|
||||
documentation: https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
|
||||
repository: https://github.com/BenLangmead/bowtie2
|
||||
references:
|
||||
doi: 10.1038/nmeth.1923
|
||||
license: GPL-3.0
|
||||
requirements:
|
||||
commands: [bowtie2-build]
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
```
|
||||
|
||||
### Requirements Specification
|
||||
|
||||
The `requirements` section documents the dependencies needed by your component:
|
||||
|
||||
```yaml
|
||||
requirements:
|
||||
commands: [bowtie2-build, bowtie2]
|
||||
```
|
||||
|
||||
**Why specify commands:**
|
||||
- Documents which executables the component expects
|
||||
- Enables validation that the Docker container has required tools
|
||||
- Helps users understand dependencies
|
||||
- Facilitates automated testing and CI/CD
|
||||
|
||||
## Arguments
|
||||
|
||||
### Input Arguments
|
||||
|
||||
By looking at the help file, add input arguments to the config file:
|
||||
|
||||
```yaml
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --bam
|
||||
alternatives: -x
|
||||
type: file
|
||||
description: |
|
||||
File in SAM/BAM/CRAM format with main alignments as generated by STAR
|
||||
(`Aligned.out.sam`). Arriba extracts candidate reads from this file.
|
||||
required: true
|
||||
example: Aligned.out.bam
|
||||
```
|
||||
|
||||
**Key principles:**
|
||||
* Argument names should be formatted in `--snake_case`
|
||||
* Input arguments can have `multiple: true` to allow multiple files
|
||||
* **Descriptions must be formatted in markdown** - they will be used downstream for rendering documentation
|
||||
* You can make minor changes to the formatting of arguments to improve clarity and better utilize markdown structure
|
||||
* Use markdown features like code blocks, lists, emphasis, and links to enhance readability
|
||||
|
||||
### Output Arguments
|
||||
|
||||
Add output arguments based on the tool's help:
|
||||
|
||||
```yaml
|
||||
argument_groups:
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --fusions
|
||||
alternatives: -o
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with fusions that have passed all filters.
|
||||
required: true
|
||||
example: fusions.tsv
|
||||
```
|
||||
|
||||
**Note:** Preferably, outputs should be files rather than directories.
|
||||
|
||||
### Other Arguments
|
||||
|
||||
Add all other arguments with these exceptions:
|
||||
* Arguments related to CPU and memory requirements are handled separately
|
||||
* Version (`-v`, `--version`) or help (`-h`, `--help`) arguments should be excluded
|
||||
* If the help file lists defaults, add them to description rather than as defaults
|
||||
|
||||
**Boolean handling:**
|
||||
* Prefer using `boolean_true` over `boolean_false` to avoid confusion in Nextflow workflows
|
||||
|
||||
### Description Formatting Guidelines
|
||||
|
||||
Argument descriptions should always be written in **markdown format** as they are used downstream for documentation rendering. Here are best practices:
|
||||
|
||||
**Good markdown formatting examples:**
|
||||
|
||||
```yaml
|
||||
description: |
|
||||
Input FASTQ file containing reads. Supports compressed files (`.gz`, `.bz2`).
|
||||
|
||||
**Supported formats:**
|
||||
- FASTQ (`.fastq`, `.fq`)
|
||||
- Compressed FASTQ (`.fastq.gz`, `.fq.gz`)
|
||||
|
||||
See the [FASTQ format specification](https://en.wikipedia.org/wiki/FASTQ_format) for details.
|
||||
```
|
||||
|
||||
```yaml
|
||||
description: |
|
||||
Maximum number of mismatches allowed during alignment.
|
||||
|
||||
**Default behavior:**
|
||||
- For reads ≤50bp: 2 mismatches
|
||||
- For reads >50bp: 3 mismatches
|
||||
|
||||
Set to `0` for exact matches only.
|
||||
```
|
||||
|
||||
**Formatting improvements you can make:**
|
||||
- Add code formatting for file extensions, parameters, and values
|
||||
- Use lists and bullet points for multiple options
|
||||
- Add emphasis with **bold** or *italic* text
|
||||
- Include links to external documentation
|
||||
- Structure complex descriptions with headers
|
||||
- Use code blocks for examples
|
||||
|
||||
**Original tool help vs. improved description:**
|
||||
|
||||
```
|
||||
# Original: "Input file in BAM format"
|
||||
# Improved:
|
||||
description: |
|
||||
Input file in BAM format containing aligned sequences.
|
||||
|
||||
The file must be coordinate-sorted and indexed. Use `samtools sort`
|
||||
and `samtools index` if needed.
|
||||
```
|
||||
|
||||
## Meta Variables
|
||||
|
||||
**Important:** Never add `threads`, `cores`, `cpus`, or `memory` as regular parameters. Instead, use Viash's built-in meta variables.
|
||||
|
||||
### Available Meta Variables
|
||||
|
||||
Viash provides several meta variables that are automatically available in your scripts:
|
||||
|
||||
- **`meta_cpus`** (integer): Maximum number of logical CPUs the component can use
|
||||
- **`meta_memory_*`** (long): Maximum memory allocation in various units:
|
||||
- `meta_memory_b`, `meta_memory_kb`, `meta_memory_mb`
|
||||
- `meta_memory_gb`, `meta_memory_tb`, `meta_memory_pb`
|
||||
- `meta_memory_kib`, `meta_memory_mib`, `meta_memory_gib`, `meta_memory_tib`, `meta_memory_pib`
|
||||
- **`meta_temp_dir`** (string): Temporary directory for the component
|
||||
- **`meta_resources_dir`** (string): Path to component resources
|
||||
- **`meta_name`** (string): Component name (useful for logging)
|
||||
- **`meta_executable`** (string): Path to the wrapped executable
|
||||
- **`meta_config`** (string): Path to the processed config YAML
|
||||
|
||||
### Usage Example
|
||||
|
||||
```bash
|
||||
# Use meta_cpus instead of a threads parameter
|
||||
./tool --threads ${meta_cpus:-1} --input $par_input --output $par_output
|
||||
|
||||
# Use meta_memory_gb for memory-intensive tools
|
||||
./tool --memory ${meta_memory_gb:-8}G --input $par_input --output $par_output
|
||||
```
|
||||
|
||||
### Setting Meta Values
|
||||
|
||||
```bash
|
||||
# When running with viash
|
||||
viash run config.vsh.yaml --cpus 8 --memory 16GB -- --input file.txt
|
||||
|
||||
# When using built executables
|
||||
./my_tool ---cpus 8 ---memory 16GB --input file.txt
|
||||
```
|
||||
|
||||
For more details, see the [Viash Variables Documentation](https://viash.io/guide/component/variables.html).
|
||||
|
||||
## Implementation
|
||||
|
||||
See [Script Development Guide](SCRIPT_DEVELOPMENT.md) for detailed script writing guidelines.
|
||||
|
||||
## Testing
|
||||
|
||||
See [Testing Guide](TESTING.md) for comprehensive testing practices.
|
||||
|
||||
## Documentation
|
||||
|
||||
### Version Documentation
|
||||
|
||||
Add version detection to the Docker engine setup:
|
||||
|
||||
```yaml
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/xxx:2.5.4--he96a11b_6
|
||||
setup:
|
||||
- type: docker
|
||||
run:
|
||||
- xxx --version 2>&1 | head -1 | sed 's/.*version /xxx: /' > /var/software_versions.txt
|
||||
```
|
||||
|
||||
**Common version extraction patterns:**
|
||||
|
||||
```bash
|
||||
# For tools that output "Tool version X.Y.Z"
|
||||
tool --version 2>&1 | head -1 | sed 's/.*version /tool: /' > /var/software_versions.txt
|
||||
|
||||
# For tools that output just the version number
|
||||
echo "tool: $(tool --version 2>&1 | head -1)" > /var/software_versions.txt
|
||||
|
||||
# For tools with complex version output
|
||||
tool --version 2>&1 | grep -E "^[0-9]" | head -1 | sed 's/^/tool: /' > /var/software_versions.txt
|
||||
```
|
||||
310
docs/DOCKER_GUIDE.md
Normal file
310
docs/DOCKER_GUIDE.md
Normal file
@@ -0,0 +1,310 @@
|
||||
# Docker and Engine Best Practices
|
||||
|
||||
This guide covers best practices for setting up Docker engines and managing dependencies in biobox components.
|
||||
|
||||
## Table of Contents
|
||||
- [Preferred Approach: Biocontainers](#preferred-approach-biocontainers)
|
||||
- [Finding Biocontainers](#finding-biocontainers)
|
||||
- [Version Detection](#version-detection)
|
||||
- [Docker Run Syntax](#docker-run-syntax)
|
||||
- [Custom Containers](#custom-containers)
|
||||
- [Recommended Base Containers](#recommended-base-containers)
|
||||
- [Multi-tool Containers](#multi-tool-containers)
|
||||
- [Container Optimization](#container-optimization)
|
||||
- [Testing Docker Setup](#testing-docker-setup)
|
||||
|
||||
## Preferred Approach: Biocontainers
|
||||
|
||||
### Basic Setup
|
||||
|
||||
```yaml
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/bowtie2:2.5.4--he96a11b_6
|
||||
setup:
|
||||
- type: docker
|
||||
run:
|
||||
- bowtie2 --version 2>&1 | head -1 | sed 's/.*version /bowtie2: /' > /var/software_versions.txt
|
||||
```
|
||||
|
||||
### Key Requirements
|
||||
|
||||
1. **Use specific versions**: Always pin to specific versions with build strings
|
||||
2. **Include version detection**: Add setup commands to create `/var/software_versions.txt`
|
||||
3. **Verify command availability**: Ensure the container has the required commands from `requirements.commands`
|
||||
|
||||
## Finding Biocontainers
|
||||
|
||||
### Search Strategy
|
||||
|
||||
1. **Google search**: `biocontainer <tool_name>`
|
||||
2. **Direct URL**: `https://quay.io/repository/biocontainers/<tool_name>?tab=tags`
|
||||
3. **Check version compatibility**: Choose the most recent stable version
|
||||
4. **Verify build string**: Include the complete version tag with build string
|
||||
|
||||
### Version Selection
|
||||
|
||||
```yaml
|
||||
# Good: Specific version with build string
|
||||
image: quay.io/biocontainers/samtools:1.17--hd87286a_2
|
||||
|
||||
# Bad: Latest or incomplete version
|
||||
image: quay.io/biocontainers/samtools:latest
|
||||
image: quay.io/biocontainers/samtools:1.17
|
||||
```
|
||||
|
||||
## Version Detection
|
||||
|
||||
### Common Patterns
|
||||
|
||||
```bash
|
||||
# Pattern 1: Tool outputs "Tool version X.Y.Z"
|
||||
tool --version 2>&1 | head -1 | sed 's/.*version /tool: /' > /var/software_versions.txt
|
||||
|
||||
# Pattern 2: Tool outputs just version number
|
||||
echo "tool: $(tool --version 2>&1 | head -1)" > /var/software_versions.txt
|
||||
|
||||
# Pattern 3: Complex version output, extract numeric part
|
||||
tool --version 2>&1 | grep -E "^[0-9]" | head -1 | sed 's/^/tool: /' > /var/software_versions.txt
|
||||
|
||||
# Pattern 4: Version in specific format
|
||||
tool --version 2>&1 | awk '{print "tool: " $NF}' > /var/software_versions.txt
|
||||
```
|
||||
|
||||
### Real Examples
|
||||
|
||||
```bash
|
||||
# bowtie2
|
||||
bowtie2 --version 2>&1 | head -1 | sed 's/.*version /bowtie2: /' > /var/software_versions.txt
|
||||
|
||||
# samtools
|
||||
samtools --version 2>&1 | head -1 | sed 's/samtools /samtools: /' > /var/software_versions.txt
|
||||
|
||||
# fastqc
|
||||
fastqc --version 2>&1 | sed 's/FastQC v/fastqc: /' > /var/software_versions.txt
|
||||
```
|
||||
|
||||
### Testing Version Detection
|
||||
|
||||
Always test your version detection command:
|
||||
|
||||
```bash
|
||||
# Test in the container
|
||||
docker run quay.io/biocontainers/tool:version bash -c "
|
||||
tool --version 2>&1 | head -1 | sed 's/.*version /tool: /'
|
||||
"
|
||||
```
|
||||
|
||||
## Docker Run Syntax
|
||||
|
||||
### List vs Multiline Strings
|
||||
|
||||
**Preferred: List format**
|
||||
```yaml
|
||||
run:
|
||||
# Single commands
|
||||
- command1 arg1 arg2
|
||||
- command2 arg1 arg2
|
||||
# Chained commands
|
||||
- command1 && command2 && command3
|
||||
```
|
||||
|
||||
**Alternative: Multiline strings (for complex commands)**
|
||||
```yaml
|
||||
run: |
|
||||
command1 arg1 arg2 && \
|
||||
command2 arg1 arg2 && \
|
||||
command3 arg1 arg2
|
||||
```
|
||||
|
||||
**Important:** Comments inside multiline strings (`run: |`) become Dockerfile `RUN` commands and will break the build. Use comments before the `run:` key or use the list format.
|
||||
|
||||
## Custom Containers
|
||||
|
||||
### When to Use Custom Containers
|
||||
|
||||
Use custom containers when:
|
||||
- No suitable biocontainer exists
|
||||
- You need to install additional dependencies
|
||||
- You need a specific base environment (R, Python, etc.)
|
||||
|
||||
### Python-based Tools
|
||||
|
||||
```yaml
|
||||
engines:
|
||||
- type: docker
|
||||
image: python:3.10-slim
|
||||
setup:
|
||||
- type: python
|
||||
packages:
|
||||
- numpy~=x.x.x
|
||||
- pandas~=x.x.x
|
||||
- scipy~=x.x.x
|
||||
```
|
||||
|
||||
### R-based Tools
|
||||
|
||||
```yaml
|
||||
engines:
|
||||
- type: docker
|
||||
image: rocker/r2u:24.04
|
||||
setup:
|
||||
- type: r
|
||||
cran: [devtools, BiocManager]
|
||||
bioc: [Biostrings, GenomicRanges]
|
||||
```
|
||||
|
||||
### Compilation from Source
|
||||
|
||||
```yaml
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [build-essential, cmake, git, wget]
|
||||
- type: docker
|
||||
run:
|
||||
- wget https://github.com/user/tool/archive/v1.0.tar.gz && tar -xzf v1.0.tar.gz
|
||||
- cd tool-1.0 && make && make install
|
||||
- echo "tool: 1.0" > /var/software_versions.txt
|
||||
```
|
||||
|
||||
## Recommended Base Containers
|
||||
|
||||
### General Purpose
|
||||
- **Ubuntu**: `ubuntu:22.04` - Good for compilation and apt packages
|
||||
- **Alpine**: `alpine:latest` - Minimal size, apk packages
|
||||
- **Debian**: `debian:bookworm-slim` - Stable, well-supported
|
||||
|
||||
### Language-Specific
|
||||
|
||||
#### Python
|
||||
```yaml
|
||||
# Basic Python
|
||||
image: python:3.10-slim
|
||||
|
||||
# With scientific packages
|
||||
image: python:3.10
|
||||
|
||||
# GPU-enabled
|
||||
image: nvcr.io/nvidia/pytorch:23.08-py3
|
||||
```
|
||||
|
||||
#### R
|
||||
```yaml
|
||||
# Fast package installation
|
||||
image: rocker/r2u:24.04
|
||||
|
||||
# Tidyverse included
|
||||
image: rocker/tidyverse:4.3.0
|
||||
|
||||
# Bioconductor base
|
||||
image: bioconductor/bioconductor_docker:RELEASE_3_17
|
||||
```
|
||||
|
||||
#### Node.js
|
||||
```yaml
|
||||
# LTS version
|
||||
image: node:18-slim
|
||||
|
||||
# Alpine variant
|
||||
image: node:18-alpine
|
||||
```
|
||||
|
||||
#### Other Languages
|
||||
```yaml
|
||||
# Java
|
||||
image: openjdk:11-jre-slim
|
||||
|
||||
# Go
|
||||
image: golang:1.20-alpine
|
||||
|
||||
# Rust
|
||||
image: rust:1.70-slim
|
||||
|
||||
# Ruby
|
||||
image: ruby:3.1-slim
|
||||
```
|
||||
|
||||
## Multi-tool Containers
|
||||
|
||||
### Installing Multiple Tools
|
||||
|
||||
```yaml
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [wget, curl, build-essential]
|
||||
- type: docker
|
||||
run:
|
||||
# Install tool 1
|
||||
- wget https://tool1.com/download && install_tool1
|
||||
# Install tool 2
|
||||
- wget https://tool2.com/download && install_tool2
|
||||
# Create version file
|
||||
- echo "tool1: $(tool1 --version)" > /var/software_versions.txt
|
||||
- echo "tool2: $(tool2 --version)" >> /var/software_versions.txt
|
||||
```
|
||||
|
||||
## Container Optimization
|
||||
|
||||
### Layer Efficiency
|
||||
|
||||
```yaml
|
||||
# Good: Combine related commands
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
apt-get update && \
|
||||
apt-get install -y wget curl && \
|
||||
wget https://tool.com/download && \
|
||||
install_tool && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Bad: Separate layers for each command
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [wget, curl]
|
||||
- type: docker
|
||||
run: wget https://tool.com/download
|
||||
- type: docker
|
||||
run: install_tool
|
||||
- type: docker
|
||||
run: apt-get clean
|
||||
```
|
||||
|
||||
## Testing Docker Setup
|
||||
|
||||
### Viash Docker Debugging
|
||||
|
||||
```bash
|
||||
# Inspect the generated Dockerfile
|
||||
viash run config.vsh.yaml -- ---dockerfile
|
||||
|
||||
# Build with cached layers (faster)
|
||||
viash run config.vsh.yaml -- ---setup cachedbuild ---verbose
|
||||
|
||||
# Build from scratch (clean build)
|
||||
viash run config.vsh.yaml -- ---setup build ---verbose
|
||||
|
||||
# Enter interactive debugging session
|
||||
viash run config.vsh.yaml -- ---debug
|
||||
|
||||
# Check installed tools (inside container)
|
||||
which tool
|
||||
tool --version
|
||||
|
||||
# Verify version file
|
||||
cat /var/software_versions.txt
|
||||
```
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Command not found**: Tool not in PATH or not installed
|
||||
2. **Version detection fails**: Command syntax varies between tools
|
||||
3. **Permission issues**: Tools installed in wrong location
|
||||
4. **Missing dependencies**: Tool requires additional libraries
|
||||
434
docs/SCRIPT_DEVELOPMENT.md
Normal file
434
docs/SCRIPT_DEVELOPMENT.md
Normal file
@@ -0,0 +1,434 @@
|
||||
# Script Development Guide
|
||||
|
||||
This guide covers best practices for writing runner scripts in biobox components.
|
||||
|
||||
## Table of Contents
|
||||
- [Script Structure and Template](#script-structure-and-template)
|
||||
- [Key Principles](#key-principles)
|
||||
- [Real-World Example](#real-world-example)
|
||||
- [Advanced Patterns](#advanced-patterns)
|
||||
- [Common Pitfalls](#common-pitfalls)
|
||||
- [Testing Your Script](#testing-your-script)
|
||||
|
||||
## Script Structure and Template
|
||||
|
||||
All Viash component scripts follow a standard structure with best practices for error handling and parameter management.
|
||||
|
||||
### Basic Template
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# unset flags
|
||||
[[ "$par_option1" == "false" ]] && unset par_option1
|
||||
[[ "$par_option2" == "false" ]] && unset par_option2
|
||||
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
--input "$par_input"
|
||||
--output "$par_output"
|
||||
${par_option1:+--option1}
|
||||
${par_option2:+--option2}
|
||||
${meta_cpus:+--threads "$meta_cpus"}
|
||||
${meta_memory_gb:+--memory "${meta_memory_gb}G"}
|
||||
)
|
||||
|
||||
# Execute command
|
||||
xxx "${cmd_args[@]}"
|
||||
```
|
||||
|
||||
### Understanding the Viash Code Block
|
||||
|
||||
The `## VIASH START` and `## VIASH END` comments mark a special placeholder block where Viash injects runtime parameters and metadata when the component is executed.
|
||||
|
||||
**At runtime**, Viash replaces this placeholder with:
|
||||
- `par_*` variables containing argument values (e.g., `par_input`, `par_output`)
|
||||
- `meta_*` variables containing runtime metadata (e.g., `meta_name`, `meta_cpus`, `meta_temp_dir`)
|
||||
|
||||
**For debugging**, you can put example code between these markers to test your script locally:
|
||||
|
||||
```bash
|
||||
## VIASH START
|
||||
par_input="test_input.txt"
|
||||
par_output="test_output.txt"
|
||||
par_verbose="true"
|
||||
meta_cpus="4"
|
||||
meta_memory_gb="8"
|
||||
meta_temp_dir="/tmp"
|
||||
## VIASH END
|
||||
```
|
||||
|
||||
This allows you to run your script directly with `bash script.sh` during development.
|
||||
|
||||
## Code Style Guidelines
|
||||
|
||||
### Indentation
|
||||
|
||||
**Use 2-space indentation consistently throughout your scripts:**
|
||||
|
||||
```bash
|
||||
# Correct - 2 spaces
|
||||
unset_if_false=(
|
||||
par_verbose
|
||||
par_quiet
|
||||
par_force
|
||||
)
|
||||
|
||||
for par in "${unset_if_false[@]}"; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset $par
|
||||
done
|
||||
|
||||
cmd_args=(
|
||||
--input "$par_input"
|
||||
--output "$par_output"
|
||||
${par_verbose:+--verbose}
|
||||
)
|
||||
```
|
||||
|
||||
```bash
|
||||
# Incorrect - 4 spaces or tabs
|
||||
unset_if_false=(
|
||||
par_verbose
|
||||
par_quiet
|
||||
par_force
|
||||
)
|
||||
|
||||
for par in "${unset_if_false[@]}"; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset $par
|
||||
done
|
||||
```
|
||||
|
||||
**Why 2 spaces:**
|
||||
- Consistent with other biobox components
|
||||
- Better readability in terminal and code editors
|
||||
- Reduces line width for complex nested structures
|
||||
- Standard practice in many shell script projects
|
||||
|
||||
## Key Principles
|
||||
|
||||
### 1. Error Handling
|
||||
|
||||
Always use `set -eo pipefail`:
|
||||
- `set -e`: Exit immediately if a command exits with a non-zero status
|
||||
- `set -o pipefail`: Exit if any command in a pipeline fails
|
||||
|
||||
### 2. Array-Based Arguments
|
||||
|
||||
**Preferred approach:**
|
||||
```bash
|
||||
cmd_args=(
|
||||
--input "$par_input"
|
||||
--output "$par_output"
|
||||
${par_option:+--option "$par_option"}
|
||||
)
|
||||
|
||||
xxx "${cmd_args[@]}"
|
||||
```
|
||||
|
||||
**Avoid repetitive appending:**
|
||||
```bash
|
||||
# Don't do this
|
||||
cmd_args+=("--input")
|
||||
cmd_args+=("$par_input")
|
||||
cmd_args+=("--output")
|
||||
cmd_args+=("$par_output")
|
||||
```
|
||||
|
||||
### 3. Conditional Parameter Inclusion
|
||||
|
||||
Use Bash parameter expansion for optional parameters:
|
||||
|
||||
```bash
|
||||
# Include parameter only if variable is set and not empty
|
||||
${meta_cpus:+--threads "$meta_cpus"}
|
||||
|
||||
# Include flag only if boolean is true (after unsetting false values)
|
||||
${par_verbose:+--verbose}
|
||||
```
|
||||
|
||||
### 4. Boolean Handling
|
||||
|
||||
Unset boolean parameters that are "false":
|
||||
|
||||
```bash
|
||||
# Single parameter
|
||||
[[ "$par_verbose" == "false" ]] && unset par_verbose
|
||||
|
||||
# For multiple parameters, you can use either approach:
|
||||
|
||||
# Option 1: Individual approach (recommended for 1-4 parameters)
|
||||
[[ "$par_verbose" == "false" ]] && unset par_verbose
|
||||
[[ "$par_quiet" == "false" ]] && unset par_quiet
|
||||
[[ "$par_force" == "false" ]] && unset par_force
|
||||
[[ "$par_recursive" == "false" ]] && unset par_recursive
|
||||
|
||||
# Option 2: Loop approach (recommended for 5+ parameters)
|
||||
unset_if_false=(
|
||||
par_verbose
|
||||
par_quiet
|
||||
par_force
|
||||
par_recursive
|
||||
par_follow_symlinks
|
||||
par_ignore_case
|
||||
par_preserve_permissions
|
||||
)
|
||||
|
||||
for par in "${unset_if_false[@]}"; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset $par
|
||||
done
|
||||
```
|
||||
|
||||
**When to use which approach:**
|
||||
|
||||
- **Individual approach**: Recommended for 1-4 boolean parameters, clearer and more direct
|
||||
- **Loop approach**: Recommended for many parameters (5+), reduces code duplication
|
||||
|
||||
The individual approach is preferred for fewer parameters because:
|
||||
- Each parameter is explicit and easy to find
|
||||
- No variable indirection complexity (`${!par}`)
|
||||
- Simple to add/remove individual parameters
|
||||
- More readable at a glance
|
||||
|
||||
### 5. Meta Variables Usage
|
||||
|
||||
**Important:** Never use `par_threads`, `par_cores`, `par_cpus`, or `par_memory` parameters. Use Viash's built-in meta variables instead.
|
||||
|
||||
**Available meta variables:**
|
||||
- `meta_cpus`: Number of CPU cores available
|
||||
- `meta_memory_*`: Memory limits in various units (b, kb, mb, gb, tb, pb, kib, mib, gib, tib, pib)
|
||||
- `meta_temp_dir`: Temporary directory for the component
|
||||
- `meta_resources_dir`: Path to component resources
|
||||
|
||||
**Examples:**
|
||||
```bash
|
||||
# CPU cores with fallback
|
||||
${meta_cpus:+--threads "$meta_cpus"}
|
||||
${meta_cpus:+--cores "${meta_cpus:-1}"}
|
||||
|
||||
# Memory with fallback and unit conversion
|
||||
${meta_memory_gb:+--memory "${meta_memory_gb}G"}
|
||||
${meta_memory_mb:+--max-memory "${meta_memory_mb:-1024}M"}
|
||||
|
||||
# Temporary directory
|
||||
--tmp-dir "${meta_temp_dir:-/tmp}"
|
||||
```
|
||||
|
||||
**Why use meta variables:**
|
||||
- Integrates seamlessly with workflow systems like Nextflow
|
||||
- Automatically managed by Viash runtime
|
||||
- Consistent across all components
|
||||
- Prevents parameter duplication and conflicts
|
||||
|
||||
For complete details, see [Viash Variables Documentation](https://viash.io/guide/component/variables.html).
|
||||
|
||||
### 6. Proper Quoting
|
||||
|
||||
Always quote variables that might contain spaces or special characters:
|
||||
|
||||
```bash
|
||||
# Correct
|
||||
--input "$par_input"
|
||||
--output "$par_output"
|
||||
|
||||
# For special characters, use @Q expansion
|
||||
--pattern "${par_pattern@Q}"
|
||||
```
|
||||
|
||||
### 7. Multiple Parameter Values
|
||||
|
||||
When using arguments with `multiple: true` in your Viash configuration, values are passed as semicolon-separated strings that need to be split into bash arrays.
|
||||
|
||||
#### In script.sh - Converting to Arrays
|
||||
|
||||
```bash
|
||||
# Convert semicolon-separated values to bash array
|
||||
IFS=';' read -ra files_array <<< "$par_files"
|
||||
|
||||
# Example: Use in command arguments
|
||||
cmd_args=(
|
||||
-i "$par_input"
|
||||
-files "${files_array[@]}"
|
||||
-o "$par_output"
|
||||
)
|
||||
|
||||
# Execute command
|
||||
bedtools annotate "${cmd_args[@]}"
|
||||
```
|
||||
|
||||
#### In test.sh - Passing Multiple Values
|
||||
|
||||
When testing components with `multiple: true` parameters, you can use either format:
|
||||
|
||||
```bash
|
||||
# Method 1: Repeated flags (recommended for readability)
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/query.bed" \
|
||||
--files "$meta_temp_dir/db1.bed" \
|
||||
--files "$meta_temp_dir/db2.bed" \
|
||||
--output "$meta_temp_dir/result.bed"
|
||||
|
||||
# Method 2: Semicolon-separated values
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/query.bed" \
|
||||
--files "$meta_temp_dir/db1.bed;$meta_temp_dir/db2.bed" \
|
||||
--output "$meta_temp_dir/result.bed"
|
||||
```
|
||||
|
||||
Both methods work identically - Viash automatically converts repeated flags to semicolon-separated strings internally.
|
||||
|
||||
#### Complete Example
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# Convert semicolon-separated files to array
|
||||
IFS=';' read -ra files_array <<< "$par_files"
|
||||
|
||||
# Convert semicolon-separated names to array if provided
|
||||
if [[ -n "${par_names}" ]]; then
|
||||
IFS=';' read -ra names_array <<< "$par_names"
|
||||
fi
|
||||
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-i "$par_input"
|
||||
${par_names:+-names "${names_array[@]}"}
|
||||
-files "${files_array[@]}"
|
||||
)
|
||||
|
||||
# Execute command
|
||||
bedtools annotate "${cmd_args[@]}" > "$par_output"
|
||||
```
|
||||
|
||||
## Real-World Example
|
||||
|
||||
Here's an example from the bowtie2_build component:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# unset flags
|
||||
[[ "$par_large_index" == "false" ]] && unset par_large_index
|
||||
[[ "$par_noauto" == "false" ]] && unset par_noauto
|
||||
[[ "$par_packed" == "false" ]] && unset par_packed
|
||||
|
||||
# Create output directory
|
||||
mkdir -p "$par_output"
|
||||
|
||||
# Determine index basename
|
||||
if [ -n "$par_index_name" ]; then
|
||||
index_basename="$par_index_name"
|
||||
else
|
||||
index_basename=$(basename "$par_input" .fasta)
|
||||
fi
|
||||
|
||||
# Build command arguments
|
||||
cmd_args=(
|
||||
${par_fasta:+-f}
|
||||
${par_cmdline:+-c}
|
||||
${par_large_index:+--large-index}
|
||||
${par_noauto:+-a}
|
||||
${par_packed:+-p}
|
||||
${par_bmax:+--bmax "$par_bmax"}
|
||||
${par_offrate:+-o "$par_offrate"}
|
||||
"$par_input"
|
||||
"$par_output/$index_basename"
|
||||
)
|
||||
|
||||
# Execute bowtie2-build
|
||||
bowtie2-build "${cmd_args[@]}"
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Multiple Input Handling
|
||||
|
||||
If your tool accepts multiple inputs with custom separators:
|
||||
|
||||
```bash
|
||||
# Convert Viash's semicolon separator to comma
|
||||
par_disable_filters=$(echo "$par_disable_filters" | tr ';' ',')
|
||||
|
||||
cmd_args=(
|
||||
--disable-filters "$par_disable_filters"
|
||||
)
|
||||
```
|
||||
|
||||
### Complex File Handling
|
||||
|
||||
```bash
|
||||
# Ensure output directory exists
|
||||
mkdir -p "$(dirname "$par_output")"
|
||||
|
||||
# Handle relative paths
|
||||
input_path=$(realpath "$par_input")
|
||||
output_path=$(realpath "$par_output")
|
||||
```
|
||||
|
||||
### Resource Management
|
||||
|
||||
```bash
|
||||
# Use available resources
|
||||
cmd_args=(
|
||||
${meta_cpus:+--threads "$meta_cpus"}
|
||||
${meta_memory_mb:+--memory "${meta_memory_mb}M"}
|
||||
)
|
||||
```
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Unquoted Variables
|
||||
```bash
|
||||
# Wrong - can break with spaces
|
||||
cmd_args=(--input $par_input)
|
||||
|
||||
# Correct
|
||||
cmd_args=(--input "$par_input")
|
||||
```
|
||||
|
||||
### 2. Improper Boolean Handling
|
||||
```bash
|
||||
# Wrong - will include false booleans
|
||||
cmd_args=(${par_verbose:+--verbose})
|
||||
|
||||
# Correct - unset false values first
|
||||
[[ "$par_verbose" == "false" ]] && unset par_verbose
|
||||
cmd_args=(${par_verbose:+--verbose})
|
||||
```
|
||||
|
||||
### 3. Array Expansion
|
||||
```bash
|
||||
# Wrong - treats array as single string
|
||||
tool $cmd_args
|
||||
|
||||
# Correct - expands array elements
|
||||
tool "${cmd_args[@]}"
|
||||
```
|
||||
|
||||
## Testing Your Script
|
||||
|
||||
Always test your script with:
|
||||
- Empty/missing optional parameters
|
||||
- Parameters with spaces
|
||||
- Boolean true/false values
|
||||
- Edge cases specific to your tool
|
||||
|
||||
See [Testing Guide](docs/TESTING.md) for extensive test best practices.
|
||||
536
docs/TESTING.md
Normal file
536
docs/TESTING.md
Normal file
@@ -0,0 +1,536 @@
|
||||
# Testing Guide
|
||||
|
||||
This guide covers best practices for writing comprehensive test scripts for biobox components.
|
||||
|
||||
> **📌 Important:** All new test scripts should use the **centralized test helpers** located at `src/_utils/test_helpers.sh`. This eliminates code duplication and ensures consistency across all components.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Core Principles](#core-principles)
|
||||
- [Test Script Structure](#test-script-structure)
|
||||
- [Centralized Test Helpers](#centralized-test-helpers)
|
||||
- [Test Scenarios](#test-scenarios)
|
||||
- [Best Practices](#best-practices)
|
||||
- [Viash Testing Features](#viash-testing-features)
|
||||
- [Static Test Data](#static-test-data)
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. Generate Test Data in Scripts
|
||||
|
||||
**Preferred approach:** Generate test data within the test script using the centralized helper functions.
|
||||
|
||||
```bash
|
||||
# Generate test data using centralized helpers
|
||||
create_test_fasta "$meta_temp_dir/input.fasta" 3 50
|
||||
create_test_fastq "$meta_temp_dir/reads.fastq" 10 35
|
||||
```
|
||||
|
||||
**Avoid:**
|
||||
- Storing static test files in the repository
|
||||
- Fetching test data from external sources
|
||||
- Large test datasets
|
||||
|
||||
### 2. Self-Contained Tests
|
||||
|
||||
Tests should be completely self-contained and not depend on external resources:
|
||||
|
||||
```yaml
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
```
|
||||
|
||||
Only add static test files if absolutely necessary:
|
||||
|
||||
```yaml
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
- type: file
|
||||
path: test_data # Only if data generation is impractical
|
||||
```
|
||||
|
||||
## Test Script Structure
|
||||
|
||||
### Configuration Setup
|
||||
|
||||
Add the test helpers as a resource in your component configuration:
|
||||
|
||||
```yaml
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
```
|
||||
|
||||
### Basic Test Template
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source the centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment with strict error handling
|
||||
setup_test_env
|
||||
|
||||
#############################################
|
||||
# Test execution with centralized functions
|
||||
#############################################
|
||||
|
||||
log "Starting tests for $meta_name"
|
||||
|
||||
# --- Test Case 1: Basic functionality ---
|
||||
log "Starting TEST 1: Basic functionality"
|
||||
|
||||
# Create and validate test data
|
||||
test_data_dir="$meta_temp_dir/test_data"
|
||||
mkdir -p "$test_data_dir"
|
||||
create_test_fasta "$test_data_dir/input.fasta" 3 50
|
||||
check_file_exists "$test_data_dir/input.fasta" "input FASTA file"
|
||||
|
||||
log "Executing $meta_name with basic parameters..."
|
||||
"$meta_executable" \
|
||||
--input "$test_data_dir/input.fasta" \
|
||||
--output "$meta_temp_dir/test1"
|
||||
|
||||
log "Validating TEST 1 outputs..."
|
||||
check_dir_exists "$meta_temp_dir/test1" "output directory"
|
||||
check_file_exists "$meta_temp_dir/test1/result.txt" "result file"
|
||||
check_file_not_empty "$meta_temp_dir/test1/result.txt" "result file"
|
||||
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# --- Test Case 2: Advanced parameters ---
|
||||
log "Starting TEST 2: Advanced parameters"
|
||||
|
||||
# Create different test data
|
||||
create_test_fastq "$test_data_dir/input.fastq" 10 35
|
||||
check_file_exists "$test_data_dir/input.fastq" "input FASTQ file"
|
||||
|
||||
log "Executing $meta_name with advanced parameters..."
|
||||
"$meta_executable" \
|
||||
--input "$test_data_dir/input.fastq" \
|
||||
--output "$meta_temp_dir/test2" \
|
||||
--threads 2 \
|
||||
--verbose
|
||||
|
||||
log "Validating TEST 2 outputs..."
|
||||
check_file_exists "$meta_temp_dir/test2/advanced_result.txt" "advanced result file"
|
||||
check_file_contains "$meta_temp_dir/test2/advanced_result.txt" "expected_pattern" "advanced result file"
|
||||
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
print_test_summary "All tests completed successfully"
|
||||
```
|
||||
|
||||
## Centralized Test Helpers
|
||||
|
||||
The centralized test helpers located at `src/_utils/test_helpers.sh` provide comprehensive testing functionality to ensure consistency across all biobox components.
|
||||
|
||||
### Available Functions
|
||||
|
||||
#### Logging Functions
|
||||
- `log "message"` - Log with timestamp
|
||||
- `log_warn "message"` - Warning message
|
||||
- `log_error "message"` - Error message
|
||||
|
||||
#### File/Directory Validation
|
||||
- `check_file_exists path "description"` - Verify file exists
|
||||
- `check_dir_exists path "description"` - Verify directory exists
|
||||
- `check_file_not_exists path "description"` - Verify file doesn't exist
|
||||
- `check_dir_not_exists path "description"` - Verify directory doesn't exist
|
||||
- `check_file_empty path "description"` - Verify file is empty
|
||||
- `check_file_not_empty path "description"` - Verify file is not empty
|
||||
|
||||
#### Content Validation
|
||||
- `check_file_contains path "text" "description"` - Verify file contains text
|
||||
- `check_file_not_contains path "text" "description"` - Verify file doesn't contain text
|
||||
- `check_file_matches_regex path "pattern" "description"` - Verify file matches regex
|
||||
- `check_file_line_count path count "description"` - Verify line count
|
||||
|
||||
#### Test Data Generation
|
||||
- `create_test_fasta path [num_seqs] [seq_length]` - Generate FASTA file
|
||||
- `create_test_fastq path [num_reads] [read_length]` - Generate FASTQ file
|
||||
- `create_test_gtf path [num_genes]` - Generate GTF file
|
||||
- `create_test_gff path [num_features]` - Generate GFF file
|
||||
- `create_test_bed path [num_intervals]` - Generate BED file
|
||||
- `create_test_csv path [num_rows]` - Generate CSV file
|
||||
- `create_test_tsv path [num_rows]` - Generate TSV file
|
||||
|
||||
#### Utility Functions
|
||||
- `setup_test_env` - Initialize test environment with strict error handling
|
||||
- `print_test_summary "test_name"` - Print completion message
|
||||
|
||||
### Usage Example
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source centralized helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
setup_test_env
|
||||
|
||||
log "Starting tests for $meta_name"
|
||||
|
||||
# Generate test data
|
||||
create_test_fasta "$meta_temp_dir/input.fasta" 3 50
|
||||
check_file_exists "$meta_temp_dir/input.fasta" "input FASTA file"
|
||||
|
||||
# Run component
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/input.fasta" \
|
||||
--output "$meta_temp_dir/output.txt"
|
||||
|
||||
# Validate output
|
||||
check_file_exists "$meta_temp_dir/output.txt" "result file"
|
||||
check_file_contains "$meta_temp_dir/output.txt" "expected_pattern" "result file"
|
||||
|
||||
print_test_summary "Basic functionality test"
|
||||
```
|
||||
|
||||
## Test Scenarios
|
||||
|
||||
### 1. Basic Functionality
|
||||
|
||||
Test the component with minimal, essential parameters:
|
||||
|
||||
```bash
|
||||
log "Starting TEST 1: Basic functionality"
|
||||
|
||||
create_test_fasta "$meta_temp_dir/input.fasta" 3 50
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/input.fasta" \
|
||||
--output "$meta_temp_dir/output.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output.txt" "output file"
|
||||
check_file_not_empty "$meta_temp_dir/output.txt" "output file"
|
||||
|
||||
log "✅ TEST 1 completed successfully"
|
||||
```
|
||||
|
||||
### 2. Multiple Input Files
|
||||
|
||||
Test with multiple input files or complex input scenarios:
|
||||
|
||||
```bash
|
||||
log "Starting TEST 2: Multiple input files"
|
||||
|
||||
create_test_fasta "$meta_temp_dir/input1.fasta" 2 30
|
||||
create_test_fasta "$meta_temp_dir/input2.fasta" 2 30
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/input1.fasta;$meta_temp_dir/input2.fasta" \
|
||||
--output "$meta_temp_dir/output.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output.txt" "merged output file"
|
||||
|
||||
log "✅ TEST 2 completed successfully"
|
||||
```
|
||||
|
||||
### 3. Optional Parameters
|
||||
|
||||
Test with optional parameters and advanced features:
|
||||
|
||||
```bash
|
||||
log "Starting TEST 3: Optional parameters"
|
||||
|
||||
create_test_fastq "$meta_temp_dir/input.fastq" 10 35
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/input.fastq" \
|
||||
--output "$meta_temp_dir/output.txt" \
|
||||
--threads 2 \
|
||||
--verbose
|
||||
|
||||
check_file_exists "$meta_temp_dir/output.txt" "output file with options"
|
||||
check_file_contains "$meta_temp_dir/output.txt" "verbose" "verbose output"
|
||||
|
||||
log "✅ TEST 3 completed successfully"
|
||||
```
|
||||
|
||||
### 4. Edge Cases
|
||||
|
||||
Test with edge cases like empty files or unusual inputs:
|
||||
|
||||
```bash
|
||||
log "Starting TEST 4: Edge case - empty input"
|
||||
|
||||
# Create empty input file
|
||||
touch "$meta_temp_dir/empty.fasta"
|
||||
|
||||
# Test should handle empty input gracefully
|
||||
if "$meta_executable" \
|
||||
--input "$meta_temp_dir/empty.fasta" \
|
||||
--output "$meta_temp_dir/output.txt" 2>/dev/null; then
|
||||
log_warn "Component succeeded with empty input - checking output"
|
||||
check_file_exists "$meta_temp_dir/output.txt" "output file for empty input"
|
||||
else
|
||||
log "Expected behavior: Component properly rejected empty input"
|
||||
fi
|
||||
|
||||
log "✅ TEST 4 completed successfully"
|
||||
```
|
||||
|
||||
### 5. Error Handling
|
||||
|
||||
Test proper error handling for invalid inputs:
|
||||
|
||||
```bash
|
||||
log "Starting TEST 5: Error handling"
|
||||
|
||||
# Test with non-existent input file
|
||||
if "$meta_executable" \
|
||||
--input "/non/existent/file.txt" \
|
||||
--output "$meta_temp_dir/output.txt" 2>/dev/null; then
|
||||
log_error "Component should have failed with non-existent input"
|
||||
exit 1
|
||||
else
|
||||
log "✅ Component properly handled non-existent input file"
|
||||
fi
|
||||
|
||||
log "✅ TEST 5 completed successfully"
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Use Centralized Test Helpers
|
||||
|
||||
Always use the centralized test helpers instead of defining functions individually:
|
||||
|
||||
```bash
|
||||
# ✅ Recommended: Use centralized helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
setup_test_env
|
||||
|
||||
# ❌ NOT recommended: Defining functions individually
|
||||
set -euo pipefail
|
||||
log() { echo "$(date '+%Y-%m-%d %H:%M:%S') [TEST] $*"; }
|
||||
```
|
||||
|
||||
### 2. Strict Error Handling
|
||||
|
||||
The centralized helpers automatically provide strict error handling via `setup_test_env`:
|
||||
|
||||
```bash
|
||||
# Automatically enabled by setup_test_env:
|
||||
set -euo pipefail # Exit on errors, undefined variables, pipe failures
|
||||
export LC_ALL=C # Consistent locale for reproducible results
|
||||
```
|
||||
|
||||
### 3. Descriptive Validation
|
||||
|
||||
Use descriptive validation functions with meaningful descriptions:
|
||||
|
||||
```bash
|
||||
# ✅ Good: Descriptive validation
|
||||
check_file_exists "$output_file" "filtered feature matrix"
|
||||
check_file_not_exists "$bam_file" "BAM file (should be disabled by default)"
|
||||
check_file_contains "$result_file" "expected_pattern" "analysis results"
|
||||
|
||||
# ❌ Less helpful: Basic validation without context
|
||||
check_file_exists "$output_file"
|
||||
```
|
||||
|
||||
### 4. Organized Structure
|
||||
|
||||
Use `$meta_temp_dir` and create organized test structure:
|
||||
|
||||
```bash
|
||||
# Create organized test structure
|
||||
test_data_dir="$meta_temp_dir/test_data"
|
||||
test_output_dir="$meta_temp_dir/test_output"
|
||||
mkdir -p "$test_data_dir" "$test_output_dir"
|
||||
|
||||
create_test_fasta "$test_data_dir/input.fasta" 3 50
|
||||
```
|
||||
|
||||
### 5. Clear Test Output
|
||||
|
||||
Use consistent logging with clear test boundaries:
|
||||
|
||||
```bash
|
||||
log "Starting TEST 1: Basic functionality"
|
||||
log "Executing $meta_name with basic parameters..."
|
||||
log "Validating TEST 1 outputs..."
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# Final summary
|
||||
print_test_summary "All tests completed successfully"
|
||||
```
|
||||
|
||||
### 6. Comprehensive Content Validation
|
||||
|
||||
Don't just check that files exist - validate their content:
|
||||
|
||||
```bash
|
||||
# Check existence and content
|
||||
check_file_exists "$meta_temp_dir/output.txt" "analysis results"
|
||||
check_file_not_empty "$meta_temp_dir/output.txt" "analysis results"
|
||||
check_file_contains "$meta_temp_dir/output.txt" "Number of sequences" "result summary"
|
||||
check_file_line_count "$meta_temp_dir/output.txt" 10 "expected number of results"
|
||||
```
|
||||
|
||||
### 7. Multiple Test Scenarios
|
||||
|
||||
Include comprehensive test coverage:
|
||||
|
||||
```bash
|
||||
# Test 1: Basic functionality
|
||||
log "Starting TEST 1: Basic functionality"
|
||||
# ... test implementation ...
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# Test 2: Advanced options
|
||||
log "Starting TEST 2: Advanced options"
|
||||
# ... test implementation ...
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# Test 3: Edge cases
|
||||
log "Starting TEST 3: Edge case handling"
|
||||
# ... test implementation ...
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
print_test_summary "All tests completed successfully"
|
||||
```
|
||||
|
||||
## Viash Testing Features
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Test a single component
|
||||
viash test config.vsh.yaml
|
||||
|
||||
# Test with specific resources
|
||||
viash test config.vsh.yaml --cpus 4 --memory 8GB
|
||||
|
||||
# Test with specific setup strategy
|
||||
viash test config.vsh.yaml --setup build --verbose
|
||||
|
||||
# Keep temporary files for debugging
|
||||
viash test config.vsh.yaml --keep true
|
||||
|
||||
# Test all components in parallel
|
||||
viash ns test --parallel
|
||||
|
||||
# Test specific namespace
|
||||
viash ns test -q alignment --parallel
|
||||
```
|
||||
|
||||
### Test Execution Flow
|
||||
|
||||
When running `viash test`, Viash automatically:
|
||||
|
||||
1. **Creates temporary directory** (available as `$meta_temp_dir`)
|
||||
2. **Builds the main executable**
|
||||
3. **Builds/pulls Docker image** (if using Docker engine)
|
||||
4. **Iterates over all test scripts** in `test_resources`
|
||||
5. **Builds each test into executable** and runs it
|
||||
6. **Cleans up** temporary files (unless `--keep true`)
|
||||
7. **Returns exit code 0** if all tests succeed
|
||||
|
||||
### Meta Variables in Tests
|
||||
|
||||
Your test scripts automatically have access to important meta variables:
|
||||
|
||||
- `$meta_executable` - Path to the built component executable
|
||||
- `$meta_temp_dir` - Temporary directory for test files (automatically cleaned up)
|
||||
- `$meta_name` - Component name for logging
|
||||
- `$meta_resources_dir` - Path to test resources
|
||||
|
||||
### Multiple Test Scripts
|
||||
|
||||
You can add multiple test scripts to cover different scenarios:
|
||||
|
||||
```yaml
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test_basic.sh
|
||||
- type: bash_script
|
||||
path: test_edge_cases.sh
|
||||
- type: bash_script
|
||||
path: test_large_data.sh
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
```
|
||||
|
||||
### Advanced Testing Options
|
||||
|
||||
```bash
|
||||
# Test with different container setup strategies
|
||||
viash test config.vsh.yaml --setup cachedbuild # Use cached layers (faster)
|
||||
viash test config.vsh.yaml --setup build # Clean build from scratch
|
||||
viash test config.vsh.yaml --setup alwaysbuild # Always rebuild container
|
||||
|
||||
# Test with configuration modifications
|
||||
viash test config.vsh.yaml -c '.engines[0].image = "ubuntu:22.04"'
|
||||
|
||||
# Test with debug mode for troubleshooting
|
||||
viash test config.vsh.yaml --keep true --verbose
|
||||
```
|
||||
|
||||
For more details, see the [Viash Unit Testing Documentation](https://viash.io/guide/component/unit-testing.html).
|
||||
|
||||
## Static Test Data
|
||||
|
||||
### When to Use Static Test Data
|
||||
|
||||
Only use static test files when:
|
||||
|
||||
- The tool requires very specific, complex file formats that are difficult to generate
|
||||
- Generating equivalent test data is impractical or overly complex
|
||||
- You need real-world data to validate complex algorithms
|
||||
- Test data is very small (<1KB preferred, <10KB maximum)
|
||||
|
||||
### Guidelines for Static Test Data
|
||||
|
||||
If you must use static test data:
|
||||
|
||||
1. **Keep files small** - Prefer <1KB, maximum <10KB
|
||||
2. **Document the source** - How was it created?
|
||||
3. **Use minimal examples** - Strip down to essential features
|
||||
4. **Consider alternatives** - Can you generate equivalent data?
|
||||
|
||||
```bash
|
||||
# test_data/README.md
|
||||
# Test data for complex_tool component
|
||||
# Source: https://github.com/example/dataset
|
||||
# Generated with: tool --export-sample --format minimal
|
||||
# Date: 2025-01-01
|
||||
# Size: 847 bytes
|
||||
# Purpose: Tests complex file format parsing
|
||||
```
|
||||
|
||||
### Referencing Static Test Data
|
||||
|
||||
```yaml
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
```
|
||||
|
||||
```bash
|
||||
# In your test script
|
||||
static_data="$meta_resources_dir/test_data/sample.complex"
|
||||
check_file_exists "$static_data" "static test data"
|
||||
|
||||
"$meta_executable" --input "$static_data" --output "$meta_temp_dir/output.txt"
|
||||
```
|
||||
410
src/_utils/test_helpers.sh
Normal file
410
src/_utils/test_helpers.sh
Normal file
@@ -0,0 +1,410 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Test Helper Functions for Biobox Components
|
||||
#
|
||||
# This file provides standardized helper functions for component testing.
|
||||
# Source this file in your test scripts with:
|
||||
# source "$meta_resources_dir/test_helpers.sh"
|
||||
#
|
||||
# Usage examples:
|
||||
# log "Starting test execution"
|
||||
# check_file_exists "$output" "result file"
|
||||
# check_file_not_exists "$bam_file" "BAM file (disabled by default)"
|
||||
# create_test_fasta "$temp_dir/input.fasta" 3 50
|
||||
#
|
||||
|
||||
#############################################
|
||||
# Logging Functions
|
||||
#############################################
|
||||
|
||||
# Log messages with timestamps and consistent formatting
|
||||
log() {
|
||||
echo "$(date '+%Y-%m-%d %H:%M:%S') [TEST] $*"
|
||||
}
|
||||
|
||||
# Log informational messages (alias for log)
|
||||
log_info() {
|
||||
log "$*"
|
||||
}
|
||||
|
||||
# Log warning messages
|
||||
log_warn() {
|
||||
echo "$(date '+%Y-%m-%d %H:%M:%S') [WARN] $*"
|
||||
}
|
||||
|
||||
# Log error messages
|
||||
log_error() {
|
||||
echo "$(date '+%Y-%m-%d %H:%M:%S') [ERROR] $*" >&2
|
||||
}
|
||||
|
||||
#############################################
|
||||
# File and Directory Validation Functions
|
||||
#############################################
|
||||
|
||||
# Check if a file exists with descriptive logging
|
||||
# Usage: check_file_exists "/path/to/file" "optional description"
|
||||
check_file_exists() {
|
||||
local file_path="$1"
|
||||
local description="${2:-File}"
|
||||
|
||||
if [[ -f "$file_path" ]]; then
|
||||
log "✓ Found $description: $file_path"
|
||||
return 0
|
||||
else
|
||||
log_error "✗ $description does not exist: $file_path"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Check if a directory exists with descriptive logging
|
||||
# Usage: check_dir_exists "/path/to/dir" "optional description"
|
||||
check_dir_exists() {
|
||||
local dir_path="$1"
|
||||
local description="${2:-Directory}"
|
||||
|
||||
if [[ -d "$dir_path" ]]; then
|
||||
log "✓ Found $description: $dir_path"
|
||||
return 0
|
||||
else
|
||||
log_error "✗ $description does not exist: $dir_path"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Check if a file does NOT exist (useful for testing disabled features)
|
||||
# Usage: check_file_not_exists "/path/to/file" "optional description"
|
||||
check_file_not_exists() {
|
||||
local file_path="$1"
|
||||
local description="${2:-File}"
|
||||
|
||||
if [[ ! -f "$file_path" ]]; then
|
||||
log "✓ Confirmed $description does not exist (as expected): $file_path"
|
||||
return 0
|
||||
else
|
||||
log_error "✗ $description exists but shouldn't: $file_path"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Check if a directory does NOT exist (useful for testing disabled features)
|
||||
# Usage: check_dir_not_exists "/path/to/dir" "optional description"
|
||||
check_dir_not_exists() {
|
||||
local dir_path="$1"
|
||||
local description="${2:-Directory}"
|
||||
|
||||
if [[ ! -d "$dir_path" ]]; then
|
||||
log "✓ Confirmed $description does not exist (as expected): $dir_path"
|
||||
return 0
|
||||
else
|
||||
log_error "✗ $description exists but shouldn't: $dir_path"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Check if a file is not empty
|
||||
# Usage: check_file_not_empty "/path/to/file" "optional description"
|
||||
check_file_not_empty() {
|
||||
local file_path="$1"
|
||||
local description="${2:-File}"
|
||||
|
||||
if [[ -s "$file_path" ]]; then
|
||||
log "✓ $description is not empty: $file_path"
|
||||
return 0
|
||||
else
|
||||
log_error "✗ $description is empty but shouldn't be: $file_path"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Check if a file is empty
|
||||
# Usage: check_file_empty "/path/to/file" "optional description"
|
||||
check_file_empty() {
|
||||
local file_path="$1"
|
||||
local description="${2:-File}"
|
||||
|
||||
if [[ ! -s "$file_path" ]]; then
|
||||
log "✓ $description is empty (as expected): $file_path"
|
||||
return 0
|
||||
else
|
||||
log_error "✗ $description is not empty but should be: $file_path"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
#############################################
|
||||
# Content Validation Functions
|
||||
#############################################
|
||||
|
||||
# Check if a file contains specific text
|
||||
# Usage: check_file_contains "/path/to/file" "search_text" "optional description"
|
||||
check_file_contains() {
|
||||
local file_path="$1"
|
||||
local search_text="$2"
|
||||
local description="${3:-File}"
|
||||
|
||||
if grep -q "$search_text" "$file_path" 2>/dev/null; then
|
||||
log "✓ $description contains expected text '$search_text': $file_path"
|
||||
return 0
|
||||
else
|
||||
log_error "✗ $description does not contain '$search_text': $file_path"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Check if a file does NOT contain specific text
|
||||
# Usage: check_file_not_contains "/path/to/file" "search_text" "optional description"
|
||||
check_file_not_contains() {
|
||||
local file_path="$1"
|
||||
local search_text="$2"
|
||||
local description="${3:-File}"
|
||||
|
||||
if ! grep -q "$search_text" "$file_path" 2>/dev/null; then
|
||||
log "✓ $description does not contain '$search_text' (as expected): $file_path"
|
||||
return 0
|
||||
else
|
||||
log_error "✗ $description contains '$search_text' but shouldn't: $file_path"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Check if a file matches a regex pattern
|
||||
# Usage: check_file_matches_regex "/path/to/file" "regex_pattern" "optional description"
|
||||
check_file_matches_regex() {
|
||||
local file_path="$1"
|
||||
local regex_pattern="$2"
|
||||
local description="${3:-File}"
|
||||
|
||||
if grep -qE "$regex_pattern" "$file_path" 2>/dev/null; then
|
||||
log "✓ $description matches expected pattern '$regex_pattern': $file_path"
|
||||
return 0
|
||||
else
|
||||
log_error "✗ $description does not match pattern '$regex_pattern': $file_path"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Check if a file has the expected number of lines
|
||||
# Usage: check_file_line_count "/path/to/file" expected_count "optional description"
|
||||
check_file_line_count() {
|
||||
local file_path="$1"
|
||||
local expected_count="$2"
|
||||
local description="${3:-File}"
|
||||
|
||||
local actual_count=$(wc -l < "$file_path" 2>/dev/null || echo "0")
|
||||
|
||||
if [[ "$actual_count" -eq "$expected_count" ]]; then
|
||||
log "✓ $description has expected line count ($expected_count): $file_path"
|
||||
return 0
|
||||
else
|
||||
log_error "✗ $description has $actual_count lines, expected $expected_count: $file_path"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
#############################################
|
||||
# Test Data Generation Functions
|
||||
#############################################
|
||||
|
||||
# Create a test FASTA file with specified sequences
|
||||
# Usage: create_test_fasta "/path/to/output.fasta" [num_sequences] [sequence_length]
|
||||
create_test_fasta() {
|
||||
local file_path="$1"
|
||||
local num_seqs="${2:-2}"
|
||||
local seq_length="${3:-64}"
|
||||
|
||||
log "Creating test FASTA file with $num_seqs sequences of length $seq_length: $file_path"
|
||||
|
||||
> "$file_path" # Create empty file
|
||||
|
||||
for i in $(seq 1 "$num_seqs"); do
|
||||
echo ">seq$i" >> "$file_path"
|
||||
# Generate random DNA sequence
|
||||
head -c "$seq_length" /dev/zero | tr '\0' 'A' | sed 's/A/ATCG/g' | head -c "$seq_length" >> "$file_path"
|
||||
echo >> "$file_path"
|
||||
done
|
||||
|
||||
log "✓ Created test FASTA file: $file_path"
|
||||
}
|
||||
|
||||
# Create a test FASTQ file with specified reads
|
||||
# Usage: create_test_fastq "/path/to/output.fastq" [num_reads] [read_length]
|
||||
create_test_fastq() {
|
||||
local file_path="$1"
|
||||
local num_reads="${2:-4}"
|
||||
local read_length="${3:-35}"
|
||||
|
||||
log "Creating test FASTQ file with $num_reads reads of length $read_length: $file_path"
|
||||
|
||||
> "$file_path" # Create empty file
|
||||
|
||||
for i in $(seq 1 "$num_reads"); do
|
||||
echo "@read$i" >> "$file_path"
|
||||
# Generate random DNA sequence of exact length using bash
|
||||
seq_line=""
|
||||
for j in $(seq 1 "$read_length"); do
|
||||
case $((RANDOM % 4)) in
|
||||
0) seq_line+="A";;
|
||||
1) seq_line+="T";;
|
||||
2) seq_line+="C";;
|
||||
3) seq_line+="G";;
|
||||
esac
|
||||
done
|
||||
echo "$seq_line" >> "$file_path"
|
||||
echo "+" >> "$file_path"
|
||||
# Generate quality scores (all good quality, Phred+33 = ASCII 73)
|
||||
printf "%*s\n" "$read_length" "" | tr ' ' 'I' >> "$file_path"
|
||||
done
|
||||
|
||||
log "✓ Created test FASTQ file: $file_path"
|
||||
}
|
||||
|
||||
# Create a test GTF file with basic gene annotations
|
||||
# Usage: create_test_gtf "/path/to/output.gtf" [num_genes]
|
||||
create_test_gtf() {
|
||||
local file_path="$1"
|
||||
local num_genes="${2:-3}"
|
||||
|
||||
log "Creating test GTF file with $num_genes genes: $file_path"
|
||||
|
||||
> "$file_path" # Create empty file
|
||||
|
||||
for i in $(seq 1 "$num_genes"); do
|
||||
local start=$((1000 * i))
|
||||
local end=$((start + 999))
|
||||
local chr="chr$((i % 22 + 1))"
|
||||
|
||||
echo -e "${chr}\ttest\tgene\t${start}\t${end}\t.\t+\t.\tgene_id \"gene$i\"; gene_name \"GENE$i\"" >> "$file_path"
|
||||
echo -e "${chr}\ttest\ttranscript\t${start}\t${end}\t.\t+\t.\tgene_id \"gene$i\"; transcript_id \"transcript${i}\"; gene_name \"GENE$i\"" >> "$file_path"
|
||||
echo -e "${chr}\ttest\texon\t${start}\t$((start + 499))\t.\t+\t.\tgene_id \"gene$i\"; transcript_id \"transcript${i}\"; exon_number \"1\"" >> "$file_path"
|
||||
echo -e "${chr}\ttest\texon\t$((start + 500))\t${end}\t.\t+\t.\tgene_id \"gene$i\"; transcript_id \"transcript${i}\"; exon_number \"2\"" >> "$file_path"
|
||||
done
|
||||
|
||||
log "✓ Created test GTF file: $file_path"
|
||||
}
|
||||
|
||||
# Create a test GFF file with basic feature annotations
|
||||
# Usage: create_test_gff "/path/to/output.gff" [num_features]
|
||||
create_test_gff() {
|
||||
local file_path="$1"
|
||||
local num_features="${2:-3}"
|
||||
|
||||
log "Creating test GFF file with $num_features features: $file_path"
|
||||
|
||||
echo "##gff-version 3" > "$file_path"
|
||||
|
||||
for i in $(seq 1 "$num_features"); do
|
||||
local start=$((1000 * i))
|
||||
local end=$((start + 999))
|
||||
local chr="chr$((i % 22 + 1))"
|
||||
|
||||
echo -e "${chr}\ttest\tgene\t${start}\t${end}\t.\t+\t.\tID=gene$i;Name=GENE$i" >> "$file_path"
|
||||
done
|
||||
|
||||
log "✓ Created test GFF file: $file_path"
|
||||
}
|
||||
|
||||
# Create a test BED file with genomic intervals
|
||||
# Usage: create_test_bed "/path/to/output.bed" [num_intervals]
|
||||
create_test_bed() {
|
||||
local file_path="$1"
|
||||
local num_intervals="${2:-3}"
|
||||
|
||||
log "Creating test BED file with $num_intervals intervals: $file_path"
|
||||
|
||||
> "$file_path" # Create empty file
|
||||
|
||||
for i in $(seq 1 "$num_intervals"); do
|
||||
local start=$((1000 * i))
|
||||
local end=$((start + 999))
|
||||
local chr="chr$((i % 22 + 1))"
|
||||
|
||||
echo -e "${chr}\t${start}\t${end}\tregion$i\t0\t+" >> "$file_path"
|
||||
done
|
||||
|
||||
log "✓ Created test BED file: $file_path"
|
||||
}
|
||||
|
||||
# Create a simple test CSV file
|
||||
# Usage: create_test_csv "/path/to/output.csv" [num_rows]
|
||||
create_test_csv() {
|
||||
local file_path="$1"
|
||||
local num_rows="${2:-5}"
|
||||
|
||||
log "Creating test CSV file with $num_rows rows: $file_path"
|
||||
|
||||
echo "id,name,value,category" > "$file_path"
|
||||
|
||||
for i in $(seq 1 "$num_rows"); do
|
||||
echo "row$i,name$i,$((i * 10)),category$((i % 3 + 1))" >> "$file_path"
|
||||
done
|
||||
|
||||
log "✓ Created test CSV file: $file_path"
|
||||
}
|
||||
|
||||
# Create a simple test TSV file
|
||||
# Usage: create_test_tsv "/path/to/output.tsv" [num_rows]
|
||||
create_test_tsv() {
|
||||
local file_path="$1"
|
||||
local num_rows="${2:-5}"
|
||||
|
||||
log "Creating test TSV file with $num_rows rows: $file_path"
|
||||
|
||||
echo -e "id\tname\tvalue\tcategory" > "$file_path"
|
||||
|
||||
for i in $(seq 1 "$num_rows"); do
|
||||
echo -e "row$i\tname$i\t$((i * 10))\tcategory$((i % 3 + 1))" >> "$file_path"
|
||||
done
|
||||
|
||||
log "✓ Created test TSV file: $file_path"
|
||||
}
|
||||
|
||||
#############################################
|
||||
# Utility Functions
|
||||
#############################################
|
||||
|
||||
# Setup test environment with recommended settings
|
||||
setup_test_env() {
|
||||
# Enable strict error handling
|
||||
set -euo pipefail
|
||||
|
||||
# Set up consistent locale for reproducible results
|
||||
export LC_ALL=C
|
||||
|
||||
log "Test environment initialized with strict error handling"
|
||||
log "Using temporary directory: ${meta_temp_dir:-$PWD}"
|
||||
}
|
||||
|
||||
# Print test summary
|
||||
print_test_summary() {
|
||||
local test_name="${1:-Test}"
|
||||
log "🎉 $test_name completed successfully!"
|
||||
}
|
||||
|
||||
#############################################
|
||||
# Example Usage
|
||||
#############################################
|
||||
|
||||
# Example function showing how to use the helpers
|
||||
example_test_usage() {
|
||||
log "=== Example Test Usage ==="
|
||||
|
||||
# Setup
|
||||
setup_test_env
|
||||
|
||||
# Create test data
|
||||
create_test_fasta "$meta_temp_dir/input.fasta" 3 50
|
||||
|
||||
# Validate test data
|
||||
check_file_exists "$meta_temp_dir/input.fasta" "input FASTA file"
|
||||
check_file_not_empty "$meta_temp_dir/input.fasta" "input FASTA file"
|
||||
check_file_line_count "$meta_temp_dir/input.fasta" 6 # 3 sequences = 6 lines
|
||||
|
||||
# Example tool execution (commented out)
|
||||
# "$meta_executable" --input "$meta_temp_dir/input.fasta" --output "$meta_temp_dir/output"
|
||||
|
||||
# Validate outputs (examples)
|
||||
# check_file_exists "$meta_temp_dir/output.txt" "result file"
|
||||
# check_file_contains "$meta_temp_dir/output.txt" "expected_pattern" "result file"
|
||||
|
||||
print_test_summary "Example test"
|
||||
}
|
||||
@@ -3,8 +3,10 @@ description: |
|
||||
Bases2Fastq demultiplexes sequencing data generated by Element Biosciences instruments and converts base calls into FASTQ files.
|
||||
keywords: ["demultiplex", "fastq", "demux", "Element Biosciences"]
|
||||
links:
|
||||
homepage: https://www.elembio.com/
|
||||
documentation: https://docs.elembio.io/docs/bases2fastq/introduction/
|
||||
license: Proprietairy
|
||||
repository: https://github.com/Illumina/bases2fastq
|
||||
license: Proprietary
|
||||
requirements:
|
||||
commands: [bases2fastq]
|
||||
authors:
|
||||
@@ -158,19 +160,51 @@ argument_groups:
|
||||
type: boolean_true
|
||||
description: |
|
||||
Split FASTQ files by lane.
|
||||
- name: --strict
|
||||
- name: "--skip_qc_report"
|
||||
type: boolean_true
|
||||
description: |
|
||||
In strict mode any invalid or missing input file will terminate execution
|
||||
(overrides no_error_on_invalid and sets --error_on_missing)
|
||||
Do not generate HTML QC report.
|
||||
- name: "--skip_multi_qc"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Do not generate MultiQC HTML report.
|
||||
- name: "--settings"
|
||||
type: string
|
||||
multiple: true
|
||||
description: |
|
||||
Run manifest settings override. This option may be specified multiple times.
|
||||
|
||||
# --help, -h Display this usage statement
|
||||
# --input-remote, NAME Rclone remote name for remote ANALYSIS_DIRECTORY
|
||||
# --num-threads, -p NUMBER Number of threads (default 1)
|
||||
# --output-remote, NAME Rclone remote name for remote OUTPUT_DIRECTORY
|
||||
# --settings SELECTION Run manifest settings override. This option may be specified multiple times.
|
||||
# --version, -v Display bases2fastq version
|
||||
# --skip-qc-report SELECTION Do not generate HTML QC report.
|
||||
# Cyto-fastq specific arguments
|
||||
- name: "Cyto-fastq Arguments"
|
||||
arguments:
|
||||
- name: "--batch"
|
||||
type: string
|
||||
description: |
|
||||
Restrict cyto-fastq generation to batch(es) that match comma delimited list (e.g. --batch B01,B02,B03).
|
||||
- name: "--cyto_fastq_mask"
|
||||
type: string
|
||||
multiple: true
|
||||
description: |
|
||||
Cycle mask for cyto fastq generation. This flag can be specified multiple times.
|
||||
- name: "--panel"
|
||||
type: file
|
||||
description: |
|
||||
Local or remote path to panel JSON
|
||||
- name: "--per_target_fastq"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Create per-target fastq for each cell assignment target site in each DISS batch according to FastqMasks in TargetCellAssignmentManifest.
|
||||
- name: "--tca_manifest"
|
||||
type: file
|
||||
description: |
|
||||
Location of TargetCellAssignmentManifest to use instead of default csv found in analysis directory
|
||||
|
||||
# Arguments not included as per contributing guidelines:
|
||||
# --help, -h Display this usage statement (handled by viash)
|
||||
# --input-remote, NAME Rclone remote name for remote ANALYSIS_DIRECTORY (not needed for biobox)
|
||||
# --num-threads, -p NUMBER Number of threads (use meta_cpus instead)
|
||||
# --output-remote, NAME Rclone remote name for remote OUTPUT_DIRECTORY (not needed for biobox)
|
||||
# --version, -v Display bases2fastq version (handled by viash)
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
@@ -179,21 +213,16 @@ resources:
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: elembio/bases2fastq:2.1.0
|
||||
image: elembio/bases2fastq:2.2
|
||||
setup:
|
||||
- type: apt
|
||||
packages:
|
||||
- procps
|
||||
- tree
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bases2fastq: $(bases2fastq --version | cut -d' ' -f3)" > /var/software_versions.txt
|
||||
test_setup:
|
||||
- type: apt
|
||||
packages: curl
|
||||
bases2fastq --version 2>&1 | head -1 | sed 's/.*version \([0-9\\.]*\).*/bases2fastq: \1/' > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
|
||||
@@ -1,40 +1,51 @@
|
||||
```
|
||||
docker run --rm docker.io/elembio/bases2fastq:2.2 bases2fastq -h
|
||||
```
|
||||
|
||||
Usage: bases2fastq [OPTIONS] ANALYSIS_DIRECTORY OUTPUT_DIRECTORY
|
||||
|
||||
positional arguments:
|
||||
ANALYSIS_DIRECTORY Location of analysis directory
|
||||
OUTPUT_DIRECTORY Location to save output
|
||||
ANALYSIS_DIRECTORY Location of analysis directory
|
||||
OUTPUT_DIRECTORY Location to save output
|
||||
|
||||
optional arguments:
|
||||
--chemistry-version VERSION Run parameters override, chemistry version.
|
||||
--demux-only, -d Generate demux files and indexing stats without generating FASTQ
|
||||
--detect-adapters Detect adapters sequences, overriding any sequences present in run manifest.
|
||||
--error-on-missing Terminate execution for a missing file (by default, missing files are skipped and execution continues). Also set by --strict.
|
||||
--exclude-tile, -e SELECTION Regex matching tile names to exclude. This flag can be specified multiple times. (e.g. L1.*C0[23]S.)
|
||||
--filter-mask MASK Run parameters override, custom pass filter mask.
|
||||
--flowcell-id FLOWCELL_ID Run parameters override, flowcell ID.
|
||||
--force-index-orientation Do not attempt to find orientation for I1/I2 reads (reverse complement). Use orientation given in run manifest.
|
||||
--group-fastq Group all FASTQ/stats/metrics for a project are in the project folder (default false)
|
||||
--help, -h Display this usage statement
|
||||
--i1-cycles NUM_CYCLES Run parameters override, I1 cycles.
|
||||
--i2-cycles NUM_CYCLES Run parameters override, I2 cycles.
|
||||
--include-tile, -i SELECTION Regex matching tile names to include. This flag can be specified multiple times. (e.g. L1.*C0[23]S.)
|
||||
--input-remote, NAME Rclone remote name for remote ANALYSIS_DIRECTORY
|
||||
--kit-configuration KIT_CONFIG Run parameters override, kit configuration.
|
||||
--legacy-fastq Legacy naming for FASTQ files (e.g. SampleName_S1_L001_R1_001.fastq.gz)
|
||||
--log-level, -l LEVEL Severity level for logging. i.e. DEBUG, INFO, WARNING, ERROR (default INFO)
|
||||
--no-error-on-invalid Skip invalid files and continue execution (by default, execution is terminated for an invalid file). Overridden by --strict options.
|
||||
--no-projects Disable project directories (default false)
|
||||
--num-threads, -p NUMBER Number of threads (default 1)
|
||||
--num-unassigned NUMBER Max Number of unassigned sequences to report. Must be <= 1000 (default 30)
|
||||
--output-remote, NAME Rclone remote name for remote OUTPUT_DIRECTORY
|
||||
--preparation-workflow WORKFLOW Run parameters override, preparation workflow.
|
||||
--qc-only Quickly generate run stats for single tile without generating FASTQ. Use --include-tile/--exclude-tile to define custom tile set.
|
||||
--r1-cycles NUM_CYCLES Run parameters override, R1 cycles.
|
||||
--r2-cycles NUM_CYCLES Run parameters override, R2 cycles.
|
||||
--run-manifest, -r PATH Location of run manifest to use instead of default RunManifest.csv found in analysis directory
|
||||
--settings SELECTION Run manifest settings override. This option may be specified multiple times.
|
||||
--skip-qc-report SELECTION Do not generate HTML QC report.
|
||||
--split-lanes Split FASTQ files by lane
|
||||
--strict, -s In strict mode any invalid or missing input file will terminate execution (overrides no-error-on-invalid and sets --error-on-missing)
|
||||
--version, -v Display bases2fastq version
|
||||
--chemistry-version VERSION Run parameters override, chemistry version.
|
||||
--demux-only, -d Generate demux files and indexing stats without generating FASTQ
|
||||
--detect-adapters Detect adapters sequences, overriding any sequences present in run manifest.
|
||||
--error-on-missing Terminate execution for a missing file (by default, missing files are skipped and execution continues).
|
||||
--exclude-tile, -e SELECTION Regex matching tile names to exclude. This flag can be specified multiple times. (e.g. L1.*C0[23]S.)
|
||||
--filter-mask MASK Run parameters override, custom pass filter mask.
|
||||
--flowcell-id FLOWCELL_ID Run parameters override, flowcell ID.
|
||||
--force-index-orientation Do not attempt to find orientation for I1/I2 reads (reverse complement). Use orientation given in run manifest.
|
||||
--group-fastq Group all FASTQ/stats/metrics for a project are in the project folder (default false)
|
||||
--help, -h Display this usage statement
|
||||
--i1-cycles NUM_CYCLES Run parameters override, I1 cycles.
|
||||
--i2-cycles NUM_CYCLES Run parameters override, I2 cycles.
|
||||
--include-tile, -i SELECTION Regex matching tile names to include. This flag can be specified multiple times. (e.g. L1.*C0[23]S.)
|
||||
--input-remote, NAME Rclone remote name for remote ANALYSIS_DIRECTORY
|
||||
--kit-configuration KIT_CONFIG Run parameters override, kit configuration.
|
||||
--legacy-fastq Legacy naming for FASTQ files (e.g. SampleName_S1_L001_R1_001.fastq.gz)
|
||||
--log-level, -l LEVEL Severity level for logging. i.e. DEBUG, INFO, WARNING, ERROR (default INFO)
|
||||
--no-error-on-invalid Skip invalid files and continue execution (by default, execution is terminated for an invalid file).
|
||||
--no-projects Disable project directories (default false)
|
||||
--num-threads, -p NUMBER Number of threads (default 1)
|
||||
--num-unassigned NUMBER Max Number of unassigned sequences to report. (default 30)
|
||||
--output-remote, NAME Rclone remote name for remote OUTPUT_DIRECTORY
|
||||
--preparation-workflow WORKFLOW Run parameters override, preparation workflow.
|
||||
--qc-only Quickly generate run stats for single tile without generating FASTQ. Use --include-tile/--exclude-tile to define custom tile set.
|
||||
--r1-cycles NUM_CYCLES Run parameters override, R1 cycles.
|
||||
--r2-cycles NUM_CYCLES Run parameters override, R2 cycles.
|
||||
--run-manifest, -r PATH Location of run manifest to use instead of default RunManifest.csv found in analysis directory
|
||||
--settings SELECTION Run manifest settings override. This option may be specified multiple times.
|
||||
--skip-multi-qc Do not generate MultiQC HTML report.
|
||||
--skip-qc-report Do not generate HTML QC report.
|
||||
--split-lanes Split FASTQ files by lane
|
||||
--version, -v Display bases2fastq version
|
||||
|
||||
cyto-fastq optional arguments:
|
||||
--batch BATCH Restrict cyto-fastq generation to batch(es) that match comma delimited list (e.g. --batch B01,B02,B03).
|
||||
--cyto-fastq-mask MASK Cycle mask for cyto fastq generation. This flag can be specified multiple times.
|
||||
--panel PANEL Local or remote path to panel JSON
|
||||
--per-target-fastq Create per-target fastq for each cell assignment target site in each DISS batch according to FastqMasks in TargetCellAssignmentManifest.
|
||||
--tca-manifest PATH Location of TargetCellAssignmentManifest to use instead of default csv found in analysis directory
|
||||
--well, -v Restrict cyto-fastq generation to well location(s) that match comma delimited list (e.g. --well A1,A2,B2)
|
||||
|
||||
@@ -8,92 +8,122 @@ set -eo pipefail
|
||||
|
||||
# Unset parameters
|
||||
unset_if_false=(
|
||||
par_demux_only
|
||||
par_detect_adapters
|
||||
par_error_on_missing
|
||||
par_group_fastq
|
||||
par_legacy_fastq
|
||||
par_no_error_on_invalid
|
||||
par_no_projects
|
||||
par_qc_only
|
||||
par_split_lanes
|
||||
par_skip_qc_report
|
||||
par_strict
|
||||
par_force_index_orientation
|
||||
par_demux_only
|
||||
par_detect_adapters
|
||||
par_error_on_missing
|
||||
par_group_fastq
|
||||
par_legacy_fastq
|
||||
par_no_error_on_invalid
|
||||
par_no_projects
|
||||
par_qc_only
|
||||
par_split_lanes
|
||||
par_skip_qc_report
|
||||
par_skip_multi_qc
|
||||
par_force_index_orientation
|
||||
par_per_target_fastq
|
||||
)
|
||||
|
||||
for par in ${unset_if_false[@]}; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset $par
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset $par
|
||||
done
|
||||
|
||||
# NOTE: --preparation-workflow is bugged in bases2fastq
|
||||
args=(
|
||||
${par_demux_only:+--demux-only}
|
||||
${par_detect_adapters:+--detect-adapters}
|
||||
${par_error_on_missing:+--error-on-missing}
|
||||
${par_group_fastq:+--group-fastq}
|
||||
${par_legacy_fastq:+--legacy-fastq}
|
||||
${par_no_error_on_invalid:+--no-error-on-invalid}
|
||||
${par_no_projects:+--no-projects}
|
||||
${par_split_lanes:+--split-lanes}
|
||||
${par_strict:+--strict}
|
||||
${par_force_index_orientation:+--force-index-orientation}
|
||||
${par_chemistry_version:+--chemistry-version "$par_chemistry_version"}
|
||||
${par_filter_mask:+--filter-mask "$par_filter_mask"}
|
||||
${par_flowcell_id:+--flowcell-id "$par_flowcell_id"}
|
||||
${par_i1_cycles:+--i1-cycles "$par_i1_cycles"}
|
||||
${par_i2_cycles:+--i2-cycles "$par_i2_cycles"}
|
||||
${par_r1_cycles:+--r1-cycles "$par_r1_cycles"}
|
||||
${par_r2_cycles:+--r2-cycles "$par_r2_cycles"}
|
||||
${par_kit_configuration:+--kit-configuration "$par_kit_configuration"}
|
||||
${par_log_level:+--log-level "$par_log_level"}
|
||||
${par_num_unassigned:+--num-unassigned "$par_num_unassigned"}
|
||||
${par_preparation_workflow:+--preparation-workflow "$par_preparation_workflow"}
|
||||
${meta_cpus:+--num-threads "$meta_cpus"}
|
||||
${par_run_manifest:+--run-manifest "$par_run_manifest"}
|
||||
)
|
||||
|
||||
# Create arrays for inputs that contain multiple arguments
|
||||
IFS=";" read -ra exclude_tile <<< "$par_exclude_tile"
|
||||
IFS=";" read -ra include_tile <<< "$par_include_tile"
|
||||
|
||||
if [ -z "$par_report" ]; then
|
||||
args+=( --skip-qc-report )
|
||||
fi
|
||||
|
||||
for arg_value in "${exclude_tile[@]}"; do
|
||||
args+=( "--exclude-tile" "$arg_value" )
|
||||
done
|
||||
|
||||
for arg_value in "${include_tile[@]}"; do
|
||||
args+=( "--include-tile" "$arg_value" )
|
||||
done
|
||||
IFS=";" read -ra settings <<< "$par_settings"
|
||||
IFS=";" read -ra cyto_fastq_mask <<< "$par_cyto_fastq_mask"
|
||||
|
||||
echo "> Creating temporary directory."
|
||||
# create temporary directory and clean up on exit
|
||||
TMPDIR=$(mktemp -d "$meta_temp_dir/$meta_name-XXXXXX")
|
||||
echo "> Created $TMPDIR"
|
||||
function clean_up {
|
||||
[[ -d "$TMPDIR" ]] && rm -rf "$TMPDIR"
|
||||
[[ -d "$TMPDIR" ]] && rm -rf "$TMPDIR"
|
||||
}
|
||||
trap clean_up EXIT
|
||||
|
||||
# NOTE: --preparation-workflow is bugged in bases2fastq
|
||||
args=(
|
||||
${par_demux_only:+--demux-only}
|
||||
${par_detect_adapters:+--detect-adapters}
|
||||
${par_error_on_missing:+--error-on-missing}
|
||||
${par_group_fastq:+--group-fastq}
|
||||
${par_legacy_fastq:+--legacy-fastq}
|
||||
${par_no_error_on_invalid:+--no-error-on-invalid}
|
||||
${par_no_projects:+--no-projects}
|
||||
${par_split_lanes:+--split-lanes}
|
||||
${par_force_index_orientation:+--force-index-orientation}
|
||||
${par_skip_qc_report:+--skip-qc-report}
|
||||
${par_skip_multi_qc:+--skip-multi-qc}
|
||||
${par_per_target_fastq:+--per-target-fastq}
|
||||
${par_chemistry_version:+--chemistry-version "$par_chemistry_version"}
|
||||
${par_filter_mask:+--filter-mask "$par_filter_mask"}
|
||||
${par_flowcell_id:+--flowcell-id "$par_flowcell_id"}
|
||||
${par_i1_cycles:+--i1-cycles "$par_i1_cycles"}
|
||||
${par_i2_cycles:+--i2-cycles "$par_i2_cycles"}
|
||||
${par_r1_cycles:+--r1-cycles "$par_r1_cycles"}
|
||||
${par_r2_cycles:+--r2-cycles "$par_r2_cycles"}
|
||||
${par_kit_configuration:+--kit-configuration "$par_kit_configuration"}
|
||||
${par_log_level:+--log-level "$par_log_level"}
|
||||
${par_num_unassigned:+--num-unassigned "$par_num_unassigned"}
|
||||
${par_preparation_workflow:+--preparation-workflow "$par_preparation_workflow"}
|
||||
${par_batch:+--batch "$par_batch"}
|
||||
${par_panel:+--panel "$par_panel"}
|
||||
${par_tca_manifest:+--tca-manifest "$par_tca_manifest"}
|
||||
${meta_cpus:+--num-threads "$meta_cpus"}
|
||||
${par_run_manifest:+--run-manifest "$par_run_manifest"}
|
||||
)
|
||||
|
||||
if [ -z "$par_report" ]; then
|
||||
args+=( --skip-qc-report )
|
||||
fi
|
||||
|
||||
for arg_value in "${exclude_tile[@]}"; do
|
||||
args+=( "--exclude-tile" "$arg_value" )
|
||||
done
|
||||
|
||||
for arg_value in "${include_tile[@]}"; do
|
||||
args+=( "--include-tile" "$arg_value" )
|
||||
done
|
||||
|
||||
for arg_value in "${settings[@]}"; do
|
||||
args+=( "--settings" "$arg_value" )
|
||||
done
|
||||
|
||||
for arg_value in "${cyto_fastq_mask[@]}"; do
|
||||
args+=( "--cyto-fastq-mask" "$arg_value" )
|
||||
done
|
||||
|
||||
args+=( "$par_analysis_directory" "$TMPDIR")
|
||||
echo "> Running bases2fastq with arguments: ${args[@]}"
|
||||
bases2fastq ${args[@]}
|
||||
echo "> Done running sgdemux"
|
||||
|
||||
echo "> Output folder:"
|
||||
tree "$TMPDIR"
|
||||
|
||||
echo "> Moving FASTQ files into final output directory"
|
||||
mkdir -p "$par_output_directory/"
|
||||
mv "$TMPDIR"/Samples/* --target-directory="$par_output_directory"
|
||||
|
||||
if [ ! -z "$par_report" ]; then
|
||||
echo "> Moving HTML report to the output ($par_report)"
|
||||
mv "$TMPDIR"/*.html "$par_report"
|
||||
# Find HTML files in TMPDIR
|
||||
html_files=("$TMPDIR"/*.html)
|
||||
if [ -f "${html_files[0]}" ]; then
|
||||
# If there's only one HTML file, move it to the specified report path
|
||||
if [ ${#html_files[@]} -eq 1 ]; then
|
||||
mv "${html_files[0]}" "$par_report"
|
||||
else
|
||||
# Multiple HTML files - find the main QC report and move it to the specified path
|
||||
# bases2fastq generates both QC report and MultiQC report
|
||||
for html_file in "${html_files[@]}"; do
|
||||
# The main QC report is usually not named multiqc_report.html
|
||||
if [[ ! "$(basename "$html_file")" =~ ^multiqc.*\.html$ ]]; then
|
||||
mv "$html_file" "$par_report"
|
||||
break
|
||||
fi
|
||||
done
|
||||
fi
|
||||
fi
|
||||
else
|
||||
echo " > Leaving reports alone"
|
||||
fi
|
||||
@@ -106,6 +136,3 @@ if [ ! -z "$par_logs" ]; then
|
||||
else
|
||||
echo "> Not moving logs"
|
||||
fi
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -2,25 +2,8 @@
|
||||
|
||||
set -eou pipefail
|
||||
|
||||
# Helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
|
||||
assert_file_not_exists() {
|
||||
[ ! -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
|
||||
assert_directory_exists() {
|
||||
[ -d "$1" ] || { echo "Directory '$1' does not exist" && exit 1; }
|
||||
}
|
||||
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
# Source centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Example output
|
||||
# Note that the format of the fastq file names and organization into subfolders
|
||||
@@ -87,16 +70,21 @@ function clean_up {
|
||||
}
|
||||
trap clean_up EXIT
|
||||
|
||||
log_info "Downloading and extracting test data"
|
||||
|
||||
# Unpack test input files
|
||||
log_info "Downloading test data from Element Biosciences"
|
||||
TAR_DIR="$TMPDIR/tar"
|
||||
mkdir -p "$TAR_DIR"
|
||||
curl http://element-public-data.s3.amazonaws.com/bases2fastq-share/bases2fastq-v2/20230404-bases2fastq-sim-151-151-9-9.tar.gz \
|
||||
-o "$TAR_DIR/20230404-bases2fastq-sim-151-151-9-9.tar.gz"
|
||||
wget http://element-public-data.s3.amazonaws.com/bases2fastq-share/bases2fastq-v2/20230404-bases2fastq-sim-151-151-9-9.tar.gz \
|
||||
-O "$TAR_DIR/20230404-bases2fastq-sim-151-151-9-9.tar.gz"
|
||||
|
||||
log_info "Extracting test data"
|
||||
BCL_DIR="$TMPDIR/bcl"
|
||||
mkdir "$BCL_DIR"
|
||||
tar -xvf "$TAR_DIR/20230404-bases2fastq-sim-151-151-9-9.tar.gz" -C "$BCL_DIR"
|
||||
tar -xzf "$TAR_DIR/20230404-bases2fastq-sim-151-151-9-9.tar.gz" -C "$BCL_DIR"
|
||||
|
||||
log_info "Running test 1 with multiple options"
|
||||
mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null
|
||||
expected_out_dir="$TMPDIR/test1/out"
|
||||
expected_report="$TMPDIR/report.html"
|
||||
@@ -123,13 +111,13 @@ expected_logs="$TMPDIR/logs"
|
||||
--log_level DEBUG \
|
||||
--no_projects \
|
||||
--num_unassigned 30 \
|
||||
--strict \
|
||||
--run_manifest "$BCL_DIR/20230404-bases2fastq-sim-151-151-9-9/RunManifest.csv"
|
||||
|
||||
assert_directory_exists "$expected_out_dir"
|
||||
assert_directory_exists "$expected_logs"
|
||||
assert_file_exists "$expected_report"
|
||||
assert_file_not_empty "$expected_report"
|
||||
log_info "Validating test 1 outputs"
|
||||
check_dir_exists "$expected_out_dir" "Output directory"
|
||||
check_dir_exists "$expected_logs" "Logs directory"
|
||||
check_file_exists "$expected_report" "HTML report"
|
||||
check_file_not_empty "$expected_report" "HTML report (should contain data)"
|
||||
|
||||
expected_samples=(
|
||||
Undetermined_S0
|
||||
@@ -140,15 +128,17 @@ expected_samples=(
|
||||
sample_4_S5
|
||||
)
|
||||
|
||||
log_info "Checking FASTQ files for all samples and lanes"
|
||||
for sample in "${expected_samples[@]}"; do
|
||||
for lane in "L001" "L002"; do
|
||||
for orientation in "R1" "R2"; do
|
||||
assert_file_exists "$expected_out_dir/${sample}_${lane}_${orientation}_001.fastq.gz"
|
||||
check_file_exists "$expected_out_dir/${sample}_${lane}_${orientation}_001.fastq.gz" "FASTQ file for ${sample}_${lane}_${orientation}"
|
||||
done
|
||||
done
|
||||
done
|
||||
popd > /dev/null
|
||||
|
||||
log_info "Running test 3 with basic options"
|
||||
mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null
|
||||
expected_out_dir="$TMPDIR/test3/out"
|
||||
"$meta_executable" \
|
||||
@@ -162,23 +152,26 @@ expected_samples=(
|
||||
sample_3
|
||||
sample_4
|
||||
)
|
||||
tree "$expected_out_dir"
|
||||
log_info "Inspecting output directory structure:"
|
||||
find "$expected_out_dir" -name "*.fastq.gz" | head -10
|
||||
|
||||
log_info "Checking sample FASTQ files"
|
||||
for sample in "${expected_samples[@]}"; do
|
||||
for orientation in "R1" "R2"; do
|
||||
assert_file_exists "$expected_out_dir/DefaultProject/${sample}/${sample}_${orientation}.fastq.gz"
|
||||
check_file_exists "$expected_out_dir/DefaultProject/${sample}/${sample}_${orientation}.fastq.gz" "Sample ${sample} ${orientation} FASTQ file"
|
||||
done
|
||||
done
|
||||
assert_file_exists "$expected_out_dir/Unassigned/Unassigned_R1.fastq.gz"
|
||||
assert_file_exists "$expected_out_dir/Unassigned/Unassigned_R2.fastq.gz"
|
||||
check_file_exists "$expected_out_dir/Unassigned/Unassigned_R1.fastq.gz" "Unassigned R1 FASTQ file"
|
||||
check_file_exists "$expected_out_dir/Unassigned/Unassigned_R2.fastq.gz" "Unassigned R2 FASTQ file"
|
||||
popd > /dev/null
|
||||
|
||||
log_info "Running test 4 with split lanes option"
|
||||
mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null
|
||||
expected_out_dir="$TMPDIR/test4/out"
|
||||
"$meta_executable" \
|
||||
--analysis_directory "$BCL_DIR/20230404-bases2fastq-sim-151-151-9-9" \
|
||||
--output_directory "$expected_out_dir" \
|
||||
--split_lanes
|
||||
--analysis_directory "$BCL_DIR/20230404-bases2fastq-sim-151-151-9-9" \
|
||||
--output_directory "$expected_out_dir" \
|
||||
--split_lanes
|
||||
|
||||
expected_samples=(
|
||||
"Unassigned/Unassigned"
|
||||
@@ -188,13 +181,17 @@ expected_samples=(
|
||||
DefaultProject/sample_3/sample_3
|
||||
DefaultProject/sample_4/sample_4
|
||||
)
|
||||
tree "$expected_out_dir"
|
||||
log_info "Inspecting split lanes output directory:"
|
||||
find "$expected_out_dir" -name "*.fastq.gz" | head -10
|
||||
|
||||
log_info "Checking split lane FASTQ files"
|
||||
for sample in "${expected_samples[@]}"; do
|
||||
for lane in "L1" "L2"; do
|
||||
for orientation in "R1" "R2"; do
|
||||
assert_file_exists "$expected_out_dir/${sample}_${lane}_${orientation}.fastq.gz"
|
||||
check_file_exists "$expected_out_dir/${sample}_${lane}_${orientation}.fastq.gz" "Split lane FASTQ file ${sample}_${lane}_${orientation}"
|
||||
done
|
||||
done
|
||||
done
|
||||
popd > /dev/null
|
||||
|
||||
log_info "All tests completed successfully"
|
||||
|
||||
140
src/bedtools/bedtools_annotate/config.vsh.yaml
Normal file
140
src/bedtools/bedtools_annotate/config.vsh.yaml
Normal file
@@ -0,0 +1,140 @@
|
||||
name: bedtools_annotate
|
||||
namespace: bedtools
|
||||
description: |
|
||||
Annotates the depth and breadth of coverage of features from multiple files.
|
||||
|
||||
This tool analyzes how intervals in the input file are covered by features
|
||||
from one or more annotation files. It reports either the fraction of each
|
||||
interval covered, the count of overlapping features, or both metrics.
|
||||
|
||||
**Default behavior:** Reports fraction of each input interval covered by features
|
||||
**Multiple files:** Can process multiple annotation files simultaneously
|
||||
**Strand options:** Supports same-strand, opposite-strand, or strand-agnostic analysis
|
||||
|
||||
keywords: [Annotate, Coverage, Overlap, BED, GFF, VCF]
|
||||
links:
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/annotate.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: |
|
||||
Input file in BED, GFF, or VCF format to be annotated.
|
||||
|
||||
Each interval in this file will be analyzed for coverage by
|
||||
features from the annotation files.
|
||||
required: true
|
||||
example: intervals.bed
|
||||
|
||||
- name: --files
|
||||
type: file
|
||||
multiple: true
|
||||
description: |
|
||||
One or more annotation files for coverage analysis.
|
||||
|
||||
**Format:** BED, GFF, or VCF files containing features to analyze
|
||||
**Multiple files:** Use space-separated list or multiple --files flags
|
||||
**Processing:** Each file analyzed separately with results in columns
|
||||
required: true
|
||||
example: ["annotations1.bed", "annotations2.bed"]
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with annotation results.
|
||||
|
||||
Contains input intervals with additional columns showing coverage
|
||||
statistics from each annotation file.
|
||||
required: true
|
||||
example: annotated_intervals.bed
|
||||
|
||||
- name: Options
|
||||
arguments:
|
||||
- name: --names
|
||||
type: string
|
||||
multiple: true
|
||||
description: |
|
||||
Descriptive names for each annotation file.
|
||||
|
||||
**Usage:** One name per file in same order as --files
|
||||
**Header:** Names appear in output header line
|
||||
**Format:** Space-separated list or multiple --names flags
|
||||
example: ["ChIP-seq_peaks", "DNA_methylation"]
|
||||
|
||||
- name: --counts
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report count of overlapping features instead of coverage fraction.
|
||||
|
||||
**Default output:** Fraction of input interval covered (0.0-1.0)
|
||||
**With --counts:** Integer count of overlapping features
|
||||
**Use case:** When feature count is more relevant than coverage area
|
||||
|
||||
- name: --both
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report both feature counts and coverage fractions.
|
||||
|
||||
**Output format:** Count followed by fraction for each annotation file
|
||||
**Columns:** Doubles the number of result columns
|
||||
**Use case:** Comprehensive analysis requiring both metrics
|
||||
|
||||
- name: --strand
|
||||
alternatives: [-s]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require same strandedness for overlap detection.
|
||||
|
||||
Only count overlaps between features on the same strand.
|
||||
Features on opposite strands are ignored.
|
||||
|
||||
**Default:** Strand-agnostic analysis
|
||||
|
||||
- name: --different_strand
|
||||
alternatives: [-S]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require different strandedness for overlap detection.
|
||||
|
||||
Only count overlaps between features on opposite strands.
|
||||
Features on the same strand are ignored.
|
||||
|
||||
**Default:** Strand-agnostic analysis
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: docker
|
||||
run:
|
||||
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
29
src/bedtools/bedtools_annotate/help.txt
Normal file
29
src/bedtools/bedtools_annotate/help.txt
Normal file
@@ -0,0 +1,29 @@
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools annotate -h
|
||||
```
|
||||
|
||||
Tool: bedtools annotate (aka annotateBed)
|
||||
Version: v2.31.1
|
||||
Summary: Annotates the depth & breadth of coverage of features from mult. files
|
||||
on the intervals in -i.
|
||||
|
||||
Usage: bedtools annotate [OPTIONS] -i <bed/gff/vcf> -files FILE1 FILE2..FILEn
|
||||
|
||||
Options:
|
||||
-names A list of names (one / file) to describe each file in -i.
|
||||
These names will be printed as a header line.
|
||||
|
||||
-counts Report the count of features in each file that overlap -i.
|
||||
- Default is to report the fraction of -i covered by each file.
|
||||
|
||||
-both Report the counts followed by the % coverage.
|
||||
- Default is to report the fraction of -i covered by each file.
|
||||
|
||||
-s Require same strandedness. That is, only counts overlaps
|
||||
on the _same_ strand.
|
||||
- By default, overlaps are counted without respect to strand.
|
||||
|
||||
-S Require different strandedness. That is, only count overlaps
|
||||
on the _opposite_ strand.
|
||||
- By default, overlaps are counted without respect to strand.
|
||||
|
||||
34
src/bedtools/bedtools_annotate/script.sh
Normal file
34
src/bedtools/bedtools_annotate/script.sh
Normal file
@@ -0,0 +1,34 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# unset flags
|
||||
[[ "$par_counts" == "false" ]] && unset par_counts
|
||||
[[ "$par_both" == "false" ]] && unset par_both
|
||||
[[ "$par_strand" == "false" ]] && unset par_strand
|
||||
[[ "$par_different_strand" == "false" ]] && unset par_different_strand
|
||||
|
||||
# Convert semicolon-separated files to array
|
||||
IFS=';' read -ra files_array <<< "$par_files"
|
||||
|
||||
# Convert semicolon-separated names to array if provided
|
||||
if [[ -n "${par_names}" ]]; then
|
||||
IFS=';' read -ra names_array <<< "$par_names"
|
||||
fi
|
||||
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-i "$par_input"
|
||||
${par_names:+-names "${names_array[@]}"}
|
||||
${par_counts:+-counts}
|
||||
${par_both:+-both}
|
||||
${par_strand:+-s}
|
||||
${par_different_strand:+-S}
|
||||
-files "${files_array[@]}"
|
||||
)
|
||||
|
||||
# Execute bedtools annotate
|
||||
bedtools annotate "${cmd_args[@]}" > "$par_output"
|
||||
113
src/bedtools/bedtools_annotate/test.sh
Normal file
113
src/bedtools/bedtools_annotate/test.sh
Normal file
@@ -0,0 +1,113 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment
|
||||
setup_test_env
|
||||
|
||||
log "Starting tests for bedtools_annotate"
|
||||
|
||||
# Create test data
|
||||
log "Creating test data..."
|
||||
|
||||
# Create input intervals file
|
||||
cat > "$meta_temp_dir/intervals.bed" << 'EOF'
|
||||
chr1 100 200 interval1 100 +
|
||||
chr1 300 400 interval2 200 +
|
||||
chr2 150 250 interval3 300 -
|
||||
chr2 500 600 interval4 400 -
|
||||
EOF
|
||||
|
||||
# Create first annotation file (overlaps with intervals 1 and 3)
|
||||
cat > "$meta_temp_dir/annotation1.bed" << 'EOF'
|
||||
chr1 120 180 feature1 500 +
|
||||
chr1 350 450 feature2 600 +
|
||||
chr2 140 260 feature3 700 -
|
||||
EOF
|
||||
|
||||
# Create second annotation file (overlaps with intervals 2 and 4)
|
||||
cat > "$meta_temp_dir/annotation2.bed" << 'EOF'
|
||||
chr1 320 380 feature4 800 +
|
||||
chr1 390 420 feature5 900 +
|
||||
chr2 520 580 feature6 1000 -
|
||||
EOF
|
||||
|
||||
# Test 1: Basic annotation with coverage fractions
|
||||
log "Starting TEST 1: Basic annotation with coverage fractions"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--files "$meta_temp_dir/annotation1.bed;$meta_temp_dir/annotation2.bed" \
|
||||
--output "$meta_temp_dir/output1.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output1.bed" "basic annotation output"
|
||||
check_file_not_empty "$meta_temp_dir/output1.bed" "basic annotation output"
|
||||
check_file_line_count "$meta_temp_dir/output1.bed" 4 "basic annotation line count"
|
||||
|
||||
# Check that fractions are present (should contain decimal numbers)
|
||||
check_file_contains "$meta_temp_dir/output1.bed" "0." "coverage fractions"
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# Test 2: Annotation with feature counts
|
||||
log "Starting TEST 2: Annotation with feature counts"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--files "$meta_temp_dir/annotation1.bed;$meta_temp_dir/annotation2.bed" \
|
||||
--output "$meta_temp_dir/output2.bed" \
|
||||
--counts
|
||||
|
||||
check_file_exists "$meta_temp_dir/output2.bed" "count annotation output"
|
||||
check_file_not_empty "$meta_temp_dir/output2.bed" "count annotation output"
|
||||
|
||||
# Check that counts are present (should contain integers)
|
||||
check_file_contains "$meta_temp_dir/output2.bed" "1" "feature counts"
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# Test 3: Annotation with both counts and fractions
|
||||
log "Starting TEST 3: Annotation with both counts and fractions"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--files "$meta_temp_dir/annotation1.bed" \
|
||||
--output "$meta_temp_dir/output3.bed" \
|
||||
--both
|
||||
|
||||
check_file_exists "$meta_temp_dir/output3.bed" "both metrics output"
|
||||
check_file_not_empty "$meta_temp_dir/output3.bed" "both metrics output"
|
||||
|
||||
# Check that both counts and fractions are present
|
||||
check_file_contains "$meta_temp_dir/output3.bed" "1" "feature counts in both output"
|
||||
check_file_contains "$meta_temp_dir/output3.bed" "0." "coverage fractions in both output"
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# Test 4: Annotation with custom names
|
||||
log "Starting TEST 4: Annotation with custom names"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--files "$meta_temp_dir/annotation1.bed;$meta_temp_dir/annotation2.bed" \
|
||||
--names "ChIP_peaks;DNA_meth" \
|
||||
--output "$meta_temp_dir/output4.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output4.bed" "named annotation output"
|
||||
check_file_not_empty "$meta_temp_dir/output4.bed" "named annotation output"
|
||||
|
||||
# The names should appear somewhere (likely in header or within results)
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# Test 5: Strand-specific annotation (same strand)
|
||||
log "Starting TEST 5: Strand-specific annotation (same strand)"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--files "$meta_temp_dir/annotation1.bed" \
|
||||
--output "$meta_temp_dir/output5.bed" \
|
||||
--strand
|
||||
|
||||
check_file_exists "$meta_temp_dir/output5.bed" "strand-specific annotation output"
|
||||
check_file_not_empty "$meta_temp_dir/output5.bed" "strand-specific annotation output"
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
log "All tests completed successfully!"
|
||||
@@ -1,12 +1,15 @@
|
||||
name: bedtools_bamtobed
|
||||
namespace: bedtools
|
||||
description: Converts BAM alignments to BED6 or BEDPE format.
|
||||
description: |
|
||||
Converts BAM alignments to BED6 or BEDPE format.
|
||||
|
||||
This tool converts alignments in BAM format to either BED6 or BEDPE format,
|
||||
allowing for flexible downstream analysis of genomic intervals.
|
||||
keywords: [Converts, BAM, BED, BED6, BEDPE]
|
||||
links:
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/#
|
||||
issue_tracker: https://github.com/arq5x/bedtools2/issues
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
@@ -14,85 +17,129 @@ requirements:
|
||||
commands: [bedtools]
|
||||
authors:
|
||||
- __merge__: /src/_authors/theodoro_gasperin.yaml
|
||||
roles: [ author, maintainer ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: -i
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: Input BAM file.
|
||||
description: |
|
||||
Input BAM file containing aligned sequences.
|
||||
|
||||
**Requirements:**
|
||||
- Must be in SAM/BAM format
|
||||
- For paired-end BEDPE output (`--bedpe`), must be grouped or sorted by query name
|
||||
required: true
|
||||
example: input.bam
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
alternatives: -o
|
||||
alternatives: [-o]
|
||||
required: true
|
||||
type: file
|
||||
direction: output
|
||||
description: Output BED file.
|
||||
description: |
|
||||
Output file in BED or BEDPE format.
|
||||
|
||||
**Output formats:**
|
||||
- Default: BED6 format (6 columns)
|
||||
- With `--bedpe`: BEDPE format for paired-end data
|
||||
- With `--bed12`: BED12 format with blocked intervals
|
||||
example: output.bed
|
||||
|
||||
- name: Options
|
||||
arguments:
|
||||
- name: --bedpe
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write BEDPE format. Requires BAM to be grouped or sorted by query.
|
||||
Write BEDPE format for paired-end data.
|
||||
|
||||
**Requirements:**
|
||||
- BAM must be grouped or sorted by query name
|
||||
- Produces paired-end BED format with mate information
|
||||
|
||||
- name: --mate1
|
||||
type: boolean_true
|
||||
description: |
|
||||
When writing BEDPE (-bedpe) format, always report mate one as the first BEDPE "block".
|
||||
When writing BEDPE format (`--bedpe`), always report mate one as the first BEDPE block.
|
||||
|
||||
Ensures consistent ordering of paired-end reads in output.
|
||||
|
||||
- name: --bed12
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write "blocked" BED format (aka "BED12"). Forces -split.
|
||||
See http://genome-test.cse.ucsc.edu/FAQ/FAQformat#format1
|
||||
Write blocked BED format (BED12 format).
|
||||
|
||||
**Features:**
|
||||
- Creates 12-column BED format with block information
|
||||
- Automatically forces `--split` option
|
||||
- Useful for representing spliced alignments
|
||||
|
||||
See [BED12 format specification](http://genome-test.cse.ucsc.edu/FAQ/FAQformat#format1) for details.
|
||||
|
||||
- name: --split
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report "split" BAM alignments as separate BED entries.
|
||||
Splits only on N CIGAR operations.
|
||||
Report split BAM alignments as separate BED entries.
|
||||
|
||||
**Behavior:**
|
||||
- Splits only on **N** CIGAR operations (introns/gaps)
|
||||
- Each split becomes a separate BED interval
|
||||
- Useful for RNA-seq data with spliced alignments
|
||||
|
||||
- name: --splitD
|
||||
type: boolean_true
|
||||
description: |
|
||||
Split alignments based on N and D CIGAR operators.
|
||||
Forces -split.
|
||||
Split alignments based on both **N** and **D** CIGAR operators.
|
||||
|
||||
**Features:**
|
||||
- Splits on N (gaps/introns) and D (deletions) operations
|
||||
- Automatically forces `--split` option
|
||||
- More aggressive splitting than `--split` alone
|
||||
|
||||
- name: --edit_distance
|
||||
alternatives: -ed
|
||||
alternatives: [-ed]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use BAM edit distance (NM tag) for BED score.
|
||||
- Default for BED is to use mapping quality.
|
||||
- Default for BEDPE is to use the minimum of
|
||||
the two mapping qualities for the pair.
|
||||
- When -ed is used with -bedpe, the total edit
|
||||
distance from the two mates is reported.
|
||||
Use BAM edit distance (NM tag) for BED score instead of mapping quality.
|
||||
|
||||
**Scoring behavior:**
|
||||
- **Default BED**: Uses mapping quality as score
|
||||
- **Default BEDPE**: Uses minimum of two mapping qualities
|
||||
- **With --ed + --bedpe**: Reports total edit distance from both mates
|
||||
|
||||
- name: --tag
|
||||
type: string
|
||||
description: |
|
||||
Use other NUMERIC BAM alignment tag for BED score.
|
||||
Default for BED is to use mapping quality. Disallowed with BEDPE output.
|
||||
Use other numeric BAM alignment tag for BED score.
|
||||
|
||||
**Usage:**
|
||||
- Specify any numeric BAM tag (e.g., `SM`, `AS`, `XS`)
|
||||
- Replaces default mapping quality scoring
|
||||
- **Not allowed** with BEDPE output format
|
||||
example: "SM"
|
||||
|
||||
- name: --color
|
||||
type: string
|
||||
description: |
|
||||
An R,G,B string for the color used with BED12 format.
|
||||
Default is (255,0,0).
|
||||
example: "250,250,250"
|
||||
RGB color string for BED12 format visualization.
|
||||
|
||||
**Format:** R,G,B values (0-255 each)
|
||||
|
||||
**Default:** `255,0,0` (red)
|
||||
example: "255,0,0"
|
||||
|
||||
- name: --cigar
|
||||
type: boolean_true
|
||||
description: |
|
||||
Add the CIGAR string to the BED entry as a 7th column.
|
||||
Add the CIGAR string as a 7th column in BED output.
|
||||
|
||||
Useful for preserving alignment information in BED format.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
@@ -101,17 +148,16 @@ resources:
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: test_data
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:stable-slim
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [bedtools, procps]
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
|
||||
run:
|
||||
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
```bash
|
||||
bedtools bamtobed
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools bamtobed -h
|
||||
```
|
||||
|
||||
Tool: bedtools bamtobed (aka bamToBed)
|
||||
Version: v2.30.0
|
||||
Version: v2.31.1
|
||||
Summary: Converts BAM alignments to BED6 or BEDPE format.
|
||||
|
||||
Usage: bedtools bamtobed [OPTIONS] -i <bam>
|
||||
|
||||
@@ -5,35 +5,36 @@
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# Unset parameters
|
||||
unset_if_false=(
|
||||
# unset flags
|
||||
unset_if_false=(
|
||||
par_bedpe
|
||||
par_mate1
|
||||
par_bed12
|
||||
par_split
|
||||
par_splitD
|
||||
par_edit_distance
|
||||
par_tag
|
||||
par_color
|
||||
par_cigar
|
||||
)
|
||||
|
||||
for par in ${unset_if_false[@]}; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset $par
|
||||
for par in "${unset_if_false[@]}"; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset "$par"
|
||||
done
|
||||
|
||||
# Execute bedtools sort with the provided arguments
|
||||
bedtools bamtobed \
|
||||
${par_bedpe:+-bedpe} \
|
||||
${par_mate1:+-mate1} \
|
||||
${par_bed12:+-bed12} \
|
||||
${par_split:+-split} \
|
||||
${par_splitD:+-splitD} \
|
||||
${par_edit_distance:+-ed} \
|
||||
${par_tag:+-tag "$par_tag"} \
|
||||
${par_cigar:+-cigar} \
|
||||
${par_color:+-color "$par_color"} \
|
||||
-i "$par_input" \
|
||||
> "$par_output"
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-i "$par_input"
|
||||
${par_bedpe:+-bedpe}
|
||||
${par_mate1:+-mate1}
|
||||
${par_bed12:+-bed12}
|
||||
${par_split:+-split}
|
||||
${par_splitD:+-splitD}
|
||||
${par_edit_distance:+-ed}
|
||||
${par_tag:+-tag "$par_tag"}
|
||||
${par_color:+-color "$par_color"}
|
||||
${par_cigar:+-cigar}
|
||||
)
|
||||
|
||||
# Execute bedtools bamtobed
|
||||
bedtools bamtobed "${cmd_args[@]}" > "$par_output"
|
||||
|
||||
|
||||
@@ -1,183 +1,133 @@
|
||||
#!/bin/bash
|
||||
|
||||
# exit on error
|
||||
set -eo pipefail
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# directory of the bam file
|
||||
test_data="$meta_resources_dir/test_data"
|
||||
# Source the centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment with strict error handling
|
||||
setup_test_env
|
||||
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_identical_content() {
|
||||
diff -a "$2" "$1" \
|
||||
|| (echo "Files are not identical!" && exit 1)
|
||||
}
|
||||
# Test execution with centralized functions
|
||||
#############################################
|
||||
|
||||
echo "Creating Test Data..."
|
||||
TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX")
|
||||
function clean_up {
|
||||
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
|
||||
log "Starting tests for $meta_name"
|
||||
|
||||
# Create test directory
|
||||
test_dir="$meta_temp_dir/test_data"
|
||||
mkdir -p "$test_dir"
|
||||
|
||||
# Create a test SAM file with proper format (based on original test data)
|
||||
log "Creating test SAM data..."
|
||||
cat > "$test_dir/test.sam" << 'EOF'
|
||||
@SQ SN:chr2:172936693-172938111 LN:1418
|
||||
@PG ID:bwa PN:bwa VN:0.7.17-r1188
|
||||
my_read/1 99 chr2:172936693-172938111 129 60 100M = 429 400 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 SM:i:85
|
||||
my_read/2 147 chr2:172936693-172938111 429 60 100M = 129 -400 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 SM:i:85
|
||||
EOF
|
||||
|
||||
# Convert SAM to BAM using samtools (if available in container) or use the SAM directly
|
||||
log "Converting SAM to BAM..."
|
||||
if command -v samtools >/dev/null 2>&1; then
|
||||
samtools view -bS "$test_dir/test.sam" > "$test_dir/test.bam"
|
||||
input_file="$test_dir/test.bam"
|
||||
else
|
||||
# bedtools can handle SAM files directly
|
||||
input_file="$test_dir/test.sam"
|
||||
log "Using SAM file directly (samtools not available)"
|
||||
fi
|
||||
|
||||
# --- Test Case 1: Basic BAM to BED conversion ---
|
||||
log "Starting TEST 1: Basic BAM to BED conversion"
|
||||
|
||||
log "Executing $meta_name with basic parameters..."
|
||||
"$meta_executable" \
|
||||
--input "$input_file" \
|
||||
--output "$meta_temp_dir/output1.bed"
|
||||
|
||||
log "Validating TEST 1 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output1.bed" "output BED file"
|
||||
check_file_not_empty "$meta_temp_dir/output1.bed" "output BED file"
|
||||
|
||||
# Check that BED file has correct number of columns (6 for BED6)
|
||||
line_count=$(wc -l < "$meta_temp_dir/output1.bed")
|
||||
log "Output contains $line_count lines"
|
||||
[ "$line_count" -gt 0 ] || { log_error "Output file is empty"; exit 1; }
|
||||
|
||||
# Check that each line has 6 columns (BED6 format)
|
||||
awk 'NF != 6 { exit 1 }' "$meta_temp_dir/output1.bed" || {
|
||||
log_error "Output is not in BED6 format (expected 6 columns per line)"
|
||||
exit 1
|
||||
}
|
||||
trap clean_up EXIT
|
||||
|
||||
# Generate expected files for comparison
|
||||
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t60\t+\nchr2:172936693-172938111\t428\t528\tmy_read/2\t60\t-\n" > "$TMPDIR/expected.bed"
|
||||
printf "chr2:172936693-172938111\t128\t228\tchr2:172936693-172938111\t428\t528\tmy_read\t60\t+\t-\n" > "$TMPDIR/expected.bedpe"
|
||||
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t60\t+\t128\t228\t255,0,0\t1\t100\t0\nchr2:172936693-172938111\t428\t528\tmy_read/2\t60\t-\t428\t528\t255,0,0\t1\t100\t0\n" > "$TMPDIR/expected.bed12"
|
||||
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t0\t+\nchr2:172936693-172938111\t428\t528\tmy_read/2\t0\t-\n" > "$TMPDIR/expected_ed.bed"
|
||||
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t60\t+\t128\t228\t250,250,250\t1\t100\t0\nchr2:172936693-172938111\t428\t528\tmy_read/2\t60\t-\t428\t528\t250,250,250\t1\t100\t0\n" > "$TMPDIR/expected_color.bed12"
|
||||
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t60\t+\t100M\nchr2:172936693-172938111\t428\t528\tmy_read/2\t60\t-\t100M\n" > "$TMPDIR/expected_cigar.bed"
|
||||
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t85\t+\nchr2:172936693-172938111\t428\t528\tmy_read/2\t85\t-\n" > "$TMPDIR/expected_tag.bed"
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# --- Test Case 2: BEDPE format ---
|
||||
log "Starting TEST 2: BEDPE format conversion"
|
||||
|
||||
# Test 1:
|
||||
mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null
|
||||
|
||||
echo "> Run bedtools bamtobed on BAM file"
|
||||
log "Executing $meta_name with --bedpe flag..."
|
||||
"$meta_executable" \
|
||||
--input "$test_data/example.bam" \
|
||||
--output "output.bed" \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected.bed"
|
||||
echo "- test1 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 2:
|
||||
mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null
|
||||
|
||||
echo "> Run bedtools bamtobed on BAM file with -bedpe"
|
||||
"$meta_executable" \
|
||||
--input "$test_data/example.bam" \
|
||||
--output "output.bedpe" \
|
||||
--input "$input_file" \
|
||||
--output "$meta_temp_dir/output2.bedpe" \
|
||||
--bedpe
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bedpe"
|
||||
assert_file_not_empty "output.bedpe"
|
||||
assert_identical_content "output.bedpe" "../expected.bedpe"
|
||||
echo "- test2 succeeded -"
|
||||
log "Validating TEST 2 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output2.bedpe" "output BEDPE file"
|
||||
check_file_not_empty "$meta_temp_dir/output2.bedpe" "output BEDPE file"
|
||||
|
||||
popd > /dev/null
|
||||
# Check that BEDPE file has correct number of columns (10 for BEDPE)
|
||||
awk 'NF != 10 { exit 1 }' "$meta_temp_dir/output2.bedpe" || {
|
||||
log_error "Output is not in BEDPE format (expected 10 columns per line)"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Test 3:
|
||||
mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
echo "> Run bedtools bamtobed on BAM file with -bed12"
|
||||
# --- Test Case 3: BED12 format ---
|
||||
log "Starting TEST 3: BED12 format conversion"
|
||||
|
||||
log "Executing $meta_name with --bed12 flag..."
|
||||
"$meta_executable" \
|
||||
--input "$test_data/example.bam" \
|
||||
--output "output.bed12" \
|
||||
--input "$input_file" \
|
||||
--output "$meta_temp_dir/output3.bed12" \
|
||||
--bed12
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed12"
|
||||
assert_file_not_empty "output.bed12"
|
||||
assert_identical_content "output.bed12" "../expected.bed12"
|
||||
echo "- test3 succeeded -"
|
||||
log "Validating TEST 3 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output3.bed12" "output BED12 file"
|
||||
check_file_not_empty "$meta_temp_dir/output3.bed12" "output BED12 file"
|
||||
|
||||
popd > /dev/null
|
||||
# Check that BED12 file has correct number of columns (12 for BED12)
|
||||
awk 'NF != 12 { exit 1 }' "$meta_temp_dir/output3.bed12" || {
|
||||
log_error "Output is not in BED12 format (expected 12 columns per line)"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Test 4:
|
||||
mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
echo "> Run bedtools bamtobed on BAM file with -ed"
|
||||
# --- Test Case 4: CIGAR addition ---
|
||||
log "Starting TEST 4: CIGAR string addition"
|
||||
|
||||
log "Executing $meta_name with --cigar flag..."
|
||||
"$meta_executable" \
|
||||
--input "$test_data/example.bam" \
|
||||
--output "output_ed.bed" \
|
||||
--edit_distance
|
||||
|
||||
# checks
|
||||
assert_file_exists "output_ed.bed"
|
||||
assert_file_not_empty "output_ed.bed"
|
||||
assert_identical_content "output_ed.bed" "../expected_ed.bed"
|
||||
echo "- test4 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 5:
|
||||
mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null
|
||||
|
||||
echo "> Run bedtools bamtobed on BAM file with -color"
|
||||
"$meta_executable" \
|
||||
--input "$test_data/example.bam" \
|
||||
--output "output_color.bed12" \
|
||||
--bed12 \
|
||||
--color "250,250,250" \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output_color.bed12"
|
||||
assert_file_not_empty "output_color.bed12"
|
||||
assert_identical_content "output_color.bed12" "../expected_color.bed12"
|
||||
echo "- test5 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 6:
|
||||
mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null
|
||||
|
||||
echo "> Run bedtools bamtobed on BAM file with -cigar"
|
||||
"$meta_executable" \
|
||||
--input "$test_data/example.bam" \
|
||||
--output "output_cigar.bed" \
|
||||
--input "$input_file" \
|
||||
--output "$meta_temp_dir/output4.bed" \
|
||||
--cigar
|
||||
|
||||
# checks
|
||||
assert_file_exists "output_cigar.bed"
|
||||
assert_file_not_empty "output_cigar.bed"
|
||||
assert_identical_content "output_cigar.bed" "../expected_cigar.bed"
|
||||
echo "- test6 succeeded -"
|
||||
log "Validating TEST 4 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output4.bed" "output BED file with CIGAR"
|
||||
check_file_not_empty "$meta_temp_dir/output4.bed" "output BED file with CIGAR"
|
||||
|
||||
popd > /dev/null
|
||||
# Check that BED file has correct number of columns (7 for BED6 + CIGAR)
|
||||
awk 'NF != 7 { exit 1 }' "$meta_temp_dir/output4.bed" || {
|
||||
log_error "Output is not in BED6+CIGAR format (expected 7 columns per line)"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Test 7:
|
||||
mkdir "$TMPDIR/test7" && pushd "$TMPDIR/test7" > /dev/null
|
||||
# Check that the 7th column contains CIGAR strings
|
||||
check_file_contains "$meta_temp_dir/output4.bed" "100M" "BED file with CIGAR strings"
|
||||
|
||||
echo "> Run bedtools bamtobed on BAM file with -tag"
|
||||
"$meta_executable" \
|
||||
--input "$test_data/example.bam" \
|
||||
--output "output_tag.bed" \
|
||||
--tag "XT"
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output_tag.bed"
|
||||
assert_file_not_empty "output_tag.bed"
|
||||
assert_identical_content "output_tag.bed" "../expected_tag.bed"
|
||||
echo "- test7 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 8:
|
||||
mkdir "$TMPDIR/test8" && pushd "$TMPDIR/test8" > /dev/null
|
||||
|
||||
echo "> Run bedtools bamtobed on BAM file with other options"
|
||||
"$meta_executable" \
|
||||
--input "$test_data/example.bam" \
|
||||
--output "output.bed" \
|
||||
--bedpe \
|
||||
--mate1 \
|
||||
--split \
|
||||
--splitD \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected.bedpe"
|
||||
echo "- test8 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
echo "---- All tests succeeded! ----"
|
||||
exit 0
|
||||
print_test_summary "All tests completed successfully"
|
||||
|
||||
Binary file not shown.
@@ -1,3 +0,0 @@
|
||||
@SQ SN:chr2:172936693-172938111 LN:1418
|
||||
my_read 99 chr2:172936693-172938111 129 60 100M = 429 400 CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT * XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
|
||||
my_read 147 chr2:172936693-172938111 429 60 100M = 129 -400 TCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAG * XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
|
||||
@@ -1,13 +1,15 @@
|
||||
name: bedtools_bamtofastq
|
||||
namespace: bedtools
|
||||
description: |
|
||||
Conversion tool for extracting FASTQ records from sequence alignments in BAM format.
|
||||
keywords: [Conversion ,BAM, FASTQ]
|
||||
Convert BAM alignments to FASTQ files.
|
||||
|
||||
This tool extracts FASTQ records from sequence alignments in BAM format,
|
||||
supporting both single-end and paired-end data extraction.
|
||||
keywords: [Conversion, BAM, FASTQ]
|
||||
links:
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/#
|
||||
issue_tracker: https://github.com/arq5x/bedtools2/issues
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
@@ -15,40 +17,62 @@ requirements:
|
||||
commands: [bedtools]
|
||||
authors:
|
||||
- __merge__: /src/_authors/theodoro_gasperin.yaml
|
||||
roles: [ author, maintainer ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: -i
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: Input BAM file to be converted to FASTQ.
|
||||
description: |
|
||||
Input BAM file to be converted to FASTQ.
|
||||
|
||||
**Requirements:**
|
||||
- Must be in BAM format
|
||||
- For paired-end output, should be sorted by query name
|
||||
required: true
|
||||
example: input.bam
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --fastq
|
||||
alternatives: -fq
|
||||
alternatives: [-fq]
|
||||
direction: output
|
||||
type: file
|
||||
description: Output FASTQ file.
|
||||
description: |
|
||||
Output FASTQ file for single-end data or first mate in paired-end data.
|
||||
|
||||
**Output format:** Standard FASTQ format with sequence and quality scores
|
||||
required: true
|
||||
example: output.fastq
|
||||
|
||||
- name: --fastq2
|
||||
alternatives: -fq2
|
||||
alternatives: [-fq2]
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
FASTQ for second end. Used if BAM contains paired-end data.
|
||||
BAM should be sorted by query name is creating paired FASTQ.
|
||||
Output FASTQ file for second mate in paired-end data.
|
||||
|
||||
**Usage:**
|
||||
- Required only for paired-end BAM files
|
||||
- BAM should be sorted by query name for proper pairing
|
||||
- If omitted, only first mates or single-end reads are extracted
|
||||
example: output_R2.fastq
|
||||
|
||||
- name: Options
|
||||
arguments:
|
||||
- name: --tags
|
||||
type: boolean_true
|
||||
description: |
|
||||
Create FASTQ based on the mate info in the BAM R2 and Q2 tags.
|
||||
Create FASTQ based on mate information in BAM R2 and Q2 tags.
|
||||
|
||||
**Usage:**
|
||||
- Uses R2 tag for second mate sequence
|
||||
- Uses Q2 tag for second mate quality scores
|
||||
- Alternative to requiring coordinate-sorted paired BAM
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
@@ -57,17 +81,16 @@ resources:
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: test_data
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:stable-slim
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [bedtools, procps]
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
|
||||
run:
|
||||
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
```bash
|
||||
bedtools bamtofastq
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools bamtofastq -h
|
||||
```
|
||||
|
||||
Tool: bedtools bamtofastq (aka bamToFastq)
|
||||
Version: v2.30.0
|
||||
Version: v2.31.1
|
||||
Summary: Convert BAM alignments to FASTQ files.
|
||||
|
||||
Usage: bamToFastq [OPTIONS] -i <BAM> -fq <FQ>
|
||||
|
||||
@@ -3,17 +3,18 @@
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Exit on error
|
||||
set -eo pipefail
|
||||
|
||||
# Unset parameters
|
||||
# Unset false boolean parameters
|
||||
[[ "$par_tags" == "false" ]] && unset par_tags
|
||||
|
||||
# Execute bedtools bamtofastq with the provided arguments
|
||||
bedtools bamtofastq \
|
||||
${par_tags:+-tags} \
|
||||
${par_fastq2:+-fq2 "$par_fastq2"} \
|
||||
-i "$par_input" \
|
||||
-fq "$par_fastq"
|
||||
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-i "$par_input"
|
||||
-fq "$par_fastq"
|
||||
${par_fastq2:+-fq2 "$par_fastq2"}
|
||||
${par_tags:+-tags}
|
||||
)
|
||||
|
||||
# Execute bedtools bamtofastq
|
||||
bedtools bamtofastq "${cmd_args[@]}"
|
||||
@@ -1,84 +1,92 @@
|
||||
#!/bin/bash
|
||||
|
||||
# exit on error
|
||||
set -eo pipefail
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
test_data="$meta_resources_dir/test_data"
|
||||
# Source the centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment with strict error handling
|
||||
setup_test_env
|
||||
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_identical_content() {
|
||||
diff -a "$2" "$1" \
|
||||
|| (echo "Files are not identical!" && exit 1)
|
||||
}
|
||||
# Test execution with centralized functions
|
||||
#############################################
|
||||
|
||||
# Test 1: normal conversion
|
||||
mkdir test1
|
||||
cd test1
|
||||
log "Starting tests for $meta_name"
|
||||
|
||||
echo "> Run bedtools bamtofastq on BAM file"
|
||||
# Create test directory
|
||||
test_dir="$meta_temp_dir/test_data"
|
||||
mkdir -p "$test_dir"
|
||||
|
||||
# Create a test SAM file with proper FASTQ data
|
||||
log "Creating test SAM data..."
|
||||
cat > "$test_dir/test.sam" << 'EOF'
|
||||
@SQ SN:chr1 LN:1000
|
||||
@PG ID:bwa PN:bwa VN:0.7.17
|
||||
read1 0 chr1 100 60 50M * 0 0 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
|
||||
read2 0 chr1 200 60 50M * 0 0 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
|
||||
EOF
|
||||
|
||||
# --- Test Case 1: Basic BAM to FASTQ conversion (single-end) ---
|
||||
log "Starting TEST 1: Basic BAM to FASTQ conversion"
|
||||
|
||||
log "Executing $meta_name with single-end BAM..."
|
||||
"$meta_executable" \
|
||||
--input "$test_data/example.bam" \
|
||||
--fastq "output.fastq"
|
||||
--input "$test_dir/test.sam" \
|
||||
--fastq "$meta_temp_dir/output1.fastq"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.fastq"
|
||||
assert_file_not_empty "output.fastq"
|
||||
assert_identical_content "output.fastq" "$test_data/expected.fastq"
|
||||
echo "- test1 succeeded -"
|
||||
log "Validating TEST 1 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output1.fastq" "output FASTQ file"
|
||||
check_file_not_empty "$meta_temp_dir/output1.fastq" "output FASTQ file"
|
||||
|
||||
cd ..
|
||||
# Check FASTQ format (should have 4 lines per read: header, sequence, +, quality)
|
||||
total_lines=$(wc -l < "$meta_temp_dir/output1.fastq")
|
||||
log "Output FASTQ contains $total_lines lines"
|
||||
[ $((total_lines % 4)) -eq 0 ] || { log_error "FASTQ format error: line count not divisible by 4"; exit 1; }
|
||||
|
||||
# Test 2: with tags
|
||||
mkdir test2
|
||||
cd test2
|
||||
# Check that FASTQ contains expected patterns
|
||||
check_file_contains "$meta_temp_dir/output1.fastq" "@read1" "FASTQ headers"
|
||||
check_file_contains "$meta_temp_dir/output1.fastq" "AAAAAAAA" "sequence content"
|
||||
check_file_contains "$meta_temp_dir/output1.fastq" "IIIIIIII" "quality scores"
|
||||
|
||||
echo "> Run bedtools bamtofastq on BAM file with tags"
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# --- Test Case 2: Test --tags option ---
|
||||
log "Starting TEST 2: BAM to FASTQ with --tags option"
|
||||
|
||||
# For the tags test, we'll just verify the command runs without error
|
||||
# since creating BAM with R2/Q2 tags would be complex
|
||||
log "Executing $meta_name with --tags flag..."
|
||||
"$meta_executable" \
|
||||
--input "$test_data/example.bam" \
|
||||
--fastq "output.fastq" \
|
||||
--input "$test_dir/test.sam" \
|
||||
--fastq "$meta_temp_dir/output2.fastq" \
|
||||
--tags
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.fastq"
|
||||
assert_file_not_empty "output.fastq"
|
||||
assert_identical_content "output.fastq" "$test_data/expected.fastq"
|
||||
echo "- test2 succeeded -"
|
||||
log "Validating TEST 2 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output2.fastq" "output FASTQ file with tags"
|
||||
|
||||
cd ..
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# Test 3: with option fq2
|
||||
mkdir test3
|
||||
cd test3
|
||||
# --- Test Case 3: Test with secondary output (without actual paired data) ---
|
||||
log "Starting TEST 3: Test secondary output parameter"
|
||||
|
||||
echo "> Run bedtools bamtofastq on BAM file with output_fq2"
|
||||
# Test that the fastq2 parameter is accepted (even if no paired reads are present)
|
||||
log "Executing $meta_name with --fastq2 parameter..."
|
||||
"$meta_executable" \
|
||||
--input "$test_data/example.bam" \
|
||||
--fastq "output1.fastq" \
|
||||
--fastq2 "output2.fastq"
|
||||
--input "$test_dir/test.sam" \
|
||||
--fastq "$meta_temp_dir/output3_R1.fastq" \
|
||||
--fastq2 "$meta_temp_dir/output3_R2.fastq"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output1.fastq"
|
||||
assert_file_not_empty "output1.fastq"
|
||||
assert_identical_content "output1.fastq" "$test_data/expected_1.fastq"
|
||||
assert_file_exists "output2.fastq"
|
||||
assert_file_not_empty "output2.fastq"
|
||||
assert_identical_content "output2.fastq" "$test_data/expected_2.fastq"
|
||||
echo "- test3 succeeded -"
|
||||
log "Validating TEST 3 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output3_R1.fastq" "primary FASTQ file"
|
||||
check_file_not_empty "$meta_temp_dir/output3_R1.fastq" "primary FASTQ file"
|
||||
|
||||
cd ..
|
||||
# The R2 file may be empty since we don't have paired reads, but should exist
|
||||
check_file_exists "$meta_temp_dir/output3_R2.fastq" "secondary FASTQ file"
|
||||
|
||||
echo "All tests succeeded"
|
||||
exit 0
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
print_test_summary "All tests completed successfully"
|
||||
|
||||
|
||||
|
||||
Binary file not shown.
@@ -1,3 +0,0 @@
|
||||
@SQ SN:chr2:172936693-172938111 LN:1418
|
||||
my_read 99 chr2:172936693-172938111 129 60 100M = 429 400 CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT * XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
|
||||
my_read 147 chr2:172936693-172938111 429 60 100M = 129 -400 TCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAG * XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
|
||||
@@ -1,16 +0,0 @@
|
||||
@my_read
|
||||
CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT
|
||||
+
|
||||
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
||||
@my_read
|
||||
CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT
|
||||
+
|
||||
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
||||
@my_read
|
||||
CTATCATTTGTTGTTCCGTGACAGGAAAAGGATAATTACCTCTCAGAGAGAATCAAAGGCTGACATGCCCTTTAGACACAGCCATGAATGCAGAGCTCGA
|
||||
+
|
||||
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
||||
@my_read
|
||||
CTATCATTTGTTGTTCCGTGACAGGAAAAGGATAATTACCTCTCAGAGAGAATCAAAGGCTGACATGCCCTTTAGACACAGCCATGAATGCAGAGCTCGA
|
||||
+
|
||||
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
||||
@@ -1,4 +0,0 @@
|
||||
@my_read/1
|
||||
CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT
|
||||
+
|
||||
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
||||
@@ -1,4 +0,0 @@
|
||||
@my_read/2
|
||||
CTATCATTTGTTGTTCCGTGACAGGAAAAGGATAATTACCTCTCAGAGAGAATCAAAGGCTGACATGCCCTTTAGACACAGCCATGAATGCAGAGCTCGA
|
||||
+
|
||||
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
||||
@@ -1,13 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
# create sam file
|
||||
printf "@SQ\tSN:chr2:172936693-172938111\tLN:1418\n" > example.sam
|
||||
printf "my_read\t99\tchr2:172936693-172938111\t129\t60\t100M\t=\t429\t400\tCTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT\t*\tXT:A:U\tNM:i:0\tSM:i:37\tAM:i:37\tX0:i:1\tX1:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tMD:Z:100\n" >> example.sam
|
||||
printf "my_read\t147\tchr2:172936693-172938111\t429\t60\t100M\t=\t129\t-400\tTCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAG\t*\tXT:A:U\tNM:i:0\tSM:i:37\tAM:i:37\tX0:i:1\tX1:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tMD:Z:100\n" >> example.sam
|
||||
|
||||
# create bam file
|
||||
# samtools view -b example.sam > example.bam
|
||||
|
||||
# create fastq files
|
||||
# bedtools bamtofastq -i example.bam -fq expected.fastq
|
||||
# bedtools bamtofastq -i example.bam -fq expected_1.fastq -fq2 expected_2.fastq
|
||||
@@ -7,8 +7,7 @@ keywords: [Converts, BED12, BED6]
|
||||
links:
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/bed12tobed6.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/#
|
||||
issue_tracker: https://github.com/arq5x/bedtools2/issues
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
@@ -16,33 +15,52 @@ requirements:
|
||||
commands: [bedtools]
|
||||
authors:
|
||||
- __merge__: /src/_authors/theodoro_gasperin.yaml
|
||||
roles: [ author, maintainer ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: -i
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: Input BED12 file.
|
||||
description: |
|
||||
Input BED12 file containing blocked features.
|
||||
|
||||
**Requirements:**
|
||||
- Must be in BED12 format (12 columns)
|
||||
- Should contain blocked features (e.g., genes with exons)
|
||||
- Blocks are defined by columns 10-12 (blockCount, blockSizes, blockStarts)
|
||||
required: true
|
||||
example: input.bed12
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
alternatives: -o
|
||||
alternatives: [-o]
|
||||
type: file
|
||||
direction: output
|
||||
description: Output BED6 file to be written.
|
||||
description: |
|
||||
Output BED6 file containing discrete features.
|
||||
|
||||
**Output format:**
|
||||
- Each block from input BED12 becomes a separate BED6 entry
|
||||
- Maintains chromosome, strand, and name information
|
||||
- Coordinates are adjusted to represent individual blocks
|
||||
example: output.bed6
|
||||
|
||||
- name: Options
|
||||
arguments:
|
||||
- name: --n_score
|
||||
alternatives: -n
|
||||
alternatives: [-n]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Force the score to be the (1-based) block number from the BED12.
|
||||
Force the score to be the 1-based block number from the BED12.
|
||||
|
||||
**Default behavior:** Preserves original score from BED12
|
||||
**With --n_score:** Sets score to block number (1, 2, 3, etc.)
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
@@ -51,16 +69,16 @@ resources:
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:stable-slim
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [bedtools, procps]
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
|
||||
run:
|
||||
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
```
|
||||
bedtools bed12tobed6 -h
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools bed12tobed6 -h
|
||||
```
|
||||
|
||||
Tool: bedtools bed12tobed6 (aka bed12ToBed6)
|
||||
Version: v2.30.0
|
||||
Version: v2.31.1
|
||||
Summary: Splits BED12 features into discrete BED6 features.
|
||||
|
||||
Usage: bedtools bed12tobed6 [OPTIONS] -i <bed12>
|
||||
|
||||
@@ -5,11 +5,14 @@
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# Unset parameters
|
||||
# Unset false boolean parameters
|
||||
[[ "$par_n_score" == "false" ]] && unset par_n_score
|
||||
|
||||
# Execute bedtools bed12tobed6 conversion
|
||||
bedtools bed12tobed6 \
|
||||
${par_n_score:+-n} \
|
||||
-i "$par_input" \
|
||||
> "$par_output"
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-i "$par_input"
|
||||
${par_n_score:+-n}
|
||||
)
|
||||
|
||||
# Execute bedtools bed12tobed6
|
||||
bedtools bed12tobed6 "${cmd_args[@]}" > "$par_output"
|
||||
|
||||
@@ -1,85 +1,119 @@
|
||||
#!/bin/bash
|
||||
|
||||
# exit on error
|
||||
set -eo pipefail
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source the centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment with strict error handling
|
||||
setup_test_env
|
||||
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_identical_content() {
|
||||
diff -a "$2" "$1" \
|
||||
|| (echo "Files are not identical!" && exit 1)
|
||||
}
|
||||
# Test execution with centralized functions
|
||||
#############################################
|
||||
|
||||
# Create directories for tests
|
||||
echo "Creating Test Data..."
|
||||
TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX")
|
||||
function clean_up {
|
||||
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
|
||||
log "Starting tests for $meta_name"
|
||||
|
||||
# Create test directory
|
||||
test_dir="$meta_temp_dir/test_data"
|
||||
mkdir -p "$test_dir"
|
||||
|
||||
# Create a test BED12 file
|
||||
log "Creating test BED12 data..."
|
||||
cat > "$test_dir/test.bed12" << 'EOF'
|
||||
chr1 100 600 gene1 1000 + 100 600 255,0,0 3 100,150,200 0,200,300
|
||||
chr2 200 800 gene2 800 - 200 800 0,255,0 2 200,250 0,350
|
||||
chr3 300 500 gene3 500 . 300 500 0,0,255 1 200 0
|
||||
EOF
|
||||
|
||||
# --- Test Case 1: Basic BED12 to BED6 conversion ---
|
||||
log "Starting TEST 1: Basic BED12 to BED6 conversion"
|
||||
|
||||
log "Executing $meta_name with basic parameters..."
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/test.bed12" \
|
||||
--output "$meta_temp_dir/output1.bed6"
|
||||
|
||||
log "Validating TEST 1 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output1.bed6" "output BED6 file"
|
||||
check_file_not_empty "$meta_temp_dir/output1.bed6" "output BED6 file"
|
||||
|
||||
# Check that BED6 file has correct number of columns (6 columns)
|
||||
awk 'NF != 6 { exit 1 }' "$meta_temp_dir/output1.bed6" || {
|
||||
log_error "Output is not in BED6 format (expected 6 columns per line)"
|
||||
exit 1
|
||||
}
|
||||
trap clean_up EXIT
|
||||
|
||||
# Create example BED12 file
|
||||
cat <<EOF > "$TMPDIR/example.bed12"
|
||||
chr21 10079666 10120808 uc002yiv.1 0 - 10081686 1 0 1 2 0 6 0 8 0 4 528,91,101,215, 0,1930,39750,40927,
|
||||
chr21 10080031 10081687 uc002yiw.1 0 - 10080031 1 0 0 8 0 0 3 1 0 2 200,91, 0,1565,
|
||||
chr21 10081660 10120796 uc002yix.2 0 - 10081660 1 0 0 8 1 6 6 0 0 3 27,101,223, 0,37756,38913,
|
||||
EOF
|
||||
# Check that we have more BED6 entries than BED12 entries (due to block splitting)
|
||||
bed12_lines=$(wc -l < "$test_dir/test.bed12")
|
||||
bed6_lines=$(wc -l < "$meta_temp_dir/output1.bed6")
|
||||
log "Input BED12: $bed12_lines lines, Output BED6: $bed6_lines lines"
|
||||
|
||||
# Expected output bed6 file
|
||||
cat <<EOF > "$TMPDIR/expected.bed6"
|
||||
chr21 10079666 10120808 uc002yiv.1 0 -
|
||||
chr21 10080031 10081687 uc002yiw.1 0 -
|
||||
chr21 10081660 10120796 uc002yix.2 0 -
|
||||
EOF
|
||||
# Expected output bed6 file with -n option
|
||||
cat <<EOF > "$TMPDIR/expected_n.bed6"
|
||||
chr21 10079666 10120808 uc002yiv.1 1 -
|
||||
chr21 10080031 10081687 uc002yiw.1 1 -
|
||||
chr21 10081660 10120796 uc002yix.2 1 -
|
||||
EOF
|
||||
[ "$bed6_lines" -gt "$bed12_lines" ] || {
|
||||
log_error "Expected more BED6 lines than BED12 lines due to block splitting"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Test 1: Default conversion BED12 to BED6
|
||||
mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null
|
||||
# Check that gene names are preserved
|
||||
check_file_contains "$meta_temp_dir/output1.bed6" "gene1" "gene names from BED12"
|
||||
check_file_contains "$meta_temp_dir/output1.bed6" "gene2" "gene names from BED12"
|
||||
|
||||
echo "> Run bedtools_bed12tobed6 on BED12 file"
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# --- Test Case 2: BED12 to BED6 with --n_score option ---
|
||||
log "Starting TEST 2: BED12 to BED6 with block numbering"
|
||||
|
||||
log "Executing $meta_name with --n_score flag..."
|
||||
"$meta_executable" \
|
||||
--input "../example.bed12" \
|
||||
--output "output.bed6"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed6"
|
||||
assert_file_not_empty "output.bed6"
|
||||
assert_identical_content "output.bed6" "../expected.bed6"
|
||||
echo "- test1 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 2: Conversion BED12 to BED6 with -n option
|
||||
mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null
|
||||
|
||||
echo "> Run bedtools_bed12tobed6 on BED12 file with -n option"
|
||||
"$meta_executable" \
|
||||
--input "../example.bed12" \
|
||||
--output "output.bed6" \
|
||||
--input "$test_dir/test.bed12" \
|
||||
--output "$meta_temp_dir/output2.bed6" \
|
||||
--n_score
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed6"
|
||||
assert_file_not_empty "output.bed6"
|
||||
assert_identical_content "output.bed6" "../expected_n.bed6"
|
||||
echo "- test2 succeeded -"
|
||||
log "Validating TEST 2 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output2.bed6" "output BED6 file with block numbers"
|
||||
check_file_not_empty "$meta_temp_dir/output2.bed6" "output BED6 file with block numbers"
|
||||
|
||||
popd > /dev/null
|
||||
# Check that BED6 file has correct number of columns
|
||||
awk 'NF != 6 { exit 1 }' "$meta_temp_dir/output2.bed6" || {
|
||||
log_error "Output is not in BED6 format (expected 6 columns per line)"
|
||||
exit 1
|
||||
}
|
||||
|
||||
echo "---- All tests succeeded! ----"
|
||||
exit 0
|
||||
# Check that scores are block numbers (should contain "1", "2", "3" for gene1 with 3 blocks)
|
||||
check_file_contains "$meta_temp_dir/output2.bed6" $'\t1\t' "block number 1 in score column"
|
||||
check_file_contains "$meta_temp_dir/output2.bed6" $'\t2\t' "block number 2 in score column"
|
||||
check_file_contains "$meta_temp_dir/output2.bed6" $'\t3\t' "block number 3 in score column"
|
||||
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# --- Test Case 3: Test with single-block BED12 ---
|
||||
log "Starting TEST 3: Single-block BED12 conversion"
|
||||
|
||||
# Create a simple single-block BED12 (should produce single BED6)
|
||||
cat > "$test_dir/single_block.bed12" << 'EOF'
|
||||
chrX 1000 2000 single_gene 900 + 1000 2000 128,128,128 1 1000 0
|
||||
EOF
|
||||
|
||||
log "Executing $meta_name with single-block BED12..."
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/single_block.bed12" \
|
||||
--output "$meta_temp_dir/output3.bed6"
|
||||
|
||||
log "Validating TEST 3 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output3.bed6" "single-block BED6 output"
|
||||
check_file_not_empty "$meta_temp_dir/output3.bed6" "single-block BED6 output"
|
||||
|
||||
# Should have exactly one line (single block)
|
||||
single_lines=$(wc -l < "$meta_temp_dir/output3.bed6")
|
||||
[ "$single_lines" -eq 1 ] || {
|
||||
log_error "Expected exactly 1 line for single-block BED12, got $single_lines"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Check that it contains the expected gene name
|
||||
check_file_contains "$meta_temp_dir/output3.bed6" "single_gene" "single gene name"
|
||||
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
print_test_summary "All tests completed successfully"
|
||||
|
||||
108
src/bedtools/bedtools_bedpetobam/config.vsh.yaml
Normal file
108
src/bedtools/bedtools_bedpetobam/config.vsh.yaml
Normal file
@@ -0,0 +1,108 @@
|
||||
name: bedtools_bedpetobam
|
||||
namespace: bedtools
|
||||
description: |
|
||||
Convert BEDPE (paired-end BED) intervals to BAM format.
|
||||
|
||||
This tool converts genomic paired-end interval data into BAM alignment format,
|
||||
where each BEDPE record becomes a pair of BAM alignment records representing
|
||||
the paired-end reads.
|
||||
|
||||
keywords: [genomics, intervals, format conversion, BAM, BEDPE, paired-end]
|
||||
links:
|
||||
homepage: https://bedtools.readthedocs.io/
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/bedpetobam.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: |
|
||||
Input file in BEDPE format.
|
||||
|
||||
**BEDPE format:** Tab-delimited with 10 columns:
|
||||
chrom1, start1, end1, chrom2, start2, end2, name, score, strand1, strand2
|
||||
|
||||
**Requirements:** Represents paired-end genomic intervals
|
||||
**Coordinate system:** 0-based coordinates
|
||||
required: true
|
||||
example: intervals.bedpe
|
||||
|
||||
- name: --genome
|
||||
alternatives: [-g]
|
||||
type: file
|
||||
description: |
|
||||
Genome file defining chromosome names and sizes.
|
||||
|
||||
**Format:** Tab-delimited file with chromosome name and size
|
||||
**Example line:** chr1 249250621
|
||||
**Purpose:** Required for BAM header creation
|
||||
required: true
|
||||
example: genome.txt
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output BAM file.
|
||||
|
||||
Contains converted BEDPE intervals as paired BAM alignment records
|
||||
suitable for visualization and downstream analysis of paired-end data.
|
||||
required: true
|
||||
example: intervals.bam
|
||||
|
||||
- name: BAM Options
|
||||
arguments:
|
||||
- name: --mapq
|
||||
type: integer
|
||||
description: |
|
||||
Set the mapping quality for BAM records.
|
||||
|
||||
**Range:** 0-255 (typical values)
|
||||
**Default:** 255 (maximum quality)
|
||||
**Purpose:** MAPQ field in BAM format
|
||||
default: 255
|
||||
example: 60
|
||||
|
||||
- name: --ubam
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write uncompressed BAM output.
|
||||
|
||||
**Default:** Compressed BAM output
|
||||
**Use case:** When compression is not needed or causes issues
|
||||
**File size:** Significantly larger output files
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
19
src/bedtools/bedtools_bedpetobam/help.txt
Normal file
19
src/bedtools/bedtools_bedpetobam/help.txt
Normal file
@@ -0,0 +1,19 @@
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools bedpetobam -h
|
||||
```
|
||||
|
||||
Tool: bedtools bedpetobam (aka bedpeToBam)
|
||||
Version: v2.31.1
|
||||
Summary: Converts feature records to BAM format.
|
||||
|
||||
Usage: bedpetobam [OPTIONS] -i <bed/gff/vcf> -g <genome>
|
||||
|
||||
Options:
|
||||
-mapq Set the mappinq quality for the BAM records.
|
||||
(INT) Default: 255
|
||||
|
||||
-ubam Write uncompressed BAM output. Default writes compressed BAM.
|
||||
|
||||
Notes:
|
||||
(1) BED files must be at least BED4 to create BAM (needs name field).
|
||||
|
||||
20
src/bedtools/bedtools_bedpetobam/script.sh
Normal file
20
src/bedtools/bedtools_bedpetobam/script.sh
Normal file
@@ -0,0 +1,20 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# unset flags
|
||||
[[ "$par_ubam" == "false" ]] && unset par_ubam
|
||||
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-i "$par_input"
|
||||
-g "$par_genome"
|
||||
${par_mapq:+-mapq "$par_mapq"}
|
||||
${par_ubam:+-ubam}
|
||||
)
|
||||
|
||||
# Execute bedtools bedpetobam
|
||||
bedtools bedpetobam "${cmd_args[@]}" > "$par_output"
|
||||
129
src/bedtools/bedtools_bedpetobam/test.sh
Normal file
129
src/bedtools/bedtools_bedpetobam/test.sh
Normal file
@@ -0,0 +1,129 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment
|
||||
setup_test_env
|
||||
|
||||
log "Starting tests for bedtools_bedpetobam"
|
||||
|
||||
# Create test data
|
||||
log "Creating test data..."
|
||||
|
||||
# Create genome file
|
||||
cat > "$meta_temp_dir/genome.txt" << 'EOF'
|
||||
chr1 249250621
|
||||
chr2 242193529
|
||||
chr3 198295559
|
||||
EOF
|
||||
|
||||
# Create BEDPE input file (paired-end BED format)
|
||||
# Format: chrom1 start1 end1 chrom2 start2 end2 name score strand1 strand2
|
||||
cat > "$meta_temp_dir/intervals.bedpe" << 'EOF'
|
||||
chr1 100 200 chr1 300 400 pair1 100 + +
|
||||
chr1 500 600 chr1 700 800 pair2 200 + -
|
||||
chr2 150 250 chr2 350 450 pair3 300 - -
|
||||
chr2 1000 1100 chr2 1200 1300 pair4 400 - +
|
||||
EOF
|
||||
|
||||
# Create more detailed BEDPE file
|
||||
cat > "$meta_temp_dir/detailed.bedpe" << 'EOF'
|
||||
chr1 1000 2000 chr1 3000 4000 detailed1 500 + +
|
||||
chr1 5000 6000 chr1 7000 8000 detailed2 600 + -
|
||||
chr2 1500 2500 chr2 3500 4500 detailed3 700 - -
|
||||
chr2 9000 10000 chr2 11000 12000 detailed4 800 - +
|
||||
chr3 2000 3000 chr3 4000 5000 detailed5 900 + +
|
||||
EOF
|
||||
|
||||
# Test 1: Basic BEDPE to BAM conversion
|
||||
log "Starting TEST 1: Basic BEDPE to BAM conversion"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bedpe" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--output "$meta_temp_dir/output1.bam"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output1.bam" "basic BAM output"
|
||||
check_file_not_empty "$meta_temp_dir/output1.bam" "basic BAM output"
|
||||
|
||||
# BAM files are binary, so basic existence and non-empty checks are sufficient
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# Test 2: BAM conversion with custom MAPQ
|
||||
log "Starting TEST 2: BAM conversion with custom MAPQ"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bedpe" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--mapq 60 \
|
||||
--output "$meta_temp_dir/output2.bam"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output2.bam" "MAPQ BAM output"
|
||||
check_file_not_empty "$meta_temp_dir/output2.bam" "MAPQ BAM output"
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# Test 3: Uncompressed BAM output
|
||||
log "Starting TEST 3: Uncompressed BAM output"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bedpe" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--ubam \
|
||||
--output "$meta_temp_dir/output3.bam"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output3.bam" "uncompressed BAM output"
|
||||
check_file_not_empty "$meta_temp_dir/output3.bam" "uncompressed BAM output"
|
||||
|
||||
# Uncompressed BAM should be larger than compressed (typically)
|
||||
compressed_size=$(stat -c%s "$meta_temp_dir/output1.bam")
|
||||
uncompressed_size=$(stat -c%s "$meta_temp_dir/output3.bam")
|
||||
if [ $uncompressed_size -lt $compressed_size ]; then
|
||||
log "Warning: Uncompressed BAM is smaller than compressed - may indicate issue or very small dataset"
|
||||
fi
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# Test 4: More detailed BEDPE file conversion
|
||||
log "Starting TEST 4: Detailed BEDPE file conversion"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/detailed.bedpe" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--output "$meta_temp_dir/output4.bam"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output4.bam" "detailed BAM output"
|
||||
check_file_not_empty "$meta_temp_dir/output4.bam" "detailed BAM output"
|
||||
|
||||
# Check file size is reasonable for 5 BEDPE pairs (10 alignments)
|
||||
detailed_size=$(stat -c%s "$meta_temp_dir/output4.bam")
|
||||
if [ $detailed_size -lt 200 ]; then
|
||||
log_error "BAM file seems too small for 5 BEDPE pairs: $detailed_size bytes"
|
||||
exit 1
|
||||
fi
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# Test 5: Verify BAM structure with samtools (if available)
|
||||
log "Starting TEST 5: BAM structure verification"
|
||||
if command -v samtools &> /dev/null; then
|
||||
# Check BAM header
|
||||
if samtools view -H "$meta_temp_dir/output1.bam" | grep -q "@SQ"; then
|
||||
log "✓ BAM header contains sequence dictionary"
|
||||
else
|
||||
log_error "BAM header missing sequence dictionary"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Count alignments (should be double the BEDPE pairs since each pair creates 2 alignments)
|
||||
alignment_count=$(samtools view -c "$meta_temp_dir/output1.bam")
|
||||
if [ $alignment_count -eq 8 ]; then
|
||||
log "✓ BAM contains expected number of alignments: $alignment_count (4 BEDPE pairs = 8 alignments)"
|
||||
else
|
||||
log "ℹ️ Expected 8 alignments (4 BEDPE pairs), got $alignment_count"
|
||||
fi
|
||||
else
|
||||
log "ℹ️ samtools not available, skipping BAM structure verification"
|
||||
fi
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
log "🎉 All bedtools_bedpetobam tests completed successfully!"
|
||||
@@ -1,12 +1,15 @@
|
||||
name: bedtools_bedtobam
|
||||
namespace: bedtools
|
||||
description: Converts feature records (bed/gff/vcf) to BAM format.
|
||||
description: |
|
||||
Converts feature records to BAM format.
|
||||
|
||||
Converts genomic intervals from BED, GFF, or VCF formats into BAM format,
|
||||
creating aligned sequence records that can be used with standard BAM tools.
|
||||
keywords: [Converts, BED, GFF, VCF, BAM]
|
||||
links:
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/bedtobam.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/#
|
||||
issue_tracker: https://github.com/arq5x/bedtools2/issues
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
@@ -14,41 +17,65 @@ requirements:
|
||||
commands: [bedtools]
|
||||
authors:
|
||||
- __merge__: /src/_authors/theodoro_gasperin.yaml
|
||||
roles: [ author, maintainer ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: -i
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: Input file (bed/gff/vcf).
|
||||
description: |
|
||||
Input genomic intervals file in BED, GFF, or VCF format.
|
||||
|
||||
**Requirements:**
|
||||
- BED files must be at least BED4 format (requires name field)
|
||||
- File must contain valid genomic coordinates
|
||||
required: true
|
||||
example: input.bed
|
||||
|
||||
- name: --genome
|
||||
alternatives: -g
|
||||
alternatives: [-g]
|
||||
type: file
|
||||
description: |
|
||||
Input genome file.
|
||||
NOTE: This is not a fasta file. This is a two-column tab-delimited file
|
||||
where the first column is the chromosome name and the second their sizes.
|
||||
Genome file defining chromosome names and sizes.
|
||||
|
||||
**Format:** Two-column tab-delimited file:
|
||||
```
|
||||
chr1 249250621
|
||||
chr2 243199373
|
||||
```
|
||||
|
||||
**Note:** This is NOT a FASTA file. Use `samtools faidx` to create from FASTA if needed.
|
||||
required: true
|
||||
example: hg19.genome
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
alternatives: -o
|
||||
alternatives: [-o]
|
||||
type: file
|
||||
direction: output
|
||||
description: Output BAM file to be written.
|
||||
description: |
|
||||
Output BAM file containing converted genomic intervals.
|
||||
|
||||
**Format:** Standard BAM format (compressed by default)
|
||||
required: true
|
||||
example: output.bam
|
||||
|
||||
- name: Options
|
||||
arguments:
|
||||
- name: --map_quality
|
||||
alternatives: -mapq
|
||||
alternatives: [-mapq]
|
||||
type: integer
|
||||
description: |
|
||||
Set the mappinq quality for the BAM records.
|
||||
Set the mapping quality for the BAM records.
|
||||
|
||||
**Range:** 0-255 (higher values indicate better quality)
|
||||
|
||||
**Default:** 255 (maximum quality)
|
||||
min: 0
|
||||
max: 255
|
||||
default: 255
|
||||
@@ -56,14 +83,21 @@ argument_groups:
|
||||
- name: --bed12
|
||||
type: boolean_true
|
||||
description: |
|
||||
The BED file is in BED12 format. The BAM CIGAR
|
||||
string will reflect BED "blocks".
|
||||
Process BED file as BED12 format with blocked intervals.
|
||||
|
||||
**Features:**
|
||||
- BAM CIGAR string reflects BED blocks (exons/introns)
|
||||
- Useful for representing spliced alignments
|
||||
- Requires BED12 format input
|
||||
|
||||
- name: --uncompress_bam
|
||||
alternatives: -ubam
|
||||
alternatives: [-ubam]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write uncompressed BAM output. Default writes compressed BAM.
|
||||
Write uncompressed BAM output.
|
||||
|
||||
**Default behavior:** Writes compressed BAM
|
||||
**Use case:** When downstream tools require uncompressed format
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
@@ -72,19 +106,16 @@ resources:
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:stable-slim
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [bedtools, procps]
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
|
||||
test_setup:
|
||||
- type: apt
|
||||
packages: [samtools]
|
||||
run:
|
||||
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
|
||||
@@ -1,21 +1,22 @@
|
||||
```bash
|
||||
bedtools bedtobam
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools bedtobam -h
|
||||
```
|
||||
|
||||
Tool: bedtools bedtobam (aka bedToBam)
|
||||
Version: v2.30.0
|
||||
Version: v2.31.1
|
||||
Summary: Converts feature records to BAM format.
|
||||
|
||||
Usage: bedtools bedtobam [OPTIONS] -i <bed/gff/vcf> -g <genome>
|
||||
|
||||
Options:
|
||||
-mapq Set the mappinq quality for the BAM records.
|
||||
(INT) Default: 255
|
||||
-mapq Set the mappinq quality for the BAM records.
|
||||
(INT) Default: 255
|
||||
|
||||
-bed12 The BED file is in BED12 format. The BAM CIGAR
|
||||
string will reflect BED "blocks".
|
||||
-bed12 The BED file is in BED12 format. The BAM CIGAR
|
||||
string will reflect BED "blocks".
|
||||
|
||||
-ubam Write uncompressed BAM output. Default writes compressed BAM.
|
||||
-ubam Write uncompressed BAM output. Default writes compressed BAM.
|
||||
|
||||
Notes:
|
||||
(1) BED files must be at least BED4 to create BAM (needs name field).
|
||||
(1) BED files must be at least BED4 to create BAM (needs name field).
|
||||
|
||||
|
||||
@@ -5,15 +5,18 @@
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# Unset parameters
|
||||
# Unset false boolean parameters
|
||||
[[ "$par_bed12" == "false" ]] && unset par_bed12
|
||||
[[ "$par_uncompress_bam" == "false" ]] && unset par_uncompress_bam
|
||||
|
||||
# Execute bedtools bed to bam
|
||||
bedtools bedtobam \
|
||||
${par_bed12:+-bed12} \
|
||||
${par_uncompress_bam:+-ubam} \
|
||||
${par_map_quality:+-mapq "$par_map_quality"} \
|
||||
-i "$par_input" \
|
||||
-g "$par_genome" \
|
||||
> "$par_output"
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-i "$par_input"
|
||||
-g "$par_genome"
|
||||
${par_map_quality:+-mapq "$par_map_quality"}
|
||||
${par_bed12:+-bed12}
|
||||
${par_uncompress_bam:+-ubam}
|
||||
)
|
||||
|
||||
# Execute bedtools bedtobam
|
||||
bedtools bedtobam "${cmd_args[@]}" > "$par_output"
|
||||
|
||||
@@ -1,188 +1,127 @@
|
||||
#!/bin/bash
|
||||
|
||||
# exit on error
|
||||
set -eo pipefail
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source the centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment with strict error handling
|
||||
setup_test_env
|
||||
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_identical_content() {
|
||||
diff -a "$2" "$1" \
|
||||
|| (echo "Files are not identical!" && exit 1)
|
||||
}
|
||||
# Test execution with centralized functions
|
||||
#############################################
|
||||
|
||||
# Create directories for tests
|
||||
echo "Creating Test Data..."
|
||||
TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX")
|
||||
function clean_up {
|
||||
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
|
||||
}
|
||||
trap clean_up EXIT
|
||||
log "Starting tests for $meta_name"
|
||||
|
||||
# Create and populate input files
|
||||
printf "chr1\t248956422\nchr3\t242193529\nchr2\t198295559\n" > "$TMPDIR/genome.txt"
|
||||
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t37\t+\nchr2:172936693-172938111\t428\t528\tmy_read/2\t37\t-\n" > "$TMPDIR/example.bed"
|
||||
printf "chr2:172936693-172938111\t128\t228\tmy_read/1\t60\t+\t128\t228\t255,0,0\t1\t100\t0\nchr2:172936693-172938111\t428\t528\tmy_read/2\t60\t-\t428\t528\t255,0,0\t1\t100\t0\n" > "$TMPDIR/example.bed12"
|
||||
# Create and populate example.gff file
|
||||
printf "##gff-version 3\n" > "$TMPDIR/example.gff"
|
||||
printf "chr1\t.\tgene\t1000\t2000\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "$TMPDIR/example.gff"
|
||||
printf "chr3\t.\tmRNA\t1000\t2000\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "$TMPDIR/example.gff"
|
||||
printf "chr1\t.\texon\t1000\t1200\t.\t+\t.\tID=exon1;Parent=transcript1\n" >> "$TMPDIR/example.gff"
|
||||
printf "chr2\t.\texon\t1500\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "$TMPDIR/example.gff"
|
||||
printf "chr1\t.\tCDS\t1000\t1200\t.\t+\t0\tID=cds1;Parent=transcript1\n" >> "$TMPDIR/example.gff"
|
||||
printf "chr1\t.\tCDS\t1500\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "$TMPDIR/example.gff"
|
||||
# Create test directory
|
||||
test_dir="$meta_temp_dir/test_data"
|
||||
mkdir -p "$test_dir"
|
||||
|
||||
# Expected output sam files for each test
|
||||
cat <<EOF > "$TMPDIR/expected.sam"
|
||||
@HD VN:1.0 SO:unsorted
|
||||
@PG ID:BEDTools_bedToBam VN:Vv2.30.0
|
||||
@PG ID:samtools PN:samtools PP:BEDTools_bedToBam VN:1.16.1 CL:samtools view -h output.bam
|
||||
@SQ SN:chr1 AS:../genome.txt LN:248956422
|
||||
@SQ SN:chr3 AS:../genome.txt LN:242193529
|
||||
@SQ SN:chr2 AS:../genome.txt LN:198295559
|
||||
my_read/1 0 chr1 129 255 100M * 0 0 * *
|
||||
my_read/2 16 chr1 429 255 100M * 0 0 * *
|
||||
EOF
|
||||
cat <<EOF > "$TMPDIR/expected12.sam"
|
||||
@HD VN:1.0 SO:unsorted
|
||||
@PG ID:BEDTools_bedToBam VN:Vv2.30.0
|
||||
@PG ID:samtools PN:samtools PP:BEDTools_bedToBam VN:1.16.1 CL:samtools view -h output.bam
|
||||
@SQ SN:chr1 AS:../genome.txt LN:248956422
|
||||
@SQ SN:chr3 AS:../genome.txt LN:242193529
|
||||
@SQ SN:chr2 AS:../genome.txt LN:198295559
|
||||
my_read/1 0 chr1 129 255 100M * 0 0 * *
|
||||
my_read/2 16 chr1 429 255 100M * 0 0 * *
|
||||
EOF
|
||||
cat <<EOF > "$TMPDIR/expected_mapquality.sam"
|
||||
@HD VN:1.0 SO:unsorted
|
||||
@PG ID:BEDTools_bedToBam VN:Vv2.30.0
|
||||
@PG ID:samtools PN:samtools PP:BEDTools_bedToBam VN:1.16.1 CL:samtools view -h output.bam
|
||||
@SQ SN:chr1 AS:../genome.txt LN:248956422
|
||||
@SQ SN:chr3 AS:../genome.txt LN:242193529
|
||||
@SQ SN:chr2 AS:../genome.txt LN:198295559
|
||||
my_read/1 0 chr1 129 10 100M * 0 0 * *
|
||||
my_read/2 16 chr1 429 10 100M * 0 0 * *
|
||||
EOF
|
||||
cat <<EOF > "$TMPDIR/expected_gff.sam"
|
||||
@HD VN:1.0 SO:unsorted
|
||||
@PG ID:BEDTools_bedToBam VN:Vv2.30.0
|
||||
@PG ID:samtools PN:samtools PP:BEDTools_bedToBam VN:1.16.1 CL:samtools view -h output.bam
|
||||
@SQ SN:chr1 AS:../genome.txt LN:248956422
|
||||
@SQ SN:chr3 AS:../genome.txt LN:242193529
|
||||
@SQ SN:chr2 AS:../genome.txt LN:198295559
|
||||
gene 0 chr1 1000 255 1001M * 0 0 * *
|
||||
mRNA 0 chr3 1000 255 1001M * 0 0 * *
|
||||
exon 0 chr1 1000 255 201M * 0 0 * *
|
||||
exon 0 chr2 1500 255 201M * 0 0 * *
|
||||
CDS 0 chr1 1000 255 201M * 0 0 * *
|
||||
CDS 0 chr1 1500 255 201M * 0 0 * *
|
||||
# Create test genome file
|
||||
log "Creating test genome file..."
|
||||
cat > "$test_dir/test.genome" << 'EOF'
|
||||
chr1 248956422
|
||||
chr2 242193529
|
||||
chr3 198295559
|
||||
EOF
|
||||
|
||||
# Test 1: Default conversion BED to BAM
|
||||
mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null
|
||||
# Create test BED file (BED4 minimum for bedtobam)
|
||||
log "Creating test BED file..."
|
||||
cat > "$test_dir/test.bed" << 'EOF'
|
||||
chr1 1000 2000 gene1 100 +
|
||||
chr2 3000 4000 gene2 200 -
|
||||
chr3 5000 6000 gene3 150 +
|
||||
EOF
|
||||
|
||||
echo "> Run bedtools_bedtobam on BED file"
|
||||
# Create test BED12 file
|
||||
log "Creating test BED12 file..."
|
||||
cat > "$test_dir/test.bed12" << 'EOF'
|
||||
chr1 1000 3000 gene1 100 + 1000 3000 255,0,0 2 500,500 0,1500
|
||||
chr2 2000 5000 gene2 200 - 2000 5000 0,255,0 3 400,300,400 0,1500,2600
|
||||
EOF
|
||||
|
||||
# --- Test Case 1: Basic BED to BAM conversion ---
|
||||
log "Starting TEST 1: Basic BED to BAM conversion"
|
||||
|
||||
log "Executing $meta_name with basic BED file..."
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bam"
|
||||
--input "$test_dir/test.bed" \
|
||||
--genome "$test_dir/test.genome" \
|
||||
--output "$meta_temp_dir/output1.bam"
|
||||
|
||||
samtools view -h output.bam > output.sam
|
||||
log "Validating TEST 1 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output1.bam" "output BAM file"
|
||||
check_file_not_empty "$meta_temp_dir/output1.bam" "output BAM file"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bam"
|
||||
assert_file_not_empty "output.bam"
|
||||
assert_identical_content "output.sam" "../expected.sam"
|
||||
echo "- test1 succeeded -"
|
||||
# Check if it's a valid BAM file by reading header
|
||||
if command -v samtools >/dev/null 2>&1; then
|
||||
samtools view -H "$meta_temp_dir/output1.bam" > "$meta_temp_dir/header1.txt" 2>/dev/null || true
|
||||
if [ -s "$meta_temp_dir/header1.txt" ]; then
|
||||
check_file_contains "$meta_temp_dir/header1.txt" "@HD" "BAM header"
|
||||
log "✓ Valid BAM format detected"
|
||||
else
|
||||
log "Note: Cannot validate BAM format (samtools not available or BAM corrupt)"
|
||||
fi
|
||||
else
|
||||
log "Note: samtools not available for BAM validation"
|
||||
fi
|
||||
|
||||
popd > /dev/null
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# Test 2: BED12 file
|
||||
mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null
|
||||
# --- Test Case 2: BED12 format conversion ---
|
||||
log "Starting TEST 2: BED12 to BAM conversion"
|
||||
|
||||
echo "> Run bedtools_bedtobam on BED12 file"
|
||||
log "Executing $meta_name with BED12 format..."
|
||||
"$meta_executable" \
|
||||
--input "../example.bed12" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bam" \
|
||||
--bed12 \
|
||||
--input "$test_dir/test.bed12" \
|
||||
--genome "$test_dir/test.genome" \
|
||||
--output "$meta_temp_dir/output2.bam" \
|
||||
--bed12
|
||||
|
||||
samtools view -h output.bam > output.sam
|
||||
log "Validating TEST 2 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output2.bam" "BED12 output BAM file"
|
||||
check_file_not_empty "$meta_temp_dir/output2.bam" "BED12 output BAM file"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bam"
|
||||
assert_file_not_empty "output.bam"
|
||||
assert_identical_content "output.sam" "../expected12.sam"
|
||||
echo "- test2 succeeded -"
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
popd > /dev/null
|
||||
# --- Test Case 3: Custom mapping quality ---
|
||||
log "Starting TEST 3: Custom mapping quality"
|
||||
|
||||
# Test 3: Uncompressed BAM file
|
||||
mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null
|
||||
|
||||
echo "> Run bedtools_bedtobam on BED file with uncompressed BAM output"
|
||||
log "Executing $meta_name with custom mapping quality..."
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bam" \
|
||||
--input "$test_dir/test.bed" \
|
||||
--genome "$test_dir/test.genome" \
|
||||
--output "$meta_temp_dir/output3.bam" \
|
||||
--map_quality 30
|
||||
|
||||
log "Validating TEST 3 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output3.bam" "output BAM with custom MAPQ"
|
||||
check_file_not_empty "$meta_temp_dir/output3.bam" "output BAM with custom MAPQ"
|
||||
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# --- Test Case 4: Uncompressed BAM ---
|
||||
log "Starting TEST 4: Uncompressed BAM output"
|
||||
|
||||
log "Executing $meta_name with uncompressed BAM..."
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/test.bed" \
|
||||
--genome "$test_dir/test.genome" \
|
||||
--output "$meta_temp_dir/output4.bam" \
|
||||
--uncompress_bam
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bam"
|
||||
assert_file_not_empty "output.bam"
|
||||
# Cannot assert_identical_content because umcompress option does not work on this version of bedtools.
|
||||
log "Validating TEST 4 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output4.bam" "uncompressed BAM file"
|
||||
check_file_not_empty "$meta_temp_dir/output4.bam" "uncompressed BAM file"
|
||||
|
||||
echo "- test3 succeeded -"
|
||||
# Uncompressed BAM should generally be larger than compressed
|
||||
compressed_size=$(stat -c%s "$meta_temp_dir/output1.bam")
|
||||
uncompressed_size=$(stat -c%s "$meta_temp_dir/output4.bam")
|
||||
log "Compressed BAM size: $compressed_size bytes"
|
||||
log "Uncompressed BAM size: $uncompressed_size bytes"
|
||||
|
||||
popd > /dev/null
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# Test 4: Map quality
|
||||
mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null
|
||||
|
||||
echo "> Run bedtools_bedtobam on BED file with map quality"
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bam" \
|
||||
--map_quality 10 \
|
||||
|
||||
samtools view -h output.bam > output.sam
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bam"
|
||||
assert_file_not_empty "output.bam"
|
||||
assert_identical_content "output.sam" "../expected_mapquality.sam"
|
||||
echo "- test4 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 5: gff to bam conversion
|
||||
mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null
|
||||
|
||||
echo "> Run bedtools_bedtobam on GFF file"
|
||||
"$meta_executable" \
|
||||
--input "../example.gff" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bam"
|
||||
|
||||
samtools view -h output.bam > output.sam
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bam"
|
||||
assert_file_not_empty "output.bam"
|
||||
assert_identical_content "output.sam" "../expected_gff.sam"
|
||||
echo "- test5 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
echo "---- All tests succeeded! ----"
|
||||
exit 0
|
||||
print_test_summary "All tests completed successfully"
|
||||
|
||||
221
src/bedtools/bedtools_closest/config.vsh.yaml
Normal file
221
src/bedtools/bedtools_closest/config.vsh.yaml
Normal file
@@ -0,0 +1,221 @@
|
||||
name: bedtools_closest
|
||||
namespace: bedtools
|
||||
description: |
|
||||
Find the closest feature in file B for each feature in file A.
|
||||
|
||||
For each interval in file A, this tool identifies the nearest feature in
|
||||
file B, regardless of whether they overlap. Useful for associating genomic
|
||||
features with their nearest neighbors, such as finding the closest gene
|
||||
to each SNP or the nearest regulatory element to each promoter.
|
||||
|
||||
**Default behavior:** Reports closest feature regardless of strand or overlap
|
||||
**Distance reporting:** Optional distance calculation with various orientations
|
||||
**Multiple hits:** Configurable handling of ties and k-nearest neighbors
|
||||
|
||||
keywords: [Closest, Nearest, Distance, BED, GFF, VCF, Association]
|
||||
links:
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/closest.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
homepage: https://bedtools.readthedtools.io/en/latest/
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input_a
|
||||
alternatives: [-a]
|
||||
type: file
|
||||
description: |
|
||||
Query file in BED, GFF, or VCF format.
|
||||
|
||||
For each feature in this file, the closest feature in file B
|
||||
will be identified and reported.
|
||||
required: true
|
||||
example: queries.bed
|
||||
|
||||
- name: --input_b
|
||||
alternatives: [-b]
|
||||
type: file
|
||||
multiple: true
|
||||
description: |
|
||||
Database file(s) in BED, GFF, or VCF format.
|
||||
|
||||
**Single file:** Find closest features in one database
|
||||
**Multiple files:** Find closest features across multiple databases
|
||||
**Format:** Same or different format as input A
|
||||
required: true
|
||||
example: ["database1.bed", "database2.bed"]
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with closest feature results.
|
||||
|
||||
Contains input A features with additional columns showing
|
||||
the closest features from file(s) B, and optionally distance
|
||||
and other metadata.
|
||||
required: true
|
||||
example: closest_features.bed
|
||||
|
||||
- name: Distance Options
|
||||
arguments:
|
||||
- name: --distance
|
||||
alternatives: [-d]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report distance to closest feature as extra column.
|
||||
|
||||
**Distance calculation:** Always positive, 0 for overlapping features
|
||||
**Use case:** When you need quantitative proximity measurements
|
||||
|
||||
- name: --distance_mode
|
||||
alternatives: [-D]
|
||||
type: string
|
||||
choices: ["ref", "a", "b"]
|
||||
description: |
|
||||
Report signed distance with orientation awareness.
|
||||
|
||||
**"ref":** Distance relative to reference genome coordinates
|
||||
**"a":** Distance relative to strand of feature A
|
||||
**"b":** Distance relative to strand of feature B
|
||||
|
||||
**Negative values:** Upstream features
|
||||
**Positive values:** Downstream features
|
||||
example: "ref"
|
||||
|
||||
- name: Filtering Options
|
||||
arguments:
|
||||
- name: --ignore_overlaps
|
||||
alternatives: [-io]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Ignore overlapping features in B.
|
||||
|
||||
Only consider features in B that do not overlap with A.
|
||||
Useful for finding nearby but non-overlapping features.
|
||||
|
||||
- name: --ignore_upstream
|
||||
alternatives: [-iu]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Ignore upstream features in B.
|
||||
|
||||
**Requires:** --distance_mode parameter
|
||||
**Effect:** Only consider downstream features
|
||||
**Orientation:** Follows --distance_mode orientation rules
|
||||
|
||||
- name: --ignore_downstream
|
||||
alternatives: [-id]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Ignore downstream features in B.
|
||||
|
||||
**Requires:** --distance_mode parameter
|
||||
**Effect:** Only consider upstream features
|
||||
**Orientation:** Follows --distance_mode orientation rules
|
||||
|
||||
- name: --force_upstream
|
||||
alternatives: [-fu]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Choose first upstream feature when ties exist.
|
||||
|
||||
**Requires:** --distance_mode parameter
|
||||
**Tie handling:** Among equally close features, prefer upstream
|
||||
**Orientation:** Follows --distance_mode orientation rules
|
||||
|
||||
- name: --force_downstream
|
||||
alternatives: [-fd]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Choose first downstream feature when ties exist.
|
||||
|
||||
**Requires:** --distance_mode parameter
|
||||
**Tie handling:** Among equally close features, prefer downstream
|
||||
**Orientation:** Follows --distance_mode orientation rules
|
||||
|
||||
- name: --strand
|
||||
alternatives: [-s]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require same strandedness.
|
||||
|
||||
Only consider features in B that are on the same strand as
|
||||
the corresponding feature in A.
|
||||
|
||||
- name: --different_strand
|
||||
alternatives: [-S]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require different strandedness.
|
||||
|
||||
Only consider features in B that are on the opposite strand
|
||||
from the corresponding feature in A.
|
||||
|
||||
- name: Advanced Options
|
||||
arguments:
|
||||
- name: --k_closest
|
||||
alternatives: [-k]
|
||||
type: integer
|
||||
description: |
|
||||
Report k closest hits for each query.
|
||||
|
||||
**Default:** 1 (single closest feature)
|
||||
**Multiple hits:** Reports multiple closest features per query
|
||||
**Tie handling:** All ties still reported based on --tie_mode
|
||||
default: 1
|
||||
example: 3
|
||||
|
||||
- name: --tie_mode
|
||||
alternatives: [-t]
|
||||
type: string
|
||||
choices: ["all", "first", "last"]
|
||||
description: |
|
||||
How to handle ties for closest features.
|
||||
|
||||
**"all":** Report all equally close features (default)
|
||||
**"first":** Report first tie found in file B
|
||||
**"last":** Report last tie found in file B
|
||||
default: "all"
|
||||
example: "first"
|
||||
|
||||
- name: --different_names
|
||||
alternatives: [-N]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require different names between query and hit.
|
||||
|
||||
For BED files, compares the 4th column (name field).
|
||||
Useful to avoid self-hits in self-comparisons.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: docker
|
||||
run:
|
||||
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
131
src/bedtools/bedtools_closest/help.txt
Normal file
131
src/bedtools/bedtools_closest/help.txt
Normal file
@@ -0,0 +1,131 @@
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools closest -h
|
||||
```
|
||||
|
||||
Tool: bedtools closest (aka closestBed)
|
||||
Version: v2.31.1
|
||||
Summary: For each feature in A, finds the closest
|
||||
feature (upstream or downstream) in B.
|
||||
|
||||
Usage: bedtools closest [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>
|
||||
|
||||
Options:
|
||||
-d In addition to the closest feature in B,
|
||||
report its distance to A as an extra column.
|
||||
- The reported distance for overlapping features will be 0.
|
||||
|
||||
-D Like -d, report the closest feature in B, and its distance to A
|
||||
as an extra column. Unlike -d, use negative distances to report
|
||||
upstream features.
|
||||
The options for defining which orientation is "upstream" are:
|
||||
- "ref" Report distance with respect to the reference genome.
|
||||
B features with a lower (start, stop) are upstream
|
||||
- "a" Report distance with respect to A.
|
||||
When A is on the - strand, "upstream" means B has a
|
||||
higher (start,stop).
|
||||
- "b" Report distance with respect to B.
|
||||
When B is on the - strand, "upstream" means A has a
|
||||
higher (start,stop).
|
||||
|
||||
-io Ignore features in B that overlap A. That is, we want close,
|
||||
yet not touching features only.
|
||||
|
||||
-iu Ignore features in B that are upstream of features in A.
|
||||
This option requires -D and follows its orientation
|
||||
rules for determining what is "upstream".
|
||||
|
||||
-id Ignore features in B that are downstream of features in A.
|
||||
This option requires -D and follows its orientation
|
||||
rules for determining what is "downstream".
|
||||
|
||||
-fu Choose first from features in B that are upstream of features in A.
|
||||
This option requires -D and follows its orientation
|
||||
rules for determining what is "upstream".
|
||||
|
||||
-fd Choose first from features in B that are downstream of features in A.
|
||||
This option requires -D and follows its orientation
|
||||
rules for determining what is "downstream".
|
||||
|
||||
-t How ties for closest feature are handled. This occurs when two
|
||||
features in B have exactly the same "closeness" with A.
|
||||
By default, all such features in B are reported.
|
||||
Here are all the options:
|
||||
- "all" Report all ties (default).
|
||||
- "first" Report the first tie that occurred in the B file.
|
||||
- "last" Report the last tie that occurred in the B file.
|
||||
|
||||
-mdb How multiple databases are resolved.
|
||||
- "each" Report closest records for each database (default).
|
||||
- "all" Report closest records among all databases.
|
||||
|
||||
-k Report the k closest hits. Default is 1. If tieMode = "all",
|
||||
- all ties will still be reported.
|
||||
|
||||
-N Require that the query and the closest hit have different names.
|
||||
For BED, the 4th column is compared.
|
||||
|
||||
-s Require same strandedness. That is, only report hits in B
|
||||
that overlap A on the _same_ strand.
|
||||
- By default, overlaps are reported without respect to strand.
|
||||
|
||||
-S Require different strandedness. That is, only report hits in B
|
||||
that overlap A on the _opposite_ strand.
|
||||
- By default, overlaps are reported without respect to strand.
|
||||
|
||||
-f Minimum overlap required as a fraction of A.
|
||||
- Default is 1E-9 (i.e., 1bp).
|
||||
- FLOAT (e.g. 0.50)
|
||||
|
||||
-F Minimum overlap required as a fraction of B.
|
||||
- Default is 1E-9 (i.e., 1bp).
|
||||
- FLOAT (e.g. 0.50)
|
||||
|
||||
-r Require that the fraction overlap be reciprocal for A AND B.
|
||||
- In other words, if -f is 0.90 and -r is used, this requires
|
||||
that B overlap 90% of A and A _also_ overlaps 90% of B.
|
||||
|
||||
-e Require that the minimum fraction be satisfied for A OR B.
|
||||
- In other words, if -e is used with -f 0.90 and -F 0.10 this requires
|
||||
that either 90% of A is covered OR 10% of B is covered.
|
||||
Without -e, both fractions would have to be satisfied.
|
||||
|
||||
-split Treat "split" BAM or BED12 entries as distinct BED intervals.
|
||||
|
||||
-g Provide a genome file to enforce consistent chromosome sort order
|
||||
across input files. Only applies when used with -sorted option.
|
||||
|
||||
-nonamecheck For sorted data, don't throw an error if the file has different naming conventions
|
||||
for the same chromosome. ex. "chr1" vs "chr01".
|
||||
|
||||
-names When using multiple databases, provide an alias for each that
|
||||
will appear instead of a fileId when also printing the DB record.
|
||||
|
||||
-filenames When using multiple databases, show each complete filename
|
||||
instead of a fileId when also printing the DB record.
|
||||
|
||||
-sortout When using multiple databases, sort the output DB hits
|
||||
for each record.
|
||||
|
||||
-bed If using BAM input, write output as BED.
|
||||
|
||||
-header Print the header from the A file prior to results.
|
||||
|
||||
-nobuf Disable buffered output. Using this option will cause each line
|
||||
of output to be printed as it is generated, rather than saved
|
||||
in a buffer. This will make printing large output files
|
||||
noticeably slower, but can be useful in conjunction with
|
||||
other software tools and scripts that need to process one
|
||||
line of bedtools output at a time.
|
||||
|
||||
-iobuf Specify amount of memory to use for input buffer.
|
||||
Takes an integer argument. Optional suffixes K/M/G supported.
|
||||
Note: currently has no effect with compressed files.
|
||||
|
||||
Notes:
|
||||
Reports "none" for chrom and "-1" for all other fields when a feature
|
||||
is not found in B on the same chromosome as the feature in A.
|
||||
E.g. none -1 -1
|
||||
|
||||
|
||||
|
||||
|
||||
52
src/bedtools/bedtools_closest/script.sh
Normal file
52
src/bedtools/bedtools_closest/script.sh
Normal file
@@ -0,0 +1,52 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# unset flags
|
||||
unset_if_false=(
|
||||
par_distance
|
||||
par_ignore_overlaps
|
||||
par_ignore_upstream
|
||||
par_ignore_downstream
|
||||
par_force_upstream
|
||||
par_force_downstream
|
||||
par_strand
|
||||
par_different_strand
|
||||
par_different_names
|
||||
)
|
||||
|
||||
for par in "${unset_if_false[@]}"; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset "$par"
|
||||
done
|
||||
|
||||
# Convert semicolon-separated input_b files to array
|
||||
IFS=';' read -ra input_b_array <<< "$par_input_b"
|
||||
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-a "$par_input_a"
|
||||
${par_distance:+-d}
|
||||
${par_distance_mode:+-D "$par_distance_mode"}
|
||||
${par_ignore_overlaps:+-io}
|
||||
${par_ignore_upstream:+-iu}
|
||||
${par_ignore_downstream:+-id}
|
||||
${par_force_upstream:+-fu}
|
||||
${par_force_downstream:+-fd}
|
||||
${par_strand:+-s}
|
||||
${par_different_strand:+-S}
|
||||
${par_k_closest:+-k "$par_k_closest"}
|
||||
${par_tie_mode:+-t "$par_tie_mode"}
|
||||
${par_different_names:+-N}
|
||||
)
|
||||
|
||||
# Add multiple input_b files
|
||||
for file in "${input_b_array[@]}"; do
|
||||
cmd_args+=(-b "$file")
|
||||
done
|
||||
|
||||
# Execute bedtools closest
|
||||
bedtools closest "${cmd_args[@]}" > "$par_output"
|
||||
173
src/bedtools/bedtools_closest/test.sh
Normal file
173
src/bedtools/bedtools_closest/test.sh
Normal file
@@ -0,0 +1,173 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment
|
||||
setup_test_env
|
||||
|
||||
log "Starting tests for bedtools_closest"
|
||||
|
||||
# Create test data
|
||||
log "Creating test data..."
|
||||
|
||||
# Create query intervals file
|
||||
cat > "$meta_temp_dir/queries.bed" << 'EOF'
|
||||
chr1 100 200 query1 100 +
|
||||
chr1 400 500 query2 200 +
|
||||
chr1 800 900 query3 300 -
|
||||
chr2 200 300 query4 400 -
|
||||
EOF
|
||||
|
||||
# Create database file with features at various distances
|
||||
cat > "$meta_temp_dir/database.bed" << 'EOF'
|
||||
chr1 250 350 feature1 500 +
|
||||
chr1 450 550 feature2 600 +
|
||||
chr1 700 800 feature3 700 -
|
||||
chr2 150 250 feature4 800 +
|
||||
chr2 600 700 feature5 900 -
|
||||
chr2 950 1050 feature6 1000 +
|
||||
EOF
|
||||
|
||||
# Create second database file for multi-file testing
|
||||
cat > "$meta_temp_dir/database2.bed" << 'EOF'
|
||||
chr1 1050 1150 db2_feature1
|
||||
chr1 1250 1350 db2_feature2
|
||||
chr1 1450 1550 db2_feature3
|
||||
EOF
|
||||
|
||||
# Create distant features for signed distance testing (non-overlapping)
|
||||
cat > "$meta_temp_dir/test_b_distant.bed" << 'EOF'
|
||||
chr1 50 90 upstream1
|
||||
chr1 250 290 downstream1
|
||||
chr1 450 490 upstream2
|
||||
chr1 650 690 downstream2
|
||||
EOF
|
||||
|
||||
# Test 1: Basic closest feature finding
|
||||
log "Starting TEST 1: Basic closest feature finding"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/queries.bed" \
|
||||
--input_b "$meta_temp_dir/database.bed" \
|
||||
--output "$meta_temp_dir/output1.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output1.bed" "basic closest output"
|
||||
check_file_not_empty "$meta_temp_dir/output1.bed" "basic closest output"
|
||||
check_file_line_count "$meta_temp_dir/output1.bed" 4 "basic closest line count"
|
||||
|
||||
# Check that closest features are reported
|
||||
check_file_contains "$meta_temp_dir/output1.bed" "feature" "closest features found"
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# Test 2: Closest features with distance reporting
|
||||
log "Starting TEST 2: Closest features with distance reporting"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/queries.bed" \
|
||||
--input_b "$meta_temp_dir/database.bed" \
|
||||
--distance_mode "ref" \
|
||||
--output "$meta_temp_dir/output2.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output2.bed" "distance output"
|
||||
check_file_not_empty "$meta_temp_dir/output2.bed" "distance output"
|
||||
check_file_line_count "$meta_temp_dir/output2.bed" 4 "distance line count"
|
||||
|
||||
# Check that distance column is added (should have more columns than input)
|
||||
input_cols=$(head -1 "$meta_temp_dir/queries.bed" | awk '{print NF}')
|
||||
output_cols=$(head -1 "$meta_temp_dir/output2.bed" | awk '{print NF}')
|
||||
if [ $output_cols -le $input_cols ]; then
|
||||
error "Expected more columns in output with distance, got $output_cols vs input $input_cols"
|
||||
fi
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# Test 3: Find closest with strand consideration
|
||||
log "Starting TEST 3: Closest with strand consideration"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/queries.bed" \
|
||||
--input_b "$meta_temp_dir/database.bed" \
|
||||
--strand \
|
||||
--output "$meta_temp_dir/output3.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output3.bed" "strand output"
|
||||
check_file_not_empty "$meta_temp_dir/output3.bed" "strand output"
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# Test 4: Find k-nearest neighbors (k=2)
|
||||
log "Starting TEST 4: K-nearest neighbors (k=2)"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/queries.bed" \
|
||||
--input_b "$meta_temp_dir/database.bed" \
|
||||
--k_closest 2 \
|
||||
--output "$meta_temp_dir/output4.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output4.bed" "k-nearest output"
|
||||
check_file_not_empty "$meta_temp_dir/output4.bed" "k-nearest output"
|
||||
|
||||
# Should have more lines than basic test (up to 2x for each query)
|
||||
basic_lines=$(wc -l < "$meta_temp_dir/output1.bed")
|
||||
knearest_lines=$(wc -l < "$meta_temp_dir/output4.bed")
|
||||
if [ $knearest_lines -lt $basic_lines ]; then
|
||||
error "Expected at least $basic_lines lines for k-nearest, got $knearest_lines"
|
||||
fi
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# Test 5: Distance reporting with different mode
|
||||
log "Starting TEST 5: Distance reporting with signed distance"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/queries.bed" \
|
||||
--input_b "$meta_temp_dir/test_b_distant.bed" \
|
||||
--distance_mode "ref" \
|
||||
--output "$meta_temp_dir/output5.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output5.bed" "signed distance output"
|
||||
check_file_not_empty "$meta_temp_dir/output5.bed" "signed distance output"
|
||||
check_file_line_count "$meta_temp_dir/output5.bed" 4 "signed distance line count"
|
||||
|
||||
# Check that distance column includes negative values (upstream features)
|
||||
if ! grep -q "[-]" "$meta_temp_dir/output5.bed"; then
|
||||
log "Warning: No negative distances found, may not have upstream features"
|
||||
fi
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "Starting TEST 6: Multiple database files"
|
||||
|
||||
# Create second database file with different features
|
||||
cat > "$meta_temp_dir/database2.bed" << 'EOF'
|
||||
chr1 300 400 enhancer1 10 +
|
||||
chr1 500 600 enhancer2 20 +
|
||||
chr2 150 250 enhancer3 15 -
|
||||
chr2 350 450 enhancer4 25 -
|
||||
EOF
|
||||
|
||||
# Test multiple databases
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/queries.bed" \
|
||||
--input_b "$meta_temp_dir/database.bed;$meta_temp_dir/database2.bed" \
|
||||
--output "$meta_temp_dir/output6.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output6.bed" "multiple database output"
|
||||
check_file_not_empty "$meta_temp_dir/output6.bed" "multiple database output"
|
||||
|
||||
# Check that we have results from multiple databases (should have database IDs)
|
||||
line_count=$(wc -l < "$meta_temp_dir/output6.bed")
|
||||
if [ "$line_count" -lt 4 ]; then
|
||||
log "❌ Expected at least 4 lines for multiple databases, got $line_count"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check for database ID column (7th column should contain database numbers)
|
||||
if ! cut -f7 "$meta_temp_dir/output6.bed" | grep -E "^[12]$" > /dev/null; then
|
||||
log "❌ Expected database IDs (1, 2) in 7th column"
|
||||
log "Actual output:"
|
||||
cat "$meta_temp_dir/output6.bed"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "✓ Found multiple database output with database IDs"
|
||||
log "✅ TEST 6 completed successfully"
|
||||
|
||||
log "🎉 All bedtools_closest tests completed successfully!"
|
||||
99
src/bedtools/bedtools_cluster/config.vsh.yaml
Normal file
99
src/bedtools/bedtools_cluster/config.vsh.yaml
Normal file
@@ -0,0 +1,99 @@
|
||||
name: bedtools_cluster
|
||||
namespace: bedtools
|
||||
description: |
|
||||
Cluster overlapping or nearby genomic intervals.
|
||||
|
||||
This tool groups genomic intervals into clusters based on overlap
|
||||
or proximity within a specified distance. Each cluster is assigned
|
||||
a unique cluster ID, making it useful for analyzing genomic feature
|
||||
distributions and relationships.
|
||||
|
||||
keywords: [genomics, intervals, clustering, overlap, proximity, grouping]
|
||||
links:
|
||||
homepage: https://bedtools.readthedocs.io/
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/cluster.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: |
|
||||
Input file in BED, GFF, or VCF format.
|
||||
|
||||
**BED format:** Standard genomic interval format
|
||||
**GFF format:** Gene feature format with annotations
|
||||
**VCF format:** Variant call format
|
||||
**Requirements:** Must be sorted by chromosome and position
|
||||
required: true
|
||||
example: intervals.bed
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with cluster assignments.
|
||||
|
||||
Contains original intervals with an additional column showing
|
||||
the cluster ID for each interval. Intervals in the same cluster
|
||||
have the same cluster ID number.
|
||||
required: true
|
||||
example: clustered.bed
|
||||
|
||||
- name: Clustering Options
|
||||
arguments:
|
||||
- name: --distance
|
||||
alternatives: [-d]
|
||||
type: integer
|
||||
description: |
|
||||
Maximum distance between features for clustering.
|
||||
|
||||
**Default:** 0 (only overlapping and book-ended features clustered)
|
||||
**Positive values:** Cluster features within specified distance
|
||||
**Use case:** Group nearby but non-overlapping features
|
||||
default: 0
|
||||
example: 1000
|
||||
|
||||
- name: --strand
|
||||
alternatives: [-s]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Force strandedness in clustering.
|
||||
|
||||
**Default:** Clustering ignores strand information
|
||||
**When enabled:** Only cluster features on the same strand
|
||||
**Use case:** Strand-specific analysis of genomic features
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
20
src/bedtools/bedtools_cluster/help.txt
Normal file
20
src/bedtools/bedtools_cluster/help.txt
Normal file
@@ -0,0 +1,20 @@
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools cluster -h
|
||||
```
|
||||
|
||||
Tool: bedtools cluster
|
||||
Version: v2.31.1
|
||||
Summary: Clusters overlapping/nearby BED/GFF/VCF intervals.
|
||||
|
||||
Usage: bedtools cluster [OPTIONS] -i <bed/gff/vcf>
|
||||
|
||||
Options:
|
||||
-s Force strandedness. That is, only merge features
|
||||
that are the same strand.
|
||||
- By default, merging is done without respect to strand.
|
||||
|
||||
-d Maximum distance between features allowed for features
|
||||
to be merged.
|
||||
- Def. 0. That is, overlapping & book-ended features are merged.
|
||||
- (INTEGER)
|
||||
|
||||
16
src/bedtools/bedtools_cluster/script.sh
Normal file
16
src/bedtools/bedtools_cluster/script.sh
Normal file
@@ -0,0 +1,16 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# unset flags
|
||||
[[ "$par_strand" == "false" ]] && unset par_strand
|
||||
|
||||
# Execute bedtools cluster
|
||||
bedtools cluster \
|
||||
-i "$par_input" \
|
||||
${par_distance:+-d "$par_distance"} \
|
||||
${par_strand:+-s} \
|
||||
> "$par_output"
|
||||
154
src/bedtools/bedtools_cluster/test.sh
Normal file
154
src/bedtools/bedtools_cluster/test.sh
Normal file
@@ -0,0 +1,154 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment
|
||||
setup_test_env
|
||||
|
||||
log "Starting tests for bedtools_cluster"
|
||||
|
||||
# Create test data
|
||||
log "Creating test data..."
|
||||
|
||||
# Create overlapping intervals for basic clustering
|
||||
cat > "$meta_temp_dir/overlapping.bed" << 'EOF'
|
||||
chr1 100 200 feature1 100 +
|
||||
chr1 150 250 feature2 200 +
|
||||
chr1 180 280 feature3 300 +
|
||||
chr1 500 600 feature4 400 -
|
||||
chr1 800 900 feature5 500 +
|
||||
chr2 100 200 feature6 600 +
|
||||
chr2 300 400 feature7 700 -
|
||||
EOF
|
||||
|
||||
# Create intervals with different strands
|
||||
cat > "$meta_temp_dir/stranded.bed" << 'EOF'
|
||||
chr1 100 200 pos1 100 +
|
||||
chr1 150 250 neg1 200 -
|
||||
chr1 180 280 pos2 300 +
|
||||
chr1 300 400 neg2 400 -
|
||||
chr1 500 600 pos3 500 +
|
||||
chr1 550 650 neg3 600 -
|
||||
EOF
|
||||
|
||||
# Create intervals for distance-based clustering
|
||||
cat > "$meta_temp_dir/nearby.bed" << 'EOF'
|
||||
chr1 100 200 interval1 100 +
|
||||
chr1 300 400 interval2 200 +
|
||||
chr1 450 550 interval3 300 +
|
||||
chr1 1000 1100 interval4 400 +
|
||||
chr1 1200 1300 interval5 500 +
|
||||
chr2 100 200 interval6 600 +
|
||||
chr2 1000 1100 interval7 700 +
|
||||
EOF
|
||||
|
||||
# Test 1: Basic clustering of overlapping intervals
|
||||
log "Starting TEST 1: Basic clustering of overlapping intervals"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/overlapping.bed" \
|
||||
--output "$meta_temp_dir/output1.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output1.bed" "basic clustering output"
|
||||
check_file_not_empty "$meta_temp_dir/output1.bed" "basic clustering output"
|
||||
check_file_line_count "$meta_temp_dir/output1.bed" 7 "basic clustering line count"
|
||||
|
||||
# Check that cluster IDs are added (should have one more column than input)
|
||||
input_cols=$(head -1 "$meta_temp_dir/overlapping.bed" | awk '{print NF}')
|
||||
output_cols=$(head -1 "$meta_temp_dir/output1.bed" | awk '{print NF}')
|
||||
if [ $output_cols -ne $((input_cols + 1)) ]; then
|
||||
log_error "Expected $((input_cols + 1)) columns in output, got $output_cols"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check that overlapping intervals get the same cluster ID
|
||||
if ! grep -q " 1$" "$meta_temp_dir/output1.bed"; then
|
||||
log_error "Expected cluster ID 1 in output"
|
||||
exit 1
|
||||
fi
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# Test 2: Distance-based clustering
|
||||
log "Starting TEST 2: Distance-based clustering"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/nearby.bed" \
|
||||
--distance 100 \
|
||||
--output "$meta_temp_dir/output2.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output2.bed" "distance clustering output"
|
||||
check_file_not_empty "$meta_temp_dir/output2.bed" "distance clustering output"
|
||||
check_file_line_count "$meta_temp_dir/output2.bed" 7 "distance clustering line count"
|
||||
|
||||
# With distance 100, intervals at positions 100-200, 300-400, 450-550 should cluster together
|
||||
# Check that cluster IDs are present
|
||||
check_file_contains "$meta_temp_dir/output2.bed" "1" "cluster IDs present"
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# Test 3: Strand-specific clustering
|
||||
log "Starting TEST 3: Strand-specific clustering"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/stranded.bed" \
|
||||
--strand \
|
||||
--output "$meta_temp_dir/output3.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output3.bed" "strand clustering output"
|
||||
check_file_not_empty "$meta_temp_dir/output3.bed" "strand clustering output"
|
||||
check_file_line_count "$meta_temp_dir/output3.bed" 6 "strand clustering line count"
|
||||
|
||||
# With strand consideration, + and - strand features should get different cluster IDs
|
||||
# even if they overlap
|
||||
pos_cluster=$(grep "pos1" "$meta_temp_dir/output3.bed" | awk '{print $NF}')
|
||||
neg_cluster=$(grep "neg1" "$meta_temp_dir/output3.bed" | awk '{print $NF}')
|
||||
if [ "$pos_cluster" = "$neg_cluster" ]; then
|
||||
log_error "Expected different cluster IDs for + and - strand overlapping features"
|
||||
exit 1
|
||||
fi
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# Test 4: Large distance clustering
|
||||
log "Starting TEST 4: Large distance clustering"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/nearby.bed" \
|
||||
--distance 1000 \
|
||||
--output "$meta_temp_dir/output4.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output4.bed" "large distance clustering output"
|
||||
check_file_not_empty "$meta_temp_dir/output4.bed" "large distance clustering output"
|
||||
check_file_line_count "$meta_temp_dir/output4.bed" 7 "large distance clustering line count"
|
||||
|
||||
# With distance 1000, most chr1 intervals should cluster together
|
||||
chr1_clusters=$(grep "^chr1" "$meta_temp_dir/output4.bed" | awk '{print $NF}' | sort -u | wc -l)
|
||||
if [ $chr1_clusters -gt 2 ]; then
|
||||
log "Warning: Expected few clusters on chr1 with distance 1000, got $chr1_clusters"
|
||||
fi
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# Test 5: Multiple chromosome handling
|
||||
log "Starting TEST 5: Multiple chromosome handling"
|
||||
# This test uses the overlapping.bed which has both chr1 and chr2
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/overlapping.bed" \
|
||||
--output "$meta_temp_dir/output5.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output5.bed" "multi-chromosome output"
|
||||
check_file_not_empty "$meta_temp_dir/output5.bed" "multi-chromosome output"
|
||||
|
||||
# Check that both chromosomes are present
|
||||
check_file_contains "$meta_temp_dir/output5.bed" "chr1" "chr1 features present"
|
||||
check_file_contains "$meta_temp_dir/output5.bed" "chr2" "chr2 features present"
|
||||
|
||||
# Each chromosome should have its own cluster numbering
|
||||
chr1_max_cluster=$(grep "^chr1" "$meta_temp_dir/output5.bed" | awk '{print $NF}' | sort -n | tail -1)
|
||||
chr2_min_cluster=$(grep "^chr2" "$meta_temp_dir/output5.bed" | awk '{print $NF}' | sort -n | head -1)
|
||||
if [ $chr2_min_cluster -le $chr1_max_cluster ]; then
|
||||
log "ℹ️ Note: Cluster IDs may continue across chromosomes (cluster numbering: chr1 max=$chr1_max_cluster, chr2 min=$chr2_min_cluster)"
|
||||
fi
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
|
||||
log "🎉 All bedtools_cluster tests completed successfully!"
|
||||
100
src/bedtools/bedtools_complement/config.vsh.yaml
Normal file
100
src/bedtools/bedtools_complement/config.vsh.yaml
Normal file
@@ -0,0 +1,100 @@
|
||||
name: bedtools_complement
|
||||
namespace: bedtools
|
||||
description: |
|
||||
Find genomic intervals that are NOT covered by input intervals.
|
||||
|
||||
This tool returns the complement of genomic intervals - the regions
|
||||
of the genome that are NOT covered by the input features. Useful for
|
||||
finding gaps, uncovered regions, or background intervals.
|
||||
|
||||
keywords: [genomics, intervals, complement, gaps, uncovered, background]
|
||||
links:
|
||||
homepage: https://bedtools.readthedocs.io/
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/complement.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: |
|
||||
Input file in BED, GFF, or VCF format.
|
||||
|
||||
**BED format:** Standard genomic interval format
|
||||
**GFF format:** Gene feature format with annotations
|
||||
**VCF format:** Variant call format
|
||||
**Requirements:** Should be sorted by chromosome and position for optimal performance
|
||||
required: true
|
||||
example: covered_regions.bed
|
||||
|
||||
- name: --genome
|
||||
alternatives: [-g]
|
||||
type: file
|
||||
description: |
|
||||
Genome file defining chromosome names and sizes.
|
||||
|
||||
**Format:** Tab-delimited file with chromosome name and size
|
||||
**Example line:** chr1 249250621
|
||||
**Sources:** Can be created with samtools faidx or UCSC Table Browser
|
||||
**Purpose:** Defines the complete genomic space for complement calculation
|
||||
required: true
|
||||
example: genome.txt
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with complement intervals.
|
||||
|
||||
Contains genomic intervals representing the regions NOT covered
|
||||
by the input intervals. Output is in BED format with chromosome,
|
||||
start, and end coordinates.
|
||||
required: true
|
||||
example: uncovered_regions.bed
|
||||
|
||||
- name: Options
|
||||
arguments:
|
||||
- name: --limit_chromosomes
|
||||
alternatives: [-L]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Limit output to chromosomes present in input file.
|
||||
|
||||
**Default:** Output includes all chromosomes from genome file
|
||||
**When enabled:** Only output complement for chromosomes that have
|
||||
records in the input file
|
||||
**Use case:** Focus analysis on chromosomes of interest
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
43
src/bedtools/bedtools_complement/help.txt
Normal file
43
src/bedtools/bedtools_complement/help.txt
Normal file
@@ -0,0 +1,43 @@
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools complement -h
|
||||
```
|
||||
|
||||
Tool: bedtools complement (aka complementBed)
|
||||
Version: v2.31.1
|
||||
Summary: Returns the base pair complement of a feature file.
|
||||
|
||||
Usage: bedtools complement [OPTIONS] -i <bed/gff/vcf> -g <genome>
|
||||
|
||||
Options:
|
||||
-L Limit output to solely the chromosomes with records in the input file.
|
||||
|
||||
Notes:
|
||||
(1) The genome file should tab delimited and structured as follows:
|
||||
<chromName><TAB><chromSize>
|
||||
|
||||
For example, Human (hg19):
|
||||
chr1 249250621
|
||||
chr2 243199373
|
||||
...
|
||||
chr18_gl000207_random 4262
|
||||
|
||||
Tip 1. Use samtools faidx to create a genome file from a FASTA:
|
||||
One can the samtools faidx command to index a FASTA file.
|
||||
The resulting .fai index is suitable as a genome file,
|
||||
as bedtools will only look at the first two, relevant columns
|
||||
of the .fai file.
|
||||
|
||||
For example:
|
||||
samtools faidx GRCh38.fa
|
||||
bedtools complement -i my.bed -g GRCh38.fa.fai
|
||||
|
||||
Tip 2. Use UCSC Table Browser to create a genome file:
|
||||
One can use the UCSC Genome Browser's MySQL database to extract
|
||||
chromosome sizes. For example, H. sapiens:
|
||||
|
||||
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
|
||||
"select chrom, size from hg19.chromInfo" > hg19.genome
|
||||
|
||||
|
||||
|
||||
|
||||
16
src/bedtools/bedtools_complement/script.sh
Normal file
16
src/bedtools/bedtools_complement/script.sh
Normal file
@@ -0,0 +1,16 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# unset flags
|
||||
[[ "$par_limit_chromosomes" == "false" ]] && unset par_limit_chromosomes
|
||||
|
||||
# Execute bedtools complement
|
||||
bedtools complement \
|
||||
-i "$par_input" \
|
||||
-g "$par_genome" \
|
||||
${par_limit_chromosomes:+-L} \
|
||||
> "$par_output"
|
||||
149
src/bedtools/bedtools_complement/test.sh
Normal file
149
src/bedtools/bedtools_complement/test.sh
Normal file
@@ -0,0 +1,149 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment
|
||||
setup_test_env
|
||||
|
||||
log "Starting tests for bedtools_complement"
|
||||
|
||||
# Create test data
|
||||
log "Creating test data..."
|
||||
|
||||
# Create genome file
|
||||
cat > "$meta_temp_dir/genome.txt" << 'EOF'
|
||||
chr1 1000
|
||||
chr2 800
|
||||
chr3 500
|
||||
EOF
|
||||
|
||||
# Create simple intervals covering some regions
|
||||
cat > "$meta_temp_dir/covered.bed" << 'EOF'
|
||||
chr1 100 200
|
||||
chr1 300 400
|
||||
chr1 600 700
|
||||
chr2 50 150
|
||||
chr2 300 500
|
||||
EOF
|
||||
|
||||
# Create intervals on only one chromosome
|
||||
cat > "$meta_temp_dir/chr1_only.bed" << 'EOF'
|
||||
chr1 100 200
|
||||
chr1 500 600
|
||||
chr1 800 900
|
||||
EOF
|
||||
|
||||
# Create overlapping intervals to test merging behavior
|
||||
cat > "$meta_temp_dir/overlapping.bed" << 'EOF'
|
||||
chr1 100 300
|
||||
chr1 250 400
|
||||
chr1 600 800
|
||||
chr2 100 200
|
||||
chr2 150 250
|
||||
EOF
|
||||
|
||||
# Test 1: Basic complement finding
|
||||
log "Starting TEST 1: Basic complement finding"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/covered.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--output "$meta_temp_dir/output1.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output1.bed" "basic complement output"
|
||||
check_file_not_empty "$meta_temp_dir/output1.bed" "basic complement output"
|
||||
|
||||
# Should have complement regions for all chromosomes
|
||||
check_file_contains "$meta_temp_dir/output1.bed" "chr1" "chr1 complement regions"
|
||||
check_file_contains "$meta_temp_dir/output1.bed" "chr2" "chr2 complement regions"
|
||||
check_file_contains "$meta_temp_dir/output1.bed" "chr3" "chr3 complement regions (entire chromosome)"
|
||||
|
||||
# Chr3 should be completely uncovered (0-500)
|
||||
check_file_contains "$meta_temp_dir/output1.bed" "chr3 0 500" "complete chr3 complement"
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# Test 2: Complement with chromosome limiting
|
||||
log "Starting TEST 2: Complement with chromosome limiting"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/chr1_only.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--limit_chromosomes \
|
||||
--output "$meta_temp_dir/output2.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output2.bed" "limited complement output"
|
||||
check_file_not_empty "$meta_temp_dir/output2.bed" "limited complement output"
|
||||
|
||||
# Should only contain chr1 complement (no chr2, chr3)
|
||||
check_file_contains "$meta_temp_dir/output2.bed" "chr1" "chr1 complement regions"
|
||||
if grep -q "chr2\|chr3" "$meta_temp_dir/output2.bed"; then
|
||||
log_error "Expected only chr1 with -L option, but found chr2 or chr3"
|
||||
exit 1
|
||||
fi
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# Test 3: Complement of overlapping intervals
|
||||
log "Starting TEST 3: Complement of overlapping intervals"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/overlapping.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--output "$meta_temp_dir/output3.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output3.bed" "overlapping complement output"
|
||||
check_file_not_empty "$meta_temp_dir/output3.bed" "overlapping complement output"
|
||||
|
||||
# bedtools complement should handle overlapping input intervals correctly
|
||||
check_file_contains "$meta_temp_dir/output3.bed" "chr1" "chr1 complement with overlaps"
|
||||
check_file_contains "$meta_temp_dir/output3.bed" "chr2" "chr2 complement with overlaps"
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# Test 4: Verify complement coordinates
|
||||
log "Starting TEST 4: Verify complement coordinates"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/covered.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--output "$meta_temp_dir/output4.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output4.bed" "coordinate verification output"
|
||||
|
||||
# Check that complement starts at 0 for chr1 (nothing covered at start)
|
||||
if ! grep -q "chr1 0 100" "$meta_temp_dir/output4.bed"; then
|
||||
log_error "Expected chr1 complement to start at position 0"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check that complement goes to chromosome end (1000 for chr1)
|
||||
if ! grep -q "700 1000" "$meta_temp_dir/output4.bed"; then
|
||||
log_error "Expected chr1 complement to end at chromosome end (1000)"
|
||||
exit 1
|
||||
fi
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# Test 5: Empty input handling
|
||||
log "Starting TEST 5: Empty input handling"
|
||||
# Create empty input file
|
||||
touch "$meta_temp_dir/empty.bed"
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/empty.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--output "$meta_temp_dir/output5.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output5.bed" "empty input output"
|
||||
check_file_not_empty "$meta_temp_dir/output5.bed" "empty input output"
|
||||
|
||||
# With no input intervals, complement should be entire genome
|
||||
total_genome_size=$(awk '{sum += $2} END {print sum}' "$meta_temp_dir/genome.txt")
|
||||
total_complement_size=$(awk '{sum += $3 - $2} END {print sum}' "$meta_temp_dir/output5.bed")
|
||||
|
||||
if [ "$total_complement_size" -ne "$total_genome_size" ]; then
|
||||
log_error "Expected complement size to equal genome size ($total_genome_size), got $total_complement_size"
|
||||
exit 1
|
||||
fi
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
log "🎉 All bedtools_complement tests completed successfully!"
|
||||
245
src/bedtools/bedtools_coverage/config.vsh.yaml
Normal file
245
src/bedtools/bedtools_coverage/config.vsh.yaml
Normal file
@@ -0,0 +1,245 @@
|
||||
name: bedtools_coverage
|
||||
namespace: bedtools
|
||||
description: |
|
||||
Calculate coverage of genomic intervals from one file over intervals in another.
|
||||
|
||||
This tool reports the depth and breadth of coverage of features from file B
|
||||
over the intervals in file A. It provides detailed coverage statistics including
|
||||
overlap counts, covered bases, and coverage fractions.
|
||||
|
||||
keywords: [genomics, intervals, coverage, depth, breadth, overlap, statistics]
|
||||
links:
|
||||
homepage: https://bedtools.readthedocs.io/
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input_a
|
||||
alternatives: [-a]
|
||||
type: file
|
||||
description: |
|
||||
Query intervals file in BED, GFF, or VCF format.
|
||||
|
||||
**Purpose:** Intervals for which coverage will be calculated
|
||||
**BED format:** Standard genomic interval format
|
||||
**GFF format:** Gene feature format with annotations
|
||||
**VCF format:** Variant call format
|
||||
required: true
|
||||
example: target_regions.bed
|
||||
|
||||
- name: --input_b
|
||||
alternatives: [-b]
|
||||
type: file
|
||||
multiple: true
|
||||
description: |
|
||||
Coverage source file(s) in BED, GFF, VCF, or BAM format.
|
||||
|
||||
**Purpose:** Features that provide coverage over query intervals
|
||||
**Multiple files:** Can specify multiple coverage sources
|
||||
**BAM support:** Binary alignment files for sequencing coverage
|
||||
required: true
|
||||
example: ["alignments.bam", "features.bed"]
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with coverage statistics.
|
||||
|
||||
**Default output:** For each interval in A, reports:
|
||||
1. Number of overlapping features from B
|
||||
2. Number of bases in A with non-zero coverage
|
||||
3. Length of interval in A
|
||||
4. Fraction of bases in A with non-zero coverage
|
||||
required: true
|
||||
example: coverage_stats.txt
|
||||
|
||||
- name: Coverage Reporting
|
||||
arguments:
|
||||
- name: --histogram
|
||||
alternatives: [-hist]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report coverage histogram for each feature and summary.
|
||||
|
||||
**Output format:** depth, bases at depth, feature size, percentage
|
||||
**Use case:** Detailed coverage distribution analysis
|
||||
|
||||
- name: --depth_per_position
|
||||
alternatives: [-d]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report depth at each position in each interval.
|
||||
|
||||
**Output:** One-based positions with coverage depth
|
||||
**Use case:** Position-specific coverage analysis
|
||||
**Note:** Generates large output for long intervals
|
||||
|
||||
- name: --counts_only
|
||||
alternatives: [-counts]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Only report overlap counts, no fractions.
|
||||
|
||||
**Simplified output:** Just the number of overlapping features
|
||||
**Use case:** When only overlap counts are needed
|
||||
|
||||
- name: --mean_depth
|
||||
alternatives: [-mean]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report mean coverage depth for each interval.
|
||||
|
||||
**Output:** Average depth across all positions in interval
|
||||
**Use case:** Summary coverage statistics
|
||||
|
||||
- name: Strand Options
|
||||
arguments:
|
||||
- name: --same_strand
|
||||
alternatives: [-s]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require same strandedness for overlaps.
|
||||
|
||||
**Default:** Overlaps reported regardless of strand
|
||||
**When enabled:** Only count overlaps on same strand
|
||||
|
||||
- name: --different_strand
|
||||
alternatives: [-S]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require different strandedness for overlaps.
|
||||
|
||||
**Default:** Overlaps reported regardless of strand
|
||||
**When enabled:** Only count overlaps on opposite strand
|
||||
|
||||
- name: Overlap Requirements
|
||||
arguments:
|
||||
- name: --min_overlap_a
|
||||
alternatives: [-f]
|
||||
type: double
|
||||
description: |
|
||||
Minimum overlap required as fraction of A.
|
||||
|
||||
**Range:** 0.0 to 1.0
|
||||
**Default:** 1E-9 (essentially 1bp)
|
||||
**Example:** 0.50 requires 50% of A to be overlapped
|
||||
example: 0.5
|
||||
|
||||
- name: --min_overlap_b
|
||||
alternatives: [-F]
|
||||
type: double
|
||||
description: |
|
||||
Minimum overlap required as fraction of B.
|
||||
|
||||
**Range:** 0.0 to 1.0
|
||||
**Default:** 1E-9 (essentially 1bp)
|
||||
**Example:** 0.80 requires 80% of B to overlap A
|
||||
example: 0.8
|
||||
|
||||
- name: --reciprocal
|
||||
alternatives: [-r]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require reciprocal minimum fraction for A AND B.
|
||||
|
||||
**Requires:** Both -f and -F fractions to be satisfied
|
||||
**Use case:** Stringent overlap requirements
|
||||
|
||||
- name: --either
|
||||
alternatives: [-e]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require minimum fraction for A OR B (not both).
|
||||
|
||||
**Default:** Both -f and -F must be satisfied
|
||||
**When enabled:** Either fraction requirement is sufficient
|
||||
|
||||
- name: Format Options
|
||||
arguments:
|
||||
- name: --split
|
||||
type: boolean_true
|
||||
description: |
|
||||
Treat split BAM/BED12 entries as distinct intervals.
|
||||
|
||||
**BAM:** Handle spliced alignments as separate blocks
|
||||
**BED12:** Process each block independently
|
||||
|
||||
- name: --bed_output
|
||||
alternatives: [-bed]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write output in BED format when using BAM input.
|
||||
|
||||
**Default:** BAM input produces BAM-style output
|
||||
**When enabled:** Force BED format output
|
||||
|
||||
- name: --header
|
||||
type: boolean_true
|
||||
description: |
|
||||
Print header from input A file before results.
|
||||
|
||||
**Use case:** Preserve metadata from input file
|
||||
|
||||
- name: Performance Options
|
||||
arguments:
|
||||
- name: --sorted
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use chromsweep algorithm for sorted input.
|
||||
|
||||
**Requirements:** Input must be sorted by chromosome and position
|
||||
**Performance:** Faster processing for large files
|
||||
|
||||
- name: --genome
|
||||
alternatives: [-g]
|
||||
type: file
|
||||
description: |
|
||||
Genome file for consistent chromosome ordering.
|
||||
|
||||
**Format:** Tab-delimited chromosome names and sizes
|
||||
**Use case:** Ensure consistent sort order with -sorted option
|
||||
example: genome.txt
|
||||
|
||||
- name: --no_name_check
|
||||
alternatives: [-nonamecheck]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Don't error on different chromosome naming conventions.
|
||||
|
||||
**Example:** Allows mixing "chr1" and "chr01"
|
||||
**Use case:** Working with files from different sources
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
89
src/bedtools/bedtools_coverage/help.txt
Normal file
89
src/bedtools/bedtools_coverage/help.txt
Normal file
@@ -0,0 +1,89 @@
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools coverage -h
|
||||
```
|
||||
|
||||
Tool: bedtools coverage (aka coverageBed)
|
||||
Version: v2.31.1
|
||||
Summary: Returns the depth and breadth of coverage of features from B
|
||||
on the intervals in A.
|
||||
|
||||
Usage: bedtools coverage [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>
|
||||
|
||||
Options:
|
||||
-hist Report a histogram of coverage for each feature in A
|
||||
as well as a summary histogram for _all_ features in A.
|
||||
|
||||
Output (tab delimited) after each feature in A:
|
||||
1) depth
|
||||
2) # bases at depth
|
||||
3) size of A
|
||||
4) % of A at depth
|
||||
|
||||
-d Report the depth at each position in each A feature.
|
||||
Positions reported are one based. Each position
|
||||
and depth follow the complete A feature.
|
||||
|
||||
-counts Only report the count of overlaps, don't compute fraction, etc.
|
||||
|
||||
-mean Report the mean depth of all positions in each A feature.
|
||||
|
||||
-s Require same strandedness. That is, only report hits in B
|
||||
that overlap A on the _same_ strand.
|
||||
- By default, overlaps are reported without respect to strand.
|
||||
|
||||
-S Require different strandedness. That is, only report hits in B
|
||||
that overlap A on the _opposite_ strand.
|
||||
- By default, overlaps are reported without respect to strand.
|
||||
|
||||
-f Minimum overlap required as a fraction of A.
|
||||
- Default is 1E-9 (i.e., 1bp).
|
||||
- FLOAT (e.g. 0.50)
|
||||
|
||||
-F Minimum overlap required as a fraction of B.
|
||||
- Default is 1E-9 (i.e., 1bp).
|
||||
- FLOAT (e.g. 0.50)
|
||||
|
||||
-r Require that the fraction overlap be reciprocal for A AND B.
|
||||
- In other words, if -f is 0.90 and -r is used, this requires
|
||||
that B overlap 90% of A and A _also_ overlaps 90% of B.
|
||||
|
||||
-e Require that the minimum fraction be satisfied for A OR B.
|
||||
- In other words, if -e is used with -f 0.90 and -F 0.10 this requires
|
||||
that either 90% of A is covered OR 10% of B is covered.
|
||||
Without -e, both fractions would have to be satisfied.
|
||||
|
||||
-split Treat "split" BAM or BED12 entries as distinct BED intervals.
|
||||
|
||||
-g Provide a genome file to enforce consistent chromosome sort order
|
||||
across input files. Only applies when used with -sorted option.
|
||||
|
||||
-nonamecheck For sorted data, don't throw an error if the file has different naming conventions
|
||||
for the same chromosome. ex. "chr1" vs "chr01".
|
||||
|
||||
-sorted Use the "chromsweep" algorithm for sorted (-k1,1 -k2,2n) input.
|
||||
|
||||
-bed If using BAM input, write output as BED.
|
||||
|
||||
-header Print the header from the A file prior to results.
|
||||
|
||||
-nobuf Disable buffered output. Using this option will cause each line
|
||||
of output to be printed as it is generated, rather than saved
|
||||
in a buffer. This will make printing large output files
|
||||
noticeably slower, but can be useful in conjunction with
|
||||
other software tools and scripts that need to process one
|
||||
line of bedtools output at a time.
|
||||
|
||||
-iobuf Specify amount of memory to use for input buffer.
|
||||
Takes an integer argument. Optional suffixes K/M/G supported.
|
||||
Note: currently has no effect with compressed files.
|
||||
|
||||
Default Output:
|
||||
After each entry in A, reports:
|
||||
1) The number of features in B that overlapped the A interval.
|
||||
2) The number of bases in A that had non-zero coverage.
|
||||
3) The length of the entry in A.
|
||||
4) The fraction of bases in A that had non-zero coverage.
|
||||
|
||||
|
||||
|
||||
|
||||
57
src/bedtools/bedtools_coverage/script.sh
Normal file
57
src/bedtools/bedtools_coverage/script.sh
Normal file
@@ -0,0 +1,57 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# unset flags
|
||||
unset_if_false=(
|
||||
par_histogram
|
||||
par_depth_per_position
|
||||
par_counts_only
|
||||
par_mean_depth
|
||||
par_same_strand
|
||||
par_different_strand
|
||||
par_reciprocal
|
||||
par_either
|
||||
par_split
|
||||
par_bed_output
|
||||
par_header
|
||||
par_sorted
|
||||
par_no_name_check
|
||||
)
|
||||
|
||||
for par in "${unset_if_false[@]}"; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset "$par"
|
||||
done
|
||||
|
||||
# Build input B arguments array from semicolon-separated string
|
||||
input_b_args=()
|
||||
IFS=';' read -ra input_b_files <<< "$par_input_b"
|
||||
for file in "${input_b_files[@]}"; do
|
||||
input_b_args+=(-b "$file")
|
||||
done
|
||||
|
||||
# Execute bedtools coverage
|
||||
bedtools coverage \
|
||||
-a "$par_input_a" \
|
||||
"${input_b_args[@]}" \
|
||||
${par_histogram:+-hist} \
|
||||
${par_depth_per_position:+-d} \
|
||||
${par_counts_only:+-counts} \
|
||||
${par_mean_depth:+-mean} \
|
||||
${par_same_strand:+-s} \
|
||||
${par_different_strand:+-S} \
|
||||
${par_min_overlap_a:+-f "$par_min_overlap_a"} \
|
||||
${par_min_overlap_b:+-F "$par_min_overlap_b"} \
|
||||
${par_reciprocal:+-r} \
|
||||
${par_either:+-e} \
|
||||
${par_split:+-split} \
|
||||
${par_bed_output:+-bed} \
|
||||
${par_header:+-header} \
|
||||
${par_sorted:+-sorted} \
|
||||
${par_genome:+-g "$par_genome"} \
|
||||
${par_no_name_check:+-nonamecheck} \
|
||||
> "$par_output"
|
||||
205
src/bedtools/bedtools_coverage/test.sh
Normal file
205
src/bedtools/bedtools_coverage/test.sh
Normal file
@@ -0,0 +1,205 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment
|
||||
setup_test_env
|
||||
|
||||
log "Starting tests for bedtools_coverage"
|
||||
|
||||
# Create test data
|
||||
log "Creating test data..."
|
||||
|
||||
# Create target intervals (query file A)
|
||||
cat > "$meta_temp_dir/targets.bed" << 'EOF'
|
||||
chr1 100 300 target1 100 +
|
||||
chr1 500 800 target2 200 +
|
||||
chr2 200 400 target3 300 -
|
||||
chr2 600 900 target4 400 -
|
||||
EOF
|
||||
|
||||
# Create coverage features (file B) - some overlapping, some not
|
||||
cat > "$meta_temp_dir/features.bed" << 'EOF'
|
||||
chr1 150 250 feature1 500 +
|
||||
chr1 200 350 feature2 600 +
|
||||
chr1 550 750 feature3 700 +
|
||||
chr2 250 350 feature4 800 -
|
||||
chr2 650 850 feature5 900 +
|
||||
chr3 100 200 feature6 1000 +
|
||||
EOF
|
||||
|
||||
# Create additional coverage file for multi-file testing
|
||||
cat > "$meta_temp_dir/features2.bed" << 'EOF'
|
||||
chr1 120 180 extra1 300 +
|
||||
chr1 600 700 extra2 400 +
|
||||
chr2 300 500 extra3 500 -
|
||||
EOF
|
||||
|
||||
# Create strand-specific test data
|
||||
cat > "$meta_temp_dir/stranded_targets.bed" << 'EOF'
|
||||
chr1 100 200 pos_target 100 +
|
||||
chr1 300 400 neg_target 200 -
|
||||
EOF
|
||||
|
||||
cat > "$meta_temp_dir/stranded_features.bed" << 'EOF'
|
||||
chr1 120 180 pos_feature 300 +
|
||||
chr1 320 380 neg_feature 400 -
|
||||
chr1 140 160 pos_feature2 500 +
|
||||
chr1 340 360 neg_feature2 600 -
|
||||
EOF
|
||||
|
||||
# Test 1: Basic coverage calculation
|
||||
log "Starting TEST 1: Basic coverage calculation"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/targets.bed" \
|
||||
--input_b "$meta_temp_dir/features.bed" \
|
||||
--output "$meta_temp_dir/output1.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output1.txt" "basic coverage output"
|
||||
check_file_not_empty "$meta_temp_dir/output1.txt" "basic coverage output"
|
||||
check_file_line_count "$meta_temp_dir/output1.txt" 4 "basic coverage line count"
|
||||
|
||||
# Check that coverage statistics are added (should have 4 extra columns)
|
||||
input_cols=$(head -1 "$meta_temp_dir/targets.bed" | awk '{print NF}')
|
||||
output_cols=$(head -1 "$meta_temp_dir/output1.txt" | awk '{print NF}')
|
||||
expected_cols=$((input_cols + 4))
|
||||
if [ $output_cols -ne $expected_cols ]; then
|
||||
log_error "Expected $expected_cols columns in output, got $output_cols"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check that some targets have coverage
|
||||
if ! grep -q -E "\s[1-9][0-9]*\s" "$meta_temp_dir/output1.txt"; then
|
||||
log_error "Expected some targets to have non-zero coverage counts"
|
||||
exit 1
|
||||
fi
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# Test 2: Coverage histogram
|
||||
log "Starting TEST 2: Coverage histogram"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/targets.bed" \
|
||||
--input_b "$meta_temp_dir/features.bed" \
|
||||
--histogram \
|
||||
--output "$meta_temp_dir/output2.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output2.txt" "histogram output"
|
||||
check_file_not_empty "$meta_temp_dir/output2.txt" "histogram output"
|
||||
|
||||
# Histogram output should have depth information
|
||||
check_file_contains "$meta_temp_dir/output2.txt" "target1" "target intervals in histogram"
|
||||
# Should contain histogram data (depth, bases, size, percentage)
|
||||
if ! grep -q -E "\s[0-9]+\s+[0-9]+\s+[0-9]+\s+[0-9]+\.[0-9]+$" "$meta_temp_dir/output2.txt"; then
|
||||
log_error "Expected histogram format with depth data"
|
||||
exit 1
|
||||
fi
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# Test 3: Counts only
|
||||
log "Starting TEST 3: Counts only output"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/targets.bed" \
|
||||
--input_b "$meta_temp_dir/features.bed" \
|
||||
--counts_only \
|
||||
--output "$meta_temp_dir/output3.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output3.txt" "counts only output"
|
||||
check_file_not_empty "$meta_temp_dir/output3.txt" "counts only output"
|
||||
check_file_line_count "$meta_temp_dir/output3.txt" 4 "counts only line count"
|
||||
|
||||
# Counts only should have fewer columns (just original + count)
|
||||
counts_cols=$(head -1 "$meta_temp_dir/output3.txt" | awk '{print NF}')
|
||||
if [ $counts_cols -ne $((input_cols + 1)) ]; then
|
||||
log_error "Expected $((input_cols + 1)) columns for counts only, got $counts_cols"
|
||||
exit 1
|
||||
fi
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# Test 4: Mean depth reporting
|
||||
log "Starting TEST 4: Mean depth reporting"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/targets.bed" \
|
||||
--input_b "$meta_temp_dir/features.bed" \
|
||||
--mean_depth \
|
||||
--output "$meta_temp_dir/output4.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output4.txt" "mean depth output"
|
||||
check_file_not_empty "$meta_temp_dir/output4.txt" "mean depth output"
|
||||
|
||||
# Should contain mean depth values (floating point numbers)
|
||||
if ! grep -q -E "\s[0-9]+\.[0-9]+$" "$meta_temp_dir/output4.txt"; then
|
||||
log_error "Expected mean depth values (floating point)"
|
||||
exit 1
|
||||
fi
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# Test 5: Strand-specific coverage
|
||||
log "Starting TEST 5: Strand-specific coverage"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/stranded_targets.bed" \
|
||||
--input_b "$meta_temp_dir/stranded_features.bed" \
|
||||
--same_strand \
|
||||
--output "$meta_temp_dir/output5.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output5.txt" "same strand output"
|
||||
check_file_not_empty "$meta_temp_dir/output5.txt" "same strand output"
|
||||
|
||||
# Compare with opposite strand requirement
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/stranded_targets.bed" \
|
||||
--input_b "$meta_temp_dir/stranded_features.bed" \
|
||||
--different_strand \
|
||||
--output "$meta_temp_dir/output5b.txt"
|
||||
|
||||
# Results should be different between same and different strand requirements
|
||||
if diff -q "$meta_temp_dir/output5.txt" "$meta_temp_dir/output5b.txt" >/dev/null; then
|
||||
log "Warning: Same and different strand outputs are identical - may not have strand-specific overlaps"
|
||||
fi
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
# Test 6: Multiple input files
|
||||
log "Starting TEST 6: Multiple input files"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/targets.bed" \
|
||||
--input_b "$meta_temp_dir/features.bed" \
|
||||
--input_b "$meta_temp_dir/features2.bed" \
|
||||
--output "$meta_temp_dir/output6.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output6.txt" "multiple files output"
|
||||
check_file_not_empty "$meta_temp_dir/output6.txt" "multiple files output"
|
||||
check_file_line_count "$meta_temp_dir/output6.txt" 4 "multiple files line count"
|
||||
|
||||
# Coverage should be higher with additional file
|
||||
single_file_coverage=$(awk '{print $7}' "$meta_temp_dir/output1.txt" | head -1)
|
||||
multi_file_coverage=$(awk '{print $7}' "$meta_temp_dir/output6.txt" | head -1)
|
||||
log "ℹ️ Single file coverage: $single_file_coverage, Multi-file coverage: $multi_file_coverage"
|
||||
log "✅ TEST 6 completed successfully"
|
||||
|
||||
# Test 7: Minimum overlap fraction
|
||||
log "Starting TEST 7: Minimum overlap fraction"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/targets.bed" \
|
||||
--input_b "$meta_temp_dir/features.bed" \
|
||||
--min_overlap_a 0.5 \
|
||||
--output "$meta_temp_dir/output7.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output7.txt" "min overlap output"
|
||||
check_file_not_empty "$meta_temp_dir/output7.txt" "min overlap output"
|
||||
|
||||
# Compare with no minimum requirement - should have fewer overlaps
|
||||
no_min_overlaps=$(awk '{sum += $7} END {print sum}' "$meta_temp_dir/output1.txt")
|
||||
min_overlaps=$(awk '{sum += $7} END {print sum}' "$meta_temp_dir/output7.txt")
|
||||
|
||||
if [ "$min_overlaps" -gt "$no_min_overlaps" ]; then
|
||||
log_error "Expected fewer overlaps with minimum fraction requirement"
|
||||
exit 1
|
||||
fi
|
||||
log "✅ TEST 7 completed successfully"
|
||||
|
||||
log "🎉 All bedtools_coverage tests completed successfully!"
|
||||
87
src/bedtools/bedtools_expand/config.vsh.yaml
Normal file
87
src/bedtools/bedtools_expand/config.vsh.yaml
Normal file
@@ -0,0 +1,87 @@
|
||||
name: bedtools_expand
|
||||
namespace: bedtools
|
||||
description: |
|
||||
Expand rows by splitting comma-separated values into separate rows.
|
||||
|
||||
This tool replicates lines based on columns containing comma-separated values,
|
||||
creating one row for each value. Useful for expanding collapsed data formats
|
||||
like BED12 blocks or multi-value annotations into individual entries.
|
||||
|
||||
keywords: [genomics, intervals, expand, split, comma-separated, replicate]
|
||||
links:
|
||||
homepage: https://bedtools.readthedocs.io/
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/expand.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: |
|
||||
Input file with comma-separated values to expand.
|
||||
|
||||
**Format:** Tab-delimited file with one or more columns containing
|
||||
comma-separated values
|
||||
**Example:** BED file with comma-separated scores or annotations
|
||||
required: true
|
||||
example: collapsed_data.bed
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with expanded rows.
|
||||
|
||||
Contains one row for each comma-separated value, with other
|
||||
columns replicated across all expanded rows.
|
||||
required: true
|
||||
example: expanded_data.bed
|
||||
|
||||
- name: Expansion Options
|
||||
arguments:
|
||||
- name: --columns
|
||||
alternatives: [-c]
|
||||
type: string
|
||||
description: |
|
||||
Column(s) to expand (1-based indexing).
|
||||
|
||||
**Single column:** Specify one column number (e.g., "4")
|
||||
**Multiple columns:** Comma-separated list (e.g., "4,5")
|
||||
**Behavior:** Values in specified columns are split and expanded
|
||||
**Requirement:** All specified columns must have same number of values
|
||||
required: true
|
||||
example: "4,5"
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
34
src/bedtools/bedtools_expand/help.txt
Normal file
34
src/bedtools/bedtools_expand/help.txt
Normal file
@@ -0,0 +1,34 @@
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools expand -h
|
||||
```
|
||||
|
||||
Tool: bedtools expand
|
||||
Version: v2.31.1
|
||||
Summary: Replicate lines in a file based on columns of comma-separated values.
|
||||
|
||||
Usage: bedtools expand -c [COLS]
|
||||
Options:
|
||||
-i Input file. Assumes "stdin" if omitted.
|
||||
|
||||
-c Specify the column (1-based) that should be summarized.
|
||||
- Required.
|
||||
Examples:
|
||||
$ cat test.txt
|
||||
chr1 10 20 1,2,3 10,20,30
|
||||
chr1 40 50 4,5,6 40,50,60
|
||||
|
||||
$ bedtools expand test.txt -c 5
|
||||
chr1 10 20 1,2,3 10
|
||||
chr1 10 20 1,2,3 20
|
||||
chr1 10 20 1,2,3 30
|
||||
chr1 40 50 4,5,6 40
|
||||
chr1 40 50 4,5,6 50
|
||||
chr1 40 50 4,5,6 60
|
||||
|
||||
$ bedtools expand test.txt -c 4,5
|
||||
chr1 10 20 1 10
|
||||
chr1 10 20 2 20
|
||||
chr1 10 20 3 30
|
||||
chr1 40 50 4 40
|
||||
chr1 40 50 5 50
|
||||
chr1 40 50 6 60
|
||||
15
src/bedtools/bedtools_expand/script.sh
Normal file
15
src/bedtools/bedtools_expand/script.sh
Normal file
@@ -0,0 +1,15 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-i "$par_input"
|
||||
-c "$par_columns"
|
||||
)
|
||||
|
||||
# Execute bedtools expand
|
||||
bedtools expand "${cmd_args[@]}" > "$par_output"
|
||||
138
src/bedtools/bedtools_expand/test.sh
Normal file
138
src/bedtools/bedtools_expand/test.sh
Normal file
@@ -0,0 +1,138 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment
|
||||
setup_test_env
|
||||
|
||||
log "Starting tests for bedtools_expand"
|
||||
|
||||
# Create test data
|
||||
log "Creating test data..."
|
||||
|
||||
# Create simple test file with comma-separated values in one column
|
||||
cat > "$meta_temp_dir/simple.bed" << 'EOF'
|
||||
chr1 100 200 1,2,3
|
||||
chr1 300 400 4,5,6
|
||||
chr2 500 600 7,8
|
||||
EOF
|
||||
|
||||
# Create test file with comma-separated values in multiple columns
|
||||
cat > "$meta_temp_dir/multi_column.bed" << 'EOF'
|
||||
chr1 10 20 1,2,3 10,20,30
|
||||
chr1 40 50 4,5,6 40,50,60
|
||||
chr2 70 80 7,8,9 70,80,90
|
||||
EOF
|
||||
|
||||
# Create BED file with single values (no expansion needed)
|
||||
cat > "$meta_temp_dir/no_expansion.bed" << 'EOF'
|
||||
chr1 100 200 single_value
|
||||
chr2 300 400 another_value
|
||||
EOF
|
||||
|
||||
# Create file with unequal comma-separated lists (should be handled gracefully)
|
||||
cat > "$meta_temp_dir/unequal.bed" << 'EOF'
|
||||
chr1 100 200 1,2,3 10,20
|
||||
chr1 300 400 4,5 40,50,60
|
||||
EOF
|
||||
|
||||
# Test 1: Basic single column expansion
|
||||
log "Starting TEST 1: Basic single column expansion"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/simple.bed" \
|
||||
--columns "4" \
|
||||
--output "$meta_temp_dir/output1.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output1.bed" "single column expansion output"
|
||||
check_file_not_empty "$meta_temp_dir/output1.bed" "single column expansion output"
|
||||
check_file_line_count "$meta_temp_dir/output1.bed" 8 "single column expansion line count"
|
||||
|
||||
# Check that expansion worked correctly
|
||||
check_file_contains "$meta_temp_dir/output1.bed" "chr1 100 200 1" "first expanded value"
|
||||
check_file_contains "$meta_temp_dir/output1.bed" "chr1 100 200 2" "second expanded value"
|
||||
check_file_contains "$meta_temp_dir/output1.bed" "chr1 100 200 3" "third expanded value"
|
||||
check_file_contains "$meta_temp_dir/output1.bed" "chr2 500 600 7" "chr2 first value"
|
||||
check_file_contains "$meta_temp_dir/output1.bed" "chr2 500 600 8" "chr2 second value"
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# Test 2: Multi-column expansion
|
||||
log "Starting TEST 2: Multi-column expansion"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/multi_column.bed" \
|
||||
--columns "4,5" \
|
||||
--output "$meta_temp_dir/output2.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output2.bed" "multi-column expansion output"
|
||||
check_file_not_empty "$meta_temp_dir/output2.bed" "multi-column expansion output"
|
||||
check_file_line_count "$meta_temp_dir/output2.bed" 9 "multi-column expansion line count"
|
||||
|
||||
# Check that paired expansion worked correctly
|
||||
check_file_contains "$meta_temp_dir/output2.bed" "chr1 10 20 1 10" "first paired expansion"
|
||||
check_file_contains "$meta_temp_dir/output2.bed" "chr1 10 20 2 20" "second paired expansion"
|
||||
check_file_contains "$meta_temp_dir/output2.bed" "chr1 10 20 3 30" "third paired expansion"
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# Test 3: No expansion needed (single values)
|
||||
log "Starting TEST 3: Single values (no expansion needed)"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/no_expansion.bed" \
|
||||
--columns "4" \
|
||||
--output "$meta_temp_dir/output3.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output3.bed" "no expansion output"
|
||||
check_file_not_empty "$meta_temp_dir/output3.bed" "no expansion output"
|
||||
check_file_line_count "$meta_temp_dir/output3.bed" 2 "no expansion line count"
|
||||
|
||||
# Should be identical to input since no comma-separated values
|
||||
check_file_contains "$meta_temp_dir/output3.bed" "single_value" "single value preserved"
|
||||
check_file_contains "$meta_temp_dir/output3.bed" "another_value" "another value preserved"
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# Test 4: Different column positions
|
||||
log "Starting TEST 4: Different column positions"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/multi_column.bed" \
|
||||
--columns "5" \
|
||||
--output "$meta_temp_dir/output4.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output4.bed" "column 5 expansion output"
|
||||
check_file_not_empty "$meta_temp_dir/output4.bed" "column 5 expansion output"
|
||||
check_file_line_count "$meta_temp_dir/output4.bed" 9 "column 5 expansion line count"
|
||||
|
||||
# Check that only column 5 was expanded, column 4 remains comma-separated
|
||||
check_file_contains "$meta_temp_dir/output4.bed" "chr1 10 20 1,2,3 10" "column 4 not expanded"
|
||||
check_file_contains "$meta_temp_dir/output4.bed" "chr1 10 20 1,2,3 20" "column 5 expanded"
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# Test 5: Large expansion test
|
||||
log "Starting TEST 5: Large expansion test"
|
||||
# Create file with more comma-separated values
|
||||
cat > "$meta_temp_dir/large.bed" << 'EOF'
|
||||
chr1 100 200 1,2,3,4,5,6,7,8,9,10
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/large.bed" \
|
||||
--columns "4" \
|
||||
--output "$meta_temp_dir/output5.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/output5.bed" "large expansion output"
|
||||
check_file_not_empty "$meta_temp_dir/output5.bed" "large expansion output"
|
||||
check_file_line_count "$meta_temp_dir/output5.bed" 10 "large expansion line count"
|
||||
|
||||
# Check that all values are expanded
|
||||
for i in {1..10}; do
|
||||
if ! grep -q "chr1 100 200 $i$" "$meta_temp_dir/output5.bed"; then
|
||||
log_error "Expected value $i not found in large expansion"
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
log "🎉 All bedtools_expand tests completed successfully!"
|
||||
234
src/bedtools/bedtools_fisher/config.vsh.yaml
Normal file
234
src/bedtools/bedtools_fisher/config.vsh.yaml
Normal file
@@ -0,0 +1,234 @@
|
||||
name: bedtools_fisher
|
||||
namespace: bedtools
|
||||
|
||||
description: |
|
||||
Calculate Fisher's exact test statistic between two feature files.
|
||||
|
||||
This tool performs Fisher's exact test to assess the statistical significance
|
||||
of overlaps between genomic intervals in two files. It calculates the probability
|
||||
of observing the given overlap pattern by chance, providing a p-value for
|
||||
statistical inference.
|
||||
|
||||
keywords: [genomics, intervals, fisher, statistics, overlap, significance, test]
|
||||
links:
|
||||
homepage: https://bedtools.readthedocs.io/
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/fisher.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input_a
|
||||
alternatives: [-a]
|
||||
type: file
|
||||
description: |
|
||||
First input file for comparison.
|
||||
|
||||
**Format:** BED, GFF, VCF file with genomic intervals
|
||||
**Requirement:** Must be sorted by chromosome, then start position
|
||||
**Usage:** File A for Fisher's exact test comparison
|
||||
required: true
|
||||
example: intervals_a.bed
|
||||
|
||||
- name: --input_b
|
||||
alternatives: [-b]
|
||||
type: file
|
||||
description: |
|
||||
Second input file for comparison.
|
||||
|
||||
**Format:** BED, GFF, VCF file with genomic intervals
|
||||
**Requirement:** Must be sorted by chromosome, then start position
|
||||
**Usage:** File B for Fisher's exact test comparison
|
||||
required: true
|
||||
example: intervals_b.bed
|
||||
|
||||
- name: --genome
|
||||
alternatives: [-g]
|
||||
type: file
|
||||
description: |
|
||||
Genome file defining chromosome sizes.
|
||||
|
||||
**Format:** Tab-delimited file with chromosome name and size
|
||||
**Purpose:** Enforces consistent chromosome sort order
|
||||
**Example:** chr1\t249250621
|
||||
required: true
|
||||
example: genome.txt
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with Fisher's exact test results.
|
||||
|
||||
Contains statistical results including p-values for overlap
|
||||
significance between input files.
|
||||
required: true
|
||||
example: fisher_results.txt
|
||||
|
||||
- name: Overlap Options
|
||||
arguments:
|
||||
- name: --merge_overlaps
|
||||
alternatives: [-m]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Merge overlapping intervals before analysis.
|
||||
|
||||
**Effect:** Collapses overlapping intervals in both files
|
||||
**Usage:** Prevents double-counting of overlapping features
|
||||
**Default:** false (no merging)
|
||||
|
||||
- name: --min_overlap_a
|
||||
alternatives: [-f]
|
||||
type: double
|
||||
description: |
|
||||
Minimum overlap required as fraction of A.
|
||||
|
||||
**Range:** 0.0 to 1.0
|
||||
**Default:** 1E-9 (effectively 1bp)
|
||||
**Example:** 0.50 requires 50% of A to be overlapped
|
||||
example: 0.5
|
||||
|
||||
- name: --min_overlap_b
|
||||
alternatives: [-F]
|
||||
type: double
|
||||
description: |
|
||||
Minimum overlap required as fraction of B.
|
||||
|
||||
**Range:** 0.0 to 1.0
|
||||
**Default:** 1E-9 (effectively 1bp)
|
||||
**Example:** 0.50 requires 50% of B to be overlapped
|
||||
example: 0.5
|
||||
|
||||
- name: --reciprocal
|
||||
alternatives: [-r]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require reciprocal overlap for both A and B.
|
||||
|
||||
**Effect:** Both -f and -F thresholds must be satisfied
|
||||
**Example:** With -f 0.90 -r, requires B overlaps 90% of A AND A overlaps 90% of B
|
||||
**Default:** false
|
||||
|
||||
- name: --either
|
||||
alternatives: [-e]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require minimum fraction satisfied for A OR B.
|
||||
|
||||
**Effect:** Only one of -f or -F thresholds needs to be satisfied
|
||||
**Alternative:** Without -e, both fractions must be satisfied
|
||||
**Default:** false (both required)
|
||||
|
||||
- name: Strand Options
|
||||
arguments:
|
||||
- name: --same_strand
|
||||
alternatives: [-s]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require same strandedness for overlaps.
|
||||
|
||||
**Effect:** Only report overlaps on the same strand
|
||||
**Default:** false (strand-independent)
|
||||
|
||||
- name: --opposite_strand
|
||||
alternatives: [-S]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require different strandedness for overlaps.
|
||||
|
||||
**Effect:** Only report overlaps on opposite strands
|
||||
**Default:** false (strand-independent)
|
||||
|
||||
- name: Format Options
|
||||
arguments:
|
||||
- name: --split
|
||||
type: boolean_true
|
||||
description: |
|
||||
Treat split BAM or BED12 entries as distinct intervals.
|
||||
|
||||
**Effect:** Split multi-block entries into individual intervals
|
||||
**Usage:** For BAM alignments with gaps or BED12 entries
|
||||
**Default:** false
|
||||
|
||||
- name: --bed_output
|
||||
alternatives: [--bed]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write output in BED format when using BAM input.
|
||||
|
||||
**Effect:** Forces BED output format for BAM inputs
|
||||
**Default:** false
|
||||
|
||||
- name: --header
|
||||
type: boolean_true
|
||||
description: |
|
||||
Print header from file A prior to results.
|
||||
|
||||
**Effect:** Includes original header from input file A
|
||||
**Default:** false
|
||||
|
||||
- name: Advanced Options
|
||||
arguments:
|
||||
- name: --no_name_check
|
||||
alternatives: [--nonamecheck]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Skip chromosome naming convention checks for sorted data.
|
||||
|
||||
**Effect:** Allows different naming (e.g., "chr1" vs "chr01")
|
||||
**Usage:** For files with inconsistent chromosome naming
|
||||
**Default:** false (strict checking)
|
||||
|
||||
- name: --no_buffer
|
||||
alternatives: [--nobuf]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Disable buffered output.
|
||||
|
||||
**Effect:** Print each line immediately instead of buffering
|
||||
**Usage:** For real-time processing or piping
|
||||
**Trade-off:** Slower performance but immediate output
|
||||
**Default:** false (buffered output)
|
||||
|
||||
- name: --io_buffer
|
||||
alternatives: [--iobuf]
|
||||
type: string
|
||||
description: |
|
||||
Specify input buffer memory size.
|
||||
|
||||
**Format:** Integer with optional K/M/G suffix
|
||||
**Example:** "128M" for 128 megabytes
|
||||
**Note:** No effect with compressed files
|
||||
example: "128M"
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
68
src/bedtools/bedtools_fisher/help.txt
Normal file
68
src/bedtools/bedtools_fisher/help.txt
Normal file
@@ -0,0 +1,68 @@
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools fisher -h
|
||||
```
|
||||
|
||||
Tool: bedtools fisher (aka fisher)
|
||||
Version: v2.31.1
|
||||
Summary: Calculate Fisher statistic b/w two feature files.
|
||||
|
||||
Usage: bedtools fisher [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf> -g <genome file>
|
||||
|
||||
Options:
|
||||
-m Merge overlapping intervals before
|
||||
- looking at overlap.
|
||||
|
||||
-s Require same strandedness. That is, only report hits in B
|
||||
that overlap A on the _same_ strand.
|
||||
- By default, overlaps are reported without respect to strand.
|
||||
|
||||
-S Require different strandedness. That is, only report hits in B
|
||||
that overlap A on the _opposite_ strand.
|
||||
- By default, overlaps are reported without respect to strand.
|
||||
|
||||
-f Minimum overlap required as a fraction of A.
|
||||
- Default is 1E-9 (i.e., 1bp).
|
||||
- FLOAT (e.g. 0.50)
|
||||
|
||||
-F Minimum overlap required as a fraction of B.
|
||||
- Default is 1E-9 (i.e., 1bp).
|
||||
- FLOAT (e.g. 0.50)
|
||||
|
||||
-r Require that the fraction overlap be reciprocal for A AND B.
|
||||
- In other words, if -f is 0.90 and -r is used, this requires
|
||||
that B overlap 90% of A and A _also_ overlaps 90% of B.
|
||||
|
||||
-e Require that the minimum fraction be satisfied for A OR B.
|
||||
- In other words, if -e is used with -f 0.90 and -F 0.10 this requires
|
||||
that either 90% of A is covered OR 10% of B is covered.
|
||||
Without -e, both fractions would have to be satisfied.
|
||||
|
||||
-split Treat "split" BAM or BED12 entries as distinct BED intervals.
|
||||
|
||||
-g Provide a genome file to enforce consistent chromosome sort order
|
||||
across input files. Only applies when used with -sorted option.
|
||||
|
||||
-nonamecheck For sorted data, don't throw an error if the file has different naming conventions
|
||||
for the same chromosome. ex. "chr1" vs "chr01".
|
||||
|
||||
-bed If using BAM input, write output as BED.
|
||||
|
||||
-header Print the header from the A file prior to results.
|
||||
|
||||
-nobuf Disable buffered output. Using this option will cause each line
|
||||
of output to be printed as it is generated, rather than saved
|
||||
in a buffer. This will make printing large output files
|
||||
noticeably slower, but can be useful in conjunction with
|
||||
other software tools and scripts that need to process one
|
||||
line of bedtools output at a time.
|
||||
|
||||
-iobuf Specify amount of memory to use for input buffer.
|
||||
Takes an integer argument. Optional suffixes K/M/G supported.
|
||||
Note: currently has no effect with compressed files.
|
||||
|
||||
Notes:
|
||||
(1) Input files must be sorted by chrom, then start position.
|
||||
|
||||
|
||||
|
||||
|
||||
48
src/bedtools/bedtools_fisher/script.sh
Normal file
48
src/bedtools/bedtools_fisher/script.sh
Normal file
@@ -0,0 +1,48 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# unset flags
|
||||
unset_if_false=(
|
||||
par_merge_overlaps
|
||||
par_reciprocal
|
||||
par_either
|
||||
par_same_strand
|
||||
par_opposite_strand
|
||||
par_split
|
||||
par_bed_output
|
||||
par_header
|
||||
par_no_name_check
|
||||
par_no_buffer
|
||||
)
|
||||
|
||||
for par in "${unset_if_false[@]}"; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset "$par"
|
||||
done
|
||||
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-a "$par_input_a"
|
||||
-b "$par_input_b"
|
||||
-g "$par_genome"
|
||||
${par_merge_overlaps:+-m}
|
||||
${par_min_overlap_a:+-f "$par_min_overlap_a"}
|
||||
${par_min_overlap_b:+-F "$par_min_overlap_b"}
|
||||
${par_reciprocal:+-r}
|
||||
${par_either:+-e}
|
||||
${par_same_strand:+-s}
|
||||
${par_opposite_strand:+-S}
|
||||
${par_split:+-split}
|
||||
${par_bed_output:+-bed}
|
||||
${par_header:+-header}
|
||||
${par_no_name_check:+-nonamecheck}
|
||||
${par_no_buffer:+-nobuf}
|
||||
${par_io_buffer:+-iobuf "$par_io_buffer"}
|
||||
)
|
||||
|
||||
# Execute bedtools fisher
|
||||
bedtools fisher "${cmd_args[@]}" > "$par_output"
|
||||
121
src/bedtools/bedtools_fisher/test.sh
Normal file
121
src/bedtools/bedtools_fisher/test.sh
Normal file
@@ -0,0 +1,121 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment
|
||||
setup_test_env
|
||||
|
||||
log "Starting tests for bedtools_fisher"
|
||||
|
||||
# Create test data
|
||||
log "Creating test data..."
|
||||
|
||||
# Create genome file
|
||||
cat > "$meta_temp_dir/genome.txt" << 'EOF'
|
||||
chr1 1000000
|
||||
chr2 1000000
|
||||
EOF
|
||||
|
||||
# Create file A - sorted intervals
|
||||
cat > "$meta_temp_dir/intervals_a.bed" << 'EOF'
|
||||
chr1 100 200 region1 10 +
|
||||
chr1 300 400 region2 20 +
|
||||
chr1 500 600 region3 15 -
|
||||
chr2 100 200 region4 25 +
|
||||
chr2 400 500 region5 30 -
|
||||
EOF
|
||||
|
||||
# Create file B - sorted intervals with some overlaps
|
||||
cat > "$meta_temp_dir/intervals_b.bed" << 'EOF'
|
||||
chr1 150 250 feature1 5 +
|
||||
chr1 350 450 feature2 8 +
|
||||
chr1 450 550 feature3 12 -
|
||||
chr2 50 150 feature4 6 +
|
||||
chr2 450 550 feature5 9 -
|
||||
EOF
|
||||
|
||||
# Create file C - larger overlap set for significance testing
|
||||
cat > "$meta_temp_dir/intervals_c.bed" << 'EOF'
|
||||
chr1 90 210 overlap1 10 +
|
||||
chr1 290 410 overlap2 15 +
|
||||
chr1 490 610 overlap3 20 -
|
||||
chr2 90 210 overlap4 12 +
|
||||
chr2 390 510 overlap5 18 -
|
||||
chr2 600 700 overlap6 25 +
|
||||
EOF
|
||||
|
||||
# TEST 1: Basic Fisher's exact test
|
||||
log "Starting TEST 1: Basic Fisher's exact test"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--output "$meta_temp_dir/fisher_basic.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/fisher_basic.txt" "basic fisher output"
|
||||
check_file_not_empty "$meta_temp_dir/fisher_basic.txt" "basic fisher output"
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# TEST 2: Fisher test with minimum overlap fraction
|
||||
log "Starting TEST 2: Fisher test with overlap fractions"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--min_overlap_a 0.5 \
|
||||
--min_overlap_b 0.3 \
|
||||
--output "$meta_temp_dir/fisher_fractions.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/fisher_fractions.txt" "fisher with fractions output"
|
||||
check_file_not_empty "$meta_temp_dir/fisher_fractions.txt" "fisher with fractions output"
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# TEST 3: Fisher test with reciprocal overlap
|
||||
log "Starting TEST 3: Fisher test with reciprocal overlap"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--min_overlap_a 0.4 \
|
||||
--reciprocal \
|
||||
--output "$meta_temp_dir/fisher_reciprocal.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/fisher_reciprocal.txt" "fisher reciprocal output"
|
||||
check_file_not_empty "$meta_temp_dir/fisher_reciprocal.txt" "fisher reciprocal output"
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# TEST 4: Fisher test with merged intervals
|
||||
log "Starting TEST 4: Fisher test with merged overlapping intervals"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_c.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--merge_overlaps \
|
||||
--output "$meta_temp_dir/fisher_merged.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/fisher_merged.txt" "fisher merged output"
|
||||
check_file_not_empty "$meta_temp_dir/fisher_merged.txt" "fisher merged output"
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# TEST 5: Fisher test with either overlap condition
|
||||
log "Starting TEST 5: Fisher test with either overlap condition"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--min_overlap_a 0.8 \
|
||||
--min_overlap_b 0.2 \
|
||||
--either \
|
||||
--output "$meta_temp_dir/fisher_either.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/fisher_either.txt" "fisher either condition output"
|
||||
check_file_not_empty "$meta_temp_dir/fisher_either.txt" "fisher either condition output"
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
log "All tests completed successfully!"
|
||||
157
src/bedtools/bedtools_flank/config.vsh.yaml
Normal file
157
src/bedtools/bedtools_flank/config.vsh.yaml
Normal file
@@ -0,0 +1,157 @@
|
||||
name: bedtools_flank
|
||||
namespace: bedtools
|
||||
|
||||
description: |
|
||||
Create flanking intervals for each genomic feature.
|
||||
|
||||
This tool generates new intervals representing the regions immediately
|
||||
upstream and/or downstream of existing genomic features. Unlike slop which
|
||||
extends existing intervals, flank creates entirely new intervals from the
|
||||
flanking regions.
|
||||
|
||||
keywords: [genomics, intervals, flank, upstream, downstream, flanking, regions]
|
||||
links:
|
||||
homepage: https://bedtools.readthedocs.io/
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/flank.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: |
|
||||
Input file with genomic intervals.
|
||||
|
||||
**Format:** BED, GFF, VCF file with genomic intervals
|
||||
**Usage:** Features for which flanking regions will be created
|
||||
required: true
|
||||
example: intervals.bed
|
||||
|
||||
- name: --genome
|
||||
alternatives: [-g]
|
||||
type: file
|
||||
description: |
|
||||
Genome file defining chromosome sizes.
|
||||
|
||||
**Format:** Tab-delimited file with chromosome name and size
|
||||
**Purpose:** Prevents flanks from extending beyond chromosome boundaries
|
||||
**Example:** chr1\t249250621
|
||||
**Tip:** Can use samtools faidx output (.fai file)
|
||||
required: true
|
||||
example: genome.txt
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with flanking intervals.
|
||||
|
||||
Contains new intervals representing the flanking regions
|
||||
of the input features.
|
||||
required: true
|
||||
example: flanking_regions.bed
|
||||
|
||||
- name: Flanking Options
|
||||
arguments:
|
||||
- name: --both
|
||||
alternatives: [-b]
|
||||
type: string
|
||||
description: |
|
||||
Create flanking intervals using specified distance in both directions.
|
||||
|
||||
**Input:** Integer (base pairs) or Float (if used with --pct)
|
||||
**Effect:** Creates flanks of equal size upstream and downstream
|
||||
**Example:** "1000" creates 1kb flanks on both sides
|
||||
**Mutually exclusive:** Cannot use with --left or --right
|
||||
example: "1000"
|
||||
|
||||
- name: --left
|
||||
alternatives: [-l]
|
||||
type: string
|
||||
description: |
|
||||
Distance for left/upstream flank from original start coordinate.
|
||||
|
||||
**Input:** Integer (base pairs) or Float (if used with --pct)
|
||||
**Strand-aware:** When used with --strand, respects feature orientation
|
||||
**Example:** "500" creates 500bp upstream flank
|
||||
**Requires:** Must be used together with --right
|
||||
example: "500"
|
||||
|
||||
- name: --right
|
||||
alternatives: [-r]
|
||||
type: string
|
||||
description: |
|
||||
Distance for right/downstream flank from original end coordinate.
|
||||
|
||||
**Input:** Integer (base pairs) or Float (if used with --pct)
|
||||
**Strand-aware:** When used with --strand, respects feature orientation
|
||||
**Example:** "300" creates 300bp downstream flank
|
||||
**Requires:** Must be used together with --left
|
||||
example: "300"
|
||||
|
||||
- name: Flanking Behavior
|
||||
arguments:
|
||||
- name: --strand
|
||||
alternatives: [-s]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Define left and right flanks based on strand orientation.
|
||||
|
||||
**Effect:** For negative-strand features, left becomes downstream
|
||||
**Example:** -l 500 on minus strand starts flank 500bp downstream
|
||||
**Default:** false (ignore strand)
|
||||
|
||||
- name: --percent
|
||||
alternatives: [-pct]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Define flanking distances as fraction of feature length.
|
||||
|
||||
**Effect:** Distances become proportional to feature size
|
||||
**Example:** -l 0.5 on 1000bp feature creates 500bp upstream flank
|
||||
**Input format:** Use decimals (e.g., "0.1" for 10%)
|
||||
**Default:** false (absolute base pairs)
|
||||
|
||||
- name: Output Options
|
||||
arguments:
|
||||
- name: --header
|
||||
type: boolean_true
|
||||
description: |
|
||||
Print header from input file prior to results.
|
||||
|
||||
**Effect:** Preserves original file header in output
|
||||
**Default:** false
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
66
src/bedtools/bedtools_flank/help.txt
Normal file
66
src/bedtools/bedtools_flank/help.txt
Normal file
@@ -0,0 +1,66 @@
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools flank -h
|
||||
```
|
||||
|
||||
Tool: bedtools flank (aka flankBed)
|
||||
Version: v2.31.1
|
||||
Summary: Creates flanking interval(s) for each BED/GFF/VCF feature.
|
||||
|
||||
Usage: bedtools flank [OPTIONS] -i <bed/gff/vcf> -g <genome> [-b <int> or (-l and -r)]
|
||||
|
||||
Options:
|
||||
-b Create flanking interval(s) using -b base pairs in each direction.
|
||||
- (Integer) or (Float, e.g. 0.1) if used with -pct.
|
||||
|
||||
-l The number of base pairs that a flank should start from
|
||||
orig. start coordinate.
|
||||
- (Integer) or (Float, e.g. 0.1) if used with -pct.
|
||||
|
||||
-r The number of base pairs that a flank should end from
|
||||
orig. end coordinate.
|
||||
- (Integer) or (Float, e.g. 0.1) if used with -pct.
|
||||
|
||||
-s Define -l and -r based on strand.
|
||||
E.g. if used, -l 500 for a negative-stranded feature,
|
||||
it will start the flank 500 bp downstream. Default = false.
|
||||
|
||||
-pct Define -l and -r as a fraction of the feature's length.
|
||||
E.g. if used on a 1000bp feature, -l 0.50,
|
||||
will add 500 bp "upstream". Default = false.
|
||||
|
||||
-header Print the header from the input file prior to results.
|
||||
|
||||
Notes:
|
||||
(1) Starts will be set to 0 if options would force it below 0.
|
||||
(2) Ends will be set to the chromosome length if requested flank would
|
||||
force it above the max chrom length.
|
||||
(3) In contrast to slop, which _extends_ intervals, bedtools flank
|
||||
creates new intervals from the regions just up- and down-stream
|
||||
of your existing intervals.
|
||||
(4) The genome file should tab delimited and structured as follows:
|
||||
|
||||
<chromName><TAB><chromSize>
|
||||
|
||||
For example, Human (hg19):
|
||||
chr1 249250621
|
||||
chr2 243199373
|
||||
...
|
||||
chr18_gl000207_random 4262
|
||||
|
||||
Tip 1. Use samtools faidx to create a genome file from a FASTA:
|
||||
One can the samtools faidx command to index a FASTA file.
|
||||
The resulting .fai index is suitable as a genome file,
|
||||
as bedtools will only look at the first two, relevant columns
|
||||
of the .fai file.
|
||||
|
||||
For example:
|
||||
samtools faidx GRCh38.fa
|
||||
bedtools flank -i my.bed -g GRCh38.fa.fai
|
||||
|
||||
Tip 2. Use UCSC Table Browser to create a genome file:
|
||||
One can use the UCSC Genome Browser's MySQL database to extract
|
||||
chromosome sizes. For example, H. sapiens:
|
||||
|
||||
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
|
||||
"select chrom, size from hg19.chromInfo" > hg19.genome
|
||||
|
||||
37
src/bedtools/bedtools_flank/script.sh
Normal file
37
src/bedtools/bedtools_flank/script.sh
Normal file
@@ -0,0 +1,37 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# unset flags
|
||||
[[ "$par_strand" == "false" ]] && unset par_strand
|
||||
[[ "$par_percent" == "false" ]] && unset par_percent
|
||||
[[ "$par_header" == "false" ]] && unset par_header
|
||||
|
||||
# Validate flanking distance options (mutually exclusive groups)
|
||||
if [ -n "$par_both" ]; then
|
||||
flanking_args=(-b "$par_both")
|
||||
elif [ -n "$par_left" ] && [ -n "$par_right" ]; then
|
||||
flanking_args=(-l "$par_left" -r "$par_right")
|
||||
elif [ -n "$par_left" ] || [ -n "$par_right" ]; then
|
||||
echo "Error: --left and --right must be used together" >&2
|
||||
exit 1
|
||||
else
|
||||
echo "Error: Must specify either --both or both --left and --right" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-i "$par_input"
|
||||
-g "$par_genome"
|
||||
"${flanking_args[@]}"
|
||||
${par_strand:+-s}
|
||||
${par_percent:+-pct}
|
||||
${par_header:+-header}
|
||||
)
|
||||
|
||||
# Execute bedtools flank
|
||||
bedtools flank "${cmd_args[@]}" > "$par_output"
|
||||
181
src/bedtools/bedtools_flank/test.sh
Normal file
181
src/bedtools/bedtools_flank/test.sh
Normal file
@@ -0,0 +1,181 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment
|
||||
setup_test_env
|
||||
|
||||
log "Starting tests for bedtools_flank"
|
||||
|
||||
# Create test data
|
||||
log "Creating test data..."
|
||||
|
||||
# Create genome file
|
||||
cat > "$meta_temp_dir/genome.txt" << 'EOF'
|
||||
chr1 1000000
|
||||
chr2 1000000
|
||||
chr3 500000
|
||||
EOF
|
||||
|
||||
# Create basic intervals file
|
||||
cat > "$meta_temp_dir/intervals.bed" << 'EOF'
|
||||
chr1 1000 2000 feature1 100 +
|
||||
chr1 5000 6000 feature2 200 -
|
||||
chr2 10000 11000 feature3 150 +
|
||||
chr2 20000 21000 feature4 300 -
|
||||
chr3 100000 101000 feature5 250 +
|
||||
EOF
|
||||
|
||||
# Create intervals near chromosome boundaries
|
||||
cat > "$meta_temp_dir/boundary.bed" << 'EOF'
|
||||
chr1 10 100 start_feature 50 +
|
||||
chr1 999900 999950 end_feature 75 +
|
||||
chr3 490000 495000 near_end 100 +
|
||||
EOF
|
||||
|
||||
# Create variable-sized intervals for percentage testing
|
||||
cat > "$meta_temp_dir/variable.bed" << 'EOF'
|
||||
chr1 10000 12000 small_2kb 10 +
|
||||
chr1 20000 30000 large_10kb 20 +
|
||||
chr1 50000 51000 medium_1kb 15 +
|
||||
EOF
|
||||
|
||||
# TEST 1: Basic flanking with both sides equal
|
||||
log "Starting TEST 1: Basic flanking with both sides"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--both "500" \
|
||||
--output "$meta_temp_dir/both_flanks.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/both_flanks.bed" "both flanks output"
|
||||
check_file_not_empty "$meta_temp_dir/both_flanks.bed" "both flanks output"
|
||||
|
||||
# Should create 10 intervals (5 features × 2 flanks each)
|
||||
line_count=$(wc -l < "$meta_temp_dir/both_flanks.bed")
|
||||
if [ "$line_count" -eq 10 ]; then
|
||||
log "✓ both flanks output has expected line count (10): $meta_temp_dir/both_flanks.bed"
|
||||
else
|
||||
log "✗ both flanks output has unexpected line count ($line_count, expected 10): $meta_temp_dir/both_flanks.bed"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# TEST 2: Asymmetric flanking with left and right
|
||||
log "Starting TEST 2: Asymmetric flanking with left and right"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--left "1000" \
|
||||
--right "300" \
|
||||
--output "$meta_temp_dir/asymmetric_flanks.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/asymmetric_flanks.bed" "asymmetric flanks output"
|
||||
check_file_not_empty "$meta_temp_dir/asymmetric_flanks.bed" "asymmetric flanks output"
|
||||
|
||||
# Check for different sized flanks (left flank from chr1:1000-2000 should be clamped to start at 0)
|
||||
if grep -q "chr1.*0.*1000" "$meta_temp_dir/asymmetric_flanks.bed"; then
|
||||
log "✓ asymmetric flanks contains expected left flank: $meta_temp_dir/asymmetric_flanks.bed"
|
||||
else
|
||||
log "✗ asymmetric flanks missing expected left flank: $meta_temp_dir/asymmetric_flanks.bed"
|
||||
cat "$meta_temp_dir/asymmetric_flanks.bed" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check for right flank size (300bp downstream)
|
||||
if grep -q "chr1.*2000.*2300" "$meta_temp_dir/asymmetric_flanks.bed"; then
|
||||
log "✓ asymmetric flanks contains expected right flank: $meta_temp_dir/asymmetric_flanks.bed"
|
||||
else
|
||||
log "✗ asymmetric flanks missing expected right flank: $meta_temp_dir/asymmetric_flanks.bed"
|
||||
cat "$meta_temp_dir/asymmetric_flanks.bed" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# TEST 3: Strand-aware flanking
|
||||
log "Starting TEST 3: Strand-aware flanking"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--left "800" \
|
||||
--right "400" \
|
||||
--strand \
|
||||
--output "$meta_temp_dir/strand_flanks.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/strand_flanks.bed" "strand-aware flanks output"
|
||||
check_file_not_empty "$meta_temp_dir/strand_flanks.bed" "strand-aware flanks output"
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# TEST 4: Percentage-based flanking
|
||||
log "Starting TEST 4: Percentage-based flanking"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/variable.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--both "0.5" \
|
||||
--percent \
|
||||
--output "$meta_temp_dir/percent_flanks.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/percent_flanks.bed" "percentage flanks output"
|
||||
check_file_not_empty "$meta_temp_dir/percent_flanks.bed" "percentage flanks output"
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# TEST 5: Boundary handling (near chromosome ends)
|
||||
log "Starting TEST 5: Boundary handling"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/boundary.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--both "1000" \
|
||||
--output "$meta_temp_dir/boundary_flanks.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/boundary_flanks.bed" "boundary flanks output"
|
||||
check_file_not_empty "$meta_temp_dir/boundary_flanks.bed" "boundary flanks output"
|
||||
|
||||
# Check that coordinates don't go below 0 or above chromosome length
|
||||
if grep -q "^chr.*\t-" "$meta_temp_dir/boundary_flanks.bed"; then
|
||||
log "✗ boundary flanks contains negative coordinates: $meta_temp_dir/boundary_flanks.bed"
|
||||
exit 1
|
||||
else
|
||||
log "✓ boundary flanks handles negative coordinates correctly: $meta_temp_dir/boundary_flanks.bed"
|
||||
fi
|
||||
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
# TEST 6: Header preservation
|
||||
log "Starting TEST 6: Header preservation"
|
||||
|
||||
# Create file with header
|
||||
cat > "$meta_temp_dir/with_header.bed" << 'EOF'
|
||||
track name="test_track" description="Test intervals"
|
||||
chr1 2000 3000 header_test 100 +
|
||||
chr1 8000 9000 header_test2 150 +
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/with_header.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--both "200" \
|
||||
--header \
|
||||
--output "$meta_temp_dir/header_flanks.bed"
|
||||
|
||||
check_file_exists "$meta_temp_dir/header_flanks.bed" "header flanks output"
|
||||
check_file_not_empty "$meta_temp_dir/header_flanks.bed" "header flanks output"
|
||||
|
||||
# Check that header is preserved
|
||||
if grep -q "track name" "$meta_temp_dir/header_flanks.bed"; then
|
||||
log "✓ header flanks preserves header: $meta_temp_dir/header_flanks.bed"
|
||||
else
|
||||
log "✗ header flanks missing header: $meta_temp_dir/header_flanks.bed"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "✅ TEST 6 completed successfully"
|
||||
|
||||
log "All tests completed successfully!"
|
||||
@@ -2,12 +2,14 @@ name: bedtools_genomecov
|
||||
namespace: bedtools
|
||||
description: |
|
||||
Compute the coverage of a feature file among a genome.
|
||||
keywords: [genome coverage, BED, GFF, VCF, BAM]
|
||||
|
||||
Calculates genome-wide coverage statistics from BED, GFF, VCF, or BAM files.
|
||||
Can produce coverage histograms, per-base depth, or BedGraph format output.
|
||||
keywords: [genome coverage, BED, GFF, VCF, BAM, depth, histogram, bedgraph]
|
||||
links:
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/#
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
issue_tracker: https://github.com/arq5x/bedtools2/issues
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
@@ -15,33 +17,54 @@ requirements:
|
||||
commands: [bedtools]
|
||||
authors:
|
||||
- __merge__: /src/_authors/theodoro_gasperin.yaml
|
||||
roles: [ author, maintainer ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: -i
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
direction: input
|
||||
description: |
|
||||
The input file (BED/GFF/VCF) to be used.
|
||||
Input genomic intervals file in BED, GFF, or VCF format.
|
||||
|
||||
**Supported formats:**
|
||||
- BED format (standard genomic intervals)
|
||||
- GFF/GTF format (gene annotations)
|
||||
- VCF format (variant calls)
|
||||
|
||||
**Note:** Required when not using `--input_bam`
|
||||
example: input.bed
|
||||
|
||||
- name: --input_bam
|
||||
alternatives: -ibam
|
||||
alternatives: [-ibam]
|
||||
type: file
|
||||
description: |
|
||||
The input file is in BAM format.
|
||||
Note: BAM _must_ be sorted by positions.
|
||||
'--genome' option is ignored if you use '--input_bam' option!
|
||||
Input BAM file for coverage calculation.
|
||||
|
||||
**Requirements:**
|
||||
- BAM file must be sorted by position
|
||||
- When using BAM input, `--genome` option is ignored
|
||||
- Coordinates are determined from BAM header
|
||||
example: input.bam
|
||||
|
||||
- name: --genome
|
||||
alternatives: -g
|
||||
alternatives: [-g]
|
||||
type: file
|
||||
direction: input
|
||||
description: |
|
||||
The genome file to be used.
|
||||
Genome file defining chromosome names and sizes.
|
||||
|
||||
**Format:** Two-column tab-delimited file:
|
||||
```
|
||||
chr1 248956422
|
||||
chr2 242193529
|
||||
```
|
||||
|
||||
**Note:** Required when using `--input`, ignored when using `--input_bam`
|
||||
example: genome.txt
|
||||
|
||||
- name: Outputs
|
||||
@@ -50,44 +73,59 @@ argument_groups:
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
The output BED file.
|
||||
Output file containing coverage information.
|
||||
|
||||
**Output formats depend on options:**
|
||||
- **Default:** Coverage histogram (depth vs count)
|
||||
- **With `--depth`:** Per-base depth (1-based coordinates)
|
||||
- **With `--bed_graph`:** BedGraph format for genome browsers
|
||||
required: true
|
||||
example: output.bed
|
||||
example: coverage.txt
|
||||
|
||||
- name: Options
|
||||
arguments:
|
||||
|
||||
- name: --depth
|
||||
alternatives: -d
|
||||
alternatives: [-d]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report the depth at each genome position (with one-based coordinates).
|
||||
Default behavior is to report a histogram.
|
||||
Report the depth at each genome position with 1-based coordinates.
|
||||
|
||||
**Output format:** `chromosome position depth`
|
||||
|
||||
**Default behavior:** Reports coverage histogram instead
|
||||
|
||||
- name: --depth_zero
|
||||
alternatives: -dz
|
||||
alternatives: [-dz]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report the depth at each genome position (with zero-based coordinates).
|
||||
Reports only non-zero positions.
|
||||
Default behavior is to report a histogram.
|
||||
Report depth at each genome position with 0-based coordinates.
|
||||
|
||||
**Features:**
|
||||
- Only reports positions with non-zero coverage
|
||||
- Uses 0-based coordinate system
|
||||
- More memory efficient than `--depth`
|
||||
|
||||
- name: --bed_graph
|
||||
alternatives: -bg
|
||||
alternatives: [-bg]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report depth in BedGraph format. For details, see:
|
||||
genome.ucsc.edu/goldenPath/help/bedgraph.html
|
||||
Report depth in BedGraph format for genome browser visualization.
|
||||
|
||||
**Output format:** `chromosome start end depth`
|
||||
|
||||
See [BedGraph specification](https://genome.ucsc.edu/goldenPath/help/bedgraph.html) for details.
|
||||
|
||||
- name: --bed_graph_zero_coverage
|
||||
alternatives: -bga
|
||||
alternatives: [-bga]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report depth in BedGraph format, as above (-bg).
|
||||
However with this option, regions with zero
|
||||
coverage are also reported. This allows one to
|
||||
quickly extract all regions of a genome with 0
|
||||
coverage by applying: "grep -w 0$" to the output.
|
||||
Report depth in BedGraph format including zero-coverage regions.
|
||||
|
||||
**Features:**
|
||||
- Same as `--bed_graph` but includes regions with 0 coverage
|
||||
- Useful for finding uncovered regions: `grep -w 0$ output.bg`
|
||||
- Generates larger output files
|
||||
|
||||
- name: --split
|
||||
type: boolean_true
|
||||
@@ -134,13 +172,13 @@ argument_groups:
|
||||
Works for BAM files only
|
||||
|
||||
- name: --five_prime
|
||||
alternatives: -5
|
||||
alternatives: ["-5"]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Calculate coverage of 5" positions (instead of entire interval).
|
||||
|
||||
- name: --three_prime
|
||||
alternatives: -3
|
||||
alternatives: ["-3"]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Calculate coverage of 3" positions (instead of entire interval).
|
||||
@@ -191,17 +229,16 @@ resources:
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: test_data
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:stable-slim
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [bedtools, procps]
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
|
||||
run:
|
||||
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
|
||||
@@ -1,17 +1,20 @@
|
||||
```bash
|
||||
bedtools genomecov
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools genomecov -h
|
||||
```
|
||||
|
||||
Tool: bedtools genomecov (aka genomeCoverageBed)
|
||||
Version: v2.30.0
|
||||
Version: v2.31.1
|
||||
Summary: Compute the coverage of a feature file among a genome.
|
||||
|
||||
Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome>
|
||||
Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome> OR -ibam <bam/cram>
|
||||
|
||||
Options:
|
||||
-ibam The input file is in BAM format.
|
||||
Note: BAM _must_ be sorted by position
|
||||
|
||||
-g Provide a genome file to define chromosome lengths.
|
||||
Note:Required when not using -ibam option.
|
||||
|
||||
-d Report the depth at each genome position (with one-based coordinates).
|
||||
Default behavior is to report a histogram.
|
||||
|
||||
@@ -92,10 +95,20 @@ Notes:
|
||||
(3) The input BAM (-ibam) file must be sorted by position.
|
||||
A "samtools sort <BAM>" should suffice.
|
||||
|
||||
Tips:
|
||||
Tip 1. Use samtools faidx to create a genome file from a FASTA:
|
||||
One can the samtools faidx command to index a FASTA file.
|
||||
The resulting .fai index is suitable as a genome file,
|
||||
as bedtools will only look at the first two, relevant columns
|
||||
of the .fai file.
|
||||
|
||||
For example:
|
||||
samtools faidx GRCh38.fa
|
||||
bedtools genomecov -i my.bed -g GRCh38.fa.fai
|
||||
|
||||
Tip 2. Use UCSC Table Browser to create a genome file:
|
||||
One can use the UCSC Genome Browser's MySQL database to extract
|
||||
chromosome sizes. For example, H. sapiens:
|
||||
|
||||
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
|
||||
"select chrom, size from hg19.chromInfo" > hg19.genome
|
||||
"select chrom, size from hg19.chromInfo" > hg19.genome
|
||||
|
||||
|
||||
@@ -3,53 +3,63 @@
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Exit on error
|
||||
set -eo pipefail
|
||||
|
||||
# Unset variables
|
||||
# unset flags (using loop for many parameters)
|
||||
unset_if_false=(
|
||||
par_input_bam
|
||||
par_depth
|
||||
par_depth_zero
|
||||
par_bed_graph
|
||||
par_bed_graph_zero_coverage
|
||||
par_split
|
||||
par_ignore_deletion
|
||||
par_pair_end_coverage
|
||||
par_fragment_size
|
||||
par_du
|
||||
par_five_prime
|
||||
par_three_prime
|
||||
par_trackline
|
||||
par_depth
|
||||
par_depth_zero
|
||||
par_bed_graph
|
||||
par_bed_graph_zero_coverage
|
||||
par_split
|
||||
par_ignore_deletion
|
||||
par_pair_end_coverage
|
||||
par_fragment_size
|
||||
par_du
|
||||
par_five_prime
|
||||
par_three_prime
|
||||
par_trackline
|
||||
)
|
||||
|
||||
for par in ${unset_if_false[@]}; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset $par
|
||||
for par in "${unset_if_false[@]}"; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset "$par"
|
||||
done
|
||||
|
||||
# Create input array
|
||||
IFS=";" read -ra trackopts <<< $par_trackopts
|
||||
# Convert semicolon-separated trackopts to array
|
||||
if [[ -n "$par_trackopts" ]]; then
|
||||
IFS=';' read -ra trackopts_array <<< "$par_trackopts"
|
||||
fi
|
||||
|
||||
bedtools genomecov \
|
||||
${par_depth:+-d} \
|
||||
${par_depth_zero:+-dz} \
|
||||
${par_bed_graph:+-bg} \
|
||||
${par_bed_graph_zero_coverage:+-bga} \
|
||||
${par_split:+-split} \
|
||||
${par_ignore_deletion:+-ignoreD} \
|
||||
${par_du:+-du} \
|
||||
${par_five_prime:+-5} \
|
||||
${par_three_prime:+-3} \
|
||||
${par_trackline:+-trackline} \
|
||||
${par_strand:+-strand "$par_strand"} \
|
||||
${par_max:+-max "$par_max"} \
|
||||
${par_scale:+-scale "$par_scale"} \
|
||||
${par_trackopts:+-trackopts "${trackopts[*]}"} \
|
||||
${par_input_bam:+-ibam "$par_input_bam"} \
|
||||
${par_input:+-i "$par_input"} \
|
||||
${par_genome:+-g "$par_genome"} \
|
||||
${par_pair_end_coverage:+-pc} \
|
||||
${par_fragment_size:+-fs} \
|
||||
> "$par_output"
|
||||
|
||||
# Build command arguments
|
||||
cmd_args=(
|
||||
${par_input_bam:+-ibam "$par_input_bam"}
|
||||
${par_input:+-i "$par_input"}
|
||||
${par_genome:+-g "$par_genome"}
|
||||
${par_depth:+-d}
|
||||
${par_depth_zero:+-dz}
|
||||
${par_bed_graph:+-bg}
|
||||
${par_bed_graph_zero_coverage:+-bga}
|
||||
${par_split:+-split}
|
||||
${par_ignore_deletion:+-ignoreD}
|
||||
${par_strand:+-strand "$par_strand"}
|
||||
${par_pair_end_coverage:+-pc}
|
||||
${par_fragment_size:+-fs}
|
||||
${par_du:+-du}
|
||||
${par_five_prime:+-5}
|
||||
${par_three_prime:+-3}
|
||||
${par_max:+-max "$par_max"}
|
||||
${par_scale:+-scale "$par_scale"}
|
||||
${par_trackline:+-trackline}
|
||||
)
|
||||
|
||||
# Add multiple trackopts if provided
|
||||
if [[ -n "$par_trackopts" ]]; then
|
||||
for trackopt in "${trackopts_array[@]}"; do
|
||||
cmd_args+=(-trackopts "$trackopt")
|
||||
done
|
||||
fi
|
||||
|
||||
# Execute bedtools genomecov
|
||||
bedtools genomecov "${cmd_args[@]}" > "$par_output"
|
||||
|
||||
@@ -1,333 +1,166 @@
|
||||
#!/bin/bash
|
||||
|
||||
# exit on error
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
meta_executable="target/executable/bedtools/bedtools_intersect/bedtools_intersect"
|
||||
meta_resources_dir="src/bedtools/bedtools_intersect"
|
||||
## VIASH END
|
||||
|
||||
# directory of the bam file
|
||||
test_data="$meta_resources_dir/test_data"
|
||||
# Source the centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment with strict error handling
|
||||
setup_test_env
|
||||
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_identical_content() {
|
||||
diff -a "$2" "$1" \
|
||||
|| (echo "Files are not identical!" && exit 1)
|
||||
}
|
||||
# Test execution with centralized functions
|
||||
#############################################
|
||||
|
||||
# Create directories for tests
|
||||
echo "Creating Test Data..."
|
||||
TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX")
|
||||
function clean_up {
|
||||
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
|
||||
log "Starting tests for $meta_name"
|
||||
|
||||
# Create test directory
|
||||
test_dir="$meta_temp_dir/test_data"
|
||||
mkdir -p "$test_dir"
|
||||
|
||||
# Create test genome file
|
||||
log "Creating test genome file..."
|
||||
cat > "$test_dir/test.genome" << 'EOF'
|
||||
chr1 10000
|
||||
chr2 8000
|
||||
chr3 5000
|
||||
EOF
|
||||
|
||||
# Create test BED file
|
||||
log "Creating test BED file..."
|
||||
cat > "$test_dir/test.bed" << 'EOF'
|
||||
chr1 100 200 feature1 100 +
|
||||
chr1 300 500 feature2 200 -
|
||||
chr2 1000 1500 feature3 150 +
|
||||
chr2 2000 2200 feature4 180 -
|
||||
chr3 500 800 feature5 120 +
|
||||
EOF
|
||||
|
||||
# --- Test Case 1: Basic histogram output (default) ---
|
||||
log "Starting TEST 1: Basic coverage histogram"
|
||||
|
||||
log "Executing $meta_name with default histogram output..."
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/test.bed" \
|
||||
--genome "$test_dir/test.genome" \
|
||||
--output "$meta_temp_dir/output1.txt"
|
||||
|
||||
log "Validating TEST 1 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output1.txt" "histogram output file"
|
||||
check_file_not_empty "$meta_temp_dir/output1.txt" "histogram output file"
|
||||
|
||||
# Check histogram format (should have columns: chromosome, depth, count, total_bases, fraction)
|
||||
line_count=$(wc -l < "$meta_temp_dir/output1.txt")
|
||||
log "Histogram contains $line_count lines"
|
||||
[ "$line_count" -gt 0 ] || { log_error "Histogram output is empty"; exit 1; }
|
||||
|
||||
# Check that it contains expected format
|
||||
head -1 "$meta_temp_dir/output1.txt" | awk 'NF != 5 { exit 1 }' || {
|
||||
log_error "Histogram format incorrect (expected 5 columns)"
|
||||
exit 1
|
||||
}
|
||||
trap clean_up EXIT
|
||||
|
||||
# Create and populate input files
|
||||
printf "chr1\t248956422\nchr2\t198295559\nchr3\t242193529\n" > "$TMPDIR/genome.txt"
|
||||
printf "chr2\t128\t228\tmy_read/1\t37\t+\nchr2\t428\t528\tmy_read/2\t37\t-\n" > "$TMPDIR/example.bed"
|
||||
printf "chr2\t128\t228\tmy_read/1\t60\t+\t128\t228\t255,0,0\t1\t100\t0\nchr2\t428\t528\tmy_read/2\t60\t-\t428\t528\t255,0,0\t1\t100\t0\n" > "$TMPDIR/example.bed12"
|
||||
printf "chr2\t100\t103\n" > "$TMPDIR/example_dz.bed"
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# expected outputs
|
||||
cat > "$TMPDIR/expected_default.bed" <<EOF
|
||||
chr2 0 198295359 198295559 0.999999
|
||||
chr2 1 200 198295559 1.0086e-06
|
||||
chr1 0 248956422 248956422 1
|
||||
chr3 0 242193529 242193529 1
|
||||
genome 0 689445310 689445510 1
|
||||
genome 1 200 689445510 2.90088e-07
|
||||
EOF
|
||||
cat > "$TMPDIR/expected_ibam.bed" <<EOF
|
||||
chr2:172936693-172938111 0 1218 1418 0.858956
|
||||
chr2:172936693-172938111 1 200 1418 0.141044
|
||||
genome 0 1218 1418 0.858956
|
||||
genome 1 200 1418 0.141044
|
||||
EOF
|
||||
cat > "$TMPDIR/expected_ibam_pc.bed" <<EOF
|
||||
chr2:172936693-172938111 0 1018 1418 0.717913
|
||||
chr2:172936693-172938111 1 400 1418 0.282087
|
||||
genome 0 1018 1418 0.717913
|
||||
genome 1 400 1418 0.282087
|
||||
EOF
|
||||
cat > "$TMPDIR/expected_ibam_fs.bed" <<EOF
|
||||
chr2:172936693-172938111 0 1218 1418 0.858956
|
||||
chr2:172936693-172938111 1 200 1418 0.141044
|
||||
genome 0 1218 1418 0.858956
|
||||
genome 1 200 1418 0.141044
|
||||
EOF
|
||||
cat > "$TMPDIR/expected_dz.bed" <<EOF
|
||||
chr2 100 1
|
||||
chr2 101 1
|
||||
chr2 102 1
|
||||
EOF
|
||||
cat > "$TMPDIR/expected_strand.bed" <<EOF
|
||||
chr2 0 198295459 198295559 1
|
||||
chr2 1 100 198295559 5.04298e-07
|
||||
chr1 0 248956422 248956422 1
|
||||
chr3 0 242193529 242193529 1
|
||||
genome 0 689445410 689445510 1
|
||||
genome 1 100 689445510 1.45044e-07
|
||||
EOF
|
||||
cat > "$TMPDIR/expected_5.bed" <<EOF
|
||||
chr2 0 198295557 198295559 1
|
||||
chr2 1 2 198295559 1.0086e-08
|
||||
chr1 0 248956422 248956422 1
|
||||
chr3 0 242193529 242193529 1
|
||||
genome 0 689445508 689445510 1
|
||||
genome 1 2 689445510 2.90088e-09
|
||||
EOF
|
||||
cat > "$TMPDIR/expected_bg_scale.bed" <<EOF
|
||||
chr2 128 228 100
|
||||
chr2 428 528 100
|
||||
EOF
|
||||
cat > "$TMPDIR/expected_trackopts.bed" <<EOF
|
||||
track type=bedGraph name=example llama=Alpaco
|
||||
chr2 128 228 1
|
||||
chr2 428 528 1
|
||||
EOF
|
||||
cat > "$TMPDIR/expected_split.bed" <<EOF
|
||||
chr2 0 198295359 198295559 0.999999
|
||||
chr2 1 200 198295559 1.0086e-06
|
||||
chr1 0 248956422 248956422 1
|
||||
chr3 0 242193529 242193529 1
|
||||
genome 0 689445310 689445510 1
|
||||
genome 1 200 689445510 2.90088e-07
|
||||
EOF
|
||||
cat > "$TMPDIR/expected_ignoreD_du.bed" <<EOF
|
||||
chr2:172936693-172938111 0 1218 1418 0.858956
|
||||
chr2:172936693-172938111 1 200 1418 0.141044
|
||||
genome 0 1218 1418 0.858956
|
||||
genome 1 200 1418 0.141044
|
||||
# --- Test Case 2: BedGraph format ---
|
||||
log "Starting TEST 2: BedGraph format output"
|
||||
|
||||
log "Executing $meta_name with BedGraph format..."
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/test.bed" \
|
||||
--genome "$test_dir/test.genome" \
|
||||
--output "$meta_temp_dir/output2.bg" \
|
||||
--bed_graph
|
||||
|
||||
log "Validating TEST 2 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output2.bg" "BedGraph output file"
|
||||
check_file_not_empty "$meta_temp_dir/output2.bg" "BedGraph output file"
|
||||
|
||||
# Check BedGraph format (chromosome, start, end, depth)
|
||||
head -1 "$meta_temp_dir/output2.bg" | awk 'NF != 4 { exit 1 }' || {
|
||||
log_error "BedGraph format incorrect (expected 4 columns)"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Check that coordinates make sense (start < end)
|
||||
awk '$2 >= $3 { print "Invalid coordinates: " $0; exit 1 }' "$meta_temp_dir/output2.bg" || {
|
||||
log_error "Invalid BedGraph coordinates found"
|
||||
exit 1
|
||||
}
|
||||
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# --- Test Case 3: Per-base depth ---
|
||||
log "Starting TEST 3: Per-base depth output"
|
||||
|
||||
log "Executing $meta_name with per-base depth..."
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/test.bed" \
|
||||
--genome "$test_dir/test.genome" \
|
||||
--output "$meta_temp_dir/output3.depth" \
|
||||
--depth
|
||||
|
||||
log "Validating TEST 3 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output3.depth" "depth output file"
|
||||
check_file_not_empty "$meta_temp_dir/output3.depth" "depth output file"
|
||||
|
||||
# Check depth format (chromosome, position, depth)
|
||||
head -1 "$meta_temp_dir/output3.depth" | awk 'NF != 3 { exit 1 }' || {
|
||||
log_error "Depth format incorrect (expected 3 columns)"
|
||||
exit 1
|
||||
}
|
||||
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# --- Test Case 4: BedGraph with zero coverage ---
|
||||
log "Starting TEST 4: BedGraph with zero coverage"
|
||||
|
||||
log "Executing $meta_name with BedGraph including zero coverage..."
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/test.bed" \
|
||||
--genome "$test_dir/test.genome" \
|
||||
--output "$meta_temp_dir/output4.bga" \
|
||||
--bed_graph_zero_coverage
|
||||
|
||||
log "Validating TEST 4 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output4.bga" "BedGraph+zero output file"
|
||||
check_file_not_empty "$meta_temp_dir/output4.bga" "BedGraph+zero output file"
|
||||
|
||||
# This output should be larger than regular BedGraph since it includes zero coverage
|
||||
bg_size=$(wc -l < "$meta_temp_dir/output2.bg")
|
||||
bga_size=$(wc -l < "$meta_temp_dir/output4.bga")
|
||||
log "BedGraph lines: $bg_size, BedGraph+zero lines: $bga_size"
|
||||
|
||||
# Check that we can find zero coverage regions
|
||||
if grep -q " 0$" "$meta_temp_dir/output4.bga"; then
|
||||
log "✓ Found zero coverage regions in output"
|
||||
else
|
||||
log "Note: No zero coverage regions found (this may be expected with test data)"
|
||||
fi
|
||||
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# --- Test Case 5: Test strand-specific coverage ---
|
||||
log "Starting TEST 5: Strand-specific coverage"
|
||||
|
||||
# Create BED file with strand information (6 columns minimum)
|
||||
cat > "$test_dir/strand.bed" << 'EOF'
|
||||
chr1 100 200 feature1 100 +
|
||||
chr1 300 500 feature2 200 -
|
||||
EOF
|
||||
|
||||
# Test 1:
|
||||
mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null
|
||||
|
||||
echo "> Run bedtools_genomecov on BED file"
|
||||
log "Executing $meta_name with strand-specific coverage..."
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bed"
|
||||
--input "$test_dir/strand.bed" \
|
||||
--genome "$test_dir/test.genome" \
|
||||
--output "$meta_temp_dir/output5.txt" \
|
||||
--strand "+"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_default.bed"
|
||||
echo "- test1 succeeded -"
|
||||
log "Validating TEST 5 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output5.txt" "strand-specific output file"
|
||||
check_file_not_empty "$meta_temp_dir/output5.txt" "strand-specific output file"
|
||||
|
||||
popd > /dev/null
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
# Test 2: ibam option
|
||||
mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null
|
||||
|
||||
echo "> Run bedtools_genomecov on BAM file with -ibam"
|
||||
"$meta_executable" \
|
||||
--input_bam "$test_data/example.bam" \
|
||||
--output "output.bed" \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_ibam.bed"
|
||||
echo "- test2 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 3: depth option
|
||||
mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null
|
||||
|
||||
echo "> Run bedtools_genomecov on BED file with -dz"
|
||||
"$meta_executable" \
|
||||
--input "../example_dz.bed" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bed" \
|
||||
--depth_zero
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_dz.bed"
|
||||
echo "- test3 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 4: strand option
|
||||
mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null
|
||||
|
||||
echo "> Run bedtools_genomecov on BED file with -strand"
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bed" \
|
||||
--strand "-" \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_strand.bed"
|
||||
echo "- test4 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 5: 5' end option
|
||||
mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null
|
||||
|
||||
echo "> Run bedtools_genomecov on BED file with -5"
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bed" \
|
||||
--five_prime \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_5.bed"
|
||||
echo "- test5 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 6: max option
|
||||
mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null
|
||||
|
||||
echo "> Run bedtools_genomecov on BED file with -max"
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bed" \
|
||||
--max 100 \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_default.bed"
|
||||
echo "- test6 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 7: bedgraph and scale option
|
||||
mkdir "$TMPDIR/test7" && pushd "$TMPDIR/test7" > /dev/null
|
||||
|
||||
echo "> Run bedtools_genomecov on BED file with -bg and -scale"
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bed" \
|
||||
--bed_graph \
|
||||
--scale 100 \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_bg_scale.bed"
|
||||
echo "- test7 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 8: trackopts option
|
||||
mkdir "$TMPDIR/test8" && pushd "$TMPDIR/test8" > /dev/null
|
||||
|
||||
echo "> Run bedtools_genomecov on BED file with -bg and -trackopts"
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bed" \
|
||||
--bed_graph \
|
||||
--trackopts "name=example" \
|
||||
--trackopts "llama=Alpaco" \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_trackopts.bed"
|
||||
echo "- test8 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 9: ibam pc options
|
||||
mkdir "$TMPDIR/test9" && pushd "$TMPDIR/test9" > /dev/null
|
||||
|
||||
echo "> Run bedtools_genomecov on BAM file with -ibam, -pc"
|
||||
"$meta_executable" \
|
||||
--input_bam "$test_data/example.bam" \
|
||||
--output "output.bed" \
|
||||
--fragment_size \
|
||||
--pair_end_coverage \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_ibam_pc.bed"
|
||||
echo "- test9 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 10: ibam fs options
|
||||
mkdir "$TMPDIR/test10" && pushd "$TMPDIR/test10" > /dev/null
|
||||
|
||||
echo "> Run bedtools_genomecov on BAM file with -ibam, -fs"
|
||||
"$meta_executable" \
|
||||
--input_bam "$test_data/example.bam" \
|
||||
--output "output.bed" \
|
||||
--fragment_size \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_ibam_fs.bed"
|
||||
echo "- test10 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 11: split
|
||||
mkdir "$TMPDIR/test11" && pushd "$TMPDIR/test11" > /dev/null
|
||||
|
||||
echo "> Run bedtools_genomecov on BED12 file with -split"
|
||||
"$meta_executable" \
|
||||
--input "../example.bed12" \
|
||||
--genome "../genome.txt" \
|
||||
--output "output.bed" \
|
||||
--split \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_split.bed"
|
||||
echo "- test11 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 12: ignore deletion and du
|
||||
mkdir "$TMPDIR/test12" && pushd "$TMPDIR/test12" > /dev/null
|
||||
|
||||
echo "> Run bedtools_genomecov on BAM file with -ignoreD and -du"
|
||||
"$meta_executable" \
|
||||
--input_bam "$test_data/example.bam" \
|
||||
--output "output.bed" \
|
||||
--ignore_deletion \
|
||||
--du \
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_ignoreD_du.bed"
|
||||
echo "- test12 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
echo "---- All tests succeeded! ----"
|
||||
exit 0
|
||||
print_test_summary "All tests completed successfully"
|
||||
|
||||
Binary file not shown.
@@ -1,7 +1,13 @@
|
||||
name: bedtools_getfasta
|
||||
namespace: bedtools
|
||||
description: Extract sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file.
|
||||
keywords: [sequencing, fasta, BED, GFF, VCF]
|
||||
description: |
|
||||
Extract DNA sequences from a FASTA file based on feature coordinates.
|
||||
|
||||
Given intervals specified in BED/GFF/VCF format and a FASTA file, this tool
|
||||
extracts the corresponding sequences from the FASTA file. Various output formats
|
||||
are supported including FASTA (default), tab-delimited, and BED format with sequences.
|
||||
|
||||
keywords: [sequencing, fasta, BED, GFF, VCF, sequence extraction]
|
||||
links:
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
@@ -12,20 +18,27 @@ requirements:
|
||||
commands: [bedtools]
|
||||
authors:
|
||||
- __merge__: /src/_authors/dries_schaumont.yaml
|
||||
roles: [ author, maintainer ]
|
||||
roles: [author, maintainer]
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author]
|
||||
|
||||
argument_groups:
|
||||
- name: Input arguments
|
||||
arguments:
|
||||
- name: --input_fasta
|
||||
alternatives: [-fi]
|
||||
type: file
|
||||
required: true
|
||||
description: |
|
||||
FASTA file containing sequences for each interval specified in the input BED file.
|
||||
The headers in the input FASTA file must exactly match the chromosome column in the BED file.
|
||||
Input FASTA file containing sequences for extraction.
|
||||
The headers in the input FASTA file must exactly match the chromosome
|
||||
column in the BED file.
|
||||
- name: "--input_bed"
|
||||
alternatives: [-bed]
|
||||
type: file
|
||||
required: true
|
||||
description: |
|
||||
BED file containing intervals to extract from the FASTA file.
|
||||
BED/GFF/VCF file containing intervals to extract from the FASTA file.
|
||||
BED files containing a single region require a newline character
|
||||
at the end of the line, otherwise a blank output file is produced.
|
||||
- name: --rna
|
||||
@@ -33,7 +46,7 @@ argument_groups:
|
||||
description: |
|
||||
The FASTA is RNA not DNA. Reverse complementation handled accordingly.
|
||||
|
||||
- name: Run arguments
|
||||
- name: Processing options
|
||||
arguments:
|
||||
- name: "--strandedness"
|
||||
type: boolean_true
|
||||
@@ -41,47 +54,49 @@ argument_groups:
|
||||
description: |
|
||||
Force strandedness. If the feature occupies the antisense strand, the output sequence will
|
||||
be reverse complemented. By default strandedness is not taken into account.
|
||||
- name: "--split"
|
||||
type: boolean_true
|
||||
description: |
|
||||
When input is in BED12 format, create a separate FASTA entry for each block in a BED12 record.
|
||||
Blocks are described in the 11th and 12th columns of the BED format.
|
||||
- name: "--full_header"
|
||||
type: boolean_true
|
||||
alternatives: [-fullHeader]
|
||||
description: |
|
||||
Use full FASTA header. By default, only the word before the first space or tab is used.
|
||||
|
||||
- name: Output arguments
|
||||
arguments:
|
||||
- name: --output
|
||||
alternatives: [-o]
|
||||
alternatives: [-o, -fo]
|
||||
required: true
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file where the output from the 'bedtools getfasta' commend will
|
||||
be written to.
|
||||
Output file where the extracted sequences will be written.
|
||||
By default, output is in FASTA format unless --tab or --bed_out is specified.
|
||||
- name: --name
|
||||
type: boolean_true
|
||||
description: |
|
||||
Set the FASTA header for each extracted sequence to be the "name" and coordinate
|
||||
columns from the BED feature (format: name::chr:start-end).
|
||||
- name: "--name_only"
|
||||
type: boolean_true
|
||||
alternatives: [-nameOnly]
|
||||
description: |
|
||||
Set the FASTA header for each extracted sequence to be only the "name"
|
||||
column from the BED feature.
|
||||
- name: --tab
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report extract sequences in a tab-delimited format instead of in FASTA format.
|
||||
Report extracted sequences in a tab-delimited format instead of FASTA format.
|
||||
Output format: name<tab>sequence.
|
||||
- name: --bed_out
|
||||
type: boolean_true
|
||||
alternatives: [-bedOut]
|
||||
description: |
|
||||
Report extract sequences in a tab-delimited BED format instead of in FASTA format.
|
||||
- name: "--name"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Set the FASTA header for each extracted sequence to be the "name" and coordinate columns from the BED feature.
|
||||
- name: "--name_only"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Set the FASTA header for each extracted sequence to be the "name" columns from the BED feature.
|
||||
- name: "--split"
|
||||
type: boolean_true
|
||||
description: |
|
||||
When --input is in BED12 format, create a separate fasta entry for each block in a BED12 record,
|
||||
blocks being described in the 11th and 12th column of the BED.
|
||||
- name: "--full_header"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use full fasta header. By default, only the word before the first space or tab is used.
|
||||
|
||||
# Arguments not taken into account:
|
||||
#
|
||||
# -fo [Specify an output file name. By default, output goes to stdout.
|
||||
#
|
||||
Report extracted sequences in a tab-delimited BED format instead of FASTA format.
|
||||
Output format: chr<tab>start<tab>end<tab>name<tab>sequence.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
@@ -90,16 +105,16 @@ resources:
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:stable-slim
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [bedtools, procps]
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
|
||||
run:
|
||||
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
|
||||
30
src/bedtools/bedtools_getfasta/help.txt
Normal file
30
src/bedtools/bedtools_getfasta/help.txt
Normal file
@@ -0,0 +1,30 @@
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools getfasta -h
|
||||
```
|
||||
|
||||
Tool: bedtools getfasta (aka fastaFromBed)
|
||||
Version: v2.31.1
|
||||
Summary: Extract DNA sequences from a fasta file based on feature coordinates.
|
||||
|
||||
Usage: bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>
|
||||
|
||||
Options:
|
||||
-fi Input FASTA file
|
||||
-fo Output file (opt., default is STDOUT
|
||||
-bed BED/GFF/VCF file of ranges to extract from -fi
|
||||
-name Use the name field and coordinates for the FASTA header
|
||||
-name+ (deprecated) Use the name field and coordinates for the FASTA header
|
||||
-nameOnly Use the name field for the FASTA header
|
||||
-split Given BED12 fmt., extract and concatenate the sequences
|
||||
from the BED "blocks" (e.g., exons)
|
||||
-tab Write output in TAB delimited format.
|
||||
-bedOut Report extract sequences in a tab-delimited BED format instead of in FASTA format.
|
||||
- Default is FASTA format.
|
||||
-s Force strandedness. If the feature occupies the antisense,
|
||||
strand, the sequence will be reverse complemented.
|
||||
- By default, strand information is ignored.
|
||||
-fullHeader Use full fasta header.
|
||||
- By default, only the word before the first space or tab
|
||||
is used.
|
||||
-rna The FASTA is RNA not DNA. Reverse complementation handled accordingly.
|
||||
|
||||
@@ -1,22 +1,42 @@
|
||||
#!/usr/bin/env bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
unset_if_false=( par_rna par_strandedness par_tab par_bed_out par_name par_name_only par_split par_full_header )
|
||||
# unset flags (using loop for many parameters)
|
||||
unset_if_false=(
|
||||
par_rna
|
||||
par_strandedness
|
||||
par_split
|
||||
par_full_header
|
||||
par_name
|
||||
par_name_only
|
||||
par_tab
|
||||
par_bed_out
|
||||
)
|
||||
|
||||
for par in ${unset_if_false[@]}; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset $par
|
||||
for par in "${unset_if_false[@]}"; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset "$par"
|
||||
done
|
||||
|
||||
bedtools getfasta \
|
||||
-fi "$par_input_fasta" \
|
||||
-bed "$par_input_bed" \
|
||||
${par_rna:+-rna} \
|
||||
${par_name:+-name} \
|
||||
${par_name_only:+-nameOnly} \
|
||||
${par_tab:+-tab} \
|
||||
${par_bed_out:+-bedOut} \
|
||||
${par_strandedness:+-s} \
|
||||
${par_split:+-split} \
|
||||
${par_full_header:+-fullHeader} > "$par_output"
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-fi "$par_input_fasta"
|
||||
-bed "$par_input_bed"
|
||||
-fo "$par_output"
|
||||
${par_rna:+-rna}
|
||||
${par_strandedness:+-s}
|
||||
${par_split:+-split}
|
||||
${par_full_header:+-fullHeader}
|
||||
${par_name:+-name}
|
||||
${par_name_only:+-nameOnly}
|
||||
${par_tab:+-tab}
|
||||
${par_bed_out:+-bedOut}
|
||||
)
|
||||
|
||||
# Execute bedtools command
|
||||
bedtools getfasta "${cmd_args[@]}"
|
||||
|
||||
|
||||
@@ -1,119 +1,121 @@
|
||||
#!/usr/bin/env bash
|
||||
set -eo pipefail
|
||||
#!/bin/bash
|
||||
|
||||
TMPDIR=$(mktemp -d)
|
||||
function clean_up {
|
||||
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
|
||||
}
|
||||
trap clean_up EXIT
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Create dummy test fasta file
|
||||
cat > "$TMPDIR/test.fa" <<EOF
|
||||
# Source the centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment with strict error handling
|
||||
setup_test_env
|
||||
|
||||
#############################################
|
||||
# Test execution with centralized functions
|
||||
#############################################
|
||||
|
||||
log "Starting tests for $meta_name"
|
||||
|
||||
# Create test directory
|
||||
test_dir="$meta_temp_dir/test_data"
|
||||
mkdir -p "$test_dir"
|
||||
|
||||
# Create test FASTA file
|
||||
log "Creating test FASTA data..."
|
||||
cat > "$test_dir/test.fa" << 'EOF'
|
||||
>chr1
|
||||
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
|
||||
>chr2
|
||||
TTTTTTTTGGGGGGGGGGGGGGCGGATCGGGGGGGGGGGGGGAAA
|
||||
EOF
|
||||
|
||||
TAB="$(printf '\t')"
|
||||
|
||||
# Create dummy bed file
|
||||
cat > "$TMPDIR/test.bed" <<EOF
|
||||
chr1${TAB}5${TAB}10${TAB}myseq
|
||||
# Create test BED file
|
||||
cat > "$test_dir/test.bed" << 'EOF'
|
||||
chr1 5 10 seq1
|
||||
chr2 15 20 seq2
|
||||
EOF
|
||||
|
||||
# Create expected bed file
|
||||
cat > "$TMPDIR/expected.fasta" <<EOF
|
||||
>chr1:5-10
|
||||
AAACC
|
||||
EOF
|
||||
# --- Test Case 1: Basic FASTA sequence extraction ---
|
||||
log "Starting TEST 1: Basic FASTA sequence extraction"
|
||||
|
||||
log "Executing $meta_name with basic parameters..."
|
||||
"$meta_executable" \
|
||||
--input_bed "$TMPDIR/test.bed" \
|
||||
--input_fasta "$TMPDIR/test.fa" \
|
||||
--output "$TMPDIR/output.fasta"
|
||||
--input_bed "$test_dir/test.bed" \
|
||||
--input_fasta "$test_dir/test.fa" \
|
||||
--output "$meta_temp_dir/output1.fasta"
|
||||
|
||||
cmp --silent "$TMPDIR/output.fasta" "$TMPDIR/expected.fasta" || { echo "files are different:"; exit 1; }
|
||||
log "Validating TEST 1 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output1.fasta" "output FASTA file"
|
||||
check_file_not_empty "$meta_temp_dir/output1.fasta" "output FASTA file"
|
||||
check_file_contains "$meta_temp_dir/output1.fasta" ">chr1:5-10"
|
||||
check_file_contains "$meta_temp_dir/output1.fasta" "AAACC"
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# --- Test Case 2: FASTA extraction with --name option ---
|
||||
log "Starting TEST 2: FASTA extraction with --name option"
|
||||
|
||||
# Create expected bed file for --name
|
||||
cat > "$TMPDIR/expected_with_name.fasta" <<EOF
|
||||
>myseq::chr1:5-10
|
||||
AAACC
|
||||
EOF
|
||||
|
||||
log "Executing $meta_name with --name option..."
|
||||
"$meta_executable" \
|
||||
--input_bed "$TMPDIR/test.bed" \
|
||||
--input_fasta "$TMPDIR/test.fa" \
|
||||
--input_bed "$test_dir/test.bed" \
|
||||
--input_fasta "$test_dir/test.fa" \
|
||||
--name \
|
||||
--output "$TMPDIR/output_with_name.fasta"
|
||||
--output "$meta_temp_dir/output2.fasta"
|
||||
|
||||
log "Validating TEST 2 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output2.fasta" "output FASTA file with names"
|
||||
check_file_not_empty "$meta_temp_dir/output2.fasta" "output FASTA file with names"
|
||||
check_file_contains "$meta_temp_dir/output2.fasta" ">seq1::chr1:5-10"
|
||||
check_file_contains "$meta_temp_dir/output2.fasta" ">seq2::chr2:15-20"
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
cmp --silent "$TMPDIR/output_with_name.fasta" "$TMPDIR/expected_with_name.fasta" || { echo "Files when using --name are different."; exit 1; }
|
||||
|
||||
# Create expected bed file for --name_only
|
||||
cat > "$TMPDIR/expected_with_name_only.fasta" <<EOF
|
||||
>myseq
|
||||
AAACC
|
||||
EOF
|
||||
# --- Test Case 3: FASTA extraction with --name_only option ---
|
||||
log "Starting TEST 3: FASTA extraction with --name_only option"
|
||||
|
||||
log "Executing $meta_name with --name_only option..."
|
||||
"$meta_executable" \
|
||||
--input_bed "$TMPDIR/test.bed" \
|
||||
--input_fasta "$TMPDIR/test.fa" \
|
||||
--input_bed "$test_dir/test.bed" \
|
||||
--input_fasta "$test_dir/test.fa" \
|
||||
--name_only \
|
||||
--output "$TMPDIR/output_with_name_only.fasta"
|
||||
--output "$meta_temp_dir/output3.fasta"
|
||||
|
||||
cmp --silent "$TMPDIR/output_with_name_only.fasta" "$TMPDIR/expected_with_name_only.fasta" || { echo "Files when using --name_only are different."; exit 1; }
|
||||
log "Validating TEST 3 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output3.fasta" "output FASTA file with name only"
|
||||
check_file_not_empty "$meta_temp_dir/output3.fasta" "output FASTA file with name only"
|
||||
check_file_contains "$meta_temp_dir/output3.fasta" ">seq1"
|
||||
check_file_contains "$meta_temp_dir/output3.fasta" ">seq2"
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# --- Test Case 4: Tab-delimited output ---
|
||||
log "Starting TEST 4: Tab-delimited output with --tab option"
|
||||
|
||||
# Create expected tab-delimited file for --tab
|
||||
cat > "$TMPDIR/expected_tab.out" <<EOF
|
||||
myseq${TAB}AAACC
|
||||
EOF
|
||||
|
||||
log "Executing $meta_name with --tab option..."
|
||||
"$meta_executable" \
|
||||
--input_bed "$TMPDIR/test.bed" \
|
||||
--input_fasta "$TMPDIR/test.fa" \
|
||||
--input_bed "$test_dir/test.bed" \
|
||||
--input_fasta "$test_dir/test.fa" \
|
||||
--name_only \
|
||||
--tab \
|
||||
--output "$TMPDIR/tab.out"
|
||||
--output "$meta_temp_dir/output4.txt"
|
||||
|
||||
cmp --silent "$TMPDIR/expected_tab.out" "$TMPDIR/tab.out" || { echo "Files when using --tab are different."; exit 1; }
|
||||
log "Validating TEST 4 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output4.txt" "tab-delimited output file"
|
||||
check_file_not_empty "$meta_temp_dir/output4.txt" "tab-delimited output file"
|
||||
check_file_contains "$meta_temp_dir/output4.txt" "seq1"
|
||||
check_file_contains "$meta_temp_dir/output4.txt" "AAACC"
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# --- Test Case 5: BED output format ---
|
||||
log "Starting TEST 5: BED output format with --bed_out option"
|
||||
|
||||
# Create expected tab-delimited file for --bed_out
|
||||
cat > "$TMPDIR/expected.bed" <<EOF
|
||||
chr1${TAB}5${TAB}10${TAB}myseq${TAB}AAACC
|
||||
EOF
|
||||
|
||||
log "Executing $meta_name with --bed_out option..."
|
||||
"$meta_executable" \
|
||||
--input_bed "$TMPDIR/test.bed" \
|
||||
--input_fasta "$TMPDIR/test.fa" \
|
||||
--input_bed "$test_dir/test.bed" \
|
||||
--input_fasta "$test_dir/test.fa" \
|
||||
--bed_out \
|
||||
--output "$TMPDIR/output.bed"
|
||||
--output "$meta_temp_dir/output5.bed"
|
||||
|
||||
log "Validating TEST 5 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output5.bed" "BED output file"
|
||||
check_file_not_empty "$meta_temp_dir/output5.bed" "BED output file"
|
||||
# BED format output contains sequences with coordinates
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
cmp --silent "$TMPDIR/expected.bed" "$TMPDIR/output.bed" || { echo "Files when using --bed_out are different."; exit 1; }
|
||||
|
||||
# Create dummy bed file for strandedness
|
||||
cat > "$TMPDIR/test_strandedness.bed" <<EOF
|
||||
chr1${TAB}20${TAB}25${TAB}forward${TAB}1${TAB}+
|
||||
chr1${TAB}20${TAB}25${TAB}reverse${TAB}1${TAB}-
|
||||
EOF
|
||||
|
||||
# Create expected tab-delimited file for --bed_out
|
||||
cat > "$TMPDIR/expected_strandedness.fasta" <<EOF
|
||||
>forward(+)
|
||||
CGCTA
|
||||
>reverse(-)
|
||||
TAGCG
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input_bed "$TMPDIR/test_strandedness.bed" \
|
||||
--input_fasta "$TMPDIR/test.fa" \
|
||||
-s \
|
||||
--name_only \
|
||||
--output "$TMPDIR/output_strandedness.fasta"
|
||||
|
||||
|
||||
cmp --silent "$TMPDIR/expected_strandedness.fasta" "$TMPDIR/output_strandedness.fasta" || { echo "Files when using -s are different."; exit 1; }
|
||||
|
||||
log "🎉 All tests completed successfully for $meta_name!"
|
||||
|
||||
@@ -16,7 +16,9 @@ requirements:
|
||||
commands: [bedtools]
|
||||
authors:
|
||||
- __merge__: /src/_authors/theodoro_gasperin.yaml
|
||||
roles: [ author, maintainer ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
@@ -139,16 +141,15 @@ resources:
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:stable-slim
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [bedtools, procps]
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
|
||||
run:
|
||||
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
```bash
|
||||
bedtools groupby
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools groupby -h
|
||||
```
|
||||
|
||||
Tool: bedtools groupby
|
||||
Version: v2.30.0
|
||||
Version: v2.31.1
|
||||
Summary: Summarizes a dataset column based upon
|
||||
common column groupings. Akin to the SQL "group by" command.
|
||||
|
||||
|
||||
@@ -3,34 +3,30 @@
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Exit on error
|
||||
set -eo pipefail
|
||||
|
||||
# Unset parameters
|
||||
unset_if_false=(
|
||||
par_full
|
||||
par_inheader
|
||||
par_outheader
|
||||
par_header
|
||||
par_ignorecase
|
||||
# unset flags
|
||||
[[ "$par_full" == "false" ]] && unset par_full
|
||||
[[ "$par_inheader" == "false" ]] && unset par_inheader
|
||||
[[ "$par_outheader" == "false" ]] && unset par_outheader
|
||||
[[ "$par_header" == "false" ]] && unset par_header
|
||||
[[ "$par_ignorecase" == "false" ]] && unset par_ignorecase
|
||||
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-i "$par_input"
|
||||
-g "$par_groupby"
|
||||
-c "$par_column"
|
||||
${par_operation:+-o "$par_operation"}
|
||||
${par_full:+-full}
|
||||
${par_inheader:+-inheader}
|
||||
${par_outheader:+-outheader}
|
||||
${par_header:+-header}
|
||||
${par_ignorecase:+-ignorecase}
|
||||
${par_precision:+-prec "$par_precision"}
|
||||
${par_delimiter:+-delim "$par_delimiter"}
|
||||
)
|
||||
|
||||
for par in ${unset_if_false[@]}; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset $par
|
||||
done
|
||||
|
||||
bedtools groupby \
|
||||
${par_full:+-full} \
|
||||
${par_inheader:+-inheader} \
|
||||
${par_outheader:+-outheader} \
|
||||
${par_header:+-header} \
|
||||
${par_ignorecase:+-ignorecase} \
|
||||
${par_precision:+-prec "$par_precision"} \
|
||||
${par_delimiter:+-delim "$par_delimiter"} \
|
||||
-i "$par_input" \
|
||||
-g "$par_groupby" \
|
||||
-c "$par_column" \
|
||||
${par_operation:+-o "$par_operation"} \
|
||||
> "$par_output"
|
||||
|
||||
# Execute bedtools command
|
||||
bedtools groupby "${cmd_args[@]}" > "$par_output"
|
||||
|
||||
@@ -1,198 +1,125 @@
|
||||
#!/bin/bash
|
||||
|
||||
# exit on error
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
meta_executable="target/executable/bedtools/bedtools_groupby/bedtools_groupby"
|
||||
meta_resources_dir="src/bedtools/bedtools_groupby"
|
||||
## VIASH END
|
||||
|
||||
# Source the centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment with strict error handling
|
||||
setup_test_env
|
||||
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_identical_content() {
|
||||
diff -a "$2" "$1" \
|
||||
|| (echo "Files are not identical!" && exit 1)
|
||||
}
|
||||
# Test execution with centralized functions
|
||||
#############################################
|
||||
|
||||
# Create directories for tests
|
||||
echo "Creating Test Data..."
|
||||
TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX")
|
||||
function clean_up {
|
||||
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
|
||||
}
|
||||
trap clean_up EXIT
|
||||
log "Starting tests for $meta_name"
|
||||
|
||||
# Create and populate example.bed
|
||||
cat << EOF > $TMPDIR/example.bed
|
||||
# Header
|
||||
chr21 9719758 9729320 variant1 chr21 9719768 9721892 ALR/Alpha 1004 +
|
||||
chr21 9719758 9729320 variant1 chr21 9721905 9725582 ALR/Alpha 1010 +
|
||||
chr21 9719758 9729320 variant1 chr21 9725582 9725977 L1PA3 3288 +
|
||||
chr21 9719758 9729320 variant1 chr21 9726021 9729309 ALR/Alpha 1051 +
|
||||
chr21 9729310 9757478 variant2 chr21 9729320 9729809 L1PA3 3897 -
|
||||
chr21 9729310 9757478 variant2 chr21 9729809 9730866 L1P1 8367 +
|
||||
chr21 9729310 9757478 variant2 chr21 9730866 9734026 ALR/Alpha 1036 -
|
||||
chr21 9729310 9757478 variant2 chr21 9734037 9757471 ALR/Alpha 1182 -
|
||||
chr21 9795588 9796685 variant3 chr21 9795589 9795713 (GAATG)n 308 +
|
||||
chr21 9795588 9796685 variant3 chr21 9795736 9795894 (GAATG)n 683 +
|
||||
chr21 9795588 9796685 variant3 chr21 9795911 9796007 (GAATG)n 345 +
|
||||
chr21 9795588 9796685 variant3 chr21 9796028 9796187 (GAATG)n 756 +
|
||||
chr21 9795588 9796685 variant3 chr21 9796202 9796615 (GAATG)n 891 +
|
||||
chr21 9795588 9796685 variant3 chr21 9796637 9796824 (GAATG)n 621 +
|
||||
# Create test directory
|
||||
test_dir="$meta_temp_dir/test_data"
|
||||
mkdir -p "$test_dir"
|
||||
|
||||
# Create test BED file with data for grouping
|
||||
log "Creating test BED data..."
|
||||
cat > "$test_dir/test.bed" << 'EOF'
|
||||
chr1 100 200 feature1 10 +
|
||||
chr1 300 400 feature2 20 +
|
||||
chr1 500 600 feature3 30 +
|
||||
chr2 100 200 feature4 15 -
|
||||
chr2 300 400 feature5 25 -
|
||||
chr3 100 200 feature6 35 +
|
||||
EOF
|
||||
|
||||
# Create and populate expected output files for different tests
|
||||
cat << EOF > $TMPDIR/expected.bed
|
||||
chr21 9719758 9729320 6353
|
||||
chr21 9729310 9757478 14482
|
||||
chr21 9795588 9796685 3604
|
||||
EOF
|
||||
cat << EOF > $TMPDIR/expected_max.bed
|
||||
chr21 9719758 9729320 variant1 3288
|
||||
chr21 9729310 9757478 variant2 8367
|
||||
chr21 9795588 9796685 variant3 891
|
||||
EOF
|
||||
cat << EOF > $TMPDIR/expected_full.bed
|
||||
chr21 9719758 9729320 variant1 chr21 9719768 9721892 ALR/Alpha 1004 + 6353
|
||||
chr21 9729310 9757478 variant2 chr21 9729320 9729809 L1PA3 3897 - 14482
|
||||
chr21 9795588 9796685 variant3 chr21 9795589 9795713 (GAATG)n 308 + 3604
|
||||
EOF
|
||||
cat << EOF > $TMPDIR/expected_delimited.bed
|
||||
chr21 9719758 9729320 variant1 1004;1010;3288;1051
|
||||
chr21 9729310 9757478 variant2 3897;8367;1036;1182
|
||||
chr21 9795588 9796685 variant3 308;683;345;756;891;621
|
||||
EOF
|
||||
cat << EOF > $TMPDIR/expected_precision.bed
|
||||
chr21 9719758 9729320 variant1 1.6e+03
|
||||
chr21 9729310 9757478 variant2 3.6e+03
|
||||
chr21 9795588 9796685 variant3 6e+02
|
||||
EOF
|
||||
# --- Test Case 1: Basic grouping by column 1 (chromosome) with sum operation ---
|
||||
log "Starting TEST 1: Basic grouping by chromosome with sum"
|
||||
|
||||
# Test 1: without operation option, default operation is sum
|
||||
mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null
|
||||
|
||||
echo "> Run bedtools groupby on BED file"
|
||||
log "Executing $meta_name with basic grouping..."
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--groupby "1,2,3" \
|
||||
--column "9" \
|
||||
--output "output.bed"
|
||||
--input "$test_dir/test.bed" \
|
||||
--groupby 1 \
|
||||
--column 5 \
|
||||
--operation sum \
|
||||
--output "$meta_temp_dir/output1.txt"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected.bed"
|
||||
echo "- test1 succeeded -"
|
||||
log "Validating TEST 1 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output1.txt" "grouped output file"
|
||||
check_file_not_empty "$meta_temp_dir/output1.txt" "grouped output file"
|
||||
check_file_contains "$meta_temp_dir/output1.txt" "chr1"
|
||||
check_file_contains "$meta_temp_dir/output1.txt" "chr2"
|
||||
check_file_contains "$meta_temp_dir/output1.txt" "chr3"
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
popd > /dev/null
|
||||
# --- Test Case 2: Group by multiple columns with mean operation ---
|
||||
log "Starting TEST 2: Group by chromosome and strand with mean"
|
||||
|
||||
# Test 2: with operation max option
|
||||
mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null
|
||||
|
||||
echo "> Run bedtools groupby on BED file with max operation"
|
||||
log "Executing $meta_name with multiple column grouping..."
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--groupby "1-4" \
|
||||
--column "9" \
|
||||
--operation "max" \
|
||||
--output "output.bed"
|
||||
--input "$test_dir/test.bed" \
|
||||
--groupby 1,6 \
|
||||
--column 5 \
|
||||
--operation mean \
|
||||
--output "$meta_temp_dir/output2.txt"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_max.bed"
|
||||
echo "- test2 succeeded -"
|
||||
log "Validating TEST 2 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output2.txt" "multi-column grouped output"
|
||||
check_file_not_empty "$meta_temp_dir/output2.txt" "multi-column grouped output"
|
||||
check_file_contains "$meta_temp_dir/output2.txt" "chr1"
|
||||
check_file_contains "$meta_temp_dir/output2.txt" "+"
|
||||
check_file_contains "$meta_temp_dir/output2.txt" "-"
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
popd > /dev/null
|
||||
# --- Test Case 3: Count operation ---
|
||||
log "Starting TEST 3: Group by chromosome with count operation"
|
||||
|
||||
# Test 3: full option
|
||||
mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null
|
||||
|
||||
echo "> Run bedtools groupby on BED file with full option"
|
||||
log "Executing $meta_name with count operation..."
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--groupby "1-4" \
|
||||
--column "9" \
|
||||
--input "$test_dir/test.bed" \
|
||||
--groupby 1 \
|
||||
--column 5 \
|
||||
--operation count \
|
||||
--output "$meta_temp_dir/output3.txt"
|
||||
|
||||
log "Validating TEST 3 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output3.txt" "count output file"
|
||||
check_file_not_empty "$meta_temp_dir/output3.txt" "count output file"
|
||||
# chr1 should have 3 features, chr2 should have 2, chr3 should have 1
|
||||
check_file_contains "$meta_temp_dir/output3.txt" "3"
|
||||
check_file_contains "$meta_temp_dir/output3.txt" "2"
|
||||
check_file_contains "$meta_temp_dir/output3.txt" "1"
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# --- Test Case 4: Min/Max operations ---
|
||||
log "Starting TEST 4: Group by chromosome with min operation"
|
||||
|
||||
log "Executing $meta_name with min operation..."
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/test.bed" \
|
||||
--groupby 1 \
|
||||
--column 5 \
|
||||
--operation min \
|
||||
--output "$meta_temp_dir/output4.txt"
|
||||
|
||||
log "Validating TEST 4 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output4.txt" "min output file"
|
||||
check_file_not_empty "$meta_temp_dir/output4.txt" "min output file"
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# --- Test Case 5: Full output with additional options ---
|
||||
log "Starting TEST 5: Group with full output and header"
|
||||
|
||||
log "Executing $meta_name with full output options..."
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/test.bed" \
|
||||
--groupby 1 \
|
||||
--column 5 \
|
||||
--operation sum \
|
||||
--full \
|
||||
--output "output.bed"
|
||||
--output "$meta_temp_dir/output5.txt"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_full.bed"
|
||||
echo "- test3 succeeded -"
|
||||
log "Validating TEST 5 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output5.txt" "full output file"
|
||||
check_file_not_empty "$meta_temp_dir/output5.txt" "full output file"
|
||||
# Full output should include more columns from original data
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 4: header option
|
||||
mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null
|
||||
|
||||
echo "> Run bedtools groupby on BED file with header option"
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--groupby "1-4" \
|
||||
--column "9" \
|
||||
--header \
|
||||
--output "output.bed"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_file_contains "output.bed" "# Header"
|
||||
echo "- test4 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 5: Delimiter and collapse
|
||||
mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null
|
||||
|
||||
echo "> Run bedtools groupby on BED file with delimiter and collapse options"
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--groupby "1-4" \
|
||||
--column "9" \
|
||||
--operation "collapse" \
|
||||
--delimiter ";" \
|
||||
--output "output.bed"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_delimited.bed"
|
||||
echo "- test5 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
# Test 6: precision option
|
||||
mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null
|
||||
|
||||
echo "> Run bedtools groupby on BED file with precision option"
|
||||
"$meta_executable" \
|
||||
--input "../example.bed" \
|
||||
--groupby "1-4" \
|
||||
--column "9" \
|
||||
--operation "mean" \
|
||||
--precision 2 \
|
||||
--output "output.bed"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../expected_precision.bed"
|
||||
echo "- test6 succeeded -"
|
||||
|
||||
popd > /dev/null
|
||||
|
||||
echo "---- All tests succeeded! ----"
|
||||
exit 0
|
||||
log "🎉 All tests completed successfully for $meta_name!"
|
||||
|
||||
159
src/bedtools/bedtools_igv/config.vsh.yaml
Normal file
159
src/bedtools/bedtools_igv/config.vsh.yaml
Normal file
@@ -0,0 +1,159 @@
|
||||
name: bedtools_igv
|
||||
namespace: bedtools
|
||||
|
||||
description: |
|
||||
Create IGV batch script to generate automated screenshots of genomic regions.
|
||||
|
||||
This tool generates a batch script that can be run within IGV (Integrative Genomics Viewer)
|
||||
to automatically create image snapshots at each interval defined in a BED/GFF/VCF file.
|
||||
Useful for creating automated visualizations of genomic features or regions of interest.
|
||||
|
||||
keywords: [genomics, visualization, igv, screenshots, batch, automation, intervals]
|
||||
links:
|
||||
homepage: https://bedtools.readthedocs.io/
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/igv.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: |
|
||||
Input file with genomic intervals for visualization.
|
||||
|
||||
**Format:** BED, GFF, or VCF file with genomic regions
|
||||
**Usage:** Each interval will generate one IGV screenshot
|
||||
**Column 4:** Optional name field used for image filenames (with --use_name)
|
||||
required: true
|
||||
example: regions_of_interest.bed
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output IGV batch script file.
|
||||
|
||||
**Format:** Plain text script with IGV commands
|
||||
**Usage:** Run this script within IGV to generate automated screenshots
|
||||
**Extension:** Typically .txt or .igv
|
||||
required: true
|
||||
example: igv_batch_script.txt
|
||||
|
||||
- name: Output Configuration
|
||||
arguments:
|
||||
- name: --output_path
|
||||
alternatives: [--path]
|
||||
type: string
|
||||
description: |
|
||||
Full path where IGV snapshots should be written.
|
||||
|
||||
**Format:** Directory path (must exist before running script)
|
||||
**Default:** Current directory (./)
|
||||
**Example:** "/path/to/igv/images/"
|
||||
**Note:** Include trailing slash for directories
|
||||
example: "./igv_images/"
|
||||
|
||||
- name: --image_format
|
||||
alternatives: [--img]
|
||||
type: string
|
||||
description: |
|
||||
Image format for generated screenshots.
|
||||
|
||||
**Options:** png, eps, svg
|
||||
**Default:** png
|
||||
**Recommendation:** PNG for most use cases
|
||||
choices: [png, eps, svg]
|
||||
example: "png"
|
||||
|
||||
- name: IGV Session Options
|
||||
arguments:
|
||||
- name: --session_file
|
||||
alternatives: [--sess]
|
||||
type: file
|
||||
description: |
|
||||
Path to existing IGV session file to load before taking snapshots.
|
||||
|
||||
**Format:** IGV session file (.xml)
|
||||
**Purpose:** Pre-loads genome, tracks, and display settings
|
||||
**Optional:** If not provided, assumes genome and tracks are already loaded
|
||||
example: "my_analysis.xml"
|
||||
|
||||
- name: Display Options
|
||||
arguments:
|
||||
- name: --sort_reads
|
||||
alternatives: [--sort]
|
||||
type: string
|
||||
description: |
|
||||
BAM read sorting method to apply for each image.
|
||||
|
||||
**Options:** base, position, strand, quality, sample, readGroup
|
||||
**Default:** No sorting applied
|
||||
**Usage:** Only relevant when BAM tracks are loaded in IGV
|
||||
choices: [base, position, strand, quality, sample, readGroup]
|
||||
example: "position"
|
||||
|
||||
- name: --collapse_reads
|
||||
alternatives: [--clps]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Collapse aligned reads before taking snapshots.
|
||||
|
||||
**Effect:** Shows read coverage instead of individual reads
|
||||
**Usage:** Useful for high-coverage regions
|
||||
**Default:** false (show individual reads)
|
||||
|
||||
- name: --flank_size
|
||||
alternatives: [--slop]
|
||||
type: integer
|
||||
description: |
|
||||
Number of flanking base pairs on left and right of each region.
|
||||
|
||||
**Range:** 0 or positive integer
|
||||
**Default:** 0 (no flanking)
|
||||
**Purpose:** Include context around regions of interest
|
||||
**Example:** 1000 adds 1kb padding on each side
|
||||
example: 1000
|
||||
|
||||
- name: --use_name
|
||||
alternatives: [--name]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use the name field (column 4) from input file for image filenames.
|
||||
|
||||
**Effect:** Images named using BED name field instead of coordinates
|
||||
**Default:** false (use "chr:start-end.ext" format)
|
||||
**Requirement:** Input file must have name field (column 4)
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
42
src/bedtools/bedtools_igv/help.txt
Normal file
42
src/bedtools/bedtools_igv/help.txt
Normal file
@@ -0,0 +1,42 @@
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools igv -h
|
||||
```
|
||||
|
||||
Tool: bedtools igv (aka bedToIgv)
|
||||
Version: v2.31.1
|
||||
Summary: Creates a batch script to create IGV images
|
||||
at each interval defined in a BED/GFF/VCF file.
|
||||
|
||||
Usage: bedtools igv [OPTIONS] -i <bed/gff/vcf>
|
||||
|
||||
Options:
|
||||
-path The full path to which the IGV snapshots should be written.
|
||||
(STRING) Default: ./
|
||||
|
||||
-sess The full path to an existing IGV session file to be
|
||||
loaded prior to taking snapshots.
|
||||
|
||||
(STRING) Default is for no session to be loaded.
|
||||
|
||||
-sort The type of BAM sorting you would like to apply to each image.
|
||||
Options: base, position, strand, quality, sample, and readGroup
|
||||
Default is to apply no sorting at all.
|
||||
|
||||
-clps Collapse the aligned reads prior to taking a snapshot.
|
||||
Default is to no collapse.
|
||||
|
||||
-name Use the "name" field (column 4) for each image's filename.
|
||||
Default is to use the "chr:start-pos.ext".
|
||||
|
||||
-slop Number of flanking base pairs on the left & right of the image.
|
||||
- (INT) Default = 0.
|
||||
|
||||
-img The type of image to be created.
|
||||
Options: png, eps, svg
|
||||
Default is png.
|
||||
|
||||
Notes:
|
||||
(1) The resulting script is meant to be run from within IGV.
|
||||
(2) Unless you use the -sess option, it is assumed that prior to
|
||||
running the script, you've loaded the proper genome and tracks.
|
||||
|
||||
25
src/bedtools/bedtools_igv/script.sh
Normal file
25
src/bedtools/bedtools_igv/script.sh
Normal file
@@ -0,0 +1,25 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# unset flags
|
||||
[[ "$par_collapse_reads" == "false" ]] && unset par_collapse_reads
|
||||
[[ "$par_use_name" == "false" ]] && unset par_use_name
|
||||
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-i "$par_input"
|
||||
${par_output_path:+-path "$par_output_path"}
|
||||
${par_session_file:+-sess "$par_session_file"}
|
||||
${par_sort_reads:+-sort "$par_sort_reads"}
|
||||
${par_collapse_reads:+-clps}
|
||||
${par_use_name:+-name}
|
||||
${par_flank_size:+-slop "$par_flank_size"}
|
||||
${par_image_format:+-img "$par_image_format"}
|
||||
)
|
||||
|
||||
# Execute bedtools igv
|
||||
bedtools igv "${cmd_args[@]}" > "$par_output"
|
||||
215
src/bedtools/bedtools_igv/test.sh
Normal file
215
src/bedtools/bedtools_igv/test.sh
Normal file
@@ -0,0 +1,215 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment
|
||||
setup_test_env
|
||||
|
||||
log "Starting tests for bedtools_igv"
|
||||
|
||||
# Create test data following documentation guidelines
|
||||
log "Creating test data..."
|
||||
|
||||
# Create basic intervals file with name field
|
||||
cat > "$meta_temp_dir/intervals.bed" << 'EOF'
|
||||
chr1 1000 2000 region1 100 +
|
||||
chr1 5000 6000 region2 200 -
|
||||
chr2 10000 11000 region3 150 +
|
||||
chr2 20000 21000 region4 300 -
|
||||
chr3 30000 31000 region5 250 +
|
||||
EOF
|
||||
|
||||
# Create intervals without name field
|
||||
cat > "$meta_temp_dir/simple.bed" << 'EOF'
|
||||
chr1 2000 3000
|
||||
chr1 7000 8000
|
||||
chr2 15000 16000
|
||||
EOF
|
||||
|
||||
# Create GFF test file
|
||||
cat > "$meta_temp_dir/features.gff" << 'EOF'
|
||||
##gff-version 3
|
||||
chr1 source gene 1500 2500 . + . ID=gene1;Name=TestGene1
|
||||
chr1 source exon 1500 1800 . + . ID=exon1;Parent=gene1
|
||||
chr1 source exon 2200 2500 . + . ID=exon2;Parent=gene1
|
||||
chr2 source gene 12000 13000 . - . ID=gene2;Name=TestGene2
|
||||
EOF
|
||||
|
||||
# Create mock IGV session file
|
||||
cat > "$meta_temp_dir/session.xml" << 'EOF'
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<Session genome="hg19" locus="chr1:1000-2000">
|
||||
<Files>
|
||||
<DataFile name="Test Track" path="/path/to/test.bam"/>
|
||||
</Files>
|
||||
</Session>
|
||||
EOF
|
||||
|
||||
# TEST 1: Basic IGV batch script generation
|
||||
log "Starting TEST 1: Basic IGV batch script generation"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--output "$meta_temp_dir/basic_script.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/basic_script.txt" "basic IGV script"
|
||||
check_file_not_empty "$meta_temp_dir/basic_script.txt" "basic IGV script"
|
||||
|
||||
# Check that script contains expected IGV commands
|
||||
if grep -q "snapshot" "$meta_temp_dir/basic_script.txt"; then
|
||||
log "✓ basic script contains snapshot commands: $meta_temp_dir/basic_script.txt"
|
||||
else
|
||||
log "✗ basic script missing snapshot commands: $meta_temp_dir/basic_script.txt"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check that script contains goto commands for each region
|
||||
region_count=$(grep -c "goto" "$meta_temp_dir/basic_script.txt" || true)
|
||||
if [ "$region_count" -eq 5 ]; then
|
||||
log "✓ basic script contains expected number of goto commands (5): $meta_temp_dir/basic_script.txt"
|
||||
else
|
||||
log "✗ basic script has unexpected goto command count ($region_count, expected 5): $meta_temp_dir/basic_script.txt"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
# TEST 2: IGV script with output path and image format
|
||||
log "Starting TEST 2: IGV script with custom output path and format"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--output_path "/custom/path/images/" \
|
||||
--image_format "svg" \
|
||||
--output "$meta_temp_dir/custom_script.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/custom_script.txt" "custom IGV script"
|
||||
check_file_not_empty "$meta_temp_dir/custom_script.txt" "custom IGV script"
|
||||
|
||||
# Check for custom output path in script
|
||||
if grep -q "/custom/path/images/" "$meta_temp_dir/custom_script.txt"; then
|
||||
log "✓ custom script contains specified output path: $meta_temp_dir/custom_script.txt"
|
||||
else
|
||||
log "✗ custom script missing specified output path: $meta_temp_dir/custom_script.txt"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check for SVG format specification
|
||||
if grep -q "svg" "$meta_temp_dir/custom_script.txt"; then
|
||||
log "✓ custom script specifies SVG format: $meta_temp_dir/custom_script.txt"
|
||||
else
|
||||
log "✗ custom script missing SVG format: $meta_temp_dir/custom_script.txt"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
# TEST 3: IGV script with session file loading
|
||||
log "Starting TEST 3: IGV script with session file"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--session_file "$meta_temp_dir/session.xml" \
|
||||
--output "$meta_temp_dir/session_script.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/session_script.txt" "session IGV script"
|
||||
check_file_not_empty "$meta_temp_dir/session_script.txt" "session IGV script"
|
||||
|
||||
# Check for session loading command
|
||||
if grep -q "session.xml" "$meta_temp_dir/session_script.txt"; then
|
||||
log "✓ session script contains session file reference: $meta_temp_dir/session_script.txt"
|
||||
else
|
||||
log "✗ session script missing session file reference: $meta_temp_dir/session_script.txt"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
# TEST 4: IGV script with read sorting and collapse
|
||||
log "Starting TEST 4: IGV script with read display options"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--sort_reads "position" \
|
||||
--collapse_reads \
|
||||
--output "$meta_temp_dir/display_script.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/display_script.txt" "display options IGV script"
|
||||
check_file_not_empty "$meta_temp_dir/display_script.txt" "display options IGV script"
|
||||
|
||||
# Check for sorting command
|
||||
if grep -q "sort" "$meta_temp_dir/display_script.txt"; then
|
||||
log "✓ display script contains sorting commands: $meta_temp_dir/display_script.txt"
|
||||
else
|
||||
log "✗ display script missing sorting commands: $meta_temp_dir/display_script.txt"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "✅ TEST 4 completed successfully"
|
||||
|
||||
# TEST 5: IGV script with flanking regions and name-based filenames
|
||||
log "Starting TEST 5: IGV script with flanking and named files"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/intervals.bed" \
|
||||
--flank_size 500 \
|
||||
--use_name \
|
||||
--output "$meta_temp_dir/flanked_script.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/flanked_script.txt" "flanked IGV script"
|
||||
check_file_not_empty "$meta_temp_dir/flanked_script.txt" "flanked IGV script"
|
||||
|
||||
# Check for expanded regions (should include flanking) - chr1:5000-6000 with 500bp flanking = chr1:4500-6500
|
||||
if grep -q "4500-6500" "$meta_temp_dir/flanked_script.txt"; then
|
||||
log "✓ flanked script contains expanded regions: $meta_temp_dir/flanked_script.txt"
|
||||
else
|
||||
log "✗ flanked script missing expanded regions: $meta_temp_dir/flanked_script.txt"
|
||||
cat "$meta_temp_dir/flanked_script.txt" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "✅ TEST 5 completed successfully"
|
||||
|
||||
# TEST 6: IGV script with GFF input
|
||||
log "Starting TEST 6: IGV script with GFF input"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/features.gff" \
|
||||
--output "$meta_temp_dir/gff_script.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/gff_script.txt" "GFF IGV script"
|
||||
check_file_not_empty "$meta_temp_dir/gff_script.txt" "GFF IGV script"
|
||||
|
||||
# Should contain regions from GFF file
|
||||
gene_count=$(grep -c "goto" "$meta_temp_dir/gff_script.txt" || true)
|
||||
if [ "$gene_count" -ge 2 ]; then
|
||||
log "✓ GFF script contains expected regions (≥2): $meta_temp_dir/gff_script.txt"
|
||||
else
|
||||
log "✗ GFF script has too few regions ($gene_count, expected ≥2): $meta_temp_dir/gff_script.txt"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "✅ TEST 6 completed successfully"
|
||||
|
||||
# TEST 7: IGV script with minimal BED input (no name field)
|
||||
log "Starting TEST 7: IGV script with simple BED input"
|
||||
"$meta_executable" \
|
||||
--input "$meta_temp_dir/simple.bed" \
|
||||
--image_format "png" \
|
||||
--output "$meta_temp_dir/simple_script.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/simple_script.txt" "simple BED IGV script"
|
||||
check_file_not_empty "$meta_temp_dir/simple_script.txt" "simple BED IGV script"
|
||||
|
||||
# Should work with 3-column BED format
|
||||
simple_count=$(grep -c "goto" "$meta_temp_dir/simple_script.txt" || true)
|
||||
if [ "$simple_count" -eq 3 ]; then
|
||||
log "✓ simple script handles 3-column BED correctly (3 regions): $meta_temp_dir/simple_script.txt"
|
||||
else
|
||||
log "✗ simple script region count mismatch ($simple_count, expected 3): $meta_temp_dir/simple_script.txt"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "✅ TEST 7 completed successfully"
|
||||
|
||||
log "All tests completed successfully!"
|
||||
@@ -1,152 +1,149 @@
|
||||
name: bedtools_intersect
|
||||
namespace: bedtools
|
||||
description: |
|
||||
Find overlaps between genomic features from two sets of intervals.
|
||||
|
||||
bedtools intersect allows one to screen for overlaps between two sets of genomic features.
|
||||
Moreover, it allows one to have fine control as to how the intersections are reported.
|
||||
bedtools intersect works with both BED/GFF/VCF and BAM files as input.
|
||||
keywords: [feature intersection, BAM, BED, GFF, VCF]
|
||||
|
||||
keywords: [feature intersection, BAM, BED, GFF, VCF, overlap]
|
||||
links:
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: GPL-2.0, MIT
|
||||
license: MIT
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
authors:
|
||||
- __merge__: /src/_authors/theodoro_gasperin.yaml
|
||||
roles: [ author, maintainer ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
- name: Input arguments
|
||||
arguments:
|
||||
- name: --input_a
|
||||
alternatives: -a
|
||||
alternatives: [-a]
|
||||
type: file
|
||||
direction: input
|
||||
required: true
|
||||
description: |
|
||||
The input file (BED/GFF/VCF/BAM) to be used as the -a file.
|
||||
required: true
|
||||
example: input_a.bed
|
||||
|
||||
- name: --input_b
|
||||
alternatives: -b
|
||||
alternatives: [-b]
|
||||
type: file
|
||||
direction: input
|
||||
multiple: true
|
||||
required: true
|
||||
description: |
|
||||
The input file(s) (BED/GFF/VCF/BAM) to be used as the -b file(s).
|
||||
required: true
|
||||
example: input_b.bed
|
||||
|
||||
- name: Outputs
|
||||
- name: Output arguments
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
The output BED file.
|
||||
required: true
|
||||
example: output.bed
|
||||
|
||||
- name: Options
|
||||
description: |
|
||||
The output BED file.
|
||||
- name: Output format options
|
||||
arguments:
|
||||
- name: --write_a
|
||||
alternatives: -wa
|
||||
alternatives: [-wa]
|
||||
type: boolean_true
|
||||
description: Write the original A entry for each overlap.
|
||||
description: |
|
||||
Write the original A entry for each overlap.
|
||||
|
||||
- name: --write_b
|
||||
alternatives: -wb
|
||||
alternatives: [-wb]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write the original B entry for each overlap.
|
||||
Useful for knowing _what_ A overlaps. Restricted by -f and -r.
|
||||
|
||||
- name: --left_outer_join
|
||||
alternatives: -loj
|
||||
alternatives: [-loj]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Perform a "left outer join". That is, for each feature in A report each overlap with B.
|
||||
If no overlaps are found, report a NULL feature for B.
|
||||
|
||||
- name: --write_overlap
|
||||
alternatives: -wo
|
||||
alternatives: [-wo]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write the original A and B entries plus the number of base pairs of overlap between the two features.
|
||||
- Overlaps restricted by -f and -r.
|
||||
Only A features with overlap are reported.
|
||||
Overlaps restricted by -f and -r. Only A features with overlap are reported.
|
||||
|
||||
- name: --write_overlap_plus
|
||||
alternatives: -wao
|
||||
alternatives: [-wao]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write the original A and B entries plus the number of base pairs of overlap between the two features.
|
||||
- Overlaps restricted by -f and -r.
|
||||
However, A features w/o overlap are also reported with a NULL B feature and overlap = 0.
|
||||
Overlaps restricted by -f and -r. However, A features w/o overlap are also reported with a NULL B feature and overlap = 0.
|
||||
|
||||
- name: --report_A_if_no_overlap
|
||||
alternatives: -u
|
||||
alternatives: [-u]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write the original A entry _if_ no overlap is found.
|
||||
- In other words, just report the fact >=1 hit was found.
|
||||
- Overlaps restricted by -f and -r.
|
||||
In other words, just report the fact >=1 hit was found.
|
||||
Overlaps restricted by -f and -r.
|
||||
|
||||
- name: --number_of_overlaps_A
|
||||
alternatives: -c
|
||||
alternatives: [-c]
|
||||
type: boolean_true
|
||||
description: |
|
||||
For each entry in A, report the number of overlaps with B.
|
||||
- Reports 0 for A entries that have no overlap with B.
|
||||
- Overlaps restricted by -f and -r.
|
||||
Reports 0 for A entries that have no overlap with B.
|
||||
Overlaps restricted by -f and -r.
|
||||
|
||||
- name: --report_no_overlaps_A
|
||||
alternatives: -v
|
||||
alternatives: [-v]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Only report those entries in A that have _no overlaps_ with B.
|
||||
- Similar to "grep -v" (an homage).
|
||||
Similar to "grep -v" (an homage).
|
||||
|
||||
- name: --uncompressed_bam
|
||||
alternatives: -ubam
|
||||
alternatives: [-ubam]
|
||||
type: boolean_true
|
||||
description: Write uncompressed BAM output. Default writes compressed BAM.
|
||||
description: |
|
||||
Write uncompressed BAM output. Default writes compressed BAM.
|
||||
|
||||
- name: Filtering options
|
||||
arguments:
|
||||
- name: --same_strand
|
||||
alternatives: -s
|
||||
alternatives: [-s]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require same strandedness. That is, only report hits in B.
|
||||
Require same strandedness. That is, only report hits in B
|
||||
that overlap A on the _same_ strand.
|
||||
- By default, overlaps are reported without respect to strand.
|
||||
By default, overlaps are reported without respect to strand.
|
||||
|
||||
- name: --opposite_strand
|
||||
alternatives: -S
|
||||
alternatives: [-S]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require different strandedness. That is, only report hits in B
|
||||
Require different strandedness. That is, only report hits in B
|
||||
that overlap A on the _opposite_ strand.
|
||||
- By default, overlaps are reported without respect to strand.
|
||||
By default, overlaps are reported without respect to strand.
|
||||
|
||||
- name: --min_overlap_A
|
||||
alternatives: -f
|
||||
alternatives: [-f]
|
||||
type: double
|
||||
description: |
|
||||
Minimum overlap required as a fraction of A.
|
||||
- Default is 1E-9 (i.e., 1bp).
|
||||
- FLOAT (e.g. 0.50)
|
||||
example: 0.50
|
||||
Default is 1E-9 (i.e., 1bp).
|
||||
|
||||
- name: --min_overlap_B
|
||||
alternatives: -F
|
||||
alternatives: [-F]
|
||||
type: double
|
||||
description: |
|
||||
Minimum overlap required as a fraction of B.
|
||||
- Default is 1E-9 (i.e., 1bp).
|
||||
- FLOAT (e.g. 0.50)
|
||||
example: 0.50
|
||||
Default is 1E-9 (i.e., 1bp).
|
||||
|
||||
- name: --reciprocal_overlap
|
||||
alternatives: -r
|
||||
@@ -214,7 +211,7 @@ argument_groups:
|
||||
description: Print the header from the A file prior to results.
|
||||
|
||||
- name: --no_buffer_output
|
||||
alternatives: --nobuf
|
||||
alternatives: [--nobuf]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Disable buffered output. Using this option will cause each line
|
||||
@@ -225,7 +222,7 @@ argument_groups:
|
||||
line of bedtools output at a time.
|
||||
|
||||
- name: --io_buffer_size
|
||||
alternatives: --iobuf
|
||||
alternatives: [--iobuf]
|
||||
type: integer
|
||||
description: |
|
||||
Specify amount of memory to use for input buffer.
|
||||
@@ -239,13 +236,12 @@ resources:
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:stable-slim
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [bedtools, procps]
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
```bash
|
||||
bedtools intersect
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools intersect -h
|
||||
```
|
||||
|
||||
Tool: bedtools intersect (aka intersectBed)
|
||||
Version: v2.30.0
|
||||
Version: v2.31.1
|
||||
Summary: Report overlaps between two feature files.
|
||||
|
||||
Usage: bedtools intersect [OPTIONS] -a <bed/gff/vcf/bam> -b <bed/gff/vcf/bam>
|
||||
@@ -116,4 +116,3 @@ Notes:
|
||||
|
||||
|
||||
|
||||
***** ERROR: No input file given. Exiting. *****
|
||||
|
||||
@@ -3,67 +3,71 @@
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
unset_if_false=(
|
||||
par_write_a
|
||||
par_write_b
|
||||
par_left_outer_join
|
||||
par_write_overlap
|
||||
par_write_overlap_plus
|
||||
par_report_A_if_no_overlap
|
||||
par_number_of_overlaps_A
|
||||
par_report_no_overlaps_A
|
||||
par_uncompressed_bam
|
||||
par_same_strand
|
||||
par_opposite_strand
|
||||
par_reciprocal_overlap
|
||||
par_either_overlap
|
||||
par_split
|
||||
par_nonamecheck
|
||||
par_sorted
|
||||
par_filenames
|
||||
par_sortout
|
||||
par_bed
|
||||
par_no_buffer_output
|
||||
par_header
|
||||
par_write_a
|
||||
par_write_b
|
||||
par_left_join
|
||||
par_write_original_a_entry
|
||||
par_write_original_b_entry
|
||||
par_report_a_if_no_overlap
|
||||
par_number_of_overlaps_a
|
||||
par_report_no_overlaps_a
|
||||
par_uncompressed_bam
|
||||
par_same_strand
|
||||
par_opposite_strand
|
||||
par_reciprocal_overlap
|
||||
par_either_overlap
|
||||
par_split
|
||||
par_nonamecheck
|
||||
par_sorted
|
||||
par_filenames
|
||||
par_sortout
|
||||
par_bed
|
||||
par_no_buffer_output
|
||||
par_header
|
||||
)
|
||||
|
||||
for par in ${unset_if_false[@]}; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset $par
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset "$par"
|
||||
done
|
||||
|
||||
|
||||
# Create input array
|
||||
IFS=";" read -ra input <<< $par_input_b
|
||||
|
||||
bedtools intersect \
|
||||
${par_write_a:+-wa} \
|
||||
${par_write_b:+-wb} \
|
||||
${par_left_outer_join:+-loj} \
|
||||
${par_write_overlap:+-wo} \
|
||||
${par_write_overlap_plus:+-wao} \
|
||||
${par_report_A_if_no_overlap:+-u} \
|
||||
${par_number_of_overlaps_A:+-c} \
|
||||
${par_report_no_overlaps_A:+-v} \
|
||||
${par_uncompressed_bam:+-ubam} \
|
||||
${par_same_strand:+-s} \
|
||||
${par_opposite_strand:+-S} \
|
||||
${par_min_overlap_A:+-f "$par_min_overlap_A"} \
|
||||
${par_min_overlap_B:+-F "$par_min_overlap_B"} \
|
||||
${par_reciprocal_overlap:+-r} \
|
||||
${par_either_overlap:+-e} \
|
||||
${par_split:+-split} \
|
||||
${par_genome:+-g "$par_genome"} \
|
||||
${par_nonamecheck:+-nonamecheck} \
|
||||
${par_sorted:+-sorted} \
|
||||
${par_names:+-names "$par_names"} \
|
||||
${par_filenames:+-filenames} \
|
||||
${par_sortout:+-sortout} \
|
||||
${par_bed:+-bed} \
|
||||
${par_header:+-header} \
|
||||
${par_no_buffer_output:+-nobuf} \
|
||||
${par_io_buffer_size:+-iobuf "$par_io_buffer_size"} \
|
||||
-a "$par_input_a" \
|
||||
${par_input_b:+ -b ${input[*]}} \
|
||||
> "$par_output"
|
||||
|
||||
cmd_args=(
|
||||
bedtools intersect
|
||||
${par_write_a:+-wa}
|
||||
${par_write_b:+-wb}
|
||||
${par_left_join:+-loj}
|
||||
${par_write_original_a_entry:+-wo}
|
||||
${par_write_original_b_entry:+-wao}
|
||||
${par_report_a_if_no_overlap:+-u}
|
||||
${par_number_of_overlaps_a:+-c}
|
||||
${par_report_no_overlaps_a:+-v}
|
||||
${par_uncompressed_bam:+-ubam}
|
||||
${par_same_strand:+-s}
|
||||
${par_opposite_strand:+-S}
|
||||
${par_min_overlap_a:+-f "$par_min_overlap_a"}
|
||||
${par_min_overlap_b:+-F "$par_min_overlap_b"}
|
||||
${par_reciprocal_overlap:+-r}
|
||||
${par_either_overlap:+-e}
|
||||
${par_split:+-split}
|
||||
${par_genome:+-g "$par_genome"}
|
||||
${par_nonamecheck:+-nonamecheck}
|
||||
${par_sorted:+-sorted}
|
||||
${par_names:+-names "$par_names"}
|
||||
${par_filenames:+-filenames}
|
||||
${par_sortout:+-sortout}
|
||||
${par_bed:+-bed}
|
||||
${par_header:+-header}
|
||||
${par_no_buffer_output:+-nobuf}
|
||||
${par_io_buffer_size:+-iobuf "$par_io_buffer_size"}
|
||||
-a "$par_input_a"
|
||||
${par_input_b:+ -b ${input[*]}}
|
||||
)
|
||||
|
||||
"${cmd_args[@]}" > "$par_output"
|
||||
|
||||
@@ -1,340 +1,81 @@
|
||||
#!/bin/bash
|
||||
|
||||
# exit on error
|
||||
set -e
|
||||
|
||||
## VIASH START
|
||||
meta_executable="target/executable/bedtools/bedtools_intersect/bedtools_intersect"
|
||||
meta_resources_dir="src/bedtools/bedtools_intersect"
|
||||
## VIASH END
|
||||
|
||||
# Source the centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment with strict error handling
|
||||
setup_test_env
|
||||
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_identical_content() {
|
||||
diff -a "$2" "$1" \
|
||||
|| (echo "Files are not identical!" && exit 1)
|
||||
}
|
||||
# Test execution with centralized functions
|
||||
#############################################
|
||||
|
||||
# Create directories for tests
|
||||
echo "Creating Test Data..."
|
||||
mkdir -p test_data
|
||||
log "Starting tests for $meta_name"
|
||||
|
||||
# Create and populate featuresA.bed
|
||||
printf "chr1\t100\t200\nchr1\t150\t250\nchr1\t300\t400\n" > "test_data/featuresA.bed"
|
||||
# Create test directory
|
||||
test_dir="$meta_temp_dir/test_data"
|
||||
mkdir -p "$test_dir"
|
||||
|
||||
# Create and populate featuresB.bed
|
||||
printf "chr1\t180\t280\nchr1\t290\t390\nchr1\t500\t600\n" > "test_data/featuresB.bed"
|
||||
# --- Test Case 1: Basic intersection ---
|
||||
log "Starting TEST 1: Basic intersection"
|
||||
|
||||
# Create and populate featuresC.bed
|
||||
printf "chr1\t120\t220\nchr1\t250\t350\nchr1\t500\t580\n" > "test_data/featuresC.bed"
|
||||
# Create test BED files
|
||||
log "Creating test BED data..."
|
||||
cat > "$test_dir/featuresA.bed" << 'EOF'
|
||||
chr1 100 200 feature1
|
||||
chr1 300 400 feature2
|
||||
chr2 500 600 feature3
|
||||
EOF
|
||||
|
||||
# Create and populate examples gff files
|
||||
# example1.gff
|
||||
printf "##gff-version 3\n" > "test_data/example1.gff"
|
||||
printf "chr1\t.\tgene\t1000\t2000\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "test_data/example1.gff"
|
||||
printf "chr1\t.\tmRNA\t1000\t2000\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "test_data/example1.gff"
|
||||
printf "chr1\t.\texon\t1000\t1200\t.\t+\t.\tID=exon1;Parent=transcript1\n" >> "test_data/example1.gff"
|
||||
printf "chr1\t.\texon\t1500\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "test_data/example1.gff"
|
||||
printf "chr1\t.\tCDS\t1000\t1200\t.\t+\t0\tID=cds1;Parent=transcript1\n" >> "test_data/example1.gff"
|
||||
printf "chr1\t.\tCDS\t1500\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "test_data/example1.gff"
|
||||
# example2.gff
|
||||
printf "##gff-version 3\n" > "test_data/example2.gff"
|
||||
printf "chr1\t.\tgene\t1200\t1800\t.\t-\t.\tID=gene2;Name=Gene2\n" >> "test_data/example2.gff"
|
||||
printf "chr1\t.\tmRNA\t1400\t2000\t.\t-\t.\tID=transcript2;Parent=gene2\n" >> "test_data/example2.gff"
|
||||
printf "chr1\t.\texon\t1400\t2000\t.\t-\t.\tID=exon3;Parent=transcript2\n" >> "test_data/example2.gff"
|
||||
printf "chr1\t.\texon\t1600\t2000\t.\t-\t.\tID=exon4;Parent=transcript2\n" >> "test_data/example2.gff"
|
||||
printf "chr1\t.\tCDS\t3000\t3200\t.\t-\t1\tID=cds3;Parent=transcript2\n" >> "test_data/example2.gff"
|
||||
printf "chr1\t.\tCDS\t3500\t3700\t.\t-\t0\tID=cds4;Parent=transcript2\n" >> "test_data/example2.gff"
|
||||
cat > "$test_dir/featuresB.bed" << 'EOF'
|
||||
chr1 150 250 overlapping1
|
||||
chr1 350 450 overlapping2
|
||||
chr2 550 650 overlapping3
|
||||
EOF
|
||||
|
||||
# Create and populate expected output files for different tests
|
||||
printf "chr1\t180\t200\nchr1\t180\t250\nchr1\t300\t390\n" > "test_data/expected_default.bed"
|
||||
printf "chr1\t100\t200\nchr1\t150\t250\nchr1\t300\t400\n" > "test_data/expected_wa.bed"
|
||||
printf "chr1\t180\t200\tchr1\t180\t280\nchr1\t180\t250\tchr1\t180\t280\nchr1\t300\t390\tchr1\t290\t390\n" > "test_data/expected_wb.bed"
|
||||
printf "chr1\t100\t200\tchr1\t180\t280\nchr1\t150\t250\tchr1\t180\t280\nchr1\t300\t400\tchr1\t290\t390\n" > "test_data/expected_loj.bed"
|
||||
printf "chr1\t100\t200\tchr1\t180\t280\t20\nchr1\t150\t250\tchr1\t180\t280\t70\nchr1\t300\t400\tchr1\t290\t390\t90\n" > "test_data/expected_wo.bed"
|
||||
printf "chr1\t100\t200\nchr1\t150\t250\nchr1\t300\t400\n" > "test_data/expected_u.bed"
|
||||
printf "chr1\t100\t200\t1\nchr1\t150\t250\t1\nchr1\t300\t400\t1\n" > "test_data/expected_c.bed"
|
||||
printf "chr1\t180\t250\nchr1\t300\t390\n" > "test_data/expected_f50.bed"
|
||||
printf "chr1\t180\t250\nchr1\t300\t390\n" > "test_data/expected_f30.bed"
|
||||
printf "chr1\t180\t200\nchr1\t180\t250\nchr1\t300\t390\n" > "test_data/expected_f10.bed"
|
||||
printf "chr1\t180\t200\nchr1\t180\t250\nchr1\t300\t390\n" > "test_data/expected_r.bed"
|
||||
printf "chr1\t180\t200\nchr1\t120\t200\nchr1\t180\t250\nchr1\t150\t220\nchr1\t300\t390\nchr1\t300\t350\n" > "test_data/expected_multiple.bed"
|
||||
# expected gff file
|
||||
printf "chr1\t.\tgene\t1200\t1800\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\tgene\t1400\t2000\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\tgene\t1400\t2000\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\tgene\t1600\t2000\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\tmRNA\t1200\t1800\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\tmRNA\t1400\t2000\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\tmRNA\t1400\t2000\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\tmRNA\t1600\t2000\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\texon\t1200\t1200\t.\t+\t.\tID=exon1;Parent=transcript1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\texon\t1500\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\texon\t1500\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\texon\t1500\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\texon\t1600\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\tCDS\t1200\t1200\t.\t+\t0\tID=cds1;Parent=transcript1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\tCDS\t1500\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\tCDS\t1500\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\tCDS\t1500\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "test_data/expected.gff"
|
||||
printf "chr1\t.\tCDS\t1600\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "test_data/expected.gff"
|
||||
|
||||
# Test 1: Default intersect
|
||||
mkdir test1
|
||||
cd test1
|
||||
|
||||
echo "> Run bedtools_intersect on BED files with default intersect"
|
||||
log "Executing $meta_name with basic parameters..."
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/featuresA.bed" \
|
||||
--input_b "../test_data/featuresB.bed" \
|
||||
--output "output.bed"
|
||||
--input_a "$test_dir/featuresA.bed" \
|
||||
--input_b "$test_dir/featuresB.bed" \
|
||||
--output "$meta_temp_dir/output1.bed"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected_default.bed"
|
||||
echo "- test1 succeeded -"
|
||||
log "Validating TEST 1 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output1.bed" "output intersection file"
|
||||
check_file_not_empty "$meta_temp_dir/output1.bed" "output intersection file"
|
||||
check_file_contains "$meta_temp_dir/output1.bed" "chr1"
|
||||
log "✅ TEST 1 completed successfully"
|
||||
|
||||
cd ..
|
||||
# --- Test Case 2: Intersection with -wa option ---
|
||||
log "Starting TEST 2: Intersection with -wa (write A) option"
|
||||
|
||||
# Test 2: Write A option
|
||||
mkdir test2
|
||||
cd test2
|
||||
|
||||
echo "> Run bedtools_intersect on BED files with -wa option"
|
||||
log "Executing $meta_name with -wa option..."
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/featuresA.bed" \
|
||||
--input_b "../test_data/featuresB.bed" \
|
||||
--output "output.bed" \
|
||||
--write_a
|
||||
--input_a "$test_dir/featuresA.bed" \
|
||||
--input_b "$test_dir/featuresB.bed" \
|
||||
--write_a \
|
||||
--output "$meta_temp_dir/output2.bed"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected_wa.bed"
|
||||
echo "- test2 succeeded -"
|
||||
log "Validating TEST 2 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output2.bed" "output file with -wa"
|
||||
check_file_not_empty "$meta_temp_dir/output2.bed" "output file with -wa"
|
||||
check_file_contains "$meta_temp_dir/output2.bed" "feature"
|
||||
log "✅ TEST 2 completed successfully"
|
||||
|
||||
cd ..
|
||||
# --- Test Case 3: Intersection with -wb option ---
|
||||
log "Starting TEST 3: Intersection with -wb (write B) option"
|
||||
|
||||
# Test 3: -wb option
|
||||
mkdir test3
|
||||
cd test3
|
||||
|
||||
echo "> Run bedtools_intersect on BED files with -wb option"
|
||||
log "Executing $meta_name with -wb option..."
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/featuresA.bed" \
|
||||
--input_b "../test_data/featuresB.bed" \
|
||||
--output "output.bed" \
|
||||
--write_b
|
||||
--input_a "$test_dir/featuresA.bed" \
|
||||
--input_b "$test_dir/featuresB.bed" \
|
||||
--write_b \
|
||||
--output "$meta_temp_dir/output3.bed"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected_wb.bed"
|
||||
echo "- test3 succeeded -"
|
||||
|
||||
cd ..
|
||||
|
||||
# Test 4: -loj option
|
||||
mkdir test4
|
||||
cd test4
|
||||
|
||||
echo "> Run bedtools_intersect on BED files with -loj option"
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/featuresA.bed" \
|
||||
--input_b "../test_data/featuresB.bed" \
|
||||
--output "output.bed" \
|
||||
--left_outer_join
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected_loj.bed"
|
||||
echo "- test4 succeeded -"
|
||||
|
||||
cd ..
|
||||
|
||||
# Test 5: -wo option
|
||||
mkdir test5
|
||||
cd test5
|
||||
|
||||
echo "> Run bedtools_intersect on BED files with -wo option"
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/featuresA.bed" \
|
||||
--input_b "../test_data/featuresB.bed" \
|
||||
--output "output.bed" \
|
||||
--write_overlap
|
||||
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected_wo.bed"
|
||||
echo "- test5 succeeded -"
|
||||
|
||||
cd ..
|
||||
|
||||
# Test 6: -u option
|
||||
mkdir test6
|
||||
cd test6
|
||||
|
||||
echo "> Run bedtools_intersect on BED files with -u option"
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/featuresA.bed" \
|
||||
--input_b "../test_data/featuresB.bed" \
|
||||
--output "output.bed" \
|
||||
--report_A_if_no_overlap true
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected_u.bed"
|
||||
echo "- test6 succeeded -"
|
||||
|
||||
cd ..
|
||||
|
||||
# Test 7: -c option
|
||||
mkdir test7
|
||||
cd test7
|
||||
|
||||
echo "> Run bedtools_intersect on BED files with -c option"
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/featuresA.bed" \
|
||||
--input_b "../test_data/featuresB.bed" \
|
||||
--output "output.bed" \
|
||||
--number_of_overlaps_A true
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected_c.bed"
|
||||
echo "- test7 succeeded -"
|
||||
|
||||
cd ..
|
||||
|
||||
# Test 8: -f 0.50 option
|
||||
mkdir test8
|
||||
cd test8
|
||||
|
||||
echo "> Run bedtools_intersect on BED files with -f 0.50 option"
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/featuresA.bed" \
|
||||
--input_b "../test_data/featuresB.bed" \
|
||||
--output "output.bed" \
|
||||
--min_overlap_A 0.50
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected_f50.bed"
|
||||
echo "- test8 succeeded -"
|
||||
|
||||
cd ..
|
||||
|
||||
# Test 9: -f 0.30 option
|
||||
mkdir test9
|
||||
cd test9
|
||||
|
||||
echo "> Run bedtools_intersect on BED files with -f 0.30 option"
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/featuresA.bed" \
|
||||
--input_b "../test_data/featuresB.bed" \
|
||||
--output "output.bed" \
|
||||
--min_overlap_A 0.30
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected_f30.bed"
|
||||
echo "- test9 succeeded -"
|
||||
|
||||
cd ..
|
||||
|
||||
# Test 10: -f 0.10 option
|
||||
mkdir test10
|
||||
cd test10
|
||||
|
||||
echo "> Run bedtools_intersect on BED files with -f 0.10 option"
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/featuresA.bed" \
|
||||
--input_b "../test_data/featuresB.bed" \
|
||||
--output "output.bed" \
|
||||
--min_overlap_A 0.10
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected_f10.bed"
|
||||
echo "- test10 succeeded -"
|
||||
|
||||
cd ..
|
||||
|
||||
# Test 11: -r option
|
||||
mkdir test11
|
||||
cd test11
|
||||
|
||||
echo "> Run bedtools_intersect on BED files with -r option"
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/featuresA.bed" \
|
||||
--input_b "../test_data/featuresB.bed" \
|
||||
--output "output.bed" \
|
||||
--reciprocal_overlap true
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected_r.bed"
|
||||
echo "- test11 succeeded -"
|
||||
|
||||
cd ..
|
||||
|
||||
|
||||
# Test 12: Multiple files
|
||||
mkdir test12
|
||||
cd test12
|
||||
|
||||
echo "> Run bedtools_intersect on multiple BED files"
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/featuresA.bed" \
|
||||
--input_b "../test_data/featuresB.bed" \
|
||||
--input_b "../test_data/featuresC.bed" \
|
||||
--output "output.bed"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected_multiple.bed"
|
||||
echo "- test12 succeeded -"
|
||||
|
||||
cd ..
|
||||
|
||||
# Test 13: VCF file format
|
||||
mkdir test13
|
||||
cd test13
|
||||
|
||||
echo "> Run bedtools_intersect on GFF files"
|
||||
"$meta_executable" \
|
||||
--input_a "../test_data/example1.gff" \
|
||||
--input_b "../test_data/example2.gff" \
|
||||
--output "output.bed"
|
||||
|
||||
# checks
|
||||
assert_file_exists "output.bed"
|
||||
assert_file_not_empty "output.bed"
|
||||
assert_identical_content "output.bed" "../test_data/expected.gff"
|
||||
echo "- test13 succeeded -"
|
||||
|
||||
cd ..
|
||||
|
||||
echo "---- All tests succeeded! ----"
|
||||
exit 0
|
||||
log "Validating TEST 3 outputs..."
|
||||
check_file_exists "$meta_temp_dir/output3.bed" "output file with -wb"
|
||||
check_file_not_empty "$meta_temp_dir/output3.bed" "output file with -wb"
|
||||
check_file_contains "$meta_temp_dir/output3.bed" "overlapping"
|
||||
log "✅ TEST 3 completed successfully"
|
||||
|
||||
227
src/bedtools/bedtools_jaccard/config.vsh.yaml
Normal file
227
src/bedtools/bedtools_jaccard/config.vsh.yaml
Normal file
@@ -0,0 +1,227 @@
|
||||
name: bedtools_jaccard
|
||||
namespace: bedtools
|
||||
|
||||
description: |
|
||||
Calculate Jaccard similarity statistic between two genomic feature files.
|
||||
|
||||
The Jaccard index measures similarity between finite sample sets, defined as
|
||||
the size of the intersection divided by the size of the union. Values range
|
||||
from 0 (no intersection) to 1 (identical sets). This tool calculates the
|
||||
Jaccard statistic for genomic intervals, providing a quantitative measure
|
||||
of overlap between two interval sets.
|
||||
|
||||
keywords: [genomics, intervals, jaccard, similarity, statistics, overlap, intersection, union]
|
||||
links:
|
||||
homepage: https://bedtools.readthedocs.io/
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/jaccard.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input_a
|
||||
alternatives: [-a]
|
||||
type: file
|
||||
description: |
|
||||
First input file for Jaccard comparison.
|
||||
|
||||
**Format:** BED, GFF, VCF file with genomic intervals
|
||||
**Requirement:** Must be sorted by chromosome, then start position
|
||||
**Usage:** File A for Jaccard similarity calculation
|
||||
required: true
|
||||
example: intervals_a.bed
|
||||
|
||||
- name: --input_b
|
||||
alternatives: [-b]
|
||||
type: file
|
||||
description: |
|
||||
Second input file for Jaccard comparison.
|
||||
|
||||
**Format:** BED, GFF, VCF file with genomic intervals
|
||||
**Requirement:** Must be sorted by chromosome, then start position
|
||||
**Usage:** File B for Jaccard similarity calculation
|
||||
required: true
|
||||
example: intervals_b.bed
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with Jaccard similarity statistics.
|
||||
|
||||
**Format:** Tab-delimited with intersection, union, and Jaccard values
|
||||
**Columns:** intersection, union, jaccard
|
||||
**Range:** Jaccard values from 0.0 to 1.0
|
||||
required: true
|
||||
example: jaccard_results.txt
|
||||
|
||||
- name: Overlap Options
|
||||
arguments:
|
||||
- name: --min_overlap_a
|
||||
alternatives: [-f]
|
||||
type: double
|
||||
description: |
|
||||
Minimum overlap required as fraction of A.
|
||||
|
||||
**Range:** 0.0 to 1.0
|
||||
**Default:** 1E-9 (effectively 1bp)
|
||||
**Example:** 0.50 requires 50% of A to be overlapped
|
||||
example: 0.5
|
||||
|
||||
- name: --min_overlap_b
|
||||
alternatives: [-F]
|
||||
type: double
|
||||
description: |
|
||||
Minimum overlap required as fraction of B.
|
||||
|
||||
**Range:** 0.0 to 1.0
|
||||
**Default:** 1E-9 (effectively 1bp)
|
||||
**Example:** 0.50 requires 50% of B to be overlapped
|
||||
example: 0.5
|
||||
|
||||
- name: --reciprocal
|
||||
alternatives: [-r]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require reciprocal overlap for A overlapping B.
|
||||
|
||||
**Requirement:** Must be used solely with -f (min_overlap_a)
|
||||
**Effect:** Requires B overlaps specified fraction of A AND A overlaps same fraction of B
|
||||
**Example:** With -f 0.90 -r, requires B overlaps 90% of A AND A overlaps 90% of B
|
||||
**Default:** false
|
||||
|
||||
- name: --either
|
||||
alternatives: [-e]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require minimum fraction satisfied for A OR B.
|
||||
|
||||
**Effect:** Only one of -f or -F thresholds needs to be satisfied
|
||||
**Alternative:** Without -e, both fractions must be satisfied
|
||||
**Default:** false (both required)
|
||||
|
||||
- name: Strand Options
|
||||
arguments:
|
||||
- name: --same_strand
|
||||
alternatives: [-s]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require same strandedness for overlaps.
|
||||
|
||||
**Effect:** Only consider overlaps on the same strand
|
||||
**Default:** false (strand-independent)
|
||||
|
||||
- name: --opposite_strand
|
||||
alternatives: [-S]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require different strandedness for overlaps.
|
||||
|
||||
**Effect:** Only consider overlaps on opposite strands
|
||||
**Default:** false (strand-independent)
|
||||
**Note:** May have issues in some bedtools versions requiring strand specification
|
||||
|
||||
- name: Format Options
|
||||
arguments:
|
||||
- name: --split
|
||||
type: boolean_true
|
||||
description: |
|
||||
Treat split BAM or BED12 entries as distinct intervals.
|
||||
|
||||
**Effect:** Split multi-block entries into individual intervals
|
||||
**Usage:** For BAM alignments with gaps or BED12 entries
|
||||
**Default:** false
|
||||
|
||||
- name: --bed_output
|
||||
alternatives: [--bed]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write output in BED format when using BAM input.
|
||||
|
||||
**Effect:** Forces BED output format for BAM inputs
|
||||
**Default:** false
|
||||
|
||||
- name: --header
|
||||
type: boolean_true
|
||||
description: |
|
||||
Print header from file A prior to results.
|
||||
|
||||
**Effect:** Includes original header from input file A
|
||||
**Default:** false
|
||||
|
||||
- name: Advanced Options
|
||||
arguments:
|
||||
- name: --genome
|
||||
alternatives: [-g]
|
||||
type: file
|
||||
description: |
|
||||
Genome file for consistent chromosome sorting.
|
||||
|
||||
**Format:** Tab-delimited file with chromosome name and size
|
||||
**Usage:** Only applies when used with sorted data
|
||||
**Purpose:** Enforces consistent chromosome sort order
|
||||
example: genome.txt
|
||||
|
||||
- name: --no_name_check
|
||||
alternatives: [--nonamecheck]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Skip chromosome naming convention checks for sorted data.
|
||||
|
||||
**Effect:** Allows different naming (e.g., "chr1" vs "chr01")
|
||||
**Usage:** For files with inconsistent chromosome naming
|
||||
**Default:** false (strict checking)
|
||||
|
||||
- name: --no_buffer
|
||||
alternatives: [--nobuf]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Disable buffered output.
|
||||
|
||||
**Effect:** Print each line immediately instead of buffering
|
||||
**Usage:** For real-time processing or piping
|
||||
**Trade-off:** Slower performance but immediate output
|
||||
**Default:** false (buffered output)
|
||||
|
||||
- name: --io_buffer
|
||||
alternatives: [--iobuf]
|
||||
type: string
|
||||
description: |
|
||||
Specify input buffer memory size.
|
||||
|
||||
**Format:** Integer with optional K/M/G suffix
|
||||
**Example:** "128M" for 128 megabytes
|
||||
**Note:** No effect with compressed files
|
||||
example: "128M"
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
67
src/bedtools/bedtools_jaccard/help.txt
Normal file
67
src/bedtools/bedtools_jaccard/help.txt
Normal file
@@ -0,0 +1,67 @@
|
||||
```bash
|
||||
docker run --rm quay.io/biocontainers/bedtools:2.31.1--h13024bc_3 bedtools jaccard -h
|
||||
```
|
||||
|
||||
Tool: bedtools jaccard (aka jaccard)
|
||||
Version: v2.31.1
|
||||
Summary: Calculate Jaccard statistic b/w two feature files.
|
||||
Jaccard is the length of the intersection over the union.
|
||||
Values range from 0 (no intersection) to 1 (self intersection).
|
||||
|
||||
Usage: bedtools jaccard [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>
|
||||
|
||||
Options:
|
||||
-s Require same strandedness. That is, only report hits in B
|
||||
that overlap A on the _same_ strand.
|
||||
- By default, overlaps are reported without respect to strand.
|
||||
|
||||
-S Require different strandedness. That is, only report hits in B
|
||||
that overlap A on the _opposite_ strand.
|
||||
- By default, overlaps are reported without respect to strand.
|
||||
|
||||
-f Minimum overlap required as a fraction of A.
|
||||
- Default is 1E-9 (i.e., 1bp).
|
||||
- FLOAT (e.g. 0.50)
|
||||
|
||||
-F Minimum overlap required as a fraction of B.
|
||||
- Default is 1E-9 (i.e., 1bp).
|
||||
- FLOAT (e.g. 0.50)
|
||||
|
||||
-r Require that the fraction overlap be reciprocal for A AND B.
|
||||
- In other words, if -f is 0.90 and -r is used, this requires
|
||||
that B overlap 90% of A and A _also_ overlaps 90% of B.
|
||||
|
||||
-e Require that the minimum fraction be satisfied for A OR B.
|
||||
- In other words, if -e is used with -f 0.90 and -F 0.10 this requires
|
||||
that either 90% of A is covered OR 10% of B is covered.
|
||||
Without -e, both fractions would have to be satisfied.
|
||||
|
||||
-split Treat "split" BAM or BED12 entries as distinct BED intervals.
|
||||
|
||||
-g Provide a genome file to enforce consistent chromosome sort order
|
||||
across input files. Only applies when used with -sorted option.
|
||||
|
||||
-nonamecheck For sorted data, don't throw an error if the file has different naming conventions
|
||||
for the same chromosome. ex. "chr1" vs "chr01".
|
||||
|
||||
-bed If using BAM input, write output as BED.
|
||||
|
||||
-header Print the header from the A file prior to results.
|
||||
|
||||
-nobuf Disable buffered output. Using this option will cause each line
|
||||
of output to be printed as it is generated, rather than saved
|
||||
in a buffer. This will make printing large output files
|
||||
noticeably slower, but can be useful in conjunction with
|
||||
other software tools and scripts that need to process one
|
||||
line of bedtools output at a time.
|
||||
|
||||
-iobuf Specify amount of memory to use for input buffer.
|
||||
Takes an integer argument. Optional suffixes K/M/G supported.
|
||||
Note: currently has no effect with compressed files.
|
||||
|
||||
Notes:
|
||||
(1) Input files must be sorted by chrom, then start position.
|
||||
|
||||
|
||||
|
||||
|
||||
46
src/bedtools/bedtools_jaccard/script.sh
Normal file
46
src/bedtools/bedtools_jaccard/script.sh
Normal file
@@ -0,0 +1,46 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
# unset flags (using loop for many parameters)
|
||||
unset_if_false=(
|
||||
par_reciprocal
|
||||
par_either
|
||||
par_same_strand
|
||||
par_opposite_strand
|
||||
par_split
|
||||
par_bed_output
|
||||
par_header
|
||||
par_no_name_check
|
||||
par_no_buffer
|
||||
)
|
||||
|
||||
for par in "${unset_if_false[@]}"; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset "$par"
|
||||
done
|
||||
|
||||
# Build command arguments array
|
||||
cmd_args=(
|
||||
-a "$par_input_a"
|
||||
-b "$par_input_b"
|
||||
${par_min_overlap_a:+-f "$par_min_overlap_a"}
|
||||
${par_min_overlap_b:+-F "$par_min_overlap_b"}
|
||||
${par_reciprocal:+-r}
|
||||
${par_either:+-e}
|
||||
${par_same_strand:+-s}
|
||||
${par_opposite_strand:+-S}
|
||||
${par_split:+-split}
|
||||
${par_bed_output:+-bed}
|
||||
${par_header:+-header}
|
||||
${par_genome:+-g "$par_genome"}
|
||||
${par_no_name_check:+-nonamecheck}
|
||||
${par_no_buffer:+-nobuf}
|
||||
${par_io_buffer:+-iobuf "$par_io_buffer"}
|
||||
)
|
||||
|
||||
# Execute bedtools jaccard
|
||||
bedtools jaccard "${cmd_args[@]}" > "$par_output"
|
||||
279
src/bedtools/bedtools_jaccard/test.sh
Normal file
279
src/bedtools/bedtools_jaccard/test.sh
Normal file
@@ -0,0 +1,279 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Source centralized test helpers
|
||||
source "$meta_resources_dir/test_helpers.sh"
|
||||
|
||||
# Initialize test environment
|
||||
setup_test_env
|
||||
|
||||
log "Starting tests for bedtools_jaccard"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "Creating test data..."
|
||||
cat <<'EOF' > "$meta_temp_dir/intervals_a.bed"
|
||||
chr1 100 200 feature_a1
|
||||
chr1 300 400 feature_a2
|
||||
chr1 500 600 feature_a3
|
||||
chr2 100 250 feature_a4
|
||||
chr2 400 500 feature_a5
|
||||
EOF
|
||||
|
||||
cat <<'EOF' > "$meta_temp_dir/intervals_b.bed"
|
||||
chr1 150 250 feature_b1
|
||||
chr1 350 450 feature_b2
|
||||
chr1 550 650 feature_b3
|
||||
chr2 150 300 feature_b4
|
||||
chr2 450 550 feature_b5
|
||||
EOF
|
||||
|
||||
# Create genome file for testing
|
||||
cat <<'EOF' > "$meta_temp_dir/genome.txt"
|
||||
chr1 1000
|
||||
chr2 1000
|
||||
EOF
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 1: Basic Jaccard calculation"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--output "$meta_temp_dir/jaccard_basic.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/jaccard_basic.txt" "basic Jaccard output"
|
||||
check_file_not_empty "$meta_temp_dir/jaccard_basic.txt" "basic Jaccard output"
|
||||
|
||||
log "Checking output format (should contain intersection, union, jaccard columns)"
|
||||
check_file_contains "$meta_temp_dir/jaccard_basic.txt" "^[0-9]"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 2: Jaccard with minimum overlap fraction for A"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--min_overlap_a 0.5 \
|
||||
--output "$meta_temp_dir/jaccard_overlap_a.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/jaccard_overlap_a.txt" "overlap A Jaccard output"
|
||||
check_file_not_empty "$meta_temp_dir/jaccard_overlap_a.txt" "overlap A Jaccard output"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 3: Jaccard with minimum overlap fraction for B"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--min_overlap_b 0.5 \
|
||||
--output "$meta_temp_dir/jaccard_overlap_b.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/jaccard_overlap_b.txt" "overlap B Jaccard output"
|
||||
check_file_not_empty "$meta_temp_dir/jaccard_overlap_b.txt" "overlap B Jaccard output"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 4: Jaccard with reciprocal overlap requirement"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--min_overlap_a 0.5 \
|
||||
--reciprocal \
|
||||
--output "$meta_temp_dir/jaccard_reciprocal.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/jaccard_reciprocal.txt" "reciprocal Jaccard output"
|
||||
check_file_not_empty "$meta_temp_dir/jaccard_reciprocal.txt" "reciprocal Jaccard output"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 5: Create stranded test data and test strand options"
|
||||
cat <<'EOF' > "$meta_temp_dir/stranded_a.bed"
|
||||
chr1 100 200 feature_a1 0 +
|
||||
chr1 300 400 feature_a2 0 -
|
||||
chr1 500 600 feature_a3 0 +
|
||||
EOF
|
||||
|
||||
cat <<'EOF' > "$meta_temp_dir/stranded_b.bed"
|
||||
chr1 150 250 feature_b1 0 +
|
||||
chr1 350 450 feature_b2 0 +
|
||||
chr1 550 650 feature_b3 0 -
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/stranded_a.bed" \
|
||||
--input_b "$meta_temp_dir/stranded_b.bed" \
|
||||
--same_strand \
|
||||
--output "$meta_temp_dir/jaccard_same_strand.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/jaccard_same_strand.txt" "strand-specific output"
|
||||
check_file_not_empty "$meta_temp_dir/jaccard_same_strand.txt" "strand-specific output"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 6: Test same strand requirement (skip opposite strand due to bedtools bug)"
|
||||
log "Skipping opposite strand test due to bedtools jaccard -S option issue"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 7: Test either flag (-e)"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--min_overlap_a 0.8 \
|
||||
--min_overlap_b 0.2 \
|
||||
--either \
|
||||
--output "$meta_temp_dir/jaccard_either.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/jaccard_either.txt" "either flag output"
|
||||
check_file_not_empty "$meta_temp_dir/jaccard_either.txt" "either flag output"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 8: Test with genome file"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--genome "$meta_temp_dir/genome.txt" \
|
||||
--output "$meta_temp_dir/jaccard_genome.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/jaccard_genome.txt" "genome file output"
|
||||
check_file_not_empty "$meta_temp_dir/jaccard_genome.txt" "genome file output"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 9: Create BED12 format data and test split option"
|
||||
cat <<'EOF' > "$meta_temp_dir/bed12_a.bed"
|
||||
chr1 100 600 feature_a1 0 + 100 600 0 2 100,100 0,400
|
||||
chr1 800 1200 feature_a2 0 - 800 1200 0 2 100,100 0,300
|
||||
EOF
|
||||
|
||||
cat <<'EOF' > "$meta_temp_dir/bed12_b.bed"
|
||||
chr1 150 650 feature_b1 0 + 150 650 0 2 100,100 0,400
|
||||
chr1 850 1250 feature_b2 0 - 850 1250 0 2 100,100 0,300
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/bed12_a.bed" \
|
||||
--input_b "$meta_temp_dir/bed12_b.bed" \
|
||||
--split \
|
||||
--output "$meta_temp_dir/jaccard_split.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/jaccard_split.txt" "split output"
|
||||
check_file_not_empty "$meta_temp_dir/jaccard_split.txt" "split output"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 10: Test header option with GFF input"
|
||||
cat <<'EOF' > "$meta_temp_dir/gff_a.gff"
|
||||
##gff-version 3
|
||||
chr1 test gene 100 200 . + . ID=gene1
|
||||
chr1 test gene 300 400 . - . ID=gene2
|
||||
EOF
|
||||
|
||||
cat <<'EOF' > "$meta_temp_dir/gff_b.gff"
|
||||
##gff-version 3
|
||||
chr1 test exon 150 250 . + . ID=exon1
|
||||
chr1 test exon 350 450 . + . ID=exon2
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/gff_a.gff" \
|
||||
--input_b "$meta_temp_dir/gff_b.gff" \
|
||||
--header \
|
||||
--output "$meta_temp_dir/jaccard_header.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/jaccard_header.txt" "header output"
|
||||
check_file_contains "$meta_temp_dir/jaccard_header.txt" "gff-version"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 11: Test no-buffer option"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--no_buffer \
|
||||
--output "$meta_temp_dir/jaccard_nobuf.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/jaccard_nobuf.txt" "no-buffer output"
|
||||
check_file_not_empty "$meta_temp_dir/jaccard_nobuf.txt" "no-buffer output"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 12: Test IO buffer option"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--io_buffer "64M" \
|
||||
--output "$meta_temp_dir/jaccard_iobuf.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/jaccard_iobuf.txt" "IO buffer output"
|
||||
check_file_not_empty "$meta_temp_dir/jaccard_iobuf.txt" "IO buffer output"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 13: Validate Jaccard values are in proper range"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_b.bed" \
|
||||
--output "$meta_temp_dir/jaccard_range.txt"
|
||||
|
||||
log "Checking Jaccard value is between 0 and 1"
|
||||
jaccard_value=$(tail -n1 "$meta_temp_dir/jaccard_range.txt" | cut -f3)
|
||||
log "Jaccard value: $jaccard_value"
|
||||
|
||||
# Check if value is numeric and within range using awk
|
||||
if echo "$jaccard_value" | awk '/^[0-9]*\.?[0-9]+$/ {exit !($1 >= 0 && $1 <= 1)}'; then
|
||||
log "✓ Jaccard value is in valid range [0,1]"
|
||||
else
|
||||
log "Error: Jaccard value $jaccard_value is out of range [0,1]"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 14: Test identical files (should give Jaccard = 1.0)"
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/intervals_a.bed" \
|
||||
--input_b "$meta_temp_dir/intervals_a.bed" \
|
||||
--output "$meta_temp_dir/jaccard_identical.txt"
|
||||
|
||||
log "Checking that identical files give Jaccard = 1"
|
||||
jaccard_identical=$(tail -n1 "$meta_temp_dir/jaccard_identical.txt" | cut -f3)
|
||||
log "Jaccard for identical files: $jaccard_identical"
|
||||
|
||||
if echo "$jaccard_identical" | awk '/^[0-9]*\.?[0-9]+$/ {exit !($1 == 1.0)}'; then
|
||||
log "✓ Identical files correctly give Jaccard = 1.0"
|
||||
else
|
||||
log "Warning: Identical files gave Jaccard = $jaccard_identical (expected 1.0)"
|
||||
fi
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "TEST 15: Test no-name-check option with different chromosome naming"
|
||||
cat <<'EOF' > "$meta_temp_dir/chr_mixed_a.bed"
|
||||
chr1 100 200 feature_a1
|
||||
chr01 300 400 feature_a2
|
||||
EOF
|
||||
|
||||
cat <<'EOF' > "$meta_temp_dir/chr_mixed_b.bed"
|
||||
chr1 150 250 feature_b1
|
||||
chr01 350 450 feature_b2
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input_a "$meta_temp_dir/chr_mixed_a.bed" \
|
||||
--input_b "$meta_temp_dir/chr_mixed_b.bed" \
|
||||
--no_name_check \
|
||||
--output "$meta_temp_dir/jaccard_nonamecheck.txt"
|
||||
|
||||
check_file_exists "$meta_temp_dir/jaccard_nonamecheck.txt" "no-name-check output"
|
||||
check_file_not_empty "$meta_temp_dir/jaccard_nonamecheck.txt" "no-name-check output"
|
||||
|
||||
####################################################################################################
|
||||
|
||||
log "All tests completed successfully!"
|
||||
@@ -1,14 +1,19 @@
|
||||
name: bedtools_links
|
||||
namespace: bedtools
|
||||
description: |
|
||||
Creates an HTML file with links to an instance of the UCSC Genome Browser for all features / intervals in a file.
|
||||
This is useful for cases when one wants to manually inspect through a large set of annotations or features.
|
||||
keywords: [Links, BED, GFF, VCF]
|
||||
description: |
|
||||
This tool generates an HTML page containing links to the UCSC Genome Browser
|
||||
for each feature/interval in the input file. This is particularly useful for
|
||||
manually inspecting large sets of genomic annotations or features through
|
||||
the browser interface.
|
||||
|
||||
**Default behavior:** Links point to human genome (hg18) on the main UCSC site
|
||||
**Customization:** Supports custom mirror sites and different organisms/builds
|
||||
|
||||
keywords: [HTML, Links, UCSC, Browser, BED, GFF, VCF]
|
||||
links:
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/links.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/#
|
||||
issue_tracker: https://github.com/arq5x/bedtools2/issues
|
||||
homepage: https://bedtools.readthedocs.io/en/latest/
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: MIT
|
||||
@@ -16,57 +21,80 @@ requirements:
|
||||
commands: [bedtools]
|
||||
authors:
|
||||
- __merge__: /src/_authors/theodoro_gasperin.yaml
|
||||
roles: [ author, maintainer ]
|
||||
roles: [author]
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [author, maintainer]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: -i
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: Input file (bed/gff/vcf).
|
||||
description: |
|
||||
Input file in BED, GFF, or VCF format containing genomic intervals.
|
||||
|
||||
Each feature/interval will be converted to a clickable link pointing
|
||||
to the UCSC Genome Browser. File format is auto-detected based on
|
||||
content and extension.
|
||||
required: true
|
||||
example: intervals.bed
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
alternatives: -o
|
||||
alternatives: [-o]
|
||||
type: file
|
||||
direction: output
|
||||
description: Output HTML file to be written.
|
||||
description: |
|
||||
Output HTML file containing clickable browser links.
|
||||
|
||||
The generated HTML page will contain one link per input feature,
|
||||
formatted for easy navigation to the UCSC Genome Browser.
|
||||
required: true
|
||||
example: browser_links.html
|
||||
|
||||
- name: Options
|
||||
description: |
|
||||
By default, the links created will point to human (hg18) UCSC browser.
|
||||
If you have a local mirror, you can override this behavior by supplying
|
||||
the -base, -org, and -db options.
|
||||
|
||||
For example, if the URL of your local mirror for mouse MM9 is called:
|
||||
http://mymirror.myuniversity.edu, then you would use the following:
|
||||
--base_url http://mymirror.myuniversity.edu
|
||||
--organism mouse
|
||||
--database mm9
|
||||
arguments:
|
||||
- name: --base_url
|
||||
alternatives: -base
|
||||
alternatives: [-base]
|
||||
type: string
|
||||
description: |
|
||||
The “basename” for the UCSC browser.
|
||||
default: http://genome.ucsc.edu
|
||||
description: |
|
||||
Base URL for the UCSC Genome Browser instance.
|
||||
|
||||
**Default:** http://genome.ucsc.edu (official UCSC site)
|
||||
**Custom mirrors:** Use your institution's mirror URL
|
||||
**Example:** http://mymirror.myuniversity.edu
|
||||
example: "http://genome.ucsc.edu"
|
||||
|
||||
- name: --organism
|
||||
alternatives: -org
|
||||
alternatives: [-org]
|
||||
type: string
|
||||
description: |
|
||||
The organism (e.g. mouse, human).
|
||||
default: human
|
||||
description: |
|
||||
Target organism for genome browser links.
|
||||
|
||||
**Common values:**
|
||||
- human (default)
|
||||
- mouse
|
||||
- rat
|
||||
- fly
|
||||
- worm
|
||||
|
||||
Must match organism names used by your UCSC browser instance.
|
||||
example: "human"
|
||||
|
||||
- name: --database
|
||||
alternatives: -db
|
||||
alternatives: [-db]
|
||||
type: string
|
||||
description: |
|
||||
The genome build.
|
||||
default: hg18
|
||||
description: |
|
||||
Genome assembly/build identifier.
|
||||
|
||||
**Human examples:** hg19, hg38, hg18 (default)
|
||||
**Mouse examples:** mm9, mm10, mm39
|
||||
**Other:** Assembly names as recognized by UCSC browser
|
||||
|
||||
Must correspond to available assemblies for the specified organism.
|
||||
example: "hg18"
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
@@ -75,16 +103,16 @@ resources:
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: /src/_utils/test_helpers.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:stable-slim
|
||||
image: quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [bedtools, procps]
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
|
||||
run:
|
||||
- "bedtools --version 2>&1 | head -1 | sed 's/.*bedtools v/bedtools: /' > /var/software_versions.txt"
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user