Build branch update_busco with version update_busco (1ca6dec)
Build pipeline: viash-hub.biobox.update-busco-bsvnf
Source commit: 1ca6dec48e
Source message: Typo
This commit is contained in:
17
.gitignore
vendored
Normal file
17
.gitignore
vendored
Normal file
@@ -0,0 +1,17 @@
|
||||
*.DS_Store
|
||||
*__pycache__
|
||||
|
||||
# IDE ignores
|
||||
.idea/
|
||||
|
||||
# R specific ignores
|
||||
.Rhistory
|
||||
.Rproj.user
|
||||
*.Rproj
|
||||
|
||||
# viash specific ignores
|
||||
target/
|
||||
|
||||
# nextflow specific ignores
|
||||
.nextflow*
|
||||
work
|
||||
96
CHANGELOG.md
Normal file
96
CHANGELOG.md
Normal file
@@ -0,0 +1,96 @@
|
||||
# biobox x.x.x
|
||||
|
||||
## BUG FIXES
|
||||
|
||||
* `pear`: fix component not exiting with the correct exitcode when PEAR fails.
|
||||
|
||||
* `cutadapt`: fix `--par_quality_cutoff_r2` argument.
|
||||
|
||||
* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode`.
|
||||
|
||||
## MINOR CHANGES
|
||||
|
||||
* `busco` components: update BUSCO to `5.7.1`.
|
||||
|
||||
# biobox 0.1.0
|
||||
|
||||
## BREAKING CHANGES
|
||||
|
||||
* Change default `multiple_sep` to `;` (PR #25). This aligns with an upcoming breaking change in
|
||||
Viash 0.9.0 in order to avoid issues with the current default separator `:` unintentionally
|
||||
splitting up certain file paths.
|
||||
|
||||
|
||||
## NEW FEATURES
|
||||
|
||||
* `arriba`: Detect gene fusions from RNA-seq data (PR #1).
|
||||
|
||||
* `fastp`: An ultra-fast all-in-one FASTQ preprocessor (PR #3).
|
||||
|
||||
* `busco`:
|
||||
- `busco/busco_run`: Assess genome assembly and annotation completeness with single copy orthologs (PR #6).
|
||||
- `busco/busco_list_datasets`: Lists available busco datasets (PR #18).
|
||||
- `busco/busco_download_datasets`: Download busco datasets (PR #19).
|
||||
|
||||
* `cutadapt`: Remove adapter sequences from high-throughput sequencing reads (PR #7).
|
||||
|
||||
* `featurecounts`: Assign sequence reads to genomic features (PR #11).
|
||||
|
||||
* `bgzip`: Add bgzip functionality to compress and decompress files (PR #13).
|
||||
|
||||
* `pear`: Paired-end read merger (PR #10).
|
||||
|
||||
* `lofreq/call`: Call variants from a BAM file (PR #17).
|
||||
|
||||
* `lofreq/indelqual`: Insert indel qualities into BAM file (PR #17).
|
||||
|
||||
* `multiqc`: Aggregate results from bioinformatics analyses across many samples into a single report (PR #42).
|
||||
|
||||
* `star`:
|
||||
- `star/star_align_reads`: Align reads to a reference genome (PR #22).
|
||||
- `star/star_genome_generate`: Generate a genome index for STAR alignment (PR #58).
|
||||
|
||||
* `gffread`: Validate, filter, convert and perform other operations on GFF files (PR #29).
|
||||
|
||||
* `salmon`:
|
||||
- `salmon/salmon_index`: Create a salmon index for the transcriptome to use Salmon in the mapping-based mode (PR #24).
|
||||
- `salmon/salmon_quant`: Transcript quantification from RNA-seq data (PR #24).
|
||||
|
||||
* `samtools`:
|
||||
- `samtools/samtools_flagstat`: Counts the number of alignments in SAM/BAM/CRAM files for each FLAG type (PR #31).
|
||||
- `samtools/samtools_idxstats`: Reports alignment summary statistics for a SAM/BAM/CRAM file (PR #32).
|
||||
- `samtools/samtools_index`: Index SAM/BAM/CRAM files (PR #35).
|
||||
- `samtools/samtools_sort`: Sort SAM/BAM/CRAM files (PR #36).
|
||||
- `samtools/samtools_stats`: Reports alignment summary statistics for a BAM file (PR #39).
|
||||
- `samtools/samtools_faidx`: Indexes FASTA files to enable random access to fasta and fastq files (PR #41).
|
||||
- `samtools/samtools_collate`: Shuffles and groups reads in SAM/BAM/CRAM files together by their names (PR #42).
|
||||
- `samtools/samtools_view`: Views and converts SAM/BAM/CRAM files (PR #48).
|
||||
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTQ (PR #52).
|
||||
|
||||
* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).
|
||||
|
||||
* `bedtools`:
|
||||
- `bedtools_getfasta`: extract sequences from a FASTA file for each of the
|
||||
intervals defined in a BED/GFF/VCF file (PR #59).
|
||||
|
||||
## MINOR CHANGES
|
||||
|
||||
* Uniformize component metadata (PR #23).
|
||||
|
||||
* Update to Viash 0.8.5 (PR #25).
|
||||
|
||||
* Update to Viash 0.9.0-RC3 (PR #51).
|
||||
|
||||
* Update to Viash 0.9.0-RC6 (PR #63).
|
||||
|
||||
* Switch to viash-hub/toolbox actions (PR #64).
|
||||
|
||||
## DOCUMENTATION
|
||||
|
||||
* Update README (PR #64).
|
||||
|
||||
## BUG FIXES
|
||||
|
||||
* Add escaping character before leading hashtag in the description field of the config file (PR #50).
|
||||
|
||||
* Format URL in biobase/bcl_convert description (PR #55).
|
||||
383
CONTRIBUTING.md
Normal file
383
CONTRIBUTING.md
Normal file
@@ -0,0 +1,383 @@
|
||||
|
||||
# Contributing guidelines
|
||||
|
||||
We encourage contributions from the community. To contribute:
|
||||
|
||||
1. **Fork the Repository**: Start by forking this repository to your account.
|
||||
2. **Develop Your Component**: Create your Viash component, ensuring it aligns with our best practices (detailed below).
|
||||
3. **Submit a Pull Request**: After testing your component, submit a pull request for review.
|
||||
|
||||
## Procedure of adding a component
|
||||
|
||||
### Step 1: Find a component to contribute
|
||||
|
||||
* Find a tool to contribute to this repo.
|
||||
|
||||
* Check whether it is already in the [Project board](https://github.com/orgs/viash-hub/projects/1).
|
||||
|
||||
* Check whether there is a corresponding [Snakemake wrapper](https://github.com/snakemake/snakemake-wrappers/blob/master/bio) or [nf-core module](https://github.com/nf-core/modules/tree/master/modules/nf-core) which we can use as inspiration.
|
||||
|
||||
* Create an issue to show that you are working on this component.
|
||||
|
||||
|
||||
### Step 2: Add config template
|
||||
|
||||
Change all occurrences of `xxx` to the name of the component.
|
||||
|
||||
Create a file at `src/xxx/config.vsh.yaml` with contents:
|
||||
|
||||
```yaml
|
||||
name: xxx
|
||||
description: xxx
|
||||
keywords: [tag1, tag2]
|
||||
links:
|
||||
homepage: yyy
|
||||
documentation: yyy
|
||||
issue_tracker: yyy
|
||||
repository: yyy
|
||||
references:
|
||||
doi: 12345/12345678.yz
|
||||
license: MIT/Apache-2.0/GPL-3.0/...
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments: <...>
|
||||
- name: Outputs
|
||||
arguments: <...>
|
||||
- name: Arguments
|
||||
arguments: <...>
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
engines:
|
||||
- <...>
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
```
|
||||
|
||||
### Step 3: Fill in the metadata
|
||||
|
||||
Fill in the relevant metadata fields in the config. Here is an example of the metadata of an existing component.
|
||||
|
||||
```yaml
|
||||
functionality:
|
||||
name: arriba
|
||||
description: Detect gene fusions from RNA-Seq data
|
||||
keywords: [Gene fusion, RNA-Seq]
|
||||
links:
|
||||
homepage: https://arriba.readthedocs.io/en/latest/
|
||||
documentation: https://arriba.readthedocs.io/en/latest/
|
||||
repository: https://github.com/suhrig/arriba
|
||||
issue_tracker: https://github.com/suhrig/arriba/issues
|
||||
references:
|
||||
doi: 10.1101/gr.257246.119
|
||||
bibtex: |
|
||||
@article{
|
||||
... a bibtex entry in case the doi is not available ...
|
||||
}
|
||||
license: MIT
|
||||
```
|
||||
|
||||
### Step 4: Find a suitable container
|
||||
|
||||
Google `biocontainer <name of component>` and find the container that is most suitable. Typically the link will be `https://quay.io/repository/biocontainers/xxx?tab=tags`.
|
||||
|
||||
If no such container is found, you can create a custom container in the next step.
|
||||
|
||||
|
||||
### Step 5: Create help file
|
||||
|
||||
To help develop the component, we store the `--help` output of the tool in a file at `src/xxx/help.txt`.
|
||||
|
||||
````bash
|
||||
cat <<EOF > src/xxx/help.txt
|
||||
```sh
|
||||
xxx --help
|
||||
```
|
||||
EOF
|
||||
|
||||
docker run quay.io/biocontainers/xxx:tag xxx --help >> src/xxx/help.txt
|
||||
````
|
||||
|
||||
Notes:
|
||||
|
||||
* This help file has no functional purpose, but it is useful for the developer to see the help output of the tool.
|
||||
|
||||
* Some tools might not have a `--help` argument but instead have a `-h` argument. For example, for `arriba`, the help message is obtained by running `arriba -h`:
|
||||
|
||||
```bash
|
||||
docker run quay.io/biocontainers/arriba:2.4.0--h0033a41_2 arriba -h
|
||||
```
|
||||
|
||||
|
||||
### Step 6: Create or fetch test data
|
||||
|
||||
To help develop the component, it's interesting to have some test data available. In most cases, we can use the test data from the Snakemake wrappers.
|
||||
|
||||
To make sure we can reproduce the test data in the future, we store the command to fetch the test data in a file at `src/xxx/test_data/script.sh`.
|
||||
|
||||
```bash
|
||||
cat <<EOF > src/xxx/test_data/script.sh
|
||||
|
||||
# clone repo
|
||||
if [ ! -d /tmp/snakemake-wrappers ]; then
|
||||
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
|
||||
fi
|
||||
|
||||
# copy test data
|
||||
cp -r /tmp/snakemake-wrappers/bio/xxx/test/* src/xxx/test_data
|
||||
EOF
|
||||
```
|
||||
|
||||
The test data should be suitable for testing this component. Ensure that the test data is small enough: ideally <1KB, preferably <10KB, if need be <100KB.
|
||||
|
||||
### Step 7: Add arguments for the input files
|
||||
|
||||
By looking at the help file, we add the input arguments to the config file. Here is an example of the input arguments of an existing component.
|
||||
|
||||
For instance, in the [arriba help file](src/arriba/help.txt), we see the following:
|
||||
|
||||
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
|
||||
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
|
||||
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
|
||||
-o fusions.tsv [-O fusions.discarded.tsv] \
|
||||
[OPTIONS]
|
||||
|
||||
-x FILE File in SAM/BAM/CRAM format with main alignments as generated by STAR
|
||||
(Aligned.out.sam). Arriba extracts candidate reads from this file.
|
||||
|
||||
Based on this information, we can add the following input arguments to the config file.
|
||||
|
||||
```yaml
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --bam
|
||||
alternatives: -x
|
||||
type: file
|
||||
description: |
|
||||
File in SAM/BAM/CRAM format with main alignments as generated by STAR
|
||||
(Aligned.out.sam). Arriba extracts candidate reads from this file.
|
||||
required: true
|
||||
example: Aligned.out.bam
|
||||
```
|
||||
|
||||
Check the [documentation](https://viash.io/reference/config/functionality/arguments) for more information on the format of input arguments.
|
||||
|
||||
Several notes:
|
||||
|
||||
* Argument names should be formatted in `--snake_case`. This means arguments like `--foo-bar` should be formatted as `--foo_bar`, and short arguments like `-f` should receive a longer name like `--foo`.
|
||||
|
||||
* Input arguments can have `multiple: true` to allow the user to specify multiple files.
|
||||
|
||||
|
||||
|
||||
### Step 8: Add arguments for the output files
|
||||
|
||||
By looking at the help file, we now also add output arguments to the config file.
|
||||
|
||||
For example, in the [arriba help file](src/arriba/help.txt), we see the following:
|
||||
|
||||
|
||||
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
|
||||
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
|
||||
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
|
||||
-o fusions.tsv [-O fusions.discarded.tsv] \
|
||||
[OPTIONS]
|
||||
|
||||
-o FILE Output file with fusions that have passed all filters.
|
||||
|
||||
-O FILE Output file with fusions that were discarded due to filtering.
|
||||
|
||||
Based on this information, we can add the following output arguments to the config file.
|
||||
|
||||
```yaml
|
||||
argument_groups:
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --fusions
|
||||
alternatives: -o
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with fusions that have passed all filters.
|
||||
required: true
|
||||
example: fusions.tsv
|
||||
- name: --fusions_discarded
|
||||
alternatives: -O
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with fusions that were discarded due to filtering.
|
||||
required: false
|
||||
example: fusions.discarded.tsv
|
||||
```
|
||||
|
||||
Note:
|
||||
|
||||
* Preferably, these outputs should not be directores but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
|
||||
|
||||
### Step 9: Add arguments for the other arguments
|
||||
|
||||
Finally, add all other arguments to the config file. There are a few exceptions:
|
||||
|
||||
* Arguments related to specifying CPU and memory requirements are handled separately and should not be added to the config file.
|
||||
|
||||
* Arguments related to printing the information such as printing the version (`-v`, `--version`) or printing the help (`-h`, `--help`) should not be added to the config file.
|
||||
|
||||
|
||||
### Step 10: Add a Docker engine
|
||||
|
||||
To ensure reproducibility of components, we require that all components are run in a Docker container.
|
||||
|
||||
```yaml
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/xxx:0.1.0--py_0
|
||||
```
|
||||
|
||||
The container should have your tool installed, as well as `ps`.
|
||||
|
||||
If you didn't find a suitable container in the previous step, you can create a custom container. For example:
|
||||
|
||||
```yaml
|
||||
engines:
|
||||
- type: docker
|
||||
image: python:3.10
|
||||
setup:
|
||||
- type: python
|
||||
packages: numpy
|
||||
```
|
||||
|
||||
For more information on how to do this, see the [documentation](https://viash.io/guide/component/add-dependencies.html#steps-for-creating-a-custom-docker-platform).
|
||||
|
||||
Here is a list of base containers we can recommend:
|
||||
|
||||
* Bash: [`bash`](https://hub.docker.com/_/bash), [`ubuntu`](https://hub.docker.com/_/ubuntu)
|
||||
* C#: [`ghcr.io/data-intuitive/dotnet-script`](https://github.com/data-intuitive/ghcr-dotnet-script/pkgs/container/dotnet-script)
|
||||
* JavaScript: [`node`](https://hub.docker.com/_/node)
|
||||
* Python: [`python`](https://hub.docker.com/_/python), [`nvcr.io/nvidia/pytorch`](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
|
||||
* R: [`eddelbuettel/r2u`](https://hub.docker.com/r/eddelbuettel/r2u), [`rocker/tidyverse`](https://hub.docker.com/r/rocker/tidyverse)
|
||||
* Scala: [`sbtscala/scala-sbt`](https://hub.docker.com/r/sbtscala/scala-sbt)
|
||||
|
||||
### Step 11: Write a runner script
|
||||
|
||||
Next, we need to write a runner script that runs the tool with the input arguments. Create a Bash script named `src/xxx/script.sh` which runs the tool with the input arguments.
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
xxx \
|
||||
--input "$par_input" \
|
||||
--output "$par_output" \
|
||||
$([ "$par_option" = "true" ] && echo "--option")
|
||||
```
|
||||
|
||||
When building a Viash component, Viash will automatically replace the `## VIASH START` and `## VIASH END` lines (and anything in between) with environment variables based on the arguments specified in the config.
|
||||
|
||||
As an example, this is what the Bash script for the `arriba` component looks like:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
arriba \
|
||||
-x "$par_bam" \
|
||||
-a "$par_genome" \
|
||||
-g "$par_gene_annotation" \
|
||||
-o "$par_fusions" \
|
||||
${par_known_fusions:+-k "${par_known_fusions}"} \
|
||||
${par_blacklist:+-b "${par_blacklist}"} \
|
||||
${par_structural_variants:+-d "${par_structural_variants}"} \
|
||||
$([ "$par_skip_duplicate_marking" = "true" ] && echo "-u") \
|
||||
$([ "$par_extra_information" = "true" ] && echo "-X") \
|
||||
$([ "$par_fill_gaps" = "true" ] && echo "-I")
|
||||
```
|
||||
|
||||
|
||||
### Step 12: Create test script
|
||||
|
||||
|
||||
If the unit test requires test resources, these should be provided in the `test_resources` section of the component.
|
||||
|
||||
```yaml
|
||||
functionality:
|
||||
# ...
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
```
|
||||
|
||||
Create a test script at `src/xxx/test.sh` that runs the component with the test data. This script should run the component (available with `$meta_executable`) with the test data and check if the output is as expected. The script should exit with a non-zero exit code if the output is not as expected. For example:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
echo "> Run xxx with test data"
|
||||
"$meta_executable" \
|
||||
--input "$meta_resources_dir/test_data/input.txt" \
|
||||
--output "output.txt" \
|
||||
--option
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "output.txt" ] && echo "Output file output.txt does not exist" && exit 1
|
||||
```
|
||||
|
||||
|
||||
For example, this is what the test script for the `arriba` component looks like:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
echo "> Run arriba with blacklist"
|
||||
"$meta_executable" \
|
||||
--bam "$meta_resources_dir/test_data/A.bam" \
|
||||
--genome "$meta_resources_dir/test_data/genome.fasta" \
|
||||
--gene_annotation "$meta_resources_dir/test_data/annotation.gtf" \
|
||||
--blacklist "$meta_resources_dir/test_data/blacklist.tsv" \
|
||||
--fusions "fusions.tsv" \
|
||||
--fusions_discarded "fusions_discarded.tsv" \
|
||||
--interesting_contigs "1,2"
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
|
||||
[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
|
||||
[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
|
||||
```
|
||||
|
||||
### Step 12: Create a `/var/software_versions.txt` file
|
||||
|
||||
For the sake of transparency and reproducibility, we require that the versions of the software used in the component are documented.
|
||||
|
||||
For now, this is managed by creating a file `/var/software_versions.txt` in the `setup` section of the Docker engine.
|
||||
|
||||
```yaml
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/xxx:0.1.0--py_0
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
echo "xxx: \"0.1.0\"" > /var/software_versions.txt
|
||||
```
|
||||
21
LICENSE
Normal file
21
LICENSE
Normal file
@@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2024 Data Intuitive
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
72
README.md
Normal file
72
README.md
Normal file
@@ -0,0 +1,72 @@
|
||||
|
||||
|
||||
# 🌱📦 biobox
|
||||
|
||||
[](https://web.viash-hub.com/packages/biobox)
|
||||
[](https://github.com/viash-hub/biobox)
|
||||
[](https://github.com/viash-hub/biobox/blob/main/LICENSE)
|
||||
[](https://github.com/viash-hub/biobox/issues)
|
||||
[](https://viash.io)
|
||||
|
||||
A collection of bioinformatics tools for working with sequence data.
|
||||
|
||||
## Objectives
|
||||
|
||||
- **Reusability**: Facilitating the use of components across various
|
||||
projects and contexts.
|
||||
- **Reproducibility**: Ensuring that components are reproducible and can
|
||||
be easily shared.
|
||||
- **Best Practices**: Adhering to established standards in software
|
||||
development and bioinformatics.
|
||||
|
||||
## Contributing
|
||||
|
||||
We encourage contributions from the community. To contribute:
|
||||
|
||||
1. **Fork the Repository**: Start by forking this repository to your
|
||||
account.
|
||||
2. **Develop Your Component**: Create your Viash component, ensuring it
|
||||
aligns with our best practices (detailed below).
|
||||
3. **Submit a Pull Request**: After testing your component, submit a
|
||||
pull request for review.
|
||||
|
||||
## Contribution Guidelines
|
||||
|
||||
The contribution guidelines describes which steps you should follow to
|
||||
contribute a component to this repository.
|
||||
|
||||
1. Find a component to contribute
|
||||
2. Add config template
|
||||
3. Fill in the metadata
|
||||
4. Find a suitable container
|
||||
5. Create help file
|
||||
6. Create or fetch test data
|
||||
7. Add arguments for the input files
|
||||
8. Add arguments for the output files
|
||||
9. Add arguments for the other arguments
|
||||
10. Add a Docker engine
|
||||
11. Write a runner script
|
||||
12. Create test script
|
||||
13. Create a `/var/software_versions.txt` file
|
||||
|
||||
See the
|
||||
[CONTRIBUTING](https://github.com/viash-hub/biobox/blob/main/CONTRIBUTING.md)
|
||||
file for more details.
|
||||
|
||||
## Support and Community
|
||||
|
||||
For support, questions, or to join our community:
|
||||
|
||||
- **Issues**: Submit questions or issues via the [GitHub issue
|
||||
tracker](https://github.com/viash-hub/biobox/issues).
|
||||
- **Discussions**: Join our discussions via [GitHub
|
||||
Discussions](https://github.com/viash-hub/biobox/discussions).
|
||||
|
||||
## License
|
||||
|
||||
This repository is licensed under an MIT license. See the
|
||||
[LICENSE](https://github.com/viash-hub/biobox/blob/main/LICENSE) file
|
||||
for details.
|
||||
62
README.qmd
Normal file
62
README.qmd
Normal file
@@ -0,0 +1,62 @@
|
||||
---
|
||||
format: gfm
|
||||
---
|
||||
```{r setup, include=FALSE}
|
||||
project <- yaml::read_yaml("_viash.yaml")
|
||||
license <- paste0(project$links$repository, "/blob/main/LICENSE")
|
||||
contributing <- paste0(project$links$repository, "/blob/main/CONTRIBUTING.md")
|
||||
```
|
||||
# 🌱📦 `r project$name`
|
||||
|
||||
[](https://web.viash-hub.com/packages/`r project$name`)
|
||||
[](`r project$links$repository`)
|
||||
[](`r license`)
|
||||
[](`r project$links$issue_tracker`)
|
||||
[`-blue)](https://viash.io)
|
||||
|
||||
`r project$description`
|
||||
|
||||
## Objectives
|
||||
|
||||
- **Reusability**: Facilitating the use of components across various projects and contexts.
|
||||
- **Reproducibility**: Ensuring that components are reproducible and can be easily shared.
|
||||
- **Best Practices**: Adhering to established standards in software development and bioinformatics.
|
||||
|
||||
## Contributing
|
||||
|
||||
We encourage contributions from the community. To contribute:
|
||||
|
||||
1. **Fork the Repository**: Start by forking this repository to your account.
|
||||
2. **Develop Your Component**: Create your Viash component, ensuring it aligns with our best practices (detailed below).
|
||||
3. **Submit a Pull Request**: After testing your component, submit a pull request for review.
|
||||
|
||||
## Contribution Guidelines
|
||||
|
||||
The contribution guidelines describes which steps you should follow to contribute a component to this repository.
|
||||
|
||||
```{r echo=FALSE}
|
||||
lines <- readr::read_lines("CONTRIBUTING.md")
|
||||
|
||||
index_start <- grep("^### Step [0-9]*:", lines)
|
||||
|
||||
index_end <- c(index_start[-1] - 1, length(lines))
|
||||
|
||||
name <- gsub("^### Step [0-9]*: *", "", lines[index_start])
|
||||
|
||||
knitr::asis_output(
|
||||
paste(paste0(" 1. ", name, "\n"), collapse = "")
|
||||
)
|
||||
```
|
||||
|
||||
See the [CONTRIBUTING](`r contributing`) file for more details.
|
||||
|
||||
|
||||
## Support and Community
|
||||
|
||||
For support, questions, or to join our community:
|
||||
|
||||
- **Issues**: Submit questions or issues via the [GitHub issue tracker](`r project$links$issue_tracker`).
|
||||
- **Discussions**: Join our discussions via [GitHub Discussions](`r project$links$repository`/discussions).
|
||||
|
||||
## License
|
||||
This repository is licensed under an MIT license. See the [LICENSE](`r license`) file for details.
|
||||
13
_viash.yaml
Normal file
13
_viash.yaml
Normal file
@@ -0,0 +1,13 @@
|
||||
name: biobox
|
||||
description: |
|
||||
A collection of bioinformatics tools for working with sequence data.
|
||||
license: MIT
|
||||
keywords: [bioinformatics, modules, sequencing]
|
||||
links:
|
||||
issue_tracker: https://github.com/viash-hub/biobox/issues
|
||||
repository: https://github.com/viash-hub/biobox
|
||||
|
||||
viash_version: 0.9.0-RC6
|
||||
|
||||
config_mods: |
|
||||
.requirements.commands := ['ps']
|
||||
3
main.nf
Normal file
3
main.nf
Normal file
@@ -0,0 +1,3 @@
|
||||
workflow {
|
||||
print("This is a dummy placeholder for pipeline execution. Please use the corresponding nf files for running pipelines.")
|
||||
}
|
||||
6
nextflow.config
Normal file
6
nextflow.config
Normal file
@@ -0,0 +1,6 @@
|
||||
manifest {
|
||||
name = "biobox"
|
||||
version = "update_busco"
|
||||
defaultBranch = "main"
|
||||
nextflowVersion = "!>=20.12.1-edge"
|
||||
}
|
||||
385
src/arriba/config.vsh.yaml
Normal file
385
src/arriba/config.vsh.yaml
Normal file
@@ -0,0 +1,385 @@
|
||||
name: arriba
|
||||
description: Detect gene fusions from RNA-Seq data
|
||||
keywords: [Gene fusion, RNA-Seq]
|
||||
links:
|
||||
homepage: https://arriba.readthedocs.io/en/latest/
|
||||
documentation: https://arriba.readthedocs.io/en/latest/
|
||||
repository: https://github.com/suhrig/arriba
|
||||
references:
|
||||
doi: 10.1101/gr.257246.119
|
||||
license: MIT
|
||||
requirements:
|
||||
cpus: 1
|
||||
commands: [ arriba ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --bam
|
||||
alternatives: -x
|
||||
type: file
|
||||
description: |
|
||||
File in SAM/BAM/CRAM format with main alignments as generated by STAR
|
||||
(Aligned.out.sam). Arriba extracts candidate reads from this file.
|
||||
required: true
|
||||
example: Aligned.out.bam
|
||||
- name: --genome
|
||||
alternatives: -a
|
||||
type: file
|
||||
description: |
|
||||
FastA file with genome sequence (assembly). The file may be gzip-compressed. An
|
||||
index with the file extension .fai must exist only if CRAM files are processed.
|
||||
required: true
|
||||
example: assembly.fa
|
||||
- name: --gene_annotation
|
||||
alternatives: -g
|
||||
type: file
|
||||
description: |
|
||||
GTF file with gene annotation. The file may be gzip-compressed.
|
||||
required: true
|
||||
example: annotation.gtf
|
||||
- name: --known_fusions
|
||||
alternatives: -k
|
||||
type: file
|
||||
description: |
|
||||
File containing known/recurrent fusions. Some cancer entities are often
|
||||
characterized by fusions between the same pair of genes. In order to boost
|
||||
sensitivity, a list of known fusions can be supplied using this parameter. The list
|
||||
must contain two columns with the names of the fused genes, separated by tabs.
|
||||
required: false
|
||||
example: known_fusions.tsv
|
||||
- name: --blacklist
|
||||
alternatives: -b
|
||||
type: file
|
||||
description: |
|
||||
File containing blacklisted events (recurrent artifacts and transcripts
|
||||
observed in healthy tissue).
|
||||
required: false
|
||||
example: blacklist.tsv
|
||||
- name: --structural_variants
|
||||
alternatives: -d
|
||||
type: file
|
||||
description: |
|
||||
Tab-separated file with coordinates of structural variants found using
|
||||
whole-genome sequencing data. These coordinates serve to increase sensitivity
|
||||
towards weakly expressed fusions and to eliminate fusions with low evidence.
|
||||
required: false
|
||||
example: structural_variants_from_WGS.tsv
|
||||
- name: --tags
|
||||
alternatives: -t
|
||||
type: file
|
||||
description: |
|
||||
Tab-separated file containing fusions to annotate with tags in the 'tags' column.
|
||||
The first two columns specify the genes; the third column specifies the tag. The
|
||||
file may be gzip-compressed.
|
||||
required: false
|
||||
example: tags.tsv
|
||||
- name: --protein_domains
|
||||
alternatives: -p
|
||||
type: file
|
||||
description: |
|
||||
File in GFF3 format containing coordinates of the protein domains of genes. The
|
||||
protein domains retained in a fusion are listed in the column
|
||||
'retained_protein_domains'. The file may be gzip-compressed.
|
||||
required: false
|
||||
example: protein_domains.gff3
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --fusions
|
||||
alternatives: -o
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with fusions that have passed all filters.
|
||||
required: true
|
||||
example: fusions.tsv
|
||||
- name: --fusions_discarded
|
||||
alternatives: -O
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file with fusions that were discarded due to filtering.
|
||||
required: false
|
||||
example: fusions.discarded.tsv
|
||||
- name: Arguments
|
||||
arguments:
|
||||
- name: --max_genomic_breakpoint_distance
|
||||
alternatives: -D
|
||||
type: long
|
||||
description: |
|
||||
When a file with genomic breakpoints obtained via
|
||||
whole-genome sequencing is supplied via the --structural_variants
|
||||
parameter, this parameter determines how far a
|
||||
genomic breakpoint may be away from a
|
||||
transcriptomic breakpoint to consider it as a
|
||||
related event. For events inside genes, the
|
||||
distance is added to the end of the gene; for
|
||||
intergenic events, the distance threshold is
|
||||
applied as is. Default: 100000.
|
||||
required: false
|
||||
- name: --strandedness
|
||||
alternatives: -s
|
||||
type: string
|
||||
description: |
|
||||
Whether a strand-specific protocol was used for library preparation,
|
||||
and if so, the type of strandedness (auto/yes/no/reverse). When
|
||||
unstranded data is processed, the strand can sometimes be inferred from
|
||||
splice-patterns. But in unclear situations, stranded data helps
|
||||
resolve ambiguities. Default: auto
|
||||
choices: ["auto", "yes", "no", "reverse"]
|
||||
required: false
|
||||
- name: --interesting_contigs
|
||||
alternatives: -i
|
||||
type: string
|
||||
description: |
|
||||
List of interesting contigs. Fusions between genes
|
||||
on other contigs are ignored. Contigs can be specified with or without the
|
||||
prefix "chr". Asterisks (*) are treated as wild-cards.
|
||||
Default: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_*
|
||||
required: false
|
||||
multiple: true
|
||||
example: ["1", "2", "AC_*", "NC_*"]
|
||||
- name: --viral_contigs
|
||||
alternatives: -v
|
||||
type: string
|
||||
description: |
|
||||
List of viral contigs. Asterisks (*) are treated as
|
||||
wild-cards.
|
||||
Default: AC_* NC_*
|
||||
required: false
|
||||
multiple: true
|
||||
example: ["AC_*", "NC_*"]
|
||||
- name: --disable_filters
|
||||
alternatives: -f
|
||||
type: string
|
||||
description: |
|
||||
List of filters to disable. By default all filters are
|
||||
enabled.
|
||||
choices: [ homologs, low_entropy, isoforms,
|
||||
top_expressed_viral_contigs, viral_contigs, uninteresting_contigs,
|
||||
non_coding_neighbors, mismatches, duplicates, no_genomic_support,
|
||||
genomic_support, intronic, end_to_end, relative_support,
|
||||
low_coverage_viral_contigs, merge_adjacent, mismappers, multimappers,
|
||||
same_gene, long_gap, internal_tandem_duplication, small_insert_size,
|
||||
read_through, inconsistently_clipped, intragenic_exonic,
|
||||
marginal_read_through, spliced, hairpin, blacklist, min_support,
|
||||
select_best, in_vitro, short_anchor, known_fusions, no_coverage,
|
||||
homopolymer, many_spliced ]
|
||||
required: false
|
||||
multiple: true
|
||||
- name: --max_e_value
|
||||
alternatives: -E
|
||||
type: double
|
||||
description: |
|
||||
Arriba estimates the number of fusions with a given number of supporting
|
||||
reads which one would expect to see by random chance. If the expected number
|
||||
of fusions (e-value) is higher than this threshold, the fusion is
|
||||
discarded by the 'relative_support' filter. Note: Increasing this
|
||||
threshold can dramatically increase the number of false positives and may
|
||||
increase the runtime of resource-intensive steps. Fractional values are
|
||||
possible. Default: 0.300000
|
||||
required: false
|
||||
- name: --min_supporting_reads
|
||||
alternatives: -S
|
||||
type: integer
|
||||
description: |
|
||||
The 'min_support' filter discards all fusions with fewer than
|
||||
this many supporting reads (split reads and discordant mates
|
||||
combined). Default: 2
|
||||
required: false
|
||||
example: 2
|
||||
- name: --max_mismappers
|
||||
alternatives: -m
|
||||
type: double
|
||||
description: |
|
||||
When more than this fraction of supporting reads turns out to be
|
||||
mismappers, the 'mismappers' filter discards the fusion. Default:
|
||||
0.800000
|
||||
required: false
|
||||
example: 0.8
|
||||
- name: --max_homolog_identity
|
||||
alternatives: -L
|
||||
type: double
|
||||
description: |
|
||||
Genes with more than the given fraction of sequence identity are
|
||||
considered homologs and removed by the 'homologs' filter.
|
||||
Default: 0.300000
|
||||
required: false
|
||||
example: 0.3
|
||||
- name: --homopolymer_length
|
||||
alternatives: -H
|
||||
type: integer
|
||||
description: |
|
||||
The 'homopolymer' filter removes breakpoints adjacent to
|
||||
homopolymers of the given length or more. Default: 6
|
||||
required: false
|
||||
example: 6
|
||||
- name: --read_through_distance
|
||||
alternatives: -R
|
||||
type: integer
|
||||
description: |
|
||||
The 'read_through' filter removes read-through fusions
|
||||
where the breakpoints are less than the given distance away
|
||||
from each other. Default: 10000
|
||||
required: false
|
||||
example: 10000
|
||||
- name : --min_anchor_length
|
||||
alternatives: -A
|
||||
type: integer
|
||||
description: |
|
||||
Alignment artifacts are often characterized by split reads coming
|
||||
from only one gene and no discordant mates. Moreover, the split
|
||||
reads only align to a short stretch in one of the genes. The
|
||||
'short_anchor' filter removes these fusions. This parameter sets
|
||||
the threshold in bp for what the filter considers short. Default: 23
|
||||
required: false
|
||||
example: 23
|
||||
- name: --many_spliced_events
|
||||
alternatives: -M
|
||||
type: integer
|
||||
description: |
|
||||
The 'many_spliced' filter recovers fusions between genes that
|
||||
have at least this many spliced breakpoints. Default: 4
|
||||
required: false
|
||||
example: 4
|
||||
- name: --max_kmer_content
|
||||
alternatives: -K
|
||||
type: double
|
||||
description: |
|
||||
The 'low_entropy' filter removes reads with repetitive 3-mers. If
|
||||
the 3-mers make up more than the given fraction of the sequence, then
|
||||
the read is discarded. Default: 0.600000
|
||||
required: false
|
||||
example: 0.6
|
||||
- name: --max_mismatch_pvalue
|
||||
alternatives: -V
|
||||
type: double
|
||||
description: |
|
||||
The 'mismatches' filter uses a binomial model to calculate a
|
||||
p-value for observing a given number of mismatches in a read. If
|
||||
the number of mismatches is too high, the read is discarded.
|
||||
Default: 0.010000
|
||||
required: false
|
||||
example: 0.05
|
||||
- name: --fragment_length
|
||||
alternatives: -F
|
||||
type: integer
|
||||
description: |
|
||||
When paired-end data is given, the fragment length is estimated
|
||||
automatically and this parameter has no effect. But when single-end
|
||||
data is given, the mean fragment length should be specified to
|
||||
effectively filter fusions that arise from hairpin structures.
|
||||
Default: 200
|
||||
required: false
|
||||
example: 200
|
||||
- name: --max_reads
|
||||
alternatives: -U
|
||||
type: integer
|
||||
description: |
|
||||
Subsample fusions with more than the given number of supporting reads. This
|
||||
improves performance without compromising sensitivity, as long as the
|
||||
threshold is high. Counting of supporting reads beyond the threshold is
|
||||
inaccurate, obviously. Default: 300
|
||||
required: false
|
||||
example: 300
|
||||
- name: --quantile
|
||||
alternatives: -Q
|
||||
type: double
|
||||
description: |
|
||||
Highly expressed genes are prone to produce artifacts during library
|
||||
preparation. Genes with an expression above the given quantile are eligible
|
||||
for filtering by the 'in_vitro' filter. Default: 0.998000
|
||||
required: false
|
||||
example: 0.998
|
||||
- name: --exonic_fraction
|
||||
alternatives: -e
|
||||
type: double
|
||||
description: |
|
||||
The breakpoints of false-positive predictions of intragenic events
|
||||
are often both in exons. True predictions are more likely to have at
|
||||
least one breakpoint in an intron, because introns are larger. If the
|
||||
fraction of exonic sequence between two breakpoints is smaller than
|
||||
the given fraction, the 'intragenic_exonic' filter discards the
|
||||
event. Default: 0.330000
|
||||
required: false
|
||||
example: 0.33
|
||||
- name: --top_n
|
||||
alternatives: -T
|
||||
type: integer
|
||||
description: |
|
||||
Only report viral integration sites of the top N most highly expressed viral
|
||||
contigs. Default: 5
|
||||
required: false
|
||||
example: 5
|
||||
- name: --covered_fraction
|
||||
alternatives: -C
|
||||
type: double
|
||||
description: |
|
||||
Ignore virally associated events if the virus is not fully
|
||||
expressed, i.e., less than the given fraction of the viral contig is
|
||||
transcribed. Default: 0.050000
|
||||
required: false
|
||||
example: 0.05
|
||||
- name: --max_itd_length
|
||||
alternatives: -l
|
||||
type: integer
|
||||
description: |
|
||||
Maximum length of internal tandem duplications. Note: Increasing
|
||||
this value beyond the default can impair performance and lead to many
|
||||
false positives. Default: 100
|
||||
required: false
|
||||
example: 100
|
||||
- name: --min_itd_allele_fraction
|
||||
alternatives: -z
|
||||
type: double
|
||||
description: |
|
||||
Required fraction of supporting reads to report an internal
|
||||
tandem duplication. Default: 0.070000
|
||||
required: false
|
||||
example: 0.07
|
||||
- name: --min_itd_supporting_reads
|
||||
alternatives: -Z
|
||||
type: integer
|
||||
description: |
|
||||
Required absolute number of supporting reads to report an
|
||||
internal tandem duplication. Default: 10
|
||||
required: false
|
||||
example: 10
|
||||
- name: --skip_duplicate_marking
|
||||
alternatives: -u
|
||||
type: boolean_true
|
||||
description: |
|
||||
Instead of performing duplicate marking itself, Arriba relies on duplicate marking by a
|
||||
preceding program using the BAM_FDUP flag. This makes sense when unique molecular
|
||||
identifiers (UMI) are used.
|
||||
- name: --extra_information
|
||||
alternatives: -X
|
||||
type: boolean_true
|
||||
description: |
|
||||
To reduce the runtime and file size, by default, the columns 'fusion_transcript',
|
||||
'peptide_sequence', and 'read_identifiers' are left empty in the file containing
|
||||
discarded fusion candidates (see parameter -O). When this flag is set, this extra
|
||||
information is reported in the discarded fusions file.
|
||||
- name: --fill_gaps
|
||||
alternatives: -I
|
||||
type: boolean_true
|
||||
description: |
|
||||
If assembly of the fusion transcript sequence from the supporting reads is incomplete
|
||||
(denoted as '...'), fill the gaps using the assembly sequence wherever possible.
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/arriba:2.4.0--h0033a41_2
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
arriba -h | grep 'Version:' 2>&1 | sed 's/Version:\s\(.*\)/arriba: "\1"/' > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
198
src/arriba/help.txt
Normal file
198
src/arriba/help.txt
Normal file
@@ -0,0 +1,198 @@
|
||||
```bash
|
||||
arriba -h
|
||||
```
|
||||
|
||||
Arriba gene fusion detector
|
||||
---------------------------
|
||||
Version: 2.4.0
|
||||
|
||||
Arriba is a fast tool to search for aberrant transcripts such as gene fusions.
|
||||
It is based on chimeric alignments found by the STAR RNA-Seq aligner.
|
||||
|
||||
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
|
||||
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
|
||||
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
|
||||
-o fusions.tsv [-O fusions.discarded.tsv] \
|
||||
[OPTIONS]
|
||||
|
||||
-c FILE File in SAM/BAM/CRAM format with chimeric alignments as generated by STAR
|
||||
(Chimeric.out.sam). This parameter is only required, if STAR was run with the
|
||||
parameter '--chimOutType SeparateSAMold'. When STAR was run with the parameter
|
||||
'--chimOutType WithinBAM', it suffices to pass the parameter -x to Arriba and -c
|
||||
can be omitted.
|
||||
|
||||
-x FILE File in SAM/BAM/CRAM format with main alignments as generated by STAR
|
||||
(Aligned.out.sam). Arriba extracts candidate reads from this file.
|
||||
|
||||
-g FILE GTF file with gene annotation. The file may be gzip-compressed.
|
||||
|
||||
-G GTF_FEATURES Comma-/space-separated list of names of GTF features.
|
||||
Default: gene_name=gene_name|gene_id gene_id=gene_id
|
||||
transcript_id=transcript_id feature_exon=exon feature_CDS=CDS
|
||||
|
||||
-a FILE FastA file with genome sequence (assembly). The file may be gzip-compressed. An
|
||||
index with the file extension .fai must exist only if CRAM files are processed.
|
||||
|
||||
-b FILE File containing blacklisted events (recurrent artifacts and transcripts
|
||||
observed in healthy tissue).
|
||||
|
||||
-k FILE File containing known/recurrent fusions. Some cancer entities are often
|
||||
characterized by fusions between the same pair of genes. In order to boost
|
||||
sensitivity, a list of known fusions can be supplied using this parameter. The list
|
||||
must contain two columns with the names of the fused genes, separated by tabs.
|
||||
|
||||
-o FILE Output file with fusions that have passed all filters.
|
||||
|
||||
-O FILE Output file with fusions that were discarded due to filtering.
|
||||
|
||||
-t FILE Tab-separated file containing fusions to annotate with tags in the 'tags' column.
|
||||
The first two columns specify the genes; the third column specifies the tag. The
|
||||
file may be gzip-compressed.
|
||||
|
||||
-p FILE File in GFF3 format containing coordinates of the protein domains of genes. The
|
||||
protein domains retained in a fusion are listed in the column
|
||||
'retained_protein_domains'. The file may be gzip-compressed.
|
||||
|
||||
-d FILE Tab-separated file with coordinates of structural variants found using
|
||||
whole-genome sequencing data. These coordinates serve to increase sensitivity
|
||||
towards weakly expressed fusions and to eliminate fusions with low evidence.
|
||||
|
||||
-D MAX_GENOMIC_BREAKPOINT_DISTANCE When a file with genomic breakpoints obtained via
|
||||
whole-genome sequencing is supplied via the -d
|
||||
parameter, this parameter determines how far a
|
||||
genomic breakpoint may be away from a
|
||||
transcriptomic breakpoint to consider it as a
|
||||
related event. For events inside genes, the
|
||||
distance is added to the end of the gene; for
|
||||
intergenic events, the distance threshold is
|
||||
applied as is. Default: 100000
|
||||
|
||||
-s STRANDEDNESS Whether a strand-specific protocol was used for library preparation,
|
||||
and if so, the type of strandedness (auto/yes/no/reverse). When
|
||||
unstranded data is processed, the strand can sometimes be inferred from
|
||||
splice-patterns. But in unclear situations, stranded data helps
|
||||
resolve ambiguities. Default: auto
|
||||
|
||||
-i CONTIGS Comma-/space-separated list of interesting contigs. Fusions between genes
|
||||
on other contigs are ignored. Cfontigs can be specified with or without the
|
||||
prefix "chr". Asterisks (*) are treated as wild-cards.
|
||||
Default: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_*
|
||||
|
||||
-v CONTIGS Comma-/space-separated list of viral contigs. Asterisks (*) are treated as
|
||||
wild-cards.
|
||||
Default: AC_* NC_*
|
||||
|
||||
-f FILTERS Comma-/space-separated list of filters to disable. By default all filters are
|
||||
enabled. Valid values: homologs, low_entropy, isoforms,
|
||||
top_expressed_viral_contigs, viral_contigs, uninteresting_contigs,
|
||||
non_coding_neighbors, mismatches, duplicates, no_genomic_support,
|
||||
genomic_support, intronic, end_to_end, relative_support,
|
||||
low_coverage_viral_contigs, merge_adjacent, mismappers, multimappers,
|
||||
same_gene, long_gap, internal_tandem_duplication, small_insert_size,
|
||||
read_through, inconsistently_clipped, intragenic_exonic,
|
||||
marginal_read_through, spliced, hairpin, blacklist, min_support,
|
||||
select_best, in_vitro, short_anchor, known_fusions, no_coverage,
|
||||
homopolymer, many_spliced
|
||||
|
||||
-E MAX_E-VALUE Arriba estimates the number of fusions with a given number of supporting
|
||||
reads which one would expect to see by random chance. If the expected number
|
||||
of fusions (e-value) is higher than this threshold, the fusion is
|
||||
discarded by the 'relative_support' filter. Note: Increasing this
|
||||
threshold can dramatically increase the number of false positives and may
|
||||
increase the runtime of resource-intensive steps. Fractional values are
|
||||
possible. Default: 0.300000
|
||||
|
||||
-S MIN_SUPPORTING_READS The 'min_support' filter discards all fusions with fewer than
|
||||
this many supporting reads (split reads and discordant mates
|
||||
combined). Default: 2
|
||||
|
||||
-m MAX_MISMAPPERS When more than this fraction of supporting reads turns out to be
|
||||
mismappers, the 'mismappers' filter discards the fusion. Default:
|
||||
0.800000
|
||||
|
||||
-L MAX_HOMOLOG_IDENTITY Genes with more than the given fraction of sequence identity are
|
||||
considered homologs and removed by the 'homologs' filter.
|
||||
Default: 0.300000
|
||||
|
||||
-H HOMOPOLYMER_LENGTH The 'homopolymer' filter removes breakpoints adjacent to
|
||||
homopolymers of the given length or more. Default: 6
|
||||
|
||||
-R READ_THROUGH_DISTANCE The 'read_through' filter removes read-through fusions
|
||||
where the breakpoints are less than the given distance away
|
||||
from each other. Default: 10000
|
||||
|
||||
-A MIN_ANCHOR_LENGTH Alignment artifacts are often characterized by split reads coming
|
||||
from only one gene and no discordant mates. Moreover, the split
|
||||
reads only align to a short stretch in one of the genes. The
|
||||
'short_anchor' filter removes these fusions. This parameter sets
|
||||
the threshold in bp for what the filter considers short. Default: 23
|
||||
|
||||
-M MANY_SPLICED_EVENTS The 'many_spliced' filter recovers fusions between genes that
|
||||
have at least this many spliced breakpoints. Default: 4
|
||||
|
||||
-K MAX_KMER_CONTENT The 'low_entropy' filter removes reads with repetitive 3-mers. If
|
||||
the 3-mers make up more than the given fraction of the sequence, then
|
||||
the read is discarded. Default: 0.600000
|
||||
|
||||
-V MAX_MISMATCH_PVALUE The 'mismatches' filter uses a binomial model to calculate a
|
||||
p-value for observing a given number of mismatches in a read. If
|
||||
the number of mismatches is too high, the read is discarded.
|
||||
Default: 0.010000
|
||||
|
||||
-F FRAGMENT_LENGTH When paired-end data is given, the fragment length is estimated
|
||||
automatically and this parameter has no effect. But when single-end
|
||||
data is given, the mean fragment length should be specified to
|
||||
effectively filter fusions that arise from hairpin structures.
|
||||
Default: 200
|
||||
|
||||
-U MAX_READS Subsample fusions with more than the given number of supporting reads. This
|
||||
improves performance without compromising sensitivity, as long as the
|
||||
threshold is high. Counting of supporting reads beyond the threshold is
|
||||
inaccurate, obviously. Default: 300
|
||||
|
||||
-Q QUANTILE Highly expressed genes are prone to produce artifacts during library
|
||||
preparation. Genes with an expression above the given quantile are eligible
|
||||
for filtering by the 'in_vitro' filter. Default: 0.998000
|
||||
|
||||
-e EXONIC_FRACTION The breakpoints of false-positive predictions of intragenic events
|
||||
are often both in exons. True predictions are more likely to have at
|
||||
least one breakpoint in an intron, because introns are larger. If the
|
||||
fraction of exonic sequence between two breakpoints is smaller than
|
||||
the given fraction, the 'intragenic_exonic' filter discards the
|
||||
event. Default: 0.330000
|
||||
|
||||
-T TOP_N Only report viral integration sites of the top N most highly expressed viral
|
||||
contigs. Default: 5
|
||||
|
||||
-C COVERED_FRACTION Ignore virally associated events if the virus is not fully
|
||||
expressed, i.e., less than the given fraction of the viral contig is
|
||||
transcribed. Default: 0.050000
|
||||
|
||||
-l MAX_ITD_LENGTH Maximum length of internal tandem duplications. Note: Increasing
|
||||
this value beyond the default can impair performance and lead to many
|
||||
false positives. Default: 100
|
||||
|
||||
-z MIN_ITD_ALLELE_FRACTION Required fraction of supporting reads to report an internal
|
||||
tandem duplication. Default: 0.070000
|
||||
|
||||
-Z MIN_ITD_SUPPORTING_READS Required absolute number of supporting reads to report an
|
||||
internal tandem duplication. Default: 10
|
||||
|
||||
-u Instead of performing duplicate marking itself, Arriba relies on duplicate marking by a
|
||||
preceding program using the BAM_FDUP flag. This makes sense when unique molecular
|
||||
identifiers (UMI) are used.
|
||||
|
||||
-X To reduce the runtime and file size, by default, the columns 'fusion_transcript',
|
||||
'peptide_sequence', and 'read_identifiers' are left empty in the file containing
|
||||
discarded fusion candidates (see parameter -O). When this flag is set, this extra
|
||||
information is reported in the discarded fusions file.
|
||||
|
||||
-I If assembly of the fusion transcript sequence from the supporting reads is incomplete
|
||||
(denoted as '...'), fill the gaps using the assembly sequence wherever possible.
|
||||
|
||||
-h Print help and exit.
|
||||
|
||||
Code repository: https://github.com/suhrig/arriba
|
||||
Get help/report bugs: https://github.com/suhrig/arriba/issues
|
||||
User manual: https://arriba.readthedocs.io/
|
||||
Please cite: https://doi.org/10.1101/gr.257246.119
|
||||
54
src/arriba/script.sh
Normal file
54
src/arriba/script.sh
Normal file
@@ -0,0 +1,54 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# unset flags
|
||||
[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking
|
||||
[[ "$par_extra_information" == "false" ]] && unset par_extra_information
|
||||
[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps
|
||||
|
||||
# replace ';' with ','
|
||||
par_interesting_contigs=$(echo $par_interesting_contigs | tr ';' ',')
|
||||
par_viral_contigs=$(echo $par_viral_contigs | tr ';' ',')
|
||||
par_disable_filters=$(echo $par_disable_filters | tr ';' ',')
|
||||
|
||||
# run arriba
|
||||
arriba \
|
||||
-x "$par_bam" \
|
||||
-a "$par_genome" \
|
||||
-g "$par_gene_annotation" \
|
||||
-o "$par_fusions" \
|
||||
${par_known_fusions:+-k "${par_known_fusions}"} \
|
||||
${par_blacklist:+-b "${par_blacklist}"} \
|
||||
${par_structural_variants:+-d "${par_structural_variants}"} \
|
||||
${par_tags:+-t "${par_tags}"} \
|
||||
${par_protein_domains:+-p "${par_protein_domains}"} \
|
||||
${par_fusions_discarded:+-O "${par_fusions_discarded}"} \
|
||||
${par_max_genomic_breakpoint_distance:+-D "${par_max_genomic_breakpoint_distance}"} \
|
||||
${par_strandedness:+-s "${par_strandedness}"} \
|
||||
${par_interesting_contigs:+-i "${par_interesting_contigs}"} \
|
||||
${par_viral_contigs:+-v "${par_viral_contigs}"} \
|
||||
${par_disable_filters:+-f "${par_disable_filters}"} \
|
||||
${par_max_e_value:+-E "${par_max_e_value}"} \
|
||||
${par_min_supporting_reads:+-S "${par_min_supporting_reads}"} \
|
||||
${par_max_mismappers:+-m "${par_max_mismappers}"} \
|
||||
${par_max_homolog_identity:+-L "${par_max_homolog_identity}"} \
|
||||
${par_homopolymer_length:+-H "${par_homopolymer_length}"} \
|
||||
${par_read_through_distance:+-R "${par_read_through_distance}"} \
|
||||
${par_min_anchor_length:+-A "${par_min_anchor_length}"} \
|
||||
${par_many_spliced_events:+-M "${par_many_spliced_events}"} \
|
||||
${par_max_kmer_content:+-K "${par_max_kmer_content}"} \
|
||||
${par_max_mismatch_pvalue:+-V "${par_max_mismatch_pvalue}"} \
|
||||
${par_fragment_length:+-F "${par_fragment_length}"} \
|
||||
${par_max_reads:+-U "${par_max_reads}"} \
|
||||
${par_quantile:+-Q "${par_quantile}"} \
|
||||
${par_exonic_fraction:+-e "${par_exonic_fraction}"} \
|
||||
${par_top_n:+-T "${par_top_n}"} \
|
||||
${par_covered_fraction:+-C "${par_covered_fraction}"} \
|
||||
${par_max_itd_length:+-l "${par_max_itd_length}"} \
|
||||
${par_min_itd_allele_fraction:+-z "${par_min_itd_allele_fraction}"} \
|
||||
${par_min_itd_supporting_reads:+-Z "${par_min_itd_supporting_reads}"} \
|
||||
${par_skip_duplicate_marking:+-u} \
|
||||
${par_extra_information:+-X} \
|
||||
${par_fill_gaps:+-I}
|
||||
45
src/arriba/test.sh
Normal file
45
src/arriba/test.sh
Normal file
@@ -0,0 +1,45 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
dir_in="$meta_resources_dir/test_data"
|
||||
|
||||
echo "> Run arriba with blacklist"
|
||||
"$meta_executable" \
|
||||
--bam "$dir_in/A.bam" \
|
||||
--genome "$dir_in/genome.fasta" \
|
||||
--gene_annotation "$dir_in/annotation.gtf" \
|
||||
--blacklist "$dir_in/blacklist.tsv" \
|
||||
--fusions "fusions.tsv" \
|
||||
--fusions_discarded "fusions_discarded.tsv" \
|
||||
--interesting_contigs "1,2"
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
|
||||
[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
|
||||
[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
|
||||
|
||||
rm fusions.tsv fusions_discarded.tsv
|
||||
|
||||
echo "> Run arriba without blacklist"
|
||||
"$meta_executable" \
|
||||
--bam "$dir_in/A.bam" \
|
||||
--genome "$dir_in/genome.fasta" \
|
||||
--gene_annotation "$dir_in/annotation.gtf" \
|
||||
--fusions "fusions.tsv" \
|
||||
--fusions_discarded "fusions_discarded.tsv" \
|
||||
--interesting_contigs "1,2" \
|
||||
--disable_filters blacklist
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
|
||||
[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
|
||||
[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
|
||||
|
||||
echo "> Test successful"
|
||||
BIN
src/arriba/test_data/A.bam
Normal file
BIN
src/arriba/test_data/A.bam
Normal file
Binary file not shown.
6
src/arriba/test_data/annotation.gtf
Normal file
6
src/arriba/test_data/annotation.gtf
Normal file
@@ -0,0 +1,6 @@
|
||||
1 havana gene 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; gene_name "A"; gene_source "havana"; gene_biotype "gene";
|
||||
1 havana transcript 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; transcript_id "ENST00000000000"; transcript_version "2"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; transcript_name "A-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
|
||||
1 havana exon 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; transcript_id "ENST00000000000"; transcript_version "2"; exon_number "1"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; transcript_name "A-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00000000000"; exon_version "1"; tag "basic"; transcript_support_level "1";
|
||||
2 havana gene 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; gene_name "B"; gene_source "havana"; gene_biotype "gene";
|
||||
2 havana transcript 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; transcript_id "ENST00000000001"; transcript_version "2"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; transcript_name "B-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
|
||||
2 havana exon 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; transcript_id "ENST00000000001"; transcript_version "2"; exon_number "1"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; transcript_name "B-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00000000001"; exon_version "1"; tag "basic"; transcript_support_level "1";
|
||||
0
src/arriba/test_data/blacklist.tsv
Normal file
0
src/arriba/test_data/blacklist.tsv
Normal file
|
|
4
src/arriba/test_data/genome.fasta
Normal file
4
src/arriba/test_data/genome.fasta
Normal file
@@ -0,0 +1,4 @@
|
||||
>1
|
||||
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
|
||||
>2
|
||||
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
|
||||
10
src/arriba/test_data/script.sh
Executable file
10
src/arriba/test_data/script.sh
Executable file
@@ -0,0 +1,10 @@
|
||||
# arriba test data
|
||||
|
||||
# Test data was obtained from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/arriba/test
|
||||
|
||||
if [ ! -d /tmp/snakemake-wrappers ]; then
|
||||
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
|
||||
fi
|
||||
|
||||
cp -r /tmp/snakemake-wrappers/bio/arriba/test/* src/arriba/test_data
|
||||
|
||||
159
src/bcl_convert/config.vsh.yaml
Normal file
159
src/bcl_convert/config.vsh.yaml
Normal file
@@ -0,0 +1,159 @@
|
||||
name: bcl_convert
|
||||
description: |
|
||||
Convert bcl files to fastq files using bcl-convert.
|
||||
Information about upgrading from bcl2fastq via
|
||||
[Upgrading from bcl2fastq to BCL Convert](https://emea.support.illumina.com/bulletins/2020/10/upgrading-from-bcl2fastq-to-bcl-convert.html)
|
||||
and [BCL Convert Compatible Products](https://support.illumina.com/sequencing/sequencing_software/bcl-convert/compatibility.html)
|
||||
argument_groups:
|
||||
- name: Input arguments
|
||||
arguments:
|
||||
- name: "--bcl_input_directory"
|
||||
alternatives: ["-i"]
|
||||
type: file
|
||||
required: true
|
||||
description: Input run directory
|
||||
example: bcl_dir
|
||||
- name: "--sample_sheet"
|
||||
alternatives: ["-s"]
|
||||
type: file
|
||||
description: Path to SampleSheet.csv file (default searched for in --bcl_input_directory)
|
||||
example: bcl_dir/sample_sheet.csv
|
||||
- name: --run_info
|
||||
type: file
|
||||
description: Path to RunInfo.xml file (default root of BCL input directory)
|
||||
example: bcl_dir/RunInfo.xml
|
||||
|
||||
- name: Lane and tile settings
|
||||
arguments:
|
||||
- name: "--bcl_only_lane"
|
||||
type: integer
|
||||
description: Convert only specified lane number (default all lanes)
|
||||
example: 1
|
||||
- name: --first_tile_only
|
||||
type: boolean
|
||||
description: Only convert first tile of input (for testing & debugging)
|
||||
example: true
|
||||
- name: --tiles
|
||||
type: string
|
||||
description: Process only a subset of tiles by a regular expression
|
||||
example: "s_[0-9]+_1"
|
||||
- name: --exclude_tiles
|
||||
type: string
|
||||
description: Exclude set of tiles by a regular expression
|
||||
example: "s_[0-9]+_1"
|
||||
|
||||
- name: Resource arguments
|
||||
arguments:
|
||||
- name: --shared_thread_odirect_output
|
||||
type: boolean
|
||||
description: Use linux native asynchronous io (io_submit) for file output (Default=false)
|
||||
example: true
|
||||
- name: --bcl_num_parallel_tiles
|
||||
type: integer
|
||||
description: "\\# of tiles to process in parallel (default 1)"
|
||||
example: 1
|
||||
- name: --bcl_num_conversion_threads
|
||||
type: integer
|
||||
description: "\\# of threads for conversion (per tile, default # cpu threads)"
|
||||
example: 1
|
||||
- name: --bcl_num_compression_threads
|
||||
type: integer
|
||||
description: "\\# of threads for fastq.gz output compression (per tile, default # cpu threads, or HW+12)"
|
||||
example: 1
|
||||
- name: --bcl_num_decompression_threads
|
||||
type: integer
|
||||
description:
|
||||
"\\# of threads for bcl/cbcl input decompression (per tile, default half # cpu threads, or HW+8).
|
||||
Only applies when preloading files"
|
||||
example: 1
|
||||
|
||||
- name: Run arguments
|
||||
arguments:
|
||||
- name: --bcl_only_matched_reads
|
||||
type: boolean
|
||||
description: For pure BCL conversion, do not output files for 'Undetermined' [unmatched] reads (output by default)
|
||||
example: true
|
||||
- name: --no_lane_splitting
|
||||
type: boolean
|
||||
description: Do not split FASTQ file by lane (false by default)
|
||||
example: true
|
||||
- name: --num_unknown_barcodes_reported
|
||||
type: integer
|
||||
description: "\\# of Top Unknown Barcodes to output (1000 by default)"
|
||||
example: 1000
|
||||
- name: --bcl_validate_sample_sheet_only
|
||||
type: boolean
|
||||
description: Only validate RunInfo.xml & SampleSheet files (produce no FASTQ files)
|
||||
example: true
|
||||
- name: --strict_mode
|
||||
type: boolean
|
||||
description: Abort if any files are missing (false by default)
|
||||
example: true
|
||||
- name: --sample_name_column_enabled
|
||||
type: boolean
|
||||
description: Use sample sheet 'Sample_Name' column when naming fastq files & subdirectories
|
||||
example: true
|
||||
|
||||
- name: Output arguments
|
||||
arguments:
|
||||
- name: "--output_directory"
|
||||
alternatives: ["-o"]
|
||||
type: file
|
||||
direction: output
|
||||
required: true
|
||||
description: Output directory containig fastq files
|
||||
example: fastq_dir
|
||||
- name: --bcl_sampleproject_subdirectories
|
||||
type: boolean
|
||||
description: Output to subdirectories based upon sample sheet 'Sample_Project' column
|
||||
example: true
|
||||
- name: --fastq_gzip_compression_level
|
||||
type: integer
|
||||
description: Set fastq output compression level 0-9 (default 1)
|
||||
example: 1
|
||||
- name: "--reports"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
description: Reports directory
|
||||
example: reports_dir
|
||||
- name: "--logs"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
description: Reports directory
|
||||
example: logs_dir
|
||||
|
||||
# bcl-convert arguments not taken into account
|
||||
# --force
|
||||
# --output-legacy-stats arg Also output stats in legacy (bcl2fastq2) format (false by default)
|
||||
# --no-sample-sheet arg Enable legacy no-sample-sheet operation (No demux or trimming. No settings
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:trixie-slim
|
||||
# https://support.illumina.com/sequencing/sequencing_software/bcl-convert/downloads.html
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [wget, gdb, which, hostname, alien, procps]
|
||||
- type: docker
|
||||
run: |
|
||||
wget https://s3.amazonaws.com/webdata.illumina.com/downloads/software/bcl-convert/bcl-convert-4.2.7-2.el8.x86_64.rpm -O /tmp/bcl-convert.rpm && \
|
||||
alien -i /tmp/bcl-convert.rpm && \
|
||||
rm -rf /var/lib/apt/lists/* && \
|
||||
rm /tmp/bcl-convert.rpm
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bcl-convert: \"$(bcl-convert -V 2>&1 >/dev/null | sed -n '/Version/ s/^bcl-convert\ Version //p')\"" > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
38
src/bcl_convert/help.txt
Normal file
38
src/bcl_convert/help.txt
Normal file
@@ -0,0 +1,38 @@
|
||||
bcl-convert Version 00.000.000.4.2.7
|
||||
Copyright (c) 2014-2022 Illumina, Inc.
|
||||
|
||||
Run BCL Conversion (BCL directory to *.fastq.gz)
|
||||
bcl-convert --bcl-input-directory <BCL_ROOT_DIR> --output-directory <PATH> [options]
|
||||
|
||||
Options:
|
||||
-h [ --help ] Print this help message
|
||||
-V [ --version ] Print the version and exit
|
||||
--output-directory arg Output BCL directory for BCL conversion (must be specified)
|
||||
-f [ --force ] Force: allow destination diretory to already exist
|
||||
--bcl-input-directory arg Input BCL directory for BCL conversion (must be specified)
|
||||
--sample-sheet arg Path to SampleSheet.csv file (default searched for in --bcl-input-directory)
|
||||
--bcl-only-lane arg Convert only specified lane number (default all lanes)
|
||||
--strict-mode arg Abort if any files are missing (false by default)
|
||||
--first-tile-only arg Only convert first tile of input (for testing & debugging)
|
||||
--tiles arg Process only a subset of tiles by a regular expression
|
||||
--exclude-tiles arg Exclude set of tiles by a regular expression
|
||||
--bcl-sampleproject-subdirectories arg Output to subdirectories based upon sample sheet 'Sample_Project' column
|
||||
--sample-name-column-enabled arg Use sample sheet 'Sample_Name' column when naming fastq files & subdirectories
|
||||
--fastq-gzip-compression-level arg Set fastq output compression level 0-9 (default 1)
|
||||
--shared-thread-odirect-output arg Use linux native asynchronous io (io_submit) for file output (Default=false)
|
||||
--bcl-num-parallel-tiles arg # of tiles to process in parallel (default 1)
|
||||
--bcl-num-conversion-threads arg # of threads for conversion (per tile, default # cpu threads)
|
||||
--bcl-num-compression-threads arg # of threads for fastq.gz output compression (per tile, default # cpu threads,
|
||||
or HW+12)
|
||||
--bcl-num-decompression-threads arg # of threads for bcl/cbcl input decompression (per tile, default half # cpu
|
||||
threads, or HW+8. Only applies when preloading files)
|
||||
--bcl-only-matched-reads arg For pure BCL conversion, do not output files for 'Undetermined' [unmatched]
|
||||
reads (output by default)
|
||||
--run-info arg Path to RunInfo.xml file (default root of BCL input directory)
|
||||
--no-lane-splitting arg Do not split FASTQ file by lane (false by default)
|
||||
--num-unknown-barcodes-reported arg # of Top Unknown Barcodes to output (1000 by default)
|
||||
--bcl-validate-sample-sheet-only arg Only validate RunInfo.xml & SampleSheet files (produce no FASTQ files)
|
||||
--output-legacy-stats arg Also output stats in legacy (bcl2fastq2) format (false by default)
|
||||
--no-sample-sheet arg Enable legacy no-sample-sheet operation (No demux or trimming. No settings
|
||||
supported. False by default, not recommended
|
||||
|
||||
40
src/bcl_convert/script.sh
Normal file
40
src/bcl_convert/script.sh
Normal file
@@ -0,0 +1,40 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
$(which bcl-convert) \
|
||||
--bcl-input-directory "$par_bcl_input_directory" \
|
||||
--output-directory "$par_output_directory" \
|
||||
${par_sample_sheet:+ --sample-sheet "$par_sample_sheet"} \
|
||||
${par_run_info:+ --run-info "$par_run_info"} \
|
||||
${par_bcl_only_lane:+ --bcl-only-lane "$par_bcl_only_lane"} \
|
||||
${par_first_tile_only:+ --first-tile-only "$par_first_tile_only"} \
|
||||
${par_tiles:+ --tiles "$par_tiles"} \
|
||||
${par_exclude_tiles:+ --exclude-tiles "$par_exclude_tiles"} \
|
||||
${par_shared_thread_odirect_output:+ --shared-thread-odirect-output "$par_shared_thread_odirect_output"} \
|
||||
${par_bcl_num_parallel_tiles:+ --bcl-num-parallel-tiles "$par_bcl_num_parallel_tiles"} \
|
||||
${par_bcl_num_conversion_threads:+ --bcl-num-conversion-threads "$par_bcl_num_conversion_threads"} \
|
||||
${par_bcl_num_compression_threads:+ --bcl-num-compression-threads "$par_bcl_num_compression_threads"} \
|
||||
${par_bcl_num_decompression_threads:+ --bcl-num-decompression-threads "$par_bcl_num_decompression_threads"} \
|
||||
${par_bcl_only_matched_reads:+ --bcl-only-matched-reads "$par_bcl_only_matched_reads"} \
|
||||
${par_no_lane_splitting:+ --no-lane-splitting "$par_no_lane_splitting"} \
|
||||
${par_num_unknown_barcodes_reported:+ --num-unknown-barcodes-reported "$par_num_unknown_barcodes_reported"} \
|
||||
${par_bcl_validate_sample_sheet_only:+ --bcl-validate-sample-sheet-only "$par_bcl_validate_sample_sheet_only"} \
|
||||
${par_strict_mode:+ --strict-mode "$par_strict_mode"} \
|
||||
${par_sample_name_column_enabled:+ --sample-name-column-enabled "$par_sample_name_column_enabled"} \
|
||||
${par_bcl_sampleproject_subdirectories:+ --bcl-sampleproject-subdirectories "$par_bcl_sampleproject_subdirectories"} \
|
||||
${par_fastq_gzip_compression_level:+ --fastq-gzip-compression-level "$par_fastq_gzip_compression_level"}
|
||||
|
||||
if [ ! -z "$par_reports" ]; then
|
||||
echo "Moving reports to their own location"
|
||||
mv "${par_output_directory}/Reports" "$par_reports"
|
||||
else
|
||||
echo "Leaving reports alone"
|
||||
fi
|
||||
|
||||
if [ ! -z "$par_logs" ]; then
|
||||
echo "Moving logs to their own location"
|
||||
mv "${par_output_directory}/Logs" "$par_logs"
|
||||
else
|
||||
echo "Leaving logs alone"
|
||||
fi
|
||||
70
src/bcl_convert/test.sh
Normal file
70
src/bcl_convert/test.sh
Normal file
@@ -0,0 +1,70 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Tests are sourced from:
|
||||
# https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/inputs/cr-direct-demultiplexing-bcl-convert
|
||||
# Test input files are fetched from:
|
||||
# https://cf.10xgenomics.com/supp/spatial-exp/demultiplexing/iseq-DI.tar.gz
|
||||
# https://cf.10xgenomics.com/supp/spatial-exp/demultiplexing/bcl_convert_samplesheet.csv
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
echo ">> Fetching and preparing test data"
|
||||
data_src="https://cf.10xgenomics.com/supp/spatial-exp/demultiplexing/iseq-DI.tar.gz"
|
||||
sample_sheet_src="https://cf.10xgenomics.com/supp/spatial-exp/demultiplexing/bcl_convert_samplesheet.csv"
|
||||
test_data_dir="test_data"
|
||||
|
||||
mkdir $test_data_dir
|
||||
wget -q $data_src -O $test_data_dir/data.tar.gz
|
||||
wget -q $sample_sheet_src -O $test_data_dir/sample_sheet.csv
|
||||
tar xzf $test_data_dir/data.tar.gz -C $test_data_dir
|
||||
rm $test_data_dir/data.tar.gz
|
||||
|
||||
echo ">> Execute and verify output"
|
||||
|
||||
$meta_executable \
|
||||
--bcl_input_directory "$test_data_dir/iseq-DI" \
|
||||
--sample_sheet "$test_data_dir/sample_sheet.csv" \
|
||||
--output_directory fastq \
|
||||
--reports reports \
|
||||
--logs logs
|
||||
|
||||
echo ">>> Checking whether the output dir exists"
|
||||
[[ ! -d fastq ]] && echo "Output dir could not be found!" && exit 1
|
||||
|
||||
echo ">>> Checking whether output fastq files are created"
|
||||
[[ ! -f fastq/Undetermined_S0_L001_R1_001.fastq.gz ]] && echo "Output fastq files could not be found!" && exit 1
|
||||
[[ ! -f fastq/iseq-DI_S1_L001_R1_001.fastq.gz ]] && echo "Output fastq files could not be found!" && exit 1
|
||||
|
||||
echo ">>> Checking whether the report dir exists"
|
||||
[[ ! -d reports ]] && echo "Reports dir could not be found!" && exit 1
|
||||
|
||||
echo ">>> Checking whether the log dir exists"
|
||||
[[ ! -d logs ]] && echo "Logs dir could not be found!" && exit 1
|
||||
|
||||
# print final message
|
||||
echo ">>> Test finished successfully"
|
||||
|
||||
echo ">> Execute with additional arguments and verify output"
|
||||
|
||||
$meta_executable \
|
||||
--bcl_input_directory "$test_data_dir/iseq-DI" \
|
||||
--sample_sheet "$test_data_dir/sample_sheet.csv" \
|
||||
--output_directory fastq1 \
|
||||
--bcl_only_matched_reads true \
|
||||
--bcl_num_compression_threads 1 \
|
||||
--no_lane_splitting false \
|
||||
--fastq_gzip_compression_level 9
|
||||
|
||||
echo ">> Checking whether the output dir exists"
|
||||
[[ ! -d fastq1 ]] && echo "Output dir could not be found!" && exit 1
|
||||
|
||||
echo ">> Checking whether output fastq files are created"
|
||||
[[ -f fastq1/Undetermined_S0_L001_R1_001.fastq.gz ]] && echo "Undetermined should not be generated!" && exit 1
|
||||
[[ ! -f fastq1/iseq-DI_S1_L001_R1_001.fastq.gz ]] && echo "Output fastq files could not be found!" && exit 1
|
||||
|
||||
# print final message
|
||||
echo ">> Test finished successfully"
|
||||
|
||||
# do not remove this
|
||||
# as otherwise your test might exit with a different exit code
|
||||
exit 0
|
||||
103
src/bedtools/bedtools_getfasta/config.vsh.yaml
Normal file
103
src/bedtools/bedtools_getfasta/config.vsh.yaml
Normal file
@@ -0,0 +1,103 @@
|
||||
name: bedtools_getfasta
|
||||
namespace: bedtools
|
||||
description: Extract sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file.
|
||||
keywords: [sequencing, fasta, BED, GFF, VCF]
|
||||
links:
|
||||
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html
|
||||
repository: https://github.com/arq5x/bedtools2
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btq033
|
||||
license: GPL-2.0
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
|
||||
argument_groups:
|
||||
- name: Input arguments
|
||||
arguments:
|
||||
- name: --input_fasta
|
||||
type: file
|
||||
description: |
|
||||
FASTA file containing sequences for each interval specified in the input BED file.
|
||||
The headers in the input FASTA file must exactly match the chromosome column in the BED file.
|
||||
- name: "--input_bed"
|
||||
type: file
|
||||
description: |
|
||||
BED file containing intervals to extract from the FASTA file.
|
||||
BED files containing a single region require a newline character
|
||||
at the end of the line, otherwise a blank output file is produced.
|
||||
- name: --rna
|
||||
type: boolean_true
|
||||
description: |
|
||||
The FASTA is RNA not DNA. Reverse complementation handled accordingly.
|
||||
|
||||
- name: Run arguments
|
||||
arguments:
|
||||
- name: "--strandedness"
|
||||
type: boolean_true
|
||||
alternatives: ["-s"]
|
||||
description: |
|
||||
Force strandedness. If the feature occupies the antisense strand, the output sequence will
|
||||
be reverse complemented. By default strandedness is not taken into account.
|
||||
|
||||
- name: Output arguments
|
||||
arguments:
|
||||
- name: --output
|
||||
alternatives: [-o]
|
||||
required: true
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Output file where the output from the 'bedtools getfasta' commend will
|
||||
be written to.
|
||||
- name: --tab
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report extract sequences in a tab-delimited format instead of in FASTA format.
|
||||
- name: --bed_out
|
||||
type: boolean_true
|
||||
description: |
|
||||
Report extract sequences in a tab-delimited BED format instead of in FASTA format.
|
||||
- name: "--name"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Set the FASTA header for each extracted sequence to be the "name" and coordinate columns from the BED feature.
|
||||
- name: "--name_only"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Set the FASTA header for each extracted sequence to be the "name" columns from the BED feature.
|
||||
- name: "--split"
|
||||
type: boolean_true
|
||||
description: |
|
||||
When --input is in BED12 format, create a separate fasta entry for each block in a BED12 record,
|
||||
blocks being described in the 11th and 12th column of the BED.
|
||||
- name: "--full_header"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use full fasta header. By default, only the word before the first space or tab is used.
|
||||
|
||||
# Arguments not taken into account:
|
||||
#
|
||||
# -fo [Specify an output file name. By default, output goes to stdout.
|
||||
#
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:stable-slim
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [bedtools, procps]
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
22
src/bedtools/bedtools_getfasta/script.sh
Normal file
22
src/bedtools/bedtools_getfasta/script.sh
Normal file
@@ -0,0 +1,22 @@
|
||||
#!/usr/bin/env bash
|
||||
set -eo pipefail
|
||||
|
||||
unset_if_false=( par_rna par_strandedness par_tab par_bed_out par_name par_name_only par_split par_full_header )
|
||||
|
||||
for par in ${unset_if_false[@]}; do
|
||||
test_val="${!par}"
|
||||
[[ "$test_val" == "false" ]] && unset $par
|
||||
done
|
||||
|
||||
bedtools getfasta \
|
||||
-fi "$par_input_fasta" \
|
||||
-bed "$par_input_bed" \
|
||||
${par_rna:+-rna} \
|
||||
${par_name:+-name} \
|
||||
${par_name_only:+-nameOnly} \
|
||||
${par_tab:+-tab} \
|
||||
${par_bed_out:+-bedOut} \
|
||||
${par_strandedness:+-s} \
|
||||
${par_split:+-split} \
|
||||
${par_full_header:+-fullHeader} > "$par_output"
|
||||
|
||||
119
src/bedtools/bedtools_getfasta/test.sh
Normal file
119
src/bedtools/bedtools_getfasta/test.sh
Normal file
@@ -0,0 +1,119 @@
|
||||
#!/usr/bin/env bash
|
||||
set -eo pipefail
|
||||
|
||||
TMPDIR=$(mktemp -d)
|
||||
function clean_up {
|
||||
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
|
||||
}
|
||||
trap clean_up EXIT
|
||||
|
||||
# Create dummy test fasta file
|
||||
cat > "$TMPDIR/test.fa" <<EOF
|
||||
>chr1
|
||||
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
|
||||
EOF
|
||||
|
||||
TAB="$(printf '\t')"
|
||||
|
||||
# Create dummy bed file
|
||||
cat > "$TMPDIR/test.bed" <<EOF
|
||||
chr1${TAB}5${TAB}10${TAB}myseq
|
||||
EOF
|
||||
|
||||
# Create expected bed file
|
||||
cat > "$TMPDIR/expected.fasta" <<EOF
|
||||
>chr1:5-10
|
||||
AAACC
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input_bed "$TMPDIR/test.bed" \
|
||||
--input_fasta "$TMPDIR/test.fa" \
|
||||
--output "$TMPDIR/output.fasta"
|
||||
|
||||
cmp --silent "$TMPDIR/output.fasta" "$TMPDIR/expected.fasta" || { echo "files are different:"; exit 1; }
|
||||
|
||||
|
||||
# Create expected bed file for --name
|
||||
cat > "$TMPDIR/expected_with_name.fasta" <<EOF
|
||||
>myseq::chr1:5-10
|
||||
AAACC
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input_bed "$TMPDIR/test.bed" \
|
||||
--input_fasta "$TMPDIR/test.fa" \
|
||||
--name \
|
||||
--output "$TMPDIR/output_with_name.fasta"
|
||||
|
||||
|
||||
cmp --silent "$TMPDIR/output_with_name.fasta" "$TMPDIR/expected_with_name.fasta" || { echo "Files when using --name are different."; exit 1; }
|
||||
|
||||
# Create expected bed file for --name_only
|
||||
cat > "$TMPDIR/expected_with_name_only.fasta" <<EOF
|
||||
>myseq
|
||||
AAACC
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input_bed "$TMPDIR/test.bed" \
|
||||
--input_fasta "$TMPDIR/test.fa" \
|
||||
--name_only \
|
||||
--output "$TMPDIR/output_with_name_only.fasta"
|
||||
|
||||
cmp --silent "$TMPDIR/output_with_name_only.fasta" "$TMPDIR/expected_with_name_only.fasta" || { echo "Files when using --name_only are different."; exit 1; }
|
||||
|
||||
|
||||
# Create expected tab-delimited file for --tab
|
||||
cat > "$TMPDIR/expected_tab.out" <<EOF
|
||||
myseq${TAB}AAACC
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input_bed "$TMPDIR/test.bed" \
|
||||
--input_fasta "$TMPDIR/test.fa" \
|
||||
--name_only \
|
||||
--tab \
|
||||
--output "$TMPDIR/tab.out"
|
||||
|
||||
cmp --silent "$TMPDIR/expected_tab.out" "$TMPDIR/tab.out" || { echo "Files when using --tab are different."; exit 1; }
|
||||
|
||||
|
||||
# Create expected tab-delimited file for --bed_out
|
||||
cat > "$TMPDIR/expected.bed" <<EOF
|
||||
chr1${TAB}5${TAB}10${TAB}myseq${TAB}AAACC
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input_bed "$TMPDIR/test.bed" \
|
||||
--input_fasta "$TMPDIR/test.fa" \
|
||||
--bed_out \
|
||||
--output "$TMPDIR/output.bed"
|
||||
|
||||
|
||||
cmp --silent "$TMPDIR/expected.bed" "$TMPDIR/output.bed" || { echo "Files when using --bed_out are different."; exit 1; }
|
||||
|
||||
# Create dummy bed file for strandedness
|
||||
cat > "$TMPDIR/test_strandedness.bed" <<EOF
|
||||
chr1${TAB}20${TAB}25${TAB}forward${TAB}1${TAB}+
|
||||
chr1${TAB}20${TAB}25${TAB}reverse${TAB}1${TAB}-
|
||||
EOF
|
||||
|
||||
# Create expected tab-delimited file for --bed_out
|
||||
cat > "$TMPDIR/expected_strandedness.fasta" <<EOF
|
||||
>forward(+)
|
||||
CGCTA
|
||||
>reverse(-)
|
||||
TAGCG
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--input_bed "$TMPDIR/test_strandedness.bed" \
|
||||
--input_fasta "$TMPDIR/test.fa" \
|
||||
-s \
|
||||
--name_only \
|
||||
--output "$TMPDIR/output_strandedness.fasta"
|
||||
|
||||
|
||||
cmp --silent "$TMPDIR/expected_strandedness.fasta" "$TMPDIR/output_strandedness.fasta" || { echo "Files when using -s are different."; exit 1; }
|
||||
|
||||
47
src/busco/busco_download_datasets/config.vsh.yaml
Normal file
47
src/busco/busco_download_datasets/config.vsh.yaml
Normal file
@@ -0,0 +1,47 @@
|
||||
name: busco_download_datasets
|
||||
namespace: busco
|
||||
description: Downloads available busco datasets
|
||||
keywords: [lineage datasets]
|
||||
links:
|
||||
homepage: https://busco.ezlab.org/
|
||||
documentation: https://busco.ezlab.org/busco_userguide.html
|
||||
repository: https://gitlab.com/ezlab/busco
|
||||
references:
|
||||
doi: 10.1007/978-1-4939-9173-0_14
|
||||
license: MIT
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --download
|
||||
type: string
|
||||
description: |
|
||||
Download dataset. Possible values are a specific dataset name, "all", "prokaryota", "eukaryota", or "virus".
|
||||
The full list of available datasets can be viewed [here](https://busco-data.ezlab.org/v5/data/lineages/) or by running the busco/busco_list_datasets component.
|
||||
required: true
|
||||
example: stramenopiles_odb10
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --download_path
|
||||
direction: output
|
||||
type: file
|
||||
description: |
|
||||
Local filepath for storing BUSCO dataset downloads
|
||||
required: false
|
||||
default: busco_downloads
|
||||
example: busco_downloads
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/busco:5.7.1--pyhdfd78af_0
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
busco --version | sed 's/BUSCO\s\(.*\)/busco: "\1"/' > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
14
src/busco/busco_download_datasets/script.sh
Normal file
14
src/busco/busco_download_datasets/script.sh
Normal file
@@ -0,0 +1,14 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
|
||||
if [ ! -d "$par_download_path" ]; then
|
||||
mkdir -p "$par_download_path"
|
||||
fi
|
||||
|
||||
busco \
|
||||
--download_path "$par_download_path" \
|
||||
--download "$par_download"
|
||||
|
||||
15
src/busco/busco_download_datasets/test.sh
Normal file
15
src/busco/busco_download_datasets/test.sh
Normal file
@@ -0,0 +1,15 @@
|
||||
echo "> Downloading busco stramenopiles_odb10 dataset"
|
||||
|
||||
"$meta_executable" \
|
||||
--download stramenopiles_odb10 \
|
||||
--download_path downloads
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "downloads/file_versions.tsv" ] && echo "file_versions.tsv does not exist" && exit 1
|
||||
[ ! -f "downloads/lineages/stramenopiles_odb10/dataset.cfg" ] && echo "dataset.cfg does not exist" && exit 1
|
||||
|
||||
echo ">> Checking if output is empty"
|
||||
[ ! -s "downloads/file_versions.tsv" ] && echo "file_versions.tsv is empty" && exit 1
|
||||
[ ! -s "downloads/lineages/stramenopiles_odb10/dataset.cfg" ] && echo "dataset.cfg is empty" && exit 1
|
||||
|
||||
rm -r downloads
|
||||
39
src/busco/busco_list_datasets/config.vsh.yaml
Normal file
39
src/busco/busco_list_datasets/config.vsh.yaml
Normal file
@@ -0,0 +1,39 @@
|
||||
name: busco_list_datasets
|
||||
namespace: busco
|
||||
description: Lists the available busco datasets
|
||||
keywords: [lineage datasets]
|
||||
links:
|
||||
homepage: https://busco.ezlab.org/
|
||||
documentation: https://busco.ezlab.org/busco_userguide.html
|
||||
repository: https://gitlab.com/ezlab/busco
|
||||
references:
|
||||
doi: 10.1007/978-1-4939-9173-0_14
|
||||
license: MIT
|
||||
argument_groups:
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
alternatives: ["-o"]
|
||||
direction: output
|
||||
type: file
|
||||
description: |
|
||||
Output file of the available busco datasets
|
||||
required: false
|
||||
default: busco_dataset_list.txt
|
||||
example: file.txt
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/busco:5.7.1--pyhdfd78af_0
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
busco --version | sed 's/BUSCO\s\(.*\)/busco: "\1"/' > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
6
src/busco/busco_list_datasets/script.sh
Normal file
6
src/busco/busco_list_datasets/script.sh
Normal file
@@ -0,0 +1,6 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
busco --list-datasets | awk '/^#{40}/{flag=1; next} flag{print}' > $par_output
|
||||
15
src/busco/busco_list_datasets/test.sh
Normal file
15
src/busco/busco_list_datasets/test.sh
Normal file
@@ -0,0 +1,15 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
"$meta_executable" \
|
||||
--output datasets.txt
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "datasets.txt" ] && echo "datasets.txt does not exist" && exit 1
|
||||
|
||||
echo ">> Checking if output is empty"
|
||||
[ ! -s "datasets.txt" ] && echo "datasets.txt is empty" && exit 1
|
||||
|
||||
rm datasets.txt
|
||||
218
src/busco/busco_run/config.vsh.yaml
Normal file
218
src/busco/busco_run/config.vsh.yaml
Normal file
@@ -0,0 +1,218 @@
|
||||
name: busco_run
|
||||
namespace: busco
|
||||
description: Assessment of genome assembly and annotation completeness with single copy orthologs
|
||||
keywords: [Genome assembly, quality control]
|
||||
links:
|
||||
homepage: https://busco.ezlab.org/
|
||||
documentation: https://busco.ezlab.org/busco_userguide.html
|
||||
repository: https://gitlab.com/ezlab/busco
|
||||
references:
|
||||
doi: 10.1007/978-1-4939-9173-0_14
|
||||
license: MIT
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: ["-i"]
|
||||
type: file
|
||||
description: |
|
||||
Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set. Also possible to use a path to a directory containing multiple input files.
|
||||
required: true
|
||||
example: file.fasta
|
||||
- name: --mode
|
||||
alternatives: ["-m"]
|
||||
type: string
|
||||
choices: [genome, geno, transcriptome, tran, proteins, prot]
|
||||
required: true
|
||||
description: |
|
||||
Specify which BUSCO analysis mode to run. There are three valid modes:
|
||||
- geno or genome, for genome assemblies (DNA)
|
||||
- tran or transcriptome, for transcriptome assemblies (DNA)
|
||||
- prot or proteins, for annotated gene sets (protein)
|
||||
example: proteins
|
||||
- name: --lineage_dataset
|
||||
alternatives: ["-l"]
|
||||
type: string
|
||||
required: false
|
||||
description: |
|
||||
Specify a BUSCO lineage dataset that is most closely related to the assembly or gene set being assessed.
|
||||
The full list of available datasets can be viewed [here](https://busco-data.ezlab.org/v5/data/lineages/) or by running the busco/busco_list_datasets component.
|
||||
When unsure, the "--auto_lineage" flag can be set to automatically find the optimal lineage path.
|
||||
BUSCO will automatically download the requested dataset if it is not already present in the download folder.
|
||||
You can optionally provide a path to a local dataset instead of a name, e.g. path/to/dataset.
|
||||
Datasets can be downloaded using the busco/busco_download_dataset component.
|
||||
example: stramenopiles_odb10
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --short_summary_json
|
||||
required: false
|
||||
direction: output
|
||||
type: file
|
||||
example: short_summary.json
|
||||
description: |
|
||||
Output file for short summary in JSON format.
|
||||
- name: --short_summary_txt
|
||||
required: false
|
||||
direction: output
|
||||
type: file
|
||||
example: short_summary.txt
|
||||
description: |
|
||||
Output file for short summary in TXT format.
|
||||
- name: --full_table
|
||||
required: false
|
||||
direction: output
|
||||
type: file
|
||||
example: full_table.tsv
|
||||
description: |
|
||||
Full table output in TSV format.
|
||||
- name: --missing_busco_list
|
||||
required: false
|
||||
direction: output
|
||||
type: file
|
||||
example: missing_busco_list.tsv
|
||||
description: |
|
||||
Missing list output in TSV format.
|
||||
- name: --output_dir
|
||||
required: false
|
||||
direction: output
|
||||
type: file
|
||||
example: output_dir/
|
||||
description: |
|
||||
The full output directory, if so desired.
|
||||
|
||||
- name: Resource and Run Settings
|
||||
arguments:
|
||||
- name: --force
|
||||
type: boolean_true
|
||||
description: |
|
||||
Force rewriting of existing files. Must be used when output files with the provided name already exist.
|
||||
- name: --quiet
|
||||
alternatives: ["-q"]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Disable the info logs, displays only errors.
|
||||
- name: --restart
|
||||
alternatives: ["-r"]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Continue a run that had already partially completed. Restarting skips calls to tools that have completed but performs all pre- and post-processing steps.
|
||||
- name: --tar
|
||||
type: boolean_true
|
||||
description: |
|
||||
Compress some subdirectories with many files to save space.
|
||||
|
||||
- name: Lineage Dataset Settings
|
||||
arguments:
|
||||
- name: --auto_lineage
|
||||
type: boolean_true
|
||||
description: |
|
||||
Run auto-lineage pipelilne to automatically determine BUSCO lineage dataset that is most closely related to the assembly or gene set being assessed.
|
||||
- name: --auto_lineage_euk
|
||||
type: boolean_true
|
||||
description: |
|
||||
Run auto-placement just on eukaryota tree to find optimal lineage path.
|
||||
- name: --auto_lineage_prok
|
||||
type: boolean_true
|
||||
description: |
|
||||
Run auto_lineage just on prokaryota trees to find optimum lineage path.
|
||||
- name: --datasets_version
|
||||
type: string
|
||||
required: false
|
||||
description: |
|
||||
Specify the version of BUSCO datasets
|
||||
example: odb10
|
||||
|
||||
- name: Augustus Settings
|
||||
arguments:
|
||||
- name: --augustus
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use augustus gene predictor for eukaryote runs.
|
||||
- name: --augustus_parameters
|
||||
type: string
|
||||
required: false
|
||||
description: |
|
||||
Additional parameters to be passed to Augustus (see Augustus documentation: https://github.com/Gaius-Augustus/Augustus/blob/master/docs/RUNNING-AUGUSTUS.md).
|
||||
Parameters should be contained within a single string, without whitespace and seperated by commas.
|
||||
example: "--PARAM1=VALUE1,--PARAM2=VALUE2"
|
||||
- name: --augustus_species
|
||||
type: string
|
||||
required: false
|
||||
description: |
|
||||
Specify the augustus species
|
||||
- name: --long
|
||||
type: boolean_true
|
||||
description: |
|
||||
Optimize Augustus self-training mode. This adds considerably to the run time, but can improve results for some non-model organisms.
|
||||
|
||||
- name: BBTools Settings
|
||||
arguments:
|
||||
- name: --contig_break
|
||||
type: integer
|
||||
required: false
|
||||
description: |
|
||||
Number of contiguous Ns to signify a break between contigs in BBTools analysis.
|
||||
- name: --limit
|
||||
type: integer
|
||||
required: false
|
||||
description: |
|
||||
Number of candidate regions (contig or transcript) from the BLAST output to consider per BUSCO.
|
||||
This option is only effective in pipelines using BLAST, i.e. the genome pipeline (see --augustus) or the prokaryota transcriptome pipeline.
|
||||
- name: --scaffold_composition
|
||||
type: boolean_true
|
||||
description: |
|
||||
Writes ACGTN content per scaffold to a file scaffold_composition.txt.
|
||||
|
||||
- name: BLAST Settings
|
||||
arguments:
|
||||
- name: --e_value
|
||||
type: double
|
||||
required: false
|
||||
description: |
|
||||
E-value cutoff for BLAST searches.
|
||||
|
||||
- name: Protein Gene Prediction settings
|
||||
arguments:
|
||||
- name: --miniprot
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use Miniprot gene predictor.
|
||||
|
||||
- name: MetaEuk Settings
|
||||
arguments:
|
||||
- name: --metaeuk
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use Metaeuk gene predictor.
|
||||
- name: --metaeuk_parameters
|
||||
type: string
|
||||
description: |
|
||||
Pass additional arguments to Metaeuk for the first run (see Metaeuk documentation https://github.com/soedinglab/metaeuk).
|
||||
All parameters should be contained within a single string with no white space, with each parameter separated by a comma.
|
||||
example: "--max-overlap=15,--min-exon-aa=15"
|
||||
- name: --metaeuk_rerun_parameters
|
||||
type: string
|
||||
description: |
|
||||
Pass additional arguments to Metaeuk for the second run (see Metaeuk documentation https://github.com/soedinglab/metaeuk).
|
||||
All parameters should be contained within a single string with no white space, with each parameter separated by a comma.
|
||||
example: "--max-overlap=15,--min-exon-aa=15"
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/busco:5.7.1--pyhdfd78af_0
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
busco --version | sed 's/BUSCO\s\(.*\)/busco: "\1"/' > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
63
src/busco/busco_run/help.txt
Normal file
63
src/busco/busco_run/help.txt
Normal file
@@ -0,0 +1,63 @@
|
||||
```bash
|
||||
busco -h
|
||||
```
|
||||
|
||||
usage: busco -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
|
||||
|
||||
Welcome to BUSCO 5.7.1: the Benchmarking Universal Single-Copy Ortholog assessment tool.
|
||||
For more detailed usage information, please review the README file provided with this distribution and the BUSCO user guide. Visit this page https://gitlab.com/ezlab/busco#how-to-cite-busco to see how to cite BUSCO
|
||||
|
||||
optional arguments:
|
||||
-i SEQUENCE_FILE, --in SEQUENCE_FILE
|
||||
Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set. Also possible to use a path to a directory containing multiple input files.
|
||||
-o OUTPUT, --out OUTPUT
|
||||
Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. The path to the output folder is set with --out_path.
|
||||
-m MODE, --mode MODE Specify which BUSCO analysis mode to run.
|
||||
There are three valid modes:
|
||||
- geno or genome, for genome assemblies (DNA)
|
||||
- tran or transcriptome, for transcriptome assemblies (DNA)
|
||||
- prot or proteins, for annotated gene sets (protein)
|
||||
-l LINEAGE, --lineage_dataset LINEAGE
|
||||
Specify the name of the BUSCO lineage to be used.
|
||||
--augustus Use augustus gene predictor for eukaryote runs
|
||||
--augustus_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
|
||||
Pass additional arguments to Augustus. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
|
||||
--augustus_species AUGUSTUS_SPECIES
|
||||
Specify a species for Augustus training.
|
||||
--auto-lineage Run auto-lineage to find optimum lineage path
|
||||
--auto-lineage-euk Run auto-placement just on eukaryote tree to find optimum lineage path
|
||||
--auto-lineage-prok Run auto-lineage just on non-eukaryote trees to find optimum lineage path
|
||||
-c N, --cpu N Specify the number (N=integer) of threads/cores to use.
|
||||
--config CONFIG_FILE Provide a config file
|
||||
--contig_break n Number of contiguous Ns to signify a break between contigs. Default is n=10.
|
||||
--datasets_version DATASETS_VERSION
|
||||
Specify the version of BUSCO datasets, e.g. odb10
|
||||
--download [dataset [dataset ...]]
|
||||
Download dataset. Possible values are a specific dataset name, "all", "prokaryota", "eukaryota", or "virus". If used together with other command line arguments, make sure to place this last.
|
||||
--download_base_url DOWNLOAD_BASE_URL
|
||||
Set the url to the remote BUSCO dataset location
|
||||
--download_path DOWNLOAD_PATH
|
||||
Specify local filepath for storing BUSCO dataset downloads
|
||||
-e N, --evalue N E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)
|
||||
-f, --force Force rewriting of existing files. Must be used when output files with the provided name already exist.
|
||||
-h, --help Show this help message and exit
|
||||
--limit N How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
|
||||
--list-datasets Print the list of available BUSCO datasets
|
||||
--long Optimization Augustus self-training mode (Default: Off); adds considerably to the run time, but can improve results for some non-model organisms
|
||||
--metaeuk Use Metaeuk gene predictor
|
||||
--metaeuk_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
|
||||
Pass additional arguments to Metaeuk for the first run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
|
||||
--metaeuk_rerun_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
|
||||
Pass additional arguments to Metaeuk for the second run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
|
||||
--miniprot Use Miniprot gene predictor
|
||||
--skip_bbtools Skip BBTools for assembly statistics
|
||||
--offline To indicate that BUSCO cannot attempt to download files
|
||||
--opt-out-run-stats Opt out of data collection. Information on the data collected is available in the user guide.
|
||||
--out_path OUTPUT_PATH
|
||||
Optional location for results folder, excluding results folder name. Default is current working directory.
|
||||
-q, --quiet Disable the info logs, displays only errors
|
||||
-r, --restart Continue a run that had already partially completed.
|
||||
--scaffold_composition
|
||||
Writes ACGTN content per scaffold to a file scaffold_composition.txt
|
||||
--tar Compress some subdirectories with many files to save space
|
||||
-v, --version Show this version and exit
|
||||
72
src/busco/busco_run/script.sh
Normal file
72
src/busco/busco_run/script.sh
Normal file
@@ -0,0 +1,72 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
|
||||
[[ "$par_tar" == "false" ]] && unset par_tar
|
||||
[[ "$par_force" == "false" ]] && unset par_force
|
||||
[[ "$par_quiet" == "false" ]] && unset par_quiet
|
||||
[[ "$par_restart" == "false" ]] && unset par_restart
|
||||
[[ "$par_auto_lineage" == "false" ]] && unset par_auto_lineage
|
||||
[[ "$par_auto_lineage_euk" == "false" ]] && unset par_auto_lineage_euk
|
||||
[[ "$par_auto_lineage_prok" == "false" ]] && unset par_auto_lineage_prok
|
||||
[[ "$par_augustus" == "false" ]] && unset par_augustus
|
||||
[[ "$par_long" == "false" ]] && unset par_long
|
||||
[[ "$par_scaffold_composition" == "false" ]] && unset par_scaffold_composition
|
||||
[[ "$par_miniprot" == "false" ]] && unset par_miniprot
|
||||
|
||||
tmp_dir=$(mktemp -d -p "$meta_temp_dir" busco_XXXXXXXXX)
|
||||
prefix=$(openssl rand -hex 8)
|
||||
|
||||
busco \
|
||||
--in "$par_input" \
|
||||
--mode "$par_mode" \
|
||||
--out "$prefix" \
|
||||
--out_path "$tmp_dir" \
|
||||
--opt-out-run-stats \
|
||||
${meta_cpus:+--cpu "${meta_cpus}"} \
|
||||
${par_lineage_dataset:+--lineage_dataset "$par_lineage_dataset"} \
|
||||
${par_augustus:+--augustus} \
|
||||
${par_augustus_parameters:+--augustus_parameters "$par_augustus_parameters"} \
|
||||
${par_augustus_species:+--augustus_species "$par_augustus_species"} \
|
||||
${par_auto_lineage:+--auto-lineage} \
|
||||
${par_auto_lineage_euk:+--auto-lineage-euk} \
|
||||
${par_auto_lineage_prok:+--auto-lineage-prok} \
|
||||
${par_contig_break:+--contig_break $par_contig_break} \
|
||||
${par_datasets_version:+--datasets_version "$par_datasets_version"} \
|
||||
${par_e_value:+--evalue "$par_e_value"} \
|
||||
${par_force:+--force} \
|
||||
${par_limit:+--limit "$par_limit"} \
|
||||
${par_long:+--long} \
|
||||
${par_metaeuk:+--metaeuk} \
|
||||
${par_metaeuk_parameters:+--metaeuk_parameters "$par_metaeuk_parameters"} \
|
||||
${par_metaeuk_rerun_parameters:+--metaeuk_rerun_parameters "$par_metaeuk_rerun_parameters"} \
|
||||
${par_miniprot:+--miniprot} \
|
||||
${par_quiet:+--quiet} \
|
||||
${par_restart:+--restart} \
|
||||
${par_scaffold_composition:+--scaffold_composition} \
|
||||
${par_tar:+--tar} \
|
||||
|
||||
|
||||
out_dir=$(find "$tmp_dir/$prefix" -maxdepth 1 -name 'run_*')
|
||||
|
||||
if [[ -n "$par_short_summary_json" ]]; then
|
||||
cp "$out_dir/short_summary.json" "$par_short_summary_json"
|
||||
fi
|
||||
if [[ -n "$par_short_summary_txt" ]]; then
|
||||
cp "$out_dir/short_summary.txt" "$par_short_summary_txt"
|
||||
fi
|
||||
if [[ -n "$par_full_table" ]]; then
|
||||
cp "$out_dir/full_table.tsv" "$par_full_table"
|
||||
fi
|
||||
if [[ -n "$par_missing_busco_list" ]]; then
|
||||
cp "$out_dir/missing_busco_list.tsv" "$par_missing_busco_list"
|
||||
fi
|
||||
if [[ -n "$par_output_dir" ]]; then
|
||||
if [[ -d "$par_output_dir" ]]; then
|
||||
rm -r "$par_output_dir"
|
||||
fi
|
||||
cp -r -L "$out_dir" "$par_output_dir"
|
||||
fi
|
||||
|
||||
88
src/busco/busco_run/test.sh
Normal file
88
src/busco/busco_run/test.sh
Normal file
@@ -0,0 +1,88 @@
|
||||
test_dir="$meta_resources_dir/test_data"
|
||||
|
||||
mkdir "run_prot_stramenopiles"
|
||||
cd "run_prot_stramenopiles"
|
||||
|
||||
echo "> Running busco with lineage dataset"
|
||||
|
||||
"$meta_executable" \
|
||||
--input $test_dir/protein.fasta \
|
||||
--mode proteins \
|
||||
--lineage_dataset stramenopiles_odb10 \
|
||||
--output_dir output \
|
||||
--short_summary_json short_summary.json \
|
||||
--short_summary_txt short_summary.txt \
|
||||
--full_table full_table.tsv \
|
||||
--missing_busco_list missing_busco_list.tsv
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "output/full_table.tsv" ] && echo "full_table.tsv does not exist" && exit 1
|
||||
[ ! -f "output/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv does not exist" && exit 1
|
||||
[ ! -f "output/short_summary.json" ] && echo "short_summary.json does not exist" && exit 1
|
||||
[ ! -f "output/short_summary.txt" ] && echo "short_summary.txt does not exist" && exit 1
|
||||
[ ! -f "full_table.tsv" ] && echo "full_table.tsv does not exist" && exit 1
|
||||
[ ! -f "missing_busco_list.tsv" ] && echo "missing_busco_list.tsv does not exist" && exit 1
|
||||
[ ! -f "short_summary.json" ] && echo "short_summary.json does not exist" && exit 1
|
||||
[ ! -f "short_summary.txt" ] && echo "short_summary.txt does not exist" && exit 1
|
||||
|
||||
echo ">> Checking if output is empty"
|
||||
[ ! -s "output/full_table.tsv" ] && echo "full_table.tsv is empty" && exit 1
|
||||
[ ! -s "output/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv is empty" && exit 1
|
||||
[ ! -s "output/short_summary.json" ] && echo "short_summary.json is empty" && exit 1
|
||||
[ ! -s "output/short_summary.txt" ] && echo "short_summary.txt is empty" && exit 1
|
||||
[ ! -s "full_table.tsv" ] && echo "full_table.tsv is empty" && exit 1
|
||||
[ ! -s "missing_busco_list.tsv" ] && echo "missing_busco_list.tsv is empty" && exit 1
|
||||
[ ! -s "short_summary.json" ] && echo "short_summary.json is empty" && exit 1
|
||||
[ ! -s "short_summary.txt" ] && echo "short_summary.txt is empty" && exit 1
|
||||
|
||||
cd ..
|
||||
mkdir "run_prot_autolineage"
|
||||
cd "run_prot_autolineage"
|
||||
|
||||
echo "> Running busco with auto lineage"
|
||||
|
||||
"$meta_executable" \
|
||||
--input $test_dir/protein.fasta \
|
||||
--mode proteins \
|
||||
--auto_lineage \
|
||||
--output_dir output
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "output/full_table.tsv" ] && echo "full_table.tsv does not exist in output folder" && exit 1
|
||||
[ ! -f "output/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv does not exist in output folder" && exit 1
|
||||
[ ! -f "output/short_summary.json" ] && echo "short_summary.json does not exist in output folder" && exit 1
|
||||
[ ! -f "output/short_summary.txt" ] && echo "short_summary.txt does not exist in output folder" && exit 1
|
||||
|
||||
echo ">> Checking if output is empty"
|
||||
[ ! -s "output/full_table.tsv" ] && echo "full_table.tsv in output folder is empty" && exit 1
|
||||
[ ! -s "output/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv in output folder is empty" && exit 1
|
||||
[ ! -s "output/short_summary.json" ] && echo "short_summary.json in output folder is empty" && exit 1
|
||||
[ ! -s "output/short_summary.txt" ] && echo "short_summary.txt in output folder is empty" && exit 1
|
||||
|
||||
rm -r output/
|
||||
|
||||
cd ..
|
||||
mkdir "run_genome"
|
||||
cd "run_genome"
|
||||
|
||||
echo "> Running busco with genome data"
|
||||
|
||||
"$meta_executable" \
|
||||
--input $test_dir/genome.fna \
|
||||
--mode genome \
|
||||
--lineage_dataset saccharomycetes_odb10 \
|
||||
--output_dir output
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "output/full_table.tsv" ] && echo "full_table.tsv does not exist in output folder" && exit 1
|
||||
[ ! -f "output/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv does not exist in output folder" && exit 1
|
||||
[ ! -f "output/short_summary.json" ] && echo "short_summary.json does not exist in output folder" && exit 1
|
||||
[ ! -f "output/short_summary.txt" ] && echo "short_summary.txt does not exist in output folder" && exit 1
|
||||
|
||||
echo ">> Checking if output is empty"
|
||||
[ ! -s "output/full_table.tsv" ] && echo "full_table.tsv in output folder is empty" && exit 1
|
||||
[ ! -s "output/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv in output folder is empty" && exit 1
|
||||
[ ! -s "output/short_summary.json" ] && echo "short_summary.json in output folder is empty" && exit 1
|
||||
[ ! -s "output/short_summary.txt" ] && echo "short_summary.txt in output folder is empty" && exit 1
|
||||
|
||||
rm -r output/
|
||||
10000
src/busco/busco_run/test_data/genome.fna
Normal file
10000
src/busco/busco_run/test_data/genome.fna
Normal file
File diff suppressed because it is too large
Load Diff
64
src/busco/busco_run/test_data/protein.fasta
Normal file
64
src/busco/busco_run/test_data/protein.fasta
Normal file
@@ -0,0 +1,64 @@
|
||||
>341721at2759_1001832_1:000010
|
||||
MASRPVKKRKLTPPGDDEASSRKSGGKIQKAFLKNAANWDLEQDYETRARKGKKKEKESTRLPLKLPGGRVQHVSAPDNDFQAIESDEDWLDGAEDVSEDEESKDKKAPEEPEKPEHEQILEAKEELAKIALMLNESPDENTGAFKALAKIGQSRIITIKKLALATQLTVYKDVIPGYRIRPVAEDGPEEKLSKDVRKLRTYETCLISGYQAYVKELTKHAKTGHANGLASVAITCACNLLTAVPHFNFRSDLVKILVGKLSTRRVDDDFNKCLQALETLFEEDEEGRPSMEAVSLLSKMMKAREYQVNESVVNLFLHLRLLSDFSGKGSKDSVDRMDDGPSKKPKSKREFRTKRERKQIKEQKALQKDMAQADALVQHEERDRMEGETLKLVFGTYFRVLKMRVPHLMGAVLEGLSKYAHLINQNFFGDLLEALKDLIRHSDASEKDDAEEKEDEEADDDAPVRNPSREALLCTTTAFALLAGQDAHNARADLHLDLSFFTTHLYQSLFPLSLHPDLELGARSLHLPDPDKPSQNRKSNSSNKVNLQTTTVLLIRCLTAVLLPPWNVRSVPPVRLAAFAKQLMTAALHVPEKSAQALLALLADVAGTHGRRIAALWNTEERKGDGAFNPLAESAEASNPFAATVWEGEILRRHYCPAVRRGVGIVEKSLSLAER
|
||||
>296129at2759_1069680_1:000010
|
||||
MMKKKQIDSRIPTLIKNGVQEKKRTLFVIVGDRGRDQIVNLHWLLSQTRIASRPSVLWMYKKDLLGFTSHRKKREAKIKKEIKKGIRDPNEATTPFELFISVTNIRYTYYKESEKILGQTFGMLVLQDFEAITPNLLARTIETVEGGGIIVILFKTMENLKQLYTMTMDIHSRYRTEAHQDVVARFNGRFILSLGHCSSCLFVDDELNVLPISEAKKVKPLPKPQLEEPKKELEELKQKYEDKQLLRSLIDVAKTVDQARALITFVEAISEKTLRSTVALTAARGRGKSAALGLAISAAVAYGYSNIFITSPNPENLKTLFEFTFKGFNSLKYEEHIDYDIIQSLNPSFNKSIVRVNIFRNHRQTIQYIHPSDAYVLGQAELLVIDEAAAIPMPLVKKLLGPYLTFMASTVNGYEGTGRSLSLKLIQQLREQSRGFAHENTKSGNSEKSMINRSEKLNKESGINSIGGRKLREITLEEPIRYSYGDPVEEWLNKLLCLDINISLKQFLEQGCPHPSQCELYYVNRDTLFSYHPVSESFLQMMMSLYVASHYKNSPNDLQLMADAPAHQLFVLLPPVKEDDNKLPEPLCVIQVALEGEISRESVVNNLTRGYRTGGDLIPWVITEQFQDDKFASLSGARIVRIATNPEYIRMGYGSHALKLLENFYEGKYLNLSEETISESNENIKIINNNLESSLLTDDIKIKDLKIMPPLLLKLSEKKPGLIHYLGVSYGLTPQLYKFWKRAEFIPVYLRQTPNDLTGEHTCLMLKLLQDKSETWLNEFSNDFRKRFLSLLSFSFRSFPTILCLNIIESINNDLIQKDNVHVITKSEIDINLSPFDLKRLESYANNMLDYHTIIDMLPYIADLYFKGRFGKDLKMTGVQSAILLALGLQKRLLEDIEKELNLPSNQVLAMLVKILRKLSSFFKDIYYKAIDNTLPIERKNLKNQLQTHADENDNFRGFIPLKATLKEELDHLSSEMEDSIKEKQRELINSLDLQKYIIKGQEEDWDKAEQHIKNGIYSGKSSVVSIQSHSLKREHESLTDIPHIKKKHQKKHKRKV
|
||||
>1217666at2759_1073089_1:000010
|
||||
MPINQPSNQIKFTNVSVVRLKKGKKRFELACYKNKLLEYRSGAEKDLDNVLQVPTIFLSVSKAQTAPSAELTKAFGANIPADEIRQEILRKGEVQVGERERKEISERVEKELLDIVSGRLVDPTTKRVYTPGMISKALDQLSSASGQMQQTQGEGSGATDEKGAAQPRKPMWTGVAPNKSAKSQALDAMKALIAWQPIPVMRARMRLRVTCPVSILKHSVKAPSGGGASKEKEAPSGNSKSNKGKKGPKSRAARQQDSDAEDGKSDAEAAPKTPSNVKDKILGYIESIESQEVIGGDEWEVVGFAEPGAYKGLNEFVGNETRGRGRVEVLDMTVTHEE
|
||||
>513979at2759_1159556_1:000010
|
||||
MAVVDIQARFSPHHPLEPDLLYEIQSILRLHGLSVDDLFFKWDAYCIRMDLDAQAALSLANVRSLKQSIQDDLEKSHRSTTQVRSERKVAAAPKAVSGGDVYGMLDGLVPSTPAAGGKRSRGVAAGGGGSGLKKKMDSLKMNSSPAGMKEQLSAFNGLPATSFAERANAGDVVEILNAQLPPCEAPLAPFPEPRIKLTAASDQKKMAYKPLAVKLSEASEVLDDRIDEFAALVQDYHGLEDSAFGSAASQGTTEVVAVGRIASDAMEGKLNAAALVLETSRRTGMGLRVPLKMHKVPSWSFFPGQVVALRGTNATGGEFVVEQVLDVPLLPSAASTPSALEAHRARMSGVPPGGGAAAATTDSDAAAPAPAPAPLTILYAAGPYTADDNLDYEPLHALCSQAADALADALVLAGPFLDIDHPLVAAGDFDLPPEDEAALDPDTATMSAVFRHLVAPALNRACAANPHLTVVLVPSVRDVLARHVSWPQDAIARKELGLAKAARIVSNPMTLSMNEVVVGVSSQDVLHELRNEECSRACPPGDLMGRLCRYLVEQRHYFPLFPPTDRARLPRTGTQSGLATGAVLDPSYLRLGEMVNVRPDVMVVPSSLPPFAKASSVVESVLAINPGPLSKRKGAGTFARMTLHAPPVGGGSEMTSHRVFDRARVEIVRI
|
||||
>543764at2759_1165861_1:000010
|
||||
MALGRAARPVGWTDCCAAVEKKPNYKSGMTQPARTITAGDNLLLKLPSGQTRTIKNVTSDSSISLGKFGKFQTNELIDQPFGLTFDILEDGKLVRNEQINLALELNPMLDELNSFESIKGMANGISNVEDIEATNEMIKESDGAQKLTNVEIEELKKSGLSGREIILRQIQQHSAFELKSEFSKAKYIKRKEKKFLKMFTCIDPTIHNMSQYLFENHNFAIKGLRPDTLSQMLSLSNVRPGWKGIVVDDIGGLLVAAVLIRMGGEGTIFVLNNADSPPDLHLLELFNLPKSVLGPLKSLNWAQTEADWTTSDIEELLLLHRDPPQPLPILDSTLPDPQLKQLSQRTKKQPNNRSKSMRKFERVQELLSMRQEFLDTQFEGLLTCSEYEPESIVTKLVNKLSGSSTIVIYSCHLRPLSDLQTLLKKSSMPSTSSSSLGGSSSLVEQNELTKRMKENKTEFIQITISEPWLRAYQVLVGRTHPEMAGTHHGGFVFSAIKVFNSCS
|
||||
>1558822at2759_1266660_1:000010
|
||||
MSIAEILPLEIIDKTVGQPVLVMLTSHREFSGTLVGYDDFVNVVLEEVVEYDHDQEIKRHAGKMLLSGNNIAMLVPGGKRVQ
|
||||
>1287094at2759_1291522_0:000010
|
||||
MGNILVKKNRVTITEADRAILTLRTQRRKMEEHRRRVEALMERETTVARTLVAKQQRPAALLALKKKRLHETQLEGLDNCLLTLEETLTQVESAQRTARLMAALKQGADVLSALQRAMPLESVEQLMEQGAESREYEMRLQALLGESLGEDQSAAAERELDEMEAQLIEEDVLDLPKVPSHAVARPASARAIGQAASERQLEPEIAA
|
||||
>83779at2759_1296121_1:000010
|
||||
MCGLTLTIRPLSLSLSSPSVSDCSSSDSTEDADLALLDSFRSTNAQRGPDSQRTFKHTVTLDDDDNGVTTTTTTKSTTKSKVEICLTATVLGLRGDLTAQPLVGNRGVLGWNGQVFEGIDIGTEENDTRKIFERLEKGERVEDVLSGVEGPFAFIYLDLENDILHYQLDPLSRRSLLIHPAEVAVDSNPSVTRHFILSSSRSTLAREHGVDMRALLGGEGGTIDLRRIKVVQNQGFLTMDMSDALKHRHTLSPDQDASCSSSSGSWTKVAPINTALPPDNLPLDNPKIKEEVPKFIEQLKESVKRRVENIPNPEKGCSRVAVLFSGGIDCTFLAYLIHLCLPPEDPIDLINVAFSPAPKLSSLSSNGADKGKGKSPALPAAPTYDVPDRLSGRDALVELKQVCPDREWRFVEIDVPYDEARAHRQNVLDLMYPSSTEMDHSLALPLYFASRGYGSVRKEGSNHSEPYRVKAKVYISGLGADEQLGGYARHRHAYQREGWQGLISETQMDIARLPTRNLSRDDRMLSSHARDARYPYLSLSFISYLSSLPVHLKCDPRLGEGQGDKILLRKAVESVGLVRASGRVKRAMQFGTRSSKLGGRGSGVKGPKAGERQVE
|
||||
>1057950at2759_1314783_1:000010
|
||||
MSSRQATHADSWYVGDGRRLDSELSKNLAAVEGDANYSPPIKGCKAVIAPHAGYSYSGRAAAWAYKSIDTTGIKRIFILGPSHHVYLDGCALSKCEKYETPLGELPIDLDTVKELRATGEFQDMDIQTDEDEHSIEMHLPYVRKVFEGLDIAIVPILIGAINLNKENKFGTVLAPYLAKDDTFFVISSDFCHWGTRFQYTFYYPRPPPTSTPAIRLSKADPNPSTLATHPIHASISAIDHEAMDLMTMPPQTAQQAHIDFAEYLRTTKNTICGRHPIGVLLGALAVLQSQGRVPHLKFVRYEQSSQCQTVRDSSVSYASAYITV
|
||||
>453044at2759_1330018_1:000010
|
||||
MPAAPQDPFFKSIGSAAADTEALREQPDEQDEQETDLEPIDEDRPLQEVESLCMSCGEQGVTRMLLTSIPYFREVIVMSFRCEHCGNQNNEIQSASTIREHGAMYTVKILNQGDLNRQLVKSEAATVTIPEFELTIPPLRGQLTTVEGTLRDTIQDLAADQPLRRIQDPPTFDKIEALLAKLKEVVPDDEDEAAPTMKERHPEDPVRPFTVILDDPTGNSFIEFSGSMSDPKWSLREYARSMDQNITLGLSQPEDEEKEKVTQKGGPFTEEDEDGLPAEEVFIFPGICSSCGHPVDTRMKKVNIPYFKDIIIMSTNCSACGYRDNEVKSGGAISDKGKRITLKVEDAEDLSRDILKSETCGLEIPEIDLALHAGTLGGRFTTVEGILTQVYDELSEKVFRGDSVGSANSKDNQEFETFLGSMKEVMTAARPFTLILDDPLANSYLQNLYAPDPDPNMEIVTYDRTFDQNEDLGLNDMKVEGYEAPS
|
||||
>1323575at2759_1392248_1:000010
|
||||
MSQPQPPPLRYIRYEPSREDEYVAAMRQLISKDLSEPYSIYVYRYFLYQWGDLCFMTVDDSRPEDPIVGVVVSKLEPHRGGPMRGYIAMLAVREEYRGRGIATKLVRMAIDAMIARDADEIALETEITNTAAMKLYERLGFLRSKRLHRYYLNGNSAYRLVLYLKEGVGNMRTSFDPYAAPAEARPEMSGAAAVPAAPAPPPLLQGNGR
|
||||
>160593at2759_139723_0:000010
|
||||
MADAELAKALKDLPNRVLNVPVEERPELFQNVIAVLPNPGINATIVRGICKVIGTTLTKYKDPESQTLVKELLVAVLKQHPDLTYEHFNAVLKALLAKDLAGAPPIKAAQASALALGWANLIALHADHETAVGKKEFPKLLEVQAGLYQLSLTSGIQKISDKAYSFLRDFFASDESLAQRYFDKLLAMEPSSGVIVMLCTIVRYLHQEQGTVELLDQHKPKLLDHLVKGLITVKTKPHASDIVACSILLKAITKDELRTIIVPALQRSMLRSAEVILRAVGAIVNEIELDVSDYALDLGKPLVQNLASKEETVRQEAVESLKQVALKCGTPNAIETLLKEVFAVLNGSGGKITVAELRINLLQGAGNLSYNKIPSQKIQTILPAACDHFTKVIEAEIQEKVVCHALEMFGLWTVNHRGEIPAKIVQLFKKGLDAKAQTIRTSYLQWFLSCLHDGKLPNGIDFTTTLSKIVERAAQSPTQTPVVSEGVGAACILLLTNPSVSEKLKDFWNIVLDTNKSPFLSERFLSTTNAETRCYVMVICEQLLIKHRNELKGSSTTDPLIRAATVCVMSAQAKVRRYCLPLVTKIVNSEDGVSLAKFLLAELTRYVECTKILSEGEPAEEGIAPAQALVDAVCTVCNVEKVANPDAQSLALSALLCSHHPAAVSVRGDLWESILERYGLYGKQFIALNTAQIEEVFFNSYKATAMYENTLATLSRISPELILSVLVKNVTDQLNNSRMSNVTDEEYFTYLTPDGELYDKSVIPNTDEQVQTAHLKRENKAYSYKEQLEELQLRRELEEKRRKEGKWKPPQLTPKQKEVIDKQREKENAIKARLQALHDTITTLISQIEGAAKGTPKQLPLFFPALLPAILRVFSSPLAAPAMVKLYYRLKDICFGEERVELGRDIAIATIRLSKPHCDLEESWCTANLVELVSDILVALYDETIDMYNVHREEEASKRYLLDAPAFSYTFEFLKRALTLPEAKKDESLLINGVQIIAYHAQLKGDTVDGKDLGDVYHPLYMPRLEMIRLLLRLIQQHRGRVQTQAVAALLDVAESCSGREYTTRAEQREIEALLVALQEELDAVRDVALRALAIMIDVLPSIADDYEFGLRLTRRLWVAKHDLSADIKQLATGIWQDGAYEVPIVMADELMKDIIHPELCVQKAAAAALVSILVEDSSTIDGVVEQLLEIYREKVVMIPAKLDQFDREVEPAIDPWGPRRGVAITLGSISPFLTPELVKSVIQFMVRSGLRDRQEIVHKEMLAASLAIVEHHGKDSVTYLLPTFEYFLDKAPSKGAYDNIRQAVVILMGSLARHLDREDERIQPIIDRLLAALETPSQQVQEAVANCIPHLIPSVKDKAPEIVKKLLQQLVKSEKYGVRRGAAYGIAGVVKGLGILSLKQLDIMSKLTHYIQDKKNYKSREGALFAFEMLCSTLGRLFEPYIVHVLPHLLQCFGDSSVYVRQAADECAKTVMAKLSAHGVKLVLPSLLNALDEDSWRTKTASVELLGSMAFCAPKQLSSCLPSIVPKLMEVLGDSHIKVQEAGANALRVIGSVIKNPEIQAIVPVLLTALEDPSSKTSACLQSLLETKFVHFIDAPSLALIMPVVQRAFMDRSTETRKMAAQIIGNMYSLTDQKDLTPYLPNIIPGLKTSLLDPVPEVRGVSARALGAMVRGMGESSFEDLLPWLMQTLTSESSSVDRSGAAQGLSEVVGGLGVEKLHKLMPEIIATAERTDIAPHVKDGYIMMFIYMPSAFPNDFTPYIGQIINPILKALADENEYVRDTALKAGQRIVNLYAESAITLLLPELEKGLFDDNWRIRYSSVQLLGDLLYKISGVSGKMTTQTASEDDNFGTEQSHKAIIRSLGADRRNRVLAGLYMGRSDVSLMVRQAALHVWKVVVTNTPRTLREILPTLFSLLLGCLASTSYDKRQVAARTLGDLVRKLGERVLPEIIPILERGLSSDQADQRQGVCIGLSEIMASTSRDMVLTFVNSLVPTVRKALADPLPEVRHAAAKTFDSLHTTVGARALEDILPSMLESLADPDPDVAEWTLDGLRQVMAIKSRVVLPYLIPQLTAKPVNTKALSILASVAGEALTKYLPKILPALLAALAAAQGTPEEVQELEYCQAVILSVSDEVGIRTIMDTVMESTKSEIPETRRAAATLLCAFCTHSPGDYSQYVPQLLRGLLWLLSDGDREVLQRSWDALNAVTKTLDSAQQIAHVTDVRQAVKFASSDLPKGGELPGFCLPKGITPLLPVFREAILNGLPEEKENAAQGLGEVIKLTSPASLQPSVVHITGPLIRILGDRFNAGVKAAVLETLAILLHKVGIMLKQFLPQLQTTFLKALHDPSRTVRIKAGHALAELIVIHTRPDPLFVEMHNGIKSADDSAVRETMLQALRGIVTPAGDKMTEPLRKQIYATLAGMLAHPEDVSRAAAAGCFGALCRWLTPEQVDDALTSHMLNEDYGDDATLRHGRTAALFVALKEHPGGIVTTKYEPKICKVITGALVSDKISVAMNGVRAGGYLLQYGMTDGTAKLSTAVIGPFVKSMNHSSNEVKQLLAKTCTYLARVVPAERIAPEYLKLAIPMLVNGTKEKNGYVRSNSEIALVHVLRLRDGEEFHQRCITLLEPGARESLSEVVSKVLRKVAMQAVGKEEELDDTILT
|
||||
>1346432at2759_1447883_1:000010
|
||||
MSSMRNAVQRRVHRERAQPANREKWGILEKHKDYSLRARDYSVKKAKLQRLREKADTRNPDEFAFGMMSGKSRTQGKHGARDTESAALSLETVKLLKTQDAGYLRVVGERIRRQMMAVDEEVRVQEGISGVSANGAAAGGGGGGGRKVVFVDSVEEQRERALEDEGKSDDDEEQGDFDEVDEEEQRQQKTQPKSKKQLEAEKLAQKEMLKARKLKIKAAEARSKKLQALTDQHKNIVAAEQELDWQRGKMENSVGGVNKHGLRWKVRERKR
|
||||
>761109at2759_198730_1:000010
|
||||
MAMTFTEDSIKELRLRLEDAVVKCSERCLYQSAKWAAEMLNSLVSTDGNDTDAESPMETDLQPTVNPFSLQSDPTEATLELQEAHKYLLAKSYFDTREYDRCAAVFLPPTIPPVPLSTVSPNVKSRASLTPQKGKRKSFIRPGLKSGQALPRNPYPNLSQKSLFLALYAKYLAGEKRRDEETEMVLGPADGGMTVNRELPDLARGLEGWFEERRERGLQDQGQGWLEYLYAVILIKGKNEEEAKKWLVRSVHLFPFHWGAWQELNDLLPSVDDLKQVAETLPQNIMSFIFQVHCSQELYQATDETHQTLNGLESIFPTSAFLKTERALLYYHSRDFEDASAIFADILIDSPHRLDSLDHYSNILYVMGARPQLAFVAQLATATDKFRPETCCVVGNYYSLKSEHEKAVMYFRRALTLDRNFLSAWTLMGHEYIEMKNTHAAIESYRRAVDVNRKDYRAWYGLGQAYEVLDMCFYALYYYQRTAALKPYDPKMWQAVGTCYAKMNQIPQSIKAMKRALVAGAYYEQRADAATADHPAAGRKILDPDLLHQIALLYEKMNNEDEAAAYMELTLQQESGEIERTETDSDDDDGDDNSDDGTTQRRSRRQRRRQKSRDDDNEIEAVGGTGVTATTSKARLWLARWALKHGDLNRADQLAGELCQDGVEVEEAKALMRDVRARREGGGG
|
||||
>1617752at2759_2004952_1:000010
|
||||
MPSSFVTPGQQRYLRACMVCSIVMTYSRFRDEGCPNCDEFLHLAGSQDQIESCTSQVFEGLITLANPAKSWIAKWQRLDGYVGGVYAIKVSGQLPDEIRTTLEDEYRIQYIPRDGTQTEADA
|
||||
>1588798at2759_215358_0:000010
|
||||
MTLPPTQQEPHTPEAFSLFVSFNHREPQNDDVMADLGIKAGDKVMMVWTQPSAPEGLKQHAEELAAIVGADGKVSVENLERLLLSSHSASSFDCVLSCLLADSSPVHTSETLEELARVLKPGGKLVLDEAVTGAETSQVRTAEKLISALKLSGFMSVTEVSKAELTAEALSALRTATGYQGNTLSRVRVSASKPNFEVGSSSQIKLSFGKKTPKPAEKPALDPNTVKMWTLSANDMGDDDVDLVDSDALLDEEDLKKPDPASLKVSCRDSGKKKACKNCSCGLAEELEQESTGKQKTNLPKSACGSCYLGDAFRCASCPYAGMPAFKPGEKIVLDKKTLTDA
|
||||
>1275837at2759_28005_1:000010
|
||||
MSSRDKASPSSPKETKGEHHLNEESDNDNNERRDEQQVTASAYLPSASRVDVHPLVLLSLVDHFARMNTKVRQKKRVVGLLLGRYKTDAAGTQVLDINNSFAVPFDEDPHNSDVWFFDTNYAEEMFVMHRRVHPKTKIVGWYASGPTVQQNDMLLHLLVADRFCANPVYCVVNTDPSHKGVPVLAYTTVQGREGARSLEFRNIPTHVGAEEAEEIGVEHLLRDLTDSTVTTLSSQLEERERSLEHMARVLVQIEEYLSDVASGALPASEDVLEALQELISLQPETYLKKKSLELNRFTNDRTIATFLGSIARCIGGLHEVILNRRVLARELKEIKARRAEAEEQRMDNEKNKIAEASPERKQ
|
||||
>1264469at2759_29058_0:000010
|
||||
MRPPLAIVRTYCTTAAPKSSNFIDEMKRNFIATNTFQKTLLSCGSAAISLLNPHRGDMIACLGEVTGESAIKYMRQKMTETEEGTEILKEKPRINSGTVSFDKLSQMPDNTLGRVYADFMTENNITADSRLPVQFIEDPELAYVMQRYREVHDLVHATLFMRTSMLGEVTVKWVEGIQTRLPMCISGGIWGAARLKPKHRQMYLKYYLPWAIKTGNNAKFMQGIYFEKRWDQDIDDFHKEMNIVRLVKK
|
||||
>673132at2759_326594_0:000010
|
||||
MTLLTVFKQFKKFQDAGKSVARSLSIKDDQESKKTCLYDLHIENNGKMVNFSGWLLPIQYRDSITASHQHTRTHASLFDVGHMLQSHVSGCDSGEFLESLTTADLQNLAQGGAALTVFTNKSGGILDDLIITKDRNDRFFVVSNAGRRNEDIELMLGRQAEMKSQGKNVTIEFLDPLEQGLIALQGPSAATTLQTLVKIDLTKLKFMNSVETKINQKSVRISRCGYTGEDGFEISVNGKDARTISEMILEVPDIKLAGLGARDSLRLEAGFCLYGHDINESITPVEASLQWLIAKRRREAANFPGAEFILEQIKNGPKKKRVGLILGQGPPARENATILTSAGERVGIVTSGGPSPTLGKPIAMGYVPLEHVHTGTPVLTEIRGKTYKALITKMPFVKPHYYSDKR
|
||||
>887370at2759_331117_1:000010
|
||||
MVVRSFLPLLSLLIALATFTSAASDYHEALVLQPLPQSSLLASFNFRGNTSQEAFDQRHFRYFPRALGQILQHTHTKELHIRFTTGRWDAESWGTRPWNGTKEGNTGVELWAWIDAPDSESAFARWISLTQSLSGLFCASLNFIDSTRTTRPVVSFEPIGDHSPSSDLHLLHGTLPGEVVCTENLTPFLKLLPCKGKAGVSSLLDGHKLFDASWQSMSVDVRPVCPQGGECLMQIEQTVDIVLDIERSKRPRDNPIPRPVPNDQLNCDNSKPYHSDDTCYPLERGSGKGWSLNEIFGRTLNGVCSLDEGQRPGEEAICLRVPHEQGVYTTSGVEETKRPDGYTRCFTLQPSGTFDLVIPEQSHTSLAPRDEPVLSAERTIVGHGQERGGMRIIFDNPSDAHPVDFIYFETLPWFLRPYVHTLRATITGRDGATRSVPVSHIVKETFYRPAIDRERGTQLELALSVPAASIVTLTYDFEKAILRYTEYPPDANRGFNVAPAVIKLSSANGNTIAHDTPIYMRTTSLLLPLPTPDFSMPYNVIILTSTVIALAFGSIFNLLVRRFVAADQAAALTAQTLKGRLLGKIVALRDRISGKRSKVE
|
||||
>166920at2759_38123_0:000010
|
||||
MAFLDFVFPLSKDELLERSDSQYYVRDQVTTSELPEKLKGCFESLHDDGPLFILENFDTLYGLLAHFKSVDFNQLHKVYTKLLIKSITEFIPILENYFSKETPDDELQNKYLNVIKMTVYILTEFIISFESRLQKEYQKVVIDVRARKVKVRAAIKHKEKYNWDWDFHLSNGLNSIHQLLKAKINKLWDPPVVEEEFVNTIANCCYKIIEDPCIASVKHKELRIFIFQVIGYLIKKYNHGISCTVKIVQLLKNCDHLVSPLAQAVTMFIRNHGCKSLVREIVREISEMDDGNEAAGQGQDNSKMVAAFLNEIAAEGPEYVIPAMDELLLNLEKESYMMRNCTLTILTELLLQVYKKENLSSEAKDQRDEYLNSLMEHIYDVHTFVRTKVLQLFQKLVIEKALPLAFTLQLVDRAIGRLMDKSSNVVKYAVQLLRTMIVSNPFAAKLGVEELKKKLAEAKATLTELEKNLPETSAQLSLVDEWNNIHYPVLLKIIREILEDGMYGCFLFYFL
|
||||
>1275837at2759_402676_1:000010
|
||||
MESMNDMFKKINAREKLVGWYHTGPQLRSSDLEINNLFKKYIPNPVLVIIDVQSKAVGLPTSAYFAVDEIKDDGTKSSLTFVHLPSSIEAEEAEEIGVEHLLRDTRDITAGTLATRVTEQVQSLRALEQRLDEIAVYLRKVVDGQLPINHTILGELQGVFNLLPNIFKTSNENDPLGLENGDERSFNINSNDQLMTVYLSSIVRSVIALHDLLDSLAASKAAEQEQDKLDLKQESTDSEKRATTAAVDEDPFMPN
|
||||
>1284731at2759_42254_0:000010
|
||||
MAEAGAVAAEYPSGGRARAARTLLDQVVLPGEELLLPEQEDADGPGGAGERPLQARDPYLKWGVRRACCEIPYVPVRGDHVIGIVTAKSGDTFKVDVGGSEPASLSYLAFEGATKRNRPNVQVGDLIYGQFVVANKDMEPEMVCIDGCGRANGMGVIGQDGLLFKVTLGLIRKLLAPDCEIIQELGKLYPLEIVFGMNGRIWVKAKTIQQTLILANILEACEHMTTDQRKQIFSRLAES
|
||||
>1228942at2759_45354_1:000010
|
||||
MNHDPFQWGRPRDEIYGHYDHKIAQASTSEFPSMHTQQPIITGTSVLGLKFDTGVVIAADHMGSYGSLLRFNNLERLICVGSETIVGVSGDISDFQHIERLLHELETEEEVYDTDGGHNLRAPNIHEYLSRVLYNRRLKMDPLWNAILVAGFNDDRTPFIRYVDLLGVTYGALALATGFGAHLAIPLLRKLVPYDLDYVKVKEADAREAVVNAMRVLYYRDARASDKYTLAVLSFKDGKVDVHFDQELKVTNQSWKFAEKVIGYGSKQQ
|
||||
>759498at2759_502779_1:000010
|
||||
MDGSRGSRKRKAVTRDLGEEPGVVSGNELHLDSADGSLADHSEDLDGSSDSEIELADDLNSDDDEEEEEEEEEDEDEINSDEVPSDIEPKVVGKKSGPGGEVDIIVRGDDTASDDDDDDDDDFESDDRPNYRVVKDANGNERYVYDEINPDDNSDYSETDENANTIGNIPLSFYDQYPHIGYNINGKKIMRPAKGQALDALLDSIELPKGFTGLTDPATGKPLELTQDELELLRKVQMNEITEEGYDPYQPTIEYFTSKLEVMPLSAAPEPKRRFVPSKHEAKRVMKLVKAIREGRILPYKQPAEEDEAEEGVQTYDIWANETPRADHPMHIPAPKLPPPGYEESYHPPPEYLPDEKEKSAWLNTDPEDRETEYLPTDHDALRKVPGYESFVKEKFERCLDLYLAPRVRRSKLNIDPESLLPKLPSPEELKPFPSTCATLFRGHQGRVRTLAIDPTGVWLASGGDDGTVRVWDILTGRQFWSVALSGDDAINVVRWRPGKDAVVLAAAAGDSIFLMVPPVLDPEMEKASFEVVDAGWGYAKTSPSTFTSTDSTKTSPVQWTRPSSSLLDSGVQAVISLGYVAKSLSWHRRGDYFVTVCPGTSTPVSLAIAIHTLSKHLTQQPFRRRLKGGGPPQTAHFHPSKPILFVANQRTIRAYDLSRQTLVKILQPGARWISSFDIHPTSSSTSGGDNLIVGSYDRRLLWHDVDLSPRPYKTLRYHQKAIRAVRYHANYPLFADASDDGSLQIFHGSVTGDLLSNASIVPLKVLRGHKVTGELGVLDLDWHPKEAWCVSAGADGTCRLWM
|
||||
>375960at2759_51337_0:000010
|
||||
MFFREHIFNIIGAFDIPRFVYNSERKKFLPLLMTNHPAPNLLGTAKDKAELYRERYTLLHQRTHRHELFTPPVIGSYPNESGSKFQLKTIETLLGSTTKIGDVIVLGMITQLKEGKFFLEDPTRTVQLDLSQAQFHSGLYTEACFVLAEGKAYYGSINFFGGPSNTSVKTSTKLKQLEEENKDAMFVFVSDVWLDRAEVLEKLHIMFSGYSPAPPSCFILCGNFSSAPYGKNQIQALKDSLKTLADIICEYPNIHQSSRFVFVPGPKDPGFGSILPRPPLAESITSEFRQKIPFSVFTTNPCRIQYCTEEIIIFREDIVNKMCRNCVRFPSSNLDIPNHFVKTILSQGHLTPLPLYVCPVYWARFPSSNLDIPNHGSFPRSGFSFKVFYPSSKTVEDSKLQGF
|
||||
>919955at2759_5643_1:000010
|
||||
MAAPMAVDKAKAPKIDVDEFLTLAISETPAELHPFFESFRSLYSRKLWHQLTNKLFEFFDHPLSKPYRVDVFNKFVRDFGLRLNQLRLVEMGVKVSKEIDNPVTHLQFLTDLLERVNIEKSPEAHVLLLSSLAHAKLLYGDHEGTKNDIDAAWKVLDELSSVDPSVNAAYYGVAADYYKSKAEYAPYYKNSLLYLACIDPAKDLTAEERLLRAHDLGIAAFLGDTIYNFGELPILQENYPFLRQKICLMALIESVFKRGSYDRTMSFQTIAEETHLPLDEVEHLVMKALSLKLIKGSLDQVDQKAQITWVQPRVLSREQIGQLAQRLAAWNSKLHQVEERIAPEVLVNS
|
||||
>817008at2759_5849_1:000010
|
||||
MDKLKTIYIDSALSIIKGALCVILQIPTGRTTESIKKKQNNVGIITVKSIFKEPTISQYNDIKQLIKTKIEENCPFYNYQINRTIAEKIYGDTIYDNYGLSKEINEVNLIILEEWNINCNRNRVLKHSGLIKNIEINKFKYLNNKESLEVHFLVNPKYTFEELNTIYKNEEELNNFLLSPIIKVTNKKIYEIEDKKSEFSYLYEEDILPKNKVLPPSGIENVNYESSKVVTPWDVNIGEEGINYNKLIKEFGCSKISDEHIRKIEKLTNRKAHHFIRRGIFFSHRDLDFLLNYYEQNGYFYIYTGRGPSSLSMHLGHLIPFYFCKYLQDAFNVPLIIQLSDDEKFLFNQNYSLDDINRFTKENVKDIIAVGFNPELTFIFKNTEYANHLYPTVLAIHKKTTLNQSMNVFGFNNSDNIGKISYPSFQIAPCFSQCFPNFLKKNIPCLVPQGIDQDPYFRLSRDIAVKLALYKPVVIHSVFMPGLQGVNTKMSSTKKKDNKNMDSKQDINNSVIFLTDSPEQIKNKINKYAFSGGGATIAEHKEKGADLEKDISYQYLRYFLVDDEKLNEIGEKYKKGEMLSGEIKKILIDILTDLVQKHQEKRNSLTDEDILYFFNDNKSSLKKFKDM
|
||||
>1426075at2759_61621_0:000010
|
||||
MTASQPNPQLPQSLPALKTSGTCARLPSTGRKLHLRIARAHPRVSRELFRRSGCGCGAGLSSAETDIAFLFSASGYRSHILKTMSGSFYFVIVGHHDNPVLKWSFXPAGKAESKDDHRHLNQFIAHAALDLVDENMWLSNNMYLKTVDKFNEWFVSAFVTAGHMRFIMLHDIRQEDGIKNFFTDVYDLYIKFSMNPFYEPNSPIRSSAFDRKVQFLGKKHLLS
|
||||
>655400at2759_688394_1:000010
|
||||
MAASRSPRLSSLLLRTTPLSRPTWQRTLSTRGFATAISNKLDNVYDMVIVGGGIAGTALACSLATNPSMKDYRIALIEAMDLSNTNNWAPATGRYSNRVVSLTPASMQFFEKIGVADELYRDRIQPYNCMKVSDGVTNASIEFDTNLLSSSTNPDDLPIAYMIENVHLQHSILKTLQTSKGKGATVDILQKARVASIRMQEQDAKETKDTLDLSDWPIIEMENGQSLQARLLVGADGVNSPVRSFAKIESLGWDYNMHGVVATFKTDPSRKNDTAYQRFLPTGPIAMLPLGDGHASMVWSMPPDMAHKVKKIPAQAFCTLVNSAFRLSMEDLDYLRSKIDPTTFEPLCDFDSEYNWRQGVAKHGLGDMEMMERELAFPPIVESVDETSRASFPLRMRNSQQYFADRVVLVGDAAHTVHPLAGQGLNQGILDVACLSDILQRGASEGQDIGNLHLLREYASVRYLRNLLMISACDKLHRLYSTDFAPITWIRSLGLSSVNQLDFVKAEIMKYAMGIEQ
|
||||
>946128at2759_765440_1:000010
|
||||
MPTTVCTAKASYKKTPGQLELTETHLQWFADGKKAPSVRVLYAEAASLFCSKEGAAQIRLKLGLVGDDTGHNFTFTSPQSVAYKERETFKKELTNIISRNRSVPNVTTPRPPLNTSISSTTPAISNAPTPRSVVPPSRASTSRAPSVSSDGRTPIVPGSDPTSDFRLRKQVLVSNPELGALHRDLVMSGQITEAEFWEGREHLLLAQTATESQKRGRPGQLVDPRPETVEGGEVKIVITPQLVHDIFEEYPVVAKAYNDNVPNKLSEAEFWKRYFQSKLFNAHRASIRSSAAQHVVKDDKIFDKYLEKDDDELEPRRQRDEGINLFVNLGATREDHGETGNEQDITMQAGRQRGALPLIRKFNEHSERLLNSALGDEPTAKRRRIDAGKEDAYSQIDLDDLHDPEASAGIILEMQDRQRYFEGQMASAASAEAAAGKNLDIRAILGETKVNLHDWETNLAQLKINKKSGDAALLSMTENVSARLEIKMKKNDIPPELFSQMTTCQTAANEFLRQFWLSMYPPAADHQVLAPATPAQKAAKAAKMIGYLGKTHEKVDALIRTAQVEAVDAAKVEIVRAVCFVYIITVNFNANLQAMKPILDAVDRALAFYRSRKPPK
|
||||
>1287401at2759_870435_1:000010
|
||||
MSSSIVGSLTRGCRTPSVNINPHPFFRCRTSLYHGIGKPPSWLHSRTQLWRTIGTSSSKHTPPSSASVSARRPTAIPSYNASREQMYKTRNRNLLMYTSAVVILGVGITYAAVPLYRMFCSATGFAGTPSVVSTSSGRFDPSRLTPDTDARRIRVHFNADRAEALPWKFFPQQKYVEVLPGESSLAFYKARNESKKDIIGIATYNVTPDRVAPYFSKVECFCFEEQKLLAGEEVDMPLLFFIDKDILDDPSCRGVNDVVLSYTFFKARRNAQGHLEPDAEEDVVQRSLGFEGYEHSPRAETKKVEGSKANS
|
||||
12
src/busco/busco_run/test_data/script.sh
Normal file
12
src/busco/busco_run/test_data/script.sh
Normal file
@@ -0,0 +1,12 @@
|
||||
# busco test data
|
||||
|
||||
# Test data from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/busco/test
|
||||
|
||||
if [ ! -d /tmp/snakemake-wrappers ]; then
|
||||
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
|
||||
fi
|
||||
|
||||
cp -r /tmp/snakemake-wrappers/bio/busco/test/protein.fasta src/busco/test_data
|
||||
|
||||
# Test data from busco test data at https://gitlab.com/ezlab/busco/-/tree/master/test_data?ref_type=heads
|
||||
wget -O src/busco/test_data/genome.fna "https://gitlab.com/ezlab/busco/-/raw/master/test_data/eukaryota/genome.fna?ref_type=heads&inline=false"
|
||||
481
src/cutadapt/config.vsh.yaml
Normal file
481
src/cutadapt/config.vsh.yaml
Normal file
@@ -0,0 +1,481 @@
|
||||
name: cutadapt
|
||||
description: |
|
||||
Cutadapt removes adapter sequences from high-throughput sequencing reads.
|
||||
keywords: [RNA-seq, scRNA-seq, high-throughput]
|
||||
links:
|
||||
homepage: https://cutadapt.readthedocs.io
|
||||
documentation: https://cutadapt.readthedocs.io
|
||||
repository: https://github.com/marcelm/cutadapt
|
||||
references:
|
||||
doi: 10.14806/ej.17.1.200
|
||||
license: MIT
|
||||
argument_groups:
|
||||
####################################################################
|
||||
- name: Specify Adapters for R1
|
||||
arguments:
|
||||
- name: --adapter
|
||||
alternatives: [-a]
|
||||
type: string
|
||||
multiple: true
|
||||
description: |
|
||||
Sequence of an adapter ligated to the 3' end (paired data:
|
||||
of the first read). The adapter and subsequent bases are
|
||||
trimmed. If a '$' character is appended ('anchoring'), the
|
||||
adapter is only found if it is a suffix of the read.
|
||||
required: false
|
||||
- name: --front
|
||||
alternatives: [-g]
|
||||
type: string
|
||||
multiple: true
|
||||
description: |
|
||||
Sequence of an adapter ligated to the 5' end (paired data:
|
||||
of the first read). The adapter and any preceding bases
|
||||
are trimmed. Partial matches at the 5' end are allowed. If
|
||||
a '^' character is prepended ('anchoring'), the adapter is
|
||||
only found if it is a prefix of the read.
|
||||
required: false
|
||||
- name: --anywhere
|
||||
alternatives: [-b]
|
||||
type: string
|
||||
multiple: true
|
||||
description: |
|
||||
Sequence of an adapter that may be ligated to the 5' or 3'
|
||||
end (paired data: of the first read). Both types of
|
||||
matches as described under -a and -g are allowed. If the
|
||||
first base of the read is part of the match, the behavior
|
||||
is as with -g, otherwise as with -a. This option is mostly
|
||||
for rescuing failed library preparations - do not use if
|
||||
you know which end your adapter was ligated to!
|
||||
required: false
|
||||
|
||||
####################################################################
|
||||
- name: Specify Adapters using Fasta files for R1
|
||||
arguments:
|
||||
- name: --adapter_fasta
|
||||
type: file
|
||||
multiple: true
|
||||
description: |
|
||||
Fasta file containing sequences of an adapter ligated to the 3' end (paired data:
|
||||
of the first read). The adapter and subsequent bases are
|
||||
trimmed. If a '$' character is appended ('anchoring'), the
|
||||
adapter is only found if it is a suffix of the read.
|
||||
required: false
|
||||
- name: --front_fasta
|
||||
type: file
|
||||
description: |
|
||||
Fasta file containing sequences of an adapter ligated to the 5' end (paired data:
|
||||
of the first read). The adapter and any preceding bases
|
||||
are trimmed. Partial matches at the 5' end are allowed. If
|
||||
a '^' character is prepended ('anchoring'), the adapter is
|
||||
only found if it is a prefix of the read.
|
||||
required: false
|
||||
- name: --anywhere_fasta
|
||||
type: file
|
||||
description: |
|
||||
Fasta file containing sequences of an adapter that may be ligated to the 5' or 3'
|
||||
end (paired data: of the first read). Both types of
|
||||
matches as described under -a and -g are allowed. If the
|
||||
first base of the read is part of the match, the behavior
|
||||
is as with -g, otherwise as with -a. This option is mostly
|
||||
for rescuing failed library preparations - do not use if
|
||||
you know which end your adapter was ligated to!
|
||||
required: false
|
||||
|
||||
####################################################################
|
||||
- name: Specify Adapters for R2
|
||||
arguments:
|
||||
- name: --adapter_r2
|
||||
alternatives: [-A]
|
||||
type: string
|
||||
multiple: true
|
||||
description: |
|
||||
Sequence of an adapter ligated to the 3' end (paired data:
|
||||
of the first read). The adapter and subsequent bases are
|
||||
trimmed. If a '$' character is appended ('anchoring'), the
|
||||
adapter is only found if it is a suffix of the read.
|
||||
required: false
|
||||
- name: --front_r2
|
||||
alternatives: [-G]
|
||||
type: string
|
||||
multiple: true
|
||||
description: |
|
||||
Sequence of an adapter ligated to the 5' end (paired data:
|
||||
of the first read). The adapter and any preceding bases
|
||||
are trimmed. Partial matches at the 5' end are allowed. If
|
||||
a '^' character is prepended ('anchoring'), the adapter is
|
||||
only found if it is a prefix of the read.
|
||||
required: false
|
||||
- name: --anywhere_r2
|
||||
alternatives: [-B]
|
||||
type: string
|
||||
multiple: true
|
||||
description: |
|
||||
Sequence of an adapter that may be ligated to the 5' or 3'
|
||||
end (paired data: of the first read). Both types of
|
||||
matches as described under -a and -g are allowed. If the
|
||||
first base of the read is part of the match, the behavior
|
||||
is as with -g, otherwise as with -a. This option is mostly
|
||||
for rescuing failed library preparations - do not use if
|
||||
you know which end your adapter was ligated to!
|
||||
required: false
|
||||
|
||||
####################################################################
|
||||
- name: Specify Adapters using Fasta files for R2
|
||||
arguments:
|
||||
- name: --adapter_r2_fasta
|
||||
type: file
|
||||
description: |
|
||||
Fasta file containing sequences of an adapter ligated to the 3' end (paired data:
|
||||
of the first read). The adapter and subsequent bases are
|
||||
trimmed. If a '$' character is appended ('anchoring'), the
|
||||
adapter is only found if it is a suffix of the read.
|
||||
required: false
|
||||
- name: --front_r2_fasta
|
||||
type: file
|
||||
description: |
|
||||
Fasta file containing sequences of an adapter ligated to the 5' end (paired data:
|
||||
of the first read). The adapter and any preceding bases
|
||||
are trimmed. Partial matches at the 5' end are allowed. If
|
||||
a '^' character is prepended ('anchoring'), the adapter is
|
||||
only found if it is a prefix of the read.
|
||||
required: false
|
||||
- name: --anywhere_r2_fasta
|
||||
type: file
|
||||
description: |
|
||||
Fasta file containing sequences of an adapter that may be ligated to the 5' or 3'
|
||||
end (paired data: of the first read). Both types of
|
||||
matches as described under -a and -g are allowed. If the
|
||||
first base of the read is part of the match, the behavior
|
||||
is as with -g, otherwise as with -a. This option is mostly
|
||||
for rescuing failed library preparations - do not use if
|
||||
you know which end your adapter was ligated to!
|
||||
required: false
|
||||
|
||||
####################################################################
|
||||
- name: Paired-end options
|
||||
arguments:
|
||||
- name: --pair_adapters
|
||||
type: boolean_true
|
||||
description: |
|
||||
Treat adapters given with -a/-A etc. as pairs. Either both
|
||||
or none are removed from each read pair.
|
||||
- name: --pair_filter
|
||||
type: string
|
||||
choices: [any, both, first]
|
||||
description: |
|
||||
Which of the reads in a paired-end read have to match the
|
||||
filtering criterion in order for the pair to be filtered.
|
||||
- name: --interleaved
|
||||
type: boolean_true
|
||||
description: |
|
||||
Read and/or write interleaved paired-end reads.
|
||||
|
||||
####################################################################
|
||||
- name: Input parameters
|
||||
arguments:
|
||||
- name: --input
|
||||
type: file
|
||||
required: true
|
||||
description: |
|
||||
Input fastq file for single-end reads or R1 for paired-end reads.
|
||||
- name: --input_r2
|
||||
type: file
|
||||
required: false
|
||||
description: |
|
||||
Input fastq file for R2 in the case of paired-end reads.
|
||||
- name: --error_rate
|
||||
alternatives: [-E, --errors]
|
||||
type: double
|
||||
description: |
|
||||
Maximum allowed error rate (if 0 <= E < 1), or absolute
|
||||
number of errors for full-length adapter match (if E is an
|
||||
integer >= 1). Error rate = no. of errors divided by
|
||||
length of matching region. Default: 0.1 (10%).
|
||||
example: 0.1
|
||||
- name: --no_indels
|
||||
type: boolean_false
|
||||
description: |
|
||||
Allow only mismatches in alignments.
|
||||
|
||||
- name: --times
|
||||
type: integer
|
||||
alternatives: [-n]
|
||||
description: |
|
||||
Remove up to COUNT adapters from each read. Default: 1.
|
||||
example: 1
|
||||
- name: --overlap
|
||||
alternatives: [-O]
|
||||
type: integer
|
||||
description: |
|
||||
Require MINLENGTH overlap between read and adapter for an
|
||||
adapter to be found. The default is 3.
|
||||
example: 3
|
||||
- name: --match_read_wildcards
|
||||
type: boolean_true
|
||||
description: |
|
||||
Interpret IUPAC wildcards in reads.
|
||||
- name: --no_match_adapter_wildcards
|
||||
type: boolean_false
|
||||
description: |
|
||||
Do not interpret IUPAC wildcards in adapters.
|
||||
- name: --action
|
||||
type: string
|
||||
choices:
|
||||
- trim
|
||||
- retain
|
||||
- mask
|
||||
- lowercase
|
||||
- none
|
||||
description: |
|
||||
What to do if a match was found. trim: trim adapter and
|
||||
up- or downstream sequence; retain: trim, but retain
|
||||
adapter; mask: replace with 'N' characters; lowercase:
|
||||
convert to lowercase; none: leave unchanged.
|
||||
The default is trim.
|
||||
example: trim
|
||||
- name: --revcomp
|
||||
alternatives: [--rc]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Check both the read and its reverse complement for adapter
|
||||
matches. If match is on reverse-complemented version,
|
||||
output that one.
|
||||
|
||||
####################################################################
|
||||
- name: "Demultiplexing options"
|
||||
arguments:
|
||||
- name: "--demultiplex_mode"
|
||||
type: string
|
||||
choices: ["single", "unique_dual", "combinatorial_dual"]
|
||||
required: false
|
||||
description: |
|
||||
Enable demultiplexing and set the mode for it.
|
||||
With mode 'unique_dual', adapters from the first and second read are used,
|
||||
and the indexes from the reads are only used in pairs. This implies
|
||||
--pair_adapters.
|
||||
Enabling mode 'combinatorial_dual' allows all combinations of the sets of indexes
|
||||
on R1 and R2. It is necessary to write each read pair to an output
|
||||
file depending on the adapters found on both R1 and R2.
|
||||
Mode 'single', uses indexes or barcodes located at the 5'
|
||||
end of the R1 read (single).
|
||||
|
||||
####################################################################
|
||||
- name: Read modifications
|
||||
arguments:
|
||||
- name: --cut
|
||||
alternatives: [-u]
|
||||
type: integer
|
||||
multiple: true
|
||||
description: |
|
||||
Remove LEN bases from each read (or R1 if paired; use --cut_r2
|
||||
option for R2). If LEN is positive, remove bases from the
|
||||
beginning. If LEN is negative, remove bases from the end.
|
||||
Can be used twice if LENs have different signs. Applied
|
||||
*before* adapter trimming.
|
||||
- name: --cut_r2
|
||||
type: integer
|
||||
multiple: true
|
||||
description: |
|
||||
Remove LEN bases from each read (for R2). If LEN is positive, remove bases from the
|
||||
beginning. If LEN is negative, remove bases from the end.
|
||||
Can be used twice if LENs have different signs. Applied
|
||||
*before* adapter trimming.
|
||||
- name: --nextseq_trim
|
||||
type: string
|
||||
description: |
|
||||
NextSeq-specific quality trimming (each read). Trims also
|
||||
dark cycles appearing as high-quality G bases.
|
||||
- name: --quality_cutoff
|
||||
alternatives: [-q]
|
||||
type: string
|
||||
description: |
|
||||
Trim low-quality bases from 5' and/or 3' ends of each read
|
||||
before adapter removal. Applied to both reads if data is
|
||||
paired. If one value is given, only the 3' end is trimmed.
|
||||
If two comma-separated cutoffs are given, the 5' end is
|
||||
trimmed with the first cutoff, the 3' end with the second.
|
||||
- name: --quality_cutoff_r2
|
||||
alternatives: [-Q]
|
||||
type: string
|
||||
description: |
|
||||
Quality-trimming cutoff for R2. Default: same as for R1
|
||||
- name: --quality_base
|
||||
type: integer
|
||||
description: |
|
||||
Assume that quality values in FASTQ are encoded as
|
||||
ascii(quality + N). This needs to be set to 64 for some
|
||||
old Illumina FASTQ files. The default is 33.
|
||||
example: 33
|
||||
- name: --poly_a
|
||||
type: boolean_true
|
||||
description: Trim poly-A tails
|
||||
- name: --length
|
||||
alternatives: [-l]
|
||||
type: integer
|
||||
description: |
|
||||
Shorten reads to LENGTH. Positive values remove bases at
|
||||
the end while negative ones remove bases at the beginning.
|
||||
This and the following modifications are applied after
|
||||
adapter trimming.
|
||||
- name: --trim_n
|
||||
type: boolean_true
|
||||
description: Trim N's on ends of reads.
|
||||
- name: --length_tag
|
||||
type: string
|
||||
description: |
|
||||
Search for TAG followed by a decimal number in the
|
||||
description field of the read. Replace the decimal number
|
||||
with the correct length of the trimmed read. For example,
|
||||
use --length-tag 'length=' to correct fields like
|
||||
'length=123'.
|
||||
example: "length="
|
||||
- name: --strip_suffix
|
||||
type: string
|
||||
description: |
|
||||
Remove this suffix from read names if present. Can be
|
||||
given multiple times.
|
||||
- name: --prefix
|
||||
alternatives: [-x]
|
||||
type: string
|
||||
description: |
|
||||
Add this prefix to read names. Use {name} to insert the
|
||||
name of the matching adapter.
|
||||
- name: --suffix
|
||||
alternatives: [-y]
|
||||
type: string
|
||||
description: |
|
||||
Add this suffix to read names; can also include {name}
|
||||
- name: --rename
|
||||
type: string
|
||||
description: |
|
||||
Rename reads using TEMPLATE containing variables such as
|
||||
{id}, {adapter_name} etc. (see documentation)
|
||||
- name: --zero_cap
|
||||
alternatives: [-z]
|
||||
type: boolean_true
|
||||
description: Change negative quality values to zero.
|
||||
|
||||
####################################################################
|
||||
- name: Filtering of processed reads
|
||||
description: |
|
||||
Filters are applied after above read modifications. Paired-end reads are
|
||||
always discarded pairwise (see also --pair_filter).
|
||||
arguments:
|
||||
- name: --minimum_length
|
||||
alternatives: [-m]
|
||||
type: string
|
||||
description: |
|
||||
Discard reads shorter than LEN. Default is 0.
|
||||
When trimming paired-end reads, the minimum lengths for R1 and R2 can be specified separately by separating them with a colon (:).
|
||||
If the colon syntax is not used, the same minimum length applies to both reads, as discussed above.
|
||||
Also, one of the values can be omitted to impose no restrictions.
|
||||
For example, with -m 17:, the length of R1 must be at least 17, but the length of R2 is ignored.
|
||||
example: "0"
|
||||
- name: --maximum_length
|
||||
alternatives: [-M]
|
||||
type: string
|
||||
description: |
|
||||
Discard reads longer than LEN. Default: no limit.
|
||||
For paired reads, see the remark for --minimum_length
|
||||
- name: --max_n
|
||||
type: string
|
||||
description: |
|
||||
Discard reads with more than COUNT 'N' bases. If COUNT is
|
||||
a number between 0 and 1, it is interpreted as a fraction
|
||||
of the read length.
|
||||
- name: --max_expected_errors
|
||||
alternatives: [--max_ee]
|
||||
type: long
|
||||
description: |
|
||||
Discard reads whose expected number of errors (computed
|
||||
from quality values) exceeds ERRORS.
|
||||
- name: --max_average_error_rate
|
||||
alternatives: [--max_aer]
|
||||
type: long
|
||||
description: |
|
||||
as --max_expected_errors (see above), but divided by
|
||||
length to account for reads of varying length.
|
||||
- name: --discard_trimmed
|
||||
alternatives: [--discard]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Discard reads that contain an adapter. Use also -O to
|
||||
avoid discarding too many randomly matching reads.
|
||||
- name: --discard_untrimmed
|
||||
alternatives: [--trimmed_only]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Discard reads that do not contain an adapter.
|
||||
- name: --discard_casava
|
||||
type: boolean_true
|
||||
description: |
|
||||
Discard reads that did not pass CASAVA filtering (header
|
||||
has :Y:).
|
||||
|
||||
####################################################################
|
||||
- name: Output parameters
|
||||
arguments:
|
||||
- name: --report
|
||||
type: string
|
||||
choices: [full, minimal]
|
||||
description: |
|
||||
Which type of report to print: 'full' (default) or 'minimal'.
|
||||
example: full
|
||||
- name: --json
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write report in JSON format to this file.
|
||||
- name: --output
|
||||
type: file
|
||||
description: |
|
||||
Glob pattern for matching the expected output files.
|
||||
Should include `$output_dir`.
|
||||
example: "fastq/*_001.fast[a,q]"
|
||||
direction: output
|
||||
required: true
|
||||
must_exist: true
|
||||
multiple: true
|
||||
- name: --fasta
|
||||
type: boolean_true
|
||||
description: |
|
||||
Output FASTA to standard output even on FASTQ input.
|
||||
- name: --info_file
|
||||
type: boolean_true
|
||||
description: |
|
||||
Write information about each read and its adapter matches
|
||||
into info.txt in the output directory.
|
||||
See the documentation for the file format.
|
||||
# - name: -Z
|
||||
# - name: --rest_file
|
||||
# - name: --wildcard-file
|
||||
# - name: --too_short_output
|
||||
# - name: --too_long_output
|
||||
# - name: --untrimmed_output
|
||||
# - name: --untrimmed_paired_output
|
||||
# - name: too_short_paired_output
|
||||
# - name: too_long_paired_output
|
||||
- name: Debug
|
||||
arguments:
|
||||
- type: boolean_true
|
||||
name: --debug
|
||||
description: Print debug information
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: python:3.12
|
||||
setup:
|
||||
- type: python
|
||||
pip:
|
||||
- cutadapt
|
||||
- type: docker
|
||||
run: |
|
||||
cutadapt --version | sed 's/\(.*\)/cutadapt: "\1"/' > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
218
src/cutadapt/help.txt
Normal file
218
src/cutadapt/help.txt
Normal file
@@ -0,0 +1,218 @@
|
||||
cutadapt version 4.6
|
||||
|
||||
Copyright (C) 2010 Marcel Martin <marcel.martin@scilifelab.se> and contributors
|
||||
|
||||
Cutadapt removes adapter sequences from high-throughput sequencing reads.
|
||||
|
||||
Usage:
|
||||
cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
|
||||
|
||||
For paired-end reads:
|
||||
cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq
|
||||
|
||||
Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
|
||||
characters are supported. All reads from input.fastq will be written to
|
||||
output.fastq with the adapter sequence removed. Adapter matching is
|
||||
error-tolerant. Multiple adapter sequences can be given (use further -a
|
||||
options), but only the best-matching adapter will be removed.
|
||||
|
||||
Input may also be in FASTA format. Compressed input and output is supported and
|
||||
auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
|
||||
standard input/output. Without the -o option, output is sent to standard output.
|
||||
|
||||
Citation:
|
||||
|
||||
Marcel Martin. Cutadapt removes adapter sequences from high-throughput
|
||||
sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.
|
||||
http://dx.doi.org/10.14806/ej.17.1.200
|
||||
|
||||
Run "cutadapt --help" to see all command-line options.
|
||||
See https://cutadapt.readthedocs.io/ for full documentation.
|
||||
|
||||
Options:
|
||||
-h, --help Show this help message and exit
|
||||
--version Show version number and exit
|
||||
--debug Print debug log. Use twice to also print DP matrices
|
||||
-j CORES, --cores CORES
|
||||
Number of CPU cores to use. Use 0 to auto-detect. Default:
|
||||
1
|
||||
|
||||
Finding adapters:
|
||||
Parameters -a, -g, -b specify adapters to be removed from each read (or from
|
||||
R1 if data is paired-end. If specified multiple times, only the best matching
|
||||
adapter is trimmed (but see the --times option). Use notation 'file:FILE' to
|
||||
read adapter sequences from a FASTA file.
|
||||
|
||||
-a ADAPTER, --adapter ADAPTER
|
||||
Sequence of an adapter ligated to the 3' end (paired data:
|
||||
of the first read). The adapter and subsequent bases are
|
||||
trimmed. If a '$' character is appended ('anchoring'), the
|
||||
adapter is only found if it is a suffix of the read.
|
||||
-g ADAPTER, --front ADAPTER
|
||||
Sequence of an adapter ligated to the 5' end (paired data:
|
||||
of the first read). The adapter and any preceding bases
|
||||
are trimmed. Partial matches at the 5' end are allowed. If
|
||||
a '^' character is prepended ('anchoring'), the adapter is
|
||||
only found if it is a prefix of the read.
|
||||
-b ADAPTER, --anywhere ADAPTER
|
||||
Sequence of an adapter that may be ligated to the 5' or 3'
|
||||
end (paired data: of the first read). Both types of
|
||||
matches as described under -a and -g are allowed. If the
|
||||
first base of the read is part of the match, the behavior
|
||||
is as with -g, otherwise as with -a. This option is mostly
|
||||
for rescuing failed library preparations - do not use if
|
||||
you know which end your adapter was ligated to!
|
||||
-e E, --error-rate E, --errors E
|
||||
Maximum allowed error rate (if 0 <= E < 1), or absolute
|
||||
number of errors for full-length adapter match (if E is an
|
||||
integer >= 1). Error rate = no. of errors divided by
|
||||
length of matching region. Default: 0.1 (10%)
|
||||
--no-indels Allow only mismatches in alignments. Default: allow both
|
||||
mismatches and indels
|
||||
-n COUNT, --times COUNT
|
||||
Remove up to COUNT adapters from each read. Default: 1
|
||||
-O MINLENGTH, --overlap MINLENGTH
|
||||
Require MINLENGTH overlap between read and adapter for an
|
||||
adapter to be found. Default: 3
|
||||
--match-read-wildcards
|
||||
Interpret IUPAC wildcards in reads. Default: False
|
||||
-N, --no-match-adapter-wildcards
|
||||
Do not interpret IUPAC wildcards in adapters.
|
||||
--action {trim,retain,mask,lowercase,none}
|
||||
What to do if a match was found. trim: trim adapter and
|
||||
up- or downstream sequence; retain: trim, but retain
|
||||
adapter; mask: replace with 'N' characters; lowercase:
|
||||
convert to lowercase; none: leave unchanged. Default: trim
|
||||
--rc, --revcomp Check both the read and its reverse complement for adapter
|
||||
matches. If match is on reverse-complemented version,
|
||||
output that one. Default: check only read
|
||||
|
||||
Additional read modifications:
|
||||
-u LEN, --cut LEN Remove LEN bases from each read (or R1 if paired; use -U
|
||||
option for R2). If LEN is positive, remove bases from the
|
||||
beginning. If LEN is negative, remove bases from the end.
|
||||
Can be used twice if LENs have different signs. Applied
|
||||
*before* adapter trimming.
|
||||
--nextseq-trim 3'CUTOFF
|
||||
NextSeq-specific quality trimming (each read). Trims also
|
||||
dark cycles appearing as high-quality G bases.
|
||||
-q [5'CUTOFF,]3'CUTOFF, --quality-cutoff [5'CUTOFF,]3'CUTOFF
|
||||
Trim low-quality bases from 5' and/or 3' ends of each read
|
||||
before adapter removal. Applied to both reads if data is
|
||||
paired. If one value is given, only the 3' end is trimmed.
|
||||
If two comma-separated cutoffs are given, the 5' end is
|
||||
trimmed with the first cutoff, the 3' end with the second.
|
||||
--quality-base N Assume that quality values in FASTQ are encoded as
|
||||
ascii(quality + N). This needs to be set to 64 for some
|
||||
old Illumina FASTQ files. Default: 33
|
||||
--poly-a Trim poly-A tails
|
||||
--length LENGTH, -l LENGTH
|
||||
Shorten reads to LENGTH. Positive values remove bases at
|
||||
the end while negative ones remove bases at the beginning.
|
||||
This and the following modifications are applied after
|
||||
adapter trimming.
|
||||
--trim-n Trim N's on ends of reads.
|
||||
--length-tag TAG Search for TAG followed by a decimal number in the
|
||||
description field of the read. Replace the decimal number
|
||||
with the correct length of the trimmed read. For example,
|
||||
use --length-tag 'length=' to correct fields like
|
||||
'length=123'.
|
||||
--strip-suffix STRIP_SUFFIX
|
||||
Remove this suffix from read names if present. Can be
|
||||
given multiple times.
|
||||
-x PREFIX, --prefix PREFIX
|
||||
Add this prefix to read names. Use {name} to insert the
|
||||
name of the matching adapter.
|
||||
-y SUFFIX, --suffix SUFFIX
|
||||
Add this suffix to read names; can also include {name}
|
||||
--rename TEMPLATE Rename reads using TEMPLATE containing variables such as
|
||||
{id}, {adapter_name} etc. (see documentation)
|
||||
--zero-cap, -z Change negative quality values to zero.
|
||||
|
||||
Filtering of processed reads:
|
||||
Filters are applied after above read modifications. Paired-end reads are
|
||||
always discarded pairwise (see also --pair-filter).
|
||||
|
||||
-m LEN[:LEN2], --minimum-length LEN[:LEN2]
|
||||
Discard reads shorter than LEN. Default: 0
|
||||
-M LEN[:LEN2], --maximum-length LEN[:LEN2]
|
||||
Discard reads longer than LEN. Default: no limit
|
||||
--max-n COUNT Discard reads with more than COUNT 'N' bases. If COUNT is
|
||||
a number between 0 and 1, it is interpreted as a fraction
|
||||
of the read length.
|
||||
--max-expected-errors ERRORS, --max-ee ERRORS
|
||||
Discard reads whose expected number of errors (computed
|
||||
from quality values) exceeds ERRORS.
|
||||
--max-average-error-rate ERROR_RATE, --max-aer ERROR_RATE
|
||||
as --max-expected-errors (see above), but divided by
|
||||
length to account for reads of varying length.
|
||||
--discard-trimmed, --discard
|
||||
Discard reads that contain an adapter. Use also -O to
|
||||
avoid discarding too many randomly matching reads.
|
||||
--discard-untrimmed, --trimmed-only
|
||||
Discard reads that do not contain an adapter.
|
||||
--discard-casava Discard reads that did not pass CASAVA filtering (header
|
||||
has :Y:).
|
||||
|
||||
Output:
|
||||
--quiet Print only error messages.
|
||||
--report {full,minimal}
|
||||
Which type of report to print: 'full' or 'minimal'.
|
||||
Default: full
|
||||
--json FILE Dump report in JSON format to FILE
|
||||
-o FILE, --output FILE
|
||||
Write trimmed reads to FILE. FASTQ or FASTA format is
|
||||
chosen depending on input. Summary report is sent to
|
||||
standard output. Use '{name}' for demultiplexing (see
|
||||
docs). Default: write to standard output
|
||||
--fasta Output FASTA to standard output even on FASTQ input.
|
||||
-Z Use compression level 1 for gzipped output files (faster,
|
||||
but uses more space)
|
||||
--info-file FILE Write information about each read and its adapter matches
|
||||
into FILE. See the documentation for the file format.
|
||||
-r FILE, --rest-file FILE
|
||||
When the adapter matches in the middle of a read, write
|
||||
the rest (after the adapter) to FILE.
|
||||
--wildcard-file FILE When the adapter has N wildcard bases, write adapter bases
|
||||
matching wildcard positions to FILE. (Inaccurate with
|
||||
indels.)
|
||||
--too-short-output FILE
|
||||
Write reads that are too short (according to length
|
||||
specified by -m) to FILE. Default: discard reads
|
||||
--too-long-output FILE
|
||||
Write reads that are too long (according to length
|
||||
specified by -M) to FILE. Default: discard reads
|
||||
--untrimmed-output FILE
|
||||
Write reads that do not contain any adapter to FILE.
|
||||
Default: output to same file as trimmed reads
|
||||
|
||||
Paired-end options:
|
||||
The -A/-G/-B/-U/-Q options work like their lowercase counterparts, but are
|
||||
applied to R2 (second read in pair)
|
||||
|
||||
-A ADAPTER 3' adapter to be removed from R2
|
||||
-G ADAPTER 5' adapter to be removed from R2
|
||||
-B ADAPTER 5'/3 adapter to be removed from R2
|
||||
-U LENGTH Remove LENGTH bases from R2
|
||||
-Q [5'CUTOFF,]3'CUTOFF
|
||||
Quality-trimming cutoff for R2. Default: same as for R1
|
||||
-p FILE, --paired-output FILE
|
||||
Write R2 to FILE.
|
||||
--pair-adapters Treat adapters given with -a/-A etc. as pairs. Either both
|
||||
or none are removed from each read pair.
|
||||
--pair-filter {any,both,first}
|
||||
Which of the reads in a paired-end read have to match the
|
||||
filtering criterion in order for the pair to be filtered.
|
||||
Default: any
|
||||
--interleaved Read and/or write interleaved paired-end reads.
|
||||
--untrimmed-paired-output FILE
|
||||
Write second read in a pair to this FILE when no adapter
|
||||
was found. Use with --untrimmed-output. Default: output to
|
||||
same file as trimmed reads
|
||||
--too-short-paired-output FILE
|
||||
Write second read in a pair to this file if pair is too
|
||||
short.
|
||||
--too-long-paired-output FILE
|
||||
Write second read in a pair to this file if pair is too
|
||||
long.
|
||||
|
||||
258
src/cutadapt/script.sh
Normal file
258
src/cutadapt/script.sh
Normal file
@@ -0,0 +1,258 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
par_adapter='AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC;GGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
|
||||
par_input='src/cutadapt/test_data/se/a.fastq'
|
||||
par_report='full'
|
||||
par_json='false'
|
||||
par_fasta='false'
|
||||
par_info_file='false'
|
||||
par_debug='true'
|
||||
## VIASH END
|
||||
|
||||
function debug {
|
||||
[[ "$par_debug" == "true" ]] && echo "DEBUG: $@"
|
||||
}
|
||||
|
||||
output_dir=$(dirname $par_output)
|
||||
[[ ! -d $output_dir ]] && mkdir -p $output_dir
|
||||
|
||||
# Init
|
||||
###########################################################
|
||||
|
||||
echo ">> Paired-end data or not?"
|
||||
|
||||
mode=""
|
||||
if [[ -z $par_input_r2 ]]; then
|
||||
mode="se"
|
||||
echo " Single end"
|
||||
input="$par_input"
|
||||
else
|
||||
echo " Paired end"
|
||||
mode="pe"
|
||||
input="$par_input $par_input_r2"
|
||||
fi
|
||||
|
||||
# Adapter arguments
|
||||
# - paired and single-end
|
||||
# - string and fasta
|
||||
###########################################################
|
||||
|
||||
function add_flags {
|
||||
local arg=$1
|
||||
local flag=$2
|
||||
local prefix=$3
|
||||
[[ -z $prefix ]] && prefix=""
|
||||
|
||||
# This function should not be called if the input is empty
|
||||
# but check for it just in case
|
||||
if [[ -z $arg ]]; then
|
||||
return
|
||||
fi
|
||||
|
||||
local output=""
|
||||
IFS=';' read -r -a array <<< "$arg"
|
||||
for a in "${array[@]}"; do
|
||||
output="$output $flag $prefix$a"
|
||||
done
|
||||
echo $output
|
||||
}
|
||||
|
||||
debug ">> Parsing arguments dealing with adapters"
|
||||
adapter_args=$(echo \
|
||||
${par_adapter:+$(add_flags "$par_adapter" "--adapter")} \
|
||||
${par_adapter_fasta:+$(add_flags "$par_adapter_fasta" "--adapter" "file:")} \
|
||||
${par_front:+$(add_flags "$par_front" "--front")} \
|
||||
${par_front_fasta:+$(add_flags "$par_front_fasta" "--front" "file:")} \
|
||||
${par_anywhere:+$(add_flags "$par_anywhere" "--anywhere")} \
|
||||
${par_anywhere_fasta:+$(add_flags "$par_anywhere_fasta" "--anywhere" "file:")} \
|
||||
${par_adapter_r2:+$(add_flags "$par_adapter_r2" "-A")} \
|
||||
${par_adapter_fasta_r2:+$(add_flags "$par_adapter_fasta_r2" "-A" "file:")} \
|
||||
${par_front_r2:+$(add_flags "$par_front_r2" "-G")} \
|
||||
${par_front_fasta_r2:+$(add_flags "$par_front_fasta_r2" "-G" "file:")} \
|
||||
${par_anywhere_r2:+$(add_flags "$par_anywhere_r2" "-B")} \
|
||||
${par_anywhere_fasta_r2:+$(add_flags "$par_anywhere_fasta_r2" "-B" "file:")} \
|
||||
)
|
||||
|
||||
debug "Arguments to cutadapt:"
|
||||
debug "$adapter_args"
|
||||
debug
|
||||
|
||||
# Paired-end options
|
||||
###########################################################
|
||||
echo ">> Parsing arguments for paired-end reads"
|
||||
[[ "$par_pair_adapters" == "false" ]] && unset par_pair_adapters
|
||||
[[ "$par_interleaved" == "false" ]] && unset par_interleaved
|
||||
|
||||
paired_args=$(echo \
|
||||
${par_pair_adapters:+--pair-adapters} \
|
||||
${par_pair_filter:+--pair-filter "${par_pair_filter}"} \
|
||||
${par_interleaved:+--interleaved}
|
||||
)
|
||||
debug "Arguments to cutadapt:"
|
||||
debug $paired_args
|
||||
debug
|
||||
|
||||
# Input arguments
|
||||
###########################################################
|
||||
echo ">> Parsing input arguments"
|
||||
[[ "$par_no_indels" == "true" ]] && unset par_no_indels
|
||||
[[ "$par_match_read_wildcards" == "false" ]] && unset par_match_read_wildcards
|
||||
[[ "$par_no_match_adapter_wildcards" == "true" ]] && unset par_no_match_adapter_wildcards
|
||||
[[ "$par_revcomp" == "false" ]] && unset par_revcomp
|
||||
|
||||
input_args=$(echo \
|
||||
${par_error_rate:+--error-rate "${par_error_rate}"} \
|
||||
${par_no_indels:+--no-indels} \
|
||||
${par_times:+--times "${par_times}"} \
|
||||
${par_overlap:+--overlap "${par_overlap}"} \
|
||||
${par_match_read_wildcards:+--match-read-wildcards} \
|
||||
${par_no_match_adapter_wildcards:+--no-match-adapter-wildcards} \
|
||||
${par_action:+--action "${par_action}"} \
|
||||
${par_revcomp:+--revcomp} \
|
||||
)
|
||||
debug "Arguments to cutadapt:"
|
||||
debug $input_args
|
||||
debug
|
||||
|
||||
# Read modifications
|
||||
###########################################################
|
||||
echo ">> Parsing read modification arguments"
|
||||
[[ "$par_poly_a" == "false" ]] && unset par_poly_a
|
||||
[[ "$par_trim_n" == "false" ]] && unset par_trim_n
|
||||
[[ "$par_zero_cap" == "false" ]] && unset par_zero_cap
|
||||
|
||||
mod_args=$(echo \
|
||||
${par_cut:+--cut "${par_cut}"} \
|
||||
${par_cut_r2:+--cut_r2 "${par_cut_r2}"} \
|
||||
${par_nextseq_trim:+--nextseq-trim "${par_nextseq_trim}"} \
|
||||
${par_quality_cutoff:+--quality-cutoff "${par_quality_cutoff}"} \
|
||||
${par_quality_cutoff_r2:+-Q "${par_quality_cutoff_r2}"} \
|
||||
${par_quality_base:+--quality-base "${par_quality_base}"} \
|
||||
${par_poly_a:+--poly-a} \
|
||||
${par_length:+--length "${par_length}"} \
|
||||
${par_trim_n:+--trim-n} \
|
||||
${par_length_tag:+--length-tag "${par_length_tag}"} \
|
||||
${par_strip_suffix:+--strip-suffix "${par_strip_suffix}"} \
|
||||
${par_prefix:+--prefix "${par_prefix}"} \
|
||||
${par_suffix:+--suffix "${par_suffix}"} \
|
||||
${par_rename:+--rename "${par_rename}"} \
|
||||
${par_zero_cap:+--zero-cap} \
|
||||
)
|
||||
debug "Arguments to cutadapt:"
|
||||
debug $mod_args
|
||||
debug
|
||||
|
||||
# Filtering of processed reads arguments
|
||||
###########################################################
|
||||
echo ">> Filtering of processed reads arguments"
|
||||
[[ "$par_discard_trimmed" == "false" ]] && unset par_discard_trimmed
|
||||
[[ "$par_discard_untrimmed" == "false" ]] && unset par_discard_untrimmed
|
||||
[[ "$par_discard_casava" == "false" ]] && unset par_discard_casava
|
||||
|
||||
# Parse and transform the minimum and maximum length arguments
|
||||
[[ -z $par_minimum_length ]]
|
||||
|
||||
filter_args=$(echo \
|
||||
${par_minimum_length:+--minimum-length "${par_minimum_length}"} \
|
||||
${par_maximum_length:+--maximum-length "${par_maximum_length}"} \
|
||||
${par_max_n:+--max-n "${par_max_n}"} \
|
||||
${par_max_expected_errors:+--max-expected-errors "${par_max_expected_errors}"} \
|
||||
${par_max_average_error_rate:+--max-average-error-rate "${par_max_average_error_rate}"} \
|
||||
${par_discard_trimmed:+--discard-trimmed} \
|
||||
${par_discard_untrimmed:+--discard-untrimmed} \
|
||||
${par_discard_casava:+--discard-casava} \
|
||||
)
|
||||
debug "Arguments to cutadapt:"
|
||||
debug $filter_args
|
||||
debug
|
||||
|
||||
# Optional output arguments
|
||||
###########################################################
|
||||
echo ">> Optional arguments"
|
||||
[[ "$par_json" == "false" ]] && unset par_json
|
||||
[[ "$par_fasta" == "false" ]] && unset par_fasta
|
||||
[[ "$par_info_file" == "false" ]] && unset par_info_file
|
||||
|
||||
optional_output_args=$(echo \
|
||||
${par_report:+--report "${par_report}"} \
|
||||
${par_json:+--json "report.json"} \
|
||||
${par_fasta:+--fasta} \
|
||||
${par_info_file:+--info-file "info.txt"} \
|
||||
)
|
||||
|
||||
debug "Arguments to cutadapt:"
|
||||
debug $optional_output_args
|
||||
debug
|
||||
|
||||
# Output arguments
|
||||
# We write the output to a directory rather than
|
||||
# individual files.
|
||||
###########################################################
|
||||
|
||||
if [[ -z $par_fasta ]]; then
|
||||
ext="fastq"
|
||||
else
|
||||
ext="fasta"
|
||||
fi
|
||||
|
||||
demultiplex_mode="$par_demultiplex_mode"
|
||||
if [[ $mode == "se" ]]; then
|
||||
if [[ "$demultiplex_mode" == "unique_dual" ]] || [[ "$demultiplex_mode" == "combinatorial_dual" ]]; then
|
||||
echo "Demultiplexing dual indexes is not possible with single-end data."
|
||||
exit 1
|
||||
fi
|
||||
prefix="trimmed_"
|
||||
if [[ ! -z "$demultiplex_mode" ]]; then
|
||||
prefix="{name}_"
|
||||
fi
|
||||
output_args=$(echo \
|
||||
--output "$output_dir/${prefix}001.$ext" \
|
||||
)
|
||||
else
|
||||
demultiplex_indicator_r1='{name}_'
|
||||
demultiplex_indicator_r2=$demultiplex_indicator_r1
|
||||
if [[ "$demultiplex_mode" == "combinatorial_dual" ]]; then
|
||||
demultiplex_indicator_r1='{name1}_{name2}_'
|
||||
demultiplex_indicator_r2='{name1}_{name2}_'
|
||||
fi
|
||||
prefix_r1="trimmed_"
|
||||
prefix_r2="trimmed_"
|
||||
if [[ ! -z "$demultiplex_mode" ]]; then
|
||||
prefix_r1=$demultiplex_indicator_r1
|
||||
prefix_r2=$demultiplex_indicator_r2
|
||||
fi
|
||||
output_args=$(echo \
|
||||
--output "$output_dir/${prefix_r1}R1_001.$ext" \
|
||||
--paired-output "$output_dir/${prefix_r2}R2_001.$ext" \
|
||||
)
|
||||
fi
|
||||
|
||||
debug "Arguments to cutadapt:"
|
||||
debug $output_args
|
||||
debug
|
||||
|
||||
# Full CLI
|
||||
# Set the --cores argument to 0 unless meta_cpus is set
|
||||
###########################################################
|
||||
echo ">> Running cutadapt"
|
||||
par_cpus=0
|
||||
[[ ! -z $meta_cpus ]] && par_cpus=$meta_cpus
|
||||
|
||||
cli=$(echo \
|
||||
$input \
|
||||
$adapter_args \
|
||||
$paired_args \
|
||||
$input_args \
|
||||
$mod_args \
|
||||
$filter_args \
|
||||
$optional_output_args \
|
||||
$output_args \
|
||||
--cores $par_cpus
|
||||
)
|
||||
|
||||
debug ">> Full CLI to be run:"
|
||||
debug cutadapt $cli | sed -e 's/--/\r\n --/g'
|
||||
debug
|
||||
|
||||
cutadapt $cli
|
||||
261
src/cutadapt/test.sh
Normal file
261
src/cutadapt/test.sh
Normal file
@@ -0,0 +1,261 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
set -eo pipefail
|
||||
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || (echo "File '$1' does not exist" && exit 1)
|
||||
}
|
||||
assert_file_doesnt_exist() {
|
||||
[ ! -f "$1" ] || (echo "File '$1' exists but shouldn't" && exit 1)
|
||||
}
|
||||
assert_file_empty() {
|
||||
[ ! -s "$1" ] || (echo "File '$1' is not empty but should be" && exit 1)
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || (echo "File '$1' is empty but shouldn't be" && exit 1)
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
|
||||
}
|
||||
assert_file_not_contains() {
|
||||
grep -q "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
|
||||
}
|
||||
|
||||
#############################################
|
||||
mkdir test_multiple_output
|
||||
cd test_multiple_output
|
||||
|
||||
echo "#############################################"
|
||||
echo "> Run cutadapt with multiple outputs"
|
||||
|
||||
cat > example.fa <<'EOF'
|
||||
>read1
|
||||
MYSEQUENCEADAPTER
|
||||
>read2
|
||||
MYSEQUENCEADAP
|
||||
>read3
|
||||
MYSEQUENCEADAPTERSOMETHINGELSE
|
||||
>read4
|
||||
MYSEQUENCEADABTER
|
||||
>read5
|
||||
MYSEQUENCEADAPTR
|
||||
>read6
|
||||
MYSEQUENCEADAPPTER
|
||||
>read7
|
||||
ADAPTERMYSEQUENCE
|
||||
>read8
|
||||
PTERMYSEQUENCE
|
||||
>read9
|
||||
SOMETHINGADAPTERMYSEQUENCE
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--report minimal \
|
||||
--output "out_test/*.fasta" \
|
||||
--adapter ADAPTER \
|
||||
--input example.fa \
|
||||
--fasta \
|
||||
--demultiplex_mode single \
|
||||
--no_match_adapter_wildcards \
|
||||
--json
|
||||
|
||||
echo ">> Checking output"
|
||||
assert_file_exists "report.json"
|
||||
assert_file_exists "out_test/1_001.fasta"
|
||||
assert_file_exists "out_test/unknown_001.fasta"
|
||||
|
||||
cd ..
|
||||
echo
|
||||
|
||||
#############################################
|
||||
mkdir test_simple_single_end
|
||||
cd test_simple_single_end
|
||||
|
||||
echo "#############################################"
|
||||
echo "> Run cutadapt on single-end data"
|
||||
|
||||
cat > example.fa <<'EOF'
|
||||
>read1
|
||||
MYSEQUENCEADAPTER
|
||||
>read2
|
||||
MYSEQUENCEADAP
|
||||
>read3
|
||||
MYSEQUENCEADAPTERSOMETHINGELSE
|
||||
>read4
|
||||
MYSEQUENCEADABTER
|
||||
>read5
|
||||
MYSEQUENCEADAPTR
|
||||
>read6
|
||||
MYSEQUENCEADAPPTER
|
||||
>read7
|
||||
ADAPTERMYSEQUENCE
|
||||
>read8
|
||||
PTERMYSEQUENCE
|
||||
>read9
|
||||
SOMETHINGADAPTERMYSEQUENCE
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--report minimal \
|
||||
--output "out_test1/*.fasta" \
|
||||
--adapter ADAPTER \
|
||||
--input example.fa \
|
||||
--demultiplex_mode single \
|
||||
--fasta \
|
||||
--no_match_adapter_wildcards \
|
||||
--json
|
||||
|
||||
echo ">> Checking output"
|
||||
assert_file_exists "report.json"
|
||||
assert_file_exists "out_test1/1_001.fasta"
|
||||
assert_file_exists "out_test1/unknown_001.fasta"
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
assert_file_not_empty "report.json"
|
||||
assert_file_not_empty "out_test1/1_001.fasta"
|
||||
assert_file_not_empty "out_test1/unknown_001.fasta"
|
||||
|
||||
echo ">> Check contents"
|
||||
for i in 1 2 3 7 9; do
|
||||
assert_file_contains "out_test1/1_001.fasta" ">read$i"
|
||||
done
|
||||
for i in 4 5 6 8; do
|
||||
assert_file_contains "out_test1/unknown_001.fasta" ">read$i"
|
||||
done
|
||||
|
||||
cd ..
|
||||
echo
|
||||
|
||||
#############################################
|
||||
mkdir test_multiple_single_end
|
||||
cd test_multiple_single_end
|
||||
|
||||
echo "#############################################"
|
||||
echo "> Run with a combination of inputs"
|
||||
|
||||
cat > example.fa <<'EOF'
|
||||
>read1
|
||||
ACGTACGTACGTAAAAA
|
||||
>read2
|
||||
ACGTACGTACGTCCCCC
|
||||
>read3
|
||||
ACGTACGTACGTGGGGG
|
||||
>read4
|
||||
ACGTACGTACGTTTTTT
|
||||
EOF
|
||||
|
||||
cat > adapters1.fasta <<'EOF'
|
||||
>adapter1
|
||||
CCCCC
|
||||
EOF
|
||||
|
||||
cat > adapters2.fasta <<'EOF'
|
||||
>adapter2
|
||||
GGGGG
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--report minimal \
|
||||
--output "out_test2/*.fasta" \
|
||||
--adapter AAAAA \
|
||||
--adapter_fasta adapters1.fasta \
|
||||
--adapter_fasta adapters2.fasta \
|
||||
--demultiplex_mode single \
|
||||
--input example.fa \
|
||||
--fasta \
|
||||
--json
|
||||
|
||||
echo ">> Checking output"
|
||||
assert_file_exists "report.json"
|
||||
assert_file_exists "out_test2/1_001.fasta"
|
||||
assert_file_exists "out_test2/adapter1_001.fasta"
|
||||
assert_file_exists "out_test2/adapter2_001.fasta"
|
||||
assert_file_exists "out_test2/unknown_001.fasta"
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
assert_file_not_empty "report.json"
|
||||
assert_file_not_empty "out_test2/1_001.fasta"
|
||||
assert_file_not_empty "out_test2/adapter1_001.fasta"
|
||||
assert_file_not_empty "out_test2/adapter2_001.fasta"
|
||||
assert_file_not_empty "out_test2/unknown_001.fasta"
|
||||
|
||||
echo ">> Check contents"
|
||||
assert_file_contains "out_test2/1_001.fasta" ">read1"
|
||||
assert_file_contains "out_test2/adapter1_001.fasta" ">read2"
|
||||
assert_file_contains "out_test2/adapter2_001.fasta" ">read3"
|
||||
assert_file_contains "out_test2/unknown_001.fasta" ">read4"
|
||||
|
||||
cd ..
|
||||
echo
|
||||
|
||||
#############################################
|
||||
mkdir test_simple_paired_end
|
||||
cd test_simple_paired_end
|
||||
|
||||
echo "#############################################"
|
||||
echo "> Run cutadapt on paired-end data"
|
||||
|
||||
cat > example_R1.fastq <<'EOF'
|
||||
@read1
|
||||
ACGTACGTACGTAAAAA
|
||||
+
|
||||
IIIIIIIIIIIIIIIII
|
||||
@read2
|
||||
ACGTACGTACGTCCCCC
|
||||
+
|
||||
IIIIIIIIIIIIIIIII
|
||||
EOF
|
||||
|
||||
cat > example_R2.fastq <<'EOF'
|
||||
@read1
|
||||
ACGTACGTACGTGGGGG
|
||||
+
|
||||
IIIIIIIIIIIIIIIII
|
||||
@read2
|
||||
ACGTACGTACGTTTTTT
|
||||
+
|
||||
IIIIIIIIIIIIIIIII
|
||||
EOF
|
||||
|
||||
"$meta_executable" \
|
||||
--report minimal \
|
||||
--output "out_test3/*.fastq" \
|
||||
--adapter AAAAA \
|
||||
--adapter_r2 GGGGG \
|
||||
--input example_R1.fastq \
|
||||
--input_r2 example_R2.fastq \
|
||||
--quality_cutoff 20 \
|
||||
--demultiplex_mode unique_dual \
|
||||
--json \
|
||||
---cpus 1
|
||||
|
||||
echo ">> Checking output"
|
||||
assert_file_exists "report.json"
|
||||
assert_file_exists "out_test3/1_R1_001.fastq"
|
||||
assert_file_exists "out_test3/1_R2_001.fastq"
|
||||
assert_file_exists "out_test3/unknown_R1_001.fastq"
|
||||
assert_file_exists "out_test3/unknown_R2_001.fastq"
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
assert_file_not_empty "report.json"
|
||||
assert_file_not_empty "out_test3/1_R1_001.fastq"
|
||||
assert_file_not_empty "out_test3/1_R2_001.fastq"
|
||||
assert_file_not_empty "out_test3/unknown_R1_001.fastq"
|
||||
|
||||
echo ">> Check contents"
|
||||
assert_file_contains "out_test3/1_R1_001.fastq" "@read1"
|
||||
assert_file_contains "out_test3/1_R2_001.fastq" "@read1"
|
||||
assert_file_contains "out_test3/unknown_R1_001.fastq" "@read2"
|
||||
assert_file_contains "out_test3/unknown_R2_001.fastq" "@read2"
|
||||
|
||||
cd ..
|
||||
echo
|
||||
|
||||
#############################################
|
||||
|
||||
echo "#############################################"
|
||||
echo "> Test successful"
|
||||
|
||||
196
src/falco/config.vsh.yaml
Normal file
196
src/falco/config.vsh.yaml
Normal file
@@ -0,0 +1,196 @@
|
||||
name: falco
|
||||
description: A C++ drop-in replacement of FastQC to assess the quality of sequence read data
|
||||
keywords: [qc, fastqc, sequencing]
|
||||
links:
|
||||
documentation: https://falco.readthedocs.io/en/latest/
|
||||
repository: https://github.com/smithlabcode/falco
|
||||
references:
|
||||
doi: 10.12688/f1000research.21142.2
|
||||
license: GPL-3.0
|
||||
requirements:
|
||||
commands: [falco]
|
||||
|
||||
# Notes:
|
||||
# - falco as arguments similar to -subsample and we update those to --subsample
|
||||
# - The outdir argument is not required
|
||||
# - The input argument in falco is positional but we changed this to --input
|
||||
argument_groups:
|
||||
- name: Input arguments
|
||||
arguments:
|
||||
- name: --input
|
||||
required: true
|
||||
type: file
|
||||
multiple: true
|
||||
description: input fastq files
|
||||
example: input1.fastq;input2.fastq
|
||||
|
||||
- name: Run arguments
|
||||
arguments:
|
||||
- name: --nogroup
|
||||
type: boolean_true
|
||||
description: |
|
||||
Disable grouping of bases for reads >50bp.
|
||||
All reports will show data for every base in
|
||||
the read. WARNING: When using this option,
|
||||
your plots may end up a ridiculous size. You
|
||||
have been warned!
|
||||
- name: --contaminents
|
||||
type: file
|
||||
description: |
|
||||
Specifies a non-default file which contains
|
||||
the list of contaminants to screen
|
||||
overrepresented sequences against. The file
|
||||
must contain sets of named contaminants in
|
||||
the form name[tab]sequence. Lines prefixed
|
||||
with a hash will be ignored. Default:
|
||||
https://github.com/smithlabcode/falco/blob/v1.2.2/Configuration/contaminant_list.txt
|
||||
- name: --adapters
|
||||
type: file
|
||||
description: |
|
||||
Specifies a non-default file which contains
|
||||
the list of adapter sequences which will be
|
||||
explicity searched against the library. The
|
||||
file must contain sets of named adapters in
|
||||
the form name[tab]sequence. Lines prefixed
|
||||
with a hash will be ignored. Default:
|
||||
https://github.com/smithlabcode/falco/blob/v1.2.2/Configuration/adapter_list.txt
|
||||
- name: --limits
|
||||
type: file
|
||||
description: |
|
||||
Specifies a non-default file which contains
|
||||
a set of criteria which will be used to
|
||||
determine the warn/error limits for the
|
||||
various modules. This file can also be used
|
||||
to selectively remove some modules from the
|
||||
output all together. The format needs to
|
||||
mirror the default limits.txt file found in
|
||||
the Configuration folder. Default:
|
||||
https://github.com/smithlabcode/falco/blob/v1.2.2/Configuration/limits.txt
|
||||
- name: --subsample
|
||||
alternatives: [-s]
|
||||
type: integer
|
||||
example: 10
|
||||
description: |
|
||||
[Falco only] makes falco faster (but
|
||||
possibly less accurate) by only processing
|
||||
reads that are a multiple of this value (using
|
||||
0-based indexing to number reads).
|
||||
- name: --bisulfite
|
||||
alternatives: [-b]
|
||||
type: boolean_true
|
||||
description: |
|
||||
[Falco only] reads are whole genome
|
||||
bisulfite sequencing, and more Ts and fewer
|
||||
Cs are therefore expected and will be
|
||||
accounted for in base content.
|
||||
- name: --reverse_complliment
|
||||
alternatives: [-r]
|
||||
type: boolean_true
|
||||
description: |
|
||||
[Falco only] The input is a
|
||||
reverse-complement. All modules will be
|
||||
tested by swapping A/T and C/G
|
||||
|
||||
- name: Output arguments
|
||||
arguments:
|
||||
- name: --outdir
|
||||
alternatives: [-o]
|
||||
required: true
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Create all output files in the specified
|
||||
output directory. FALCO-SPECIFIC: If the
|
||||
directory does not exists, the program will
|
||||
create it.
|
||||
example: output
|
||||
- name: --format
|
||||
type: string
|
||||
choices: [bam, sam, bam_mapped, sam_mapped, fastq, fq, fastq.gz, fq.gz]
|
||||
alternatives: ["-f"]
|
||||
description: |
|
||||
Bypasses the normal sequence file format
|
||||
detection and forces the program to use the
|
||||
specified format. Validformats are bam, sam,
|
||||
bam_mapped, sam_mapped, fastq, fq, fastq.gz
|
||||
or fq.gz.
|
||||
- name: --data_filename
|
||||
alternatives: [-D]
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
[Falco only] Specify filename for FastQC
|
||||
data output (TXT). If not specified, it will
|
||||
be called fastq_data.txt in either the input
|
||||
file's directory or the one specified in the
|
||||
--output flag. Only available when running
|
||||
falco with a single input.
|
||||
- name: --report_filename
|
||||
alternatives: [-R]
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
[Falco only] Specify filename for FastQC
|
||||
report output (HTML). If not specified, it
|
||||
will be called fastq_report.html in either
|
||||
the input file's directory or the one
|
||||
specified in the --output flag. Only
|
||||
available when running falco with a single
|
||||
input.
|
||||
- name: --summary_filename
|
||||
alternatives: [-S]
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
[Falco only] Specify filename for the short
|
||||
summary output (TXT). If not specified, it
|
||||
will be called fastq_report.html in either
|
||||
the input file's directory or the one
|
||||
specified in the --output flag. Only
|
||||
available when running falco with a single
|
||||
input.
|
||||
|
||||
# Arguments not taken into account:
|
||||
#
|
||||
# -skip-data [Falco only] Do not create FastQC data text
|
||||
# file.
|
||||
# -skip-report [Falco only] Do not create FastQC report
|
||||
# HTML file.
|
||||
# -skip-summary [Falco only] Do not create FastQC summary
|
||||
# file
|
||||
# -K, -add-call [Falco only] add the command call call to
|
||||
# FastQC data output and FastQC report HTML
|
||||
# (this may break the parse of fastqc_data.txt
|
||||
# in programs that are very strict about the
|
||||
# FastQC output format).
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:trixie-slim
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [wget, build-essential, g++, zlib1g-dev, procps]
|
||||
- type: docker
|
||||
run: |
|
||||
wget https://github.com/smithlabcode/falco/releases/download/v1.2.2/falco-1.2.2.tar.gz -O /tmp/falco.tar.gz && \
|
||||
cd /tmp && \
|
||||
tar xvf falco.tar.gz && \
|
||||
cd falco-1.2.2 && \
|
||||
./configure && \
|
||||
make all && \
|
||||
make install
|
||||
- type: docker
|
||||
run: |
|
||||
echo "falco: \"$(falco -v | sed -n 's/^falco //p')\"" > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
156
src/falco/help.txt
Normal file
156
src/falco/help.txt
Normal file
@@ -0,0 +1,156 @@
|
||||
Usage: falco [OPTIONS] <seqfile1> <seqfile2> ...
|
||||
|
||||
Options:
|
||||
-h, --help Print this help file and exit
|
||||
-v, --version Print the version of the program and exit
|
||||
-o, --outdir Create all output files in the specified
|
||||
output directory. FALCO-SPECIFIC: If the
|
||||
directory does not exists, the program will
|
||||
create it. If this option is not set then
|
||||
the output file for each sequence file is
|
||||
created in the same directory as the
|
||||
sequence file which was processed.
|
||||
--casava [IGNORED BY FALCO] Files come from raw
|
||||
casava output. Files in the same sample
|
||||
group (differing only by the group number)
|
||||
will be analysed as a set rather than
|
||||
individually. Sequences with the filter flag
|
||||
set in the header will be excluded from the
|
||||
analysis. Files must have the same names
|
||||
given to them by casava (including being
|
||||
gzipped and ending with .gz) otherwise they
|
||||
won't be grouped together correctly.
|
||||
--nano [IGNORED BY FALCO] Files come from nanopore
|
||||
sequences and are in fast5 format. In this
|
||||
mode you can pass in directories to process
|
||||
and the program will take in all fast5 files
|
||||
within those directories and produce a
|
||||
single output file from the sequences found
|
||||
in all files.
|
||||
--nofilter [IGNORED BY FALCO] If running with --casava
|
||||
then don't remove read flagged by casava as
|
||||
poor quality when performing the QC
|
||||
analysis.
|
||||
--extract [ALWAYS ON IN FALCO] If set then the zipped
|
||||
output file will be uncompressed in the same
|
||||
directory after it has been created. By
|
||||
default this option will be set if fastqc is
|
||||
run in non-interactive mode.
|
||||
-j, --java [IGNORED BY FALCO] Provides the full path to
|
||||
the java binary you want to use to launch
|
||||
fastqc. If not supplied then java is assumed
|
||||
to be in your path.
|
||||
--noextract [IGNORED BY FALCO] Do not uncompress the
|
||||
output file after creating it. You should
|
||||
set this option if you do not wish to
|
||||
uncompress the output when running in
|
||||
non-interactive mode.
|
||||
--nogroup Disable grouping of bases for reads >50bp.
|
||||
All reports will show data for every base in
|
||||
the read. WARNING: When using this option,
|
||||
your plots may end up a ridiculous size. You
|
||||
have been warned!
|
||||
--min_length [NOT YET IMPLEMENTED IN FALCO] Sets an
|
||||
artificial lower limit on the length of the
|
||||
sequence to be shown in the report. As long
|
||||
as you set this to a value greater or equal
|
||||
to your longest read length then this will
|
||||
be the sequence length used to create your
|
||||
read groups. This can be useful for making
|
||||
directly comaparable statistics from
|
||||
datasets with somewhat variable read
|
||||
lengths.
|
||||
-f, --format Bypasses the normal sequence file format
|
||||
detection and forces the program to use the
|
||||
specified format. Validformats are bam, sam,
|
||||
bam_mapped, sam_mapped, fastq, fq, fastq.gz
|
||||
or fq.gz.
|
||||
-t, --threads [NOT YET IMPLEMENTED IN FALCO] Specifies the
|
||||
number of files which can be processed
|
||||
simultaneously. Each thread will be
|
||||
allocated 250MB of memory so you shouldn't
|
||||
run more threads than your available memory
|
||||
will cope with, and not more than 6 threads
|
||||
on a 32 bit machine [1]
|
||||
-c, --contaminants Specifies a non-default file which contains
|
||||
the list of contaminants to screen
|
||||
overrepresented sequences against. The file
|
||||
must contain sets of named contaminants in
|
||||
the form name[tab]sequence. Lines prefixed
|
||||
with a hash will be ignored. Default:
|
||||
/tmp/falco-1.2.2/Configuration/contaminant_list.txt
|
||||
-a, --adapters Specifies a non-default file which contains
|
||||
the list of adapter sequences which will be
|
||||
explicity searched against the library. The
|
||||
file must contain sets of named adapters in
|
||||
the form name[tab]sequence. Lines prefixed
|
||||
with a hash will be ignored. Default:
|
||||
/tmp/falco-1.2.2/Configuration/adapter_list.txt
|
||||
-l, --limits Specifies a non-default file which contains
|
||||
a set of criteria which will be used to
|
||||
determine the warn/error limits for the
|
||||
various modules. This file can also be used
|
||||
to selectively remove some modules from the
|
||||
output all together. The format needs to
|
||||
mirror the default limits.txt file found in
|
||||
the Configuration folder. Default:
|
||||
/tmp/falco-1.2.2/Configuration/limits.txt
|
||||
-k, --kmers [IGNORED BY FALCO AND ALWAYS SET TO 7]
|
||||
Specifies the length of Kmer to look for in
|
||||
the Kmer content module. Specified Kmer
|
||||
length must be between 2 and 10. Default
|
||||
length is 7 if not specified.
|
||||
-q, --quiet Supress all progress messages on stdout and
|
||||
only report errors.
|
||||
-d, --dir [IGNORED: FALCO DOES NOT CREATE TMP FILES]
|
||||
Selects a directory to be used for temporary
|
||||
files written when generating report images.
|
||||
Defaults to system temp directory if not
|
||||
specified.
|
||||
-s, -subsample [Falco only] makes falco faster (but
|
||||
possibly less accurate) by only processing
|
||||
reads that are multiple of this value (using
|
||||
0-based indexing to number reads). [1]
|
||||
-b, -bisulfite [Falco only] reads are whole genome
|
||||
bisulfite sequencing, and more Ts and fewer
|
||||
Cs are therefore expected and will be
|
||||
accounted for in base content.
|
||||
-r, -reverse-complement [Falco only] The input is a
|
||||
reverse-complement. All modules will be
|
||||
tested by swapping A/T and C/G
|
||||
-skip-data [Falco only] Do not create FastQC data text
|
||||
file.
|
||||
-skip-report [Falco only] Do not create FastQC report
|
||||
HTML file.
|
||||
-skip-summary [Falco only] Do not create FastQC summary
|
||||
file
|
||||
-D, -data-filename [Falco only] Specify filename for FastQC
|
||||
data output (TXT). If not specified, it will
|
||||
be called fastq_data.txt in either the input
|
||||
file's directory or the one specified in the
|
||||
--output flag. Only available when running
|
||||
falco with a single input.
|
||||
-R, -report-filename [Falco only] Specify filename for FastQC
|
||||
report output (HTML). If not specified, it
|
||||
will be called fastq_report.html in either
|
||||
the input file's directory or the one
|
||||
specified in the --output flag. Only
|
||||
available when running falco with a single
|
||||
input.
|
||||
-S, -summary-filename [Falco only] Specify filename for the short
|
||||
summary output (TXT). If not specified, it
|
||||
will be called fastq_report.html in either
|
||||
the input file's directory or the one
|
||||
specified in the --output flag. Only
|
||||
available when running falco with a single
|
||||
input.
|
||||
-K, -add-call [Falco only] add the command call call to
|
||||
FastQC data output and FastQC report HTML
|
||||
(this may break the parse of fastqc_data.txt
|
||||
in programs that are very strict about the
|
||||
FastQC output format).
|
||||
|
||||
Help options:
|
||||
-?, -help print this help message
|
||||
-about print about message
|
||||
|
||||
24
src/falco/script.sh
Normal file
24
src/falco/script.sh
Normal file
@@ -0,0 +1,24 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
[[ "$par_nogroup" == "false" ]] && unset par_nogroup
|
||||
[[ "$par_bisulfite" == "false" ]] && unset par_bisulfite
|
||||
[[ "$par_reverse_compliment" == "false" ]] && unset par_reverse_compliment
|
||||
|
||||
IFS=";" read -ra input <<< $par_input
|
||||
|
||||
$(which falco) \
|
||||
${par_nogroup:+--nogroup} \
|
||||
${par_contaminants:+--contaminants "$par_contaminants"} \
|
||||
${par_adapters:+--adapters "$par_adapters"} \
|
||||
${par_limits:+--limits "$par_limits"} \
|
||||
${par_subsample:+-subsample $par_subsample} \
|
||||
${par_bisulfite:+-bisulfite} \
|
||||
${par_reverse_compliment:+-reverse-compliment} \
|
||||
${par_outdir:+--outdir "$par_outdir"} \
|
||||
${par_format:+--format "$par_format"} \
|
||||
${par_data_filename:+-data-filename "$par_data_filename"} \
|
||||
${par_report_filename:+-report-filename "$par_report_filename"} \
|
||||
${par_summary_filename:+-summary-filename "$par_summary_filename"} \
|
||||
${input[*]}
|
||||
79
src/falco/test.sh
Normal file
79
src/falco/test.sh
Normal file
@@ -0,0 +1,79 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "> Prepare test data"
|
||||
|
||||
# We use data from this repo: https://github.com/hartwigmedical/testData
|
||||
echo ">> Fetching and preparing test data"
|
||||
fastq1="https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R1_001.fastq.gz"
|
||||
fastq2="https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R2_001.fastq.gz"
|
||||
TMPDIR=$(mktemp -d "$meta_temp_dir/$meta_functionality_name-XXXXXX")
|
||||
function clean_up {
|
||||
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
|
||||
}
|
||||
trap clean_up EXIT
|
||||
|
||||
test_data_dir="$TMPDIR/test_data"
|
||||
|
||||
mkdir $test_data_dir
|
||||
wget -q $fastq1 -O $test_data_dir/R1.fastq.gz
|
||||
wget -q $fastq2 -O $test_data_dir/R2.fastq.gz
|
||||
|
||||
echo ">> Run falco on test data, output to dir"
|
||||
echo ">>> Run falco"
|
||||
$meta_executable \
|
||||
--input "$test_data_dir/R1.fastq.gz;$test_data_dir/R2.fastq.gz" \
|
||||
--outdir "$TMPDIR/output1"
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -d "$TMPDIR/output1" ] && echo "Output directory not created" && exit 1
|
||||
[ ! -f "$TMPDIR/output1/R1.fastq.gz_fastqc_report.html" ] && echo "Report not created" && exit 1
|
||||
[ ! -f "$TMPDIR/output1/R1.fastq.gz_summary.txt" ] && echo "Summary not created" && exit 1
|
||||
[ ! -f "$TMPDIR/output1/R1.fastq.gz_fastqc_data.txt" ] && echo "fastqc_data not created" && exit 1
|
||||
[ ! -f "$TMPDIR/output1/R2.fastq.gz_fastqc_report.html" ] && echo "Report not created" && exit 1
|
||||
[ ! -f "$TMPDIR/output1/R2.fastq.gz_summary.txt" ] && echo "Summary not created" && exit 1
|
||||
[ ! -f "$TMPDIR/output1/R2.fastq.gz_fastqc_data.txt" ] && echo "fastqc_data not created" && exit 1
|
||||
|
||||
echo ">>> cleanup"
|
||||
rm -rf "$TMPDIR/output1"
|
||||
|
||||
echo ">> Run falco on test data, output to individual files"
|
||||
echo ">>> Please note this is only possible for 1 input fastq file!"
|
||||
echo ">>> Run falco"
|
||||
$meta_executable \
|
||||
--input "$test_data_dir/R1.fastq.gz" \
|
||||
--data_filename "$TMPDIR/output2/data.txt" \
|
||||
--report_filename "$TMPDIR/output2/report.html" \
|
||||
--summary_filename "$TMPDIR/output2/summary.txt" \
|
||||
--outdir "$TMPDIR/output2/"
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -d "$TMPDIR/output2" ] && echo "Output directory not created" && exit 1
|
||||
[ ! -f "$TMPDIR/output2/report.html" ] && echo "Report not created" && exit 1
|
||||
[ ! -f "$TMPDIR/output2/summary.txt" ] && echo "Summary not created" && exit 1
|
||||
[ ! -f "$TMPDIR/output2/data.txt" ] && echo "fastqc_data not created" && exit 1
|
||||
|
||||
echo ">>> cleanup"
|
||||
rm -rf $TMPDIR/output2/
|
||||
|
||||
echo ">> Run falco on test data, subsample"
|
||||
echo ">>> Run falco"
|
||||
$meta_executable \
|
||||
--input "$test_data_dir/R1.fastq.gz" \
|
||||
--data_filename "$TMPDIR/output3/data.txt" \
|
||||
--report_filename "$TMPDIR/output3/report.html" \
|
||||
--summary_filename "$TMPDIR/output3/summary.txt" \
|
||||
--subsample 100 \
|
||||
--outdir "$TMPDIR/output3"
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -d "$TMPDIR/output3" ] && echo "Output directory not created" && exit 1
|
||||
[ ! -f "$TMPDIR/output3/report.html" ] && echo "Report not created" && exit 1
|
||||
[ ! -f "$TMPDIR/output3/summary.txt" ] && echo "Summary not created" && exit 1
|
||||
[ ! -f "$TMPDIR/output3/data.txt" ] && echo "fastqc_data not created" && exit 1
|
||||
|
||||
echo ">>> cleanup"
|
||||
rm -rf "$TMPDIR/output3/"
|
||||
|
||||
echo "All tests succeeded!"
|
||||
576
src/fastp/config.vsh.yaml
Normal file
576
src/fastp/config.vsh.yaml
Normal file
@@ -0,0 +1,576 @@
|
||||
name: fastp
|
||||
description: |
|
||||
An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...).
|
||||
|
||||
Features:
|
||||
|
||||
- comprehensive quality profiling for both before and after filtering data (quality curves, base contents, KMER, Q20/Q30, GC Ratio, duplication, adapter contents...)
|
||||
- filter out bad reads (too low quality, too short, or too many N...)
|
||||
- cut low quality bases for per read in its 5' and 3' by evaluating the mean quality from a sliding window (like Trimmomatic but faster).
|
||||
- trim all reads in front and tail
|
||||
- cut adapters. Adapter sequences can be automatically detected, which means you don't have to input the adapter sequences to trim them.
|
||||
- correct mismatched base pairs in overlapped regions of paired end reads, if one base is with high quality while the other is with ultra low quality
|
||||
- trim polyG in 3' ends, which is commonly seen in NovaSeq/NextSeq data. Trim polyX in 3' ends to remove unwanted polyX tailing (i.e. polyA tailing for mRNA-Seq data)
|
||||
- preprocess unique molecular identifier (UMI) enabled data, shift UMI to sequence name.
|
||||
- report JSON format result for further interpreting.
|
||||
- visualize quality control and filtering results on a single HTML page (like FASTQC but faster and more informative).
|
||||
- split the output to multiple files (0001.R1.gz, 0002.R1.gz...) to support parallel processing. Two modes can be used, limiting the total split file number, or limitting the lines of each split file.
|
||||
- support long reads (data from PacBio / Nanopore devices).
|
||||
- support reading from STDIN and writing to STDOUT
|
||||
- support interleaved input
|
||||
- support ultra-fast FASTQ-level deduplication
|
||||
keywords: [RNA-Seq, Trimming, Quality control]
|
||||
links:
|
||||
repository: https://github.com/OpenGene/fastp
|
||||
documentation: https://github.com/OpenGene/fastp/blob/master/README.md
|
||||
references:
|
||||
doi: "10.1093/bioinformatics/bty560"
|
||||
license: MIT
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
description: |
|
||||
`fastp` supports both single-end (SE) and paired-end (PE) input.
|
||||
|
||||
- for SE data, you only have to specify read1 input by `-i` or `--in1`.
|
||||
- for PE data, you should also specify read2 input by `-I` or `--in2`.
|
||||
arguments:
|
||||
- name: --in1
|
||||
alternatives: [-i]
|
||||
type: file
|
||||
description: Input FastQ file. Must be single-end or paired-end R1. Can be gzipped.
|
||||
required: true
|
||||
example: in.R1.fq.gz
|
||||
- name: --in2
|
||||
alternatives: [-I]
|
||||
type: file
|
||||
description: Input FastQ file. Must be paired-end R2. Can be gzipped.
|
||||
required: false
|
||||
example: in.R2.fq.gz
|
||||
- name: Outputs
|
||||
description: |
|
||||
|
||||
- for SE data, you only have to specify read1 output by `-o` or `--out1`.
|
||||
- for PE data, you should also specify read2 output by `-O` or `--out2`.
|
||||
- if you don't specify the output file names, no output files will be written, but the QC will still be done for both data before and after filtering.
|
||||
- the output will be gzip-compressed if its file name ends with `.gz`
|
||||
arguments:
|
||||
- name: --out1
|
||||
alternatives: [-o]
|
||||
type: file
|
||||
description: The single-end or paired-end R1 reads that pass QC. Will be gzipped if its file name ends with `.gz`.
|
||||
required: true
|
||||
example: out.R1.fq.gz
|
||||
direction: output
|
||||
- name: --out2
|
||||
alternatives: [-O]
|
||||
type: file
|
||||
description: The paired-end R2 reads that pass QC. Will be gzipped if its file name ends with `.gz`.
|
||||
required: false
|
||||
example: out.R2.fq.gz
|
||||
direction: output
|
||||
- name: --unpaired1
|
||||
type: file
|
||||
description: Store the reads that `read1` passes filters but its paired `read2` doesn't.
|
||||
required: false
|
||||
example: unpaired.R1.fq.gz
|
||||
direction: output
|
||||
- name: --unpaired2
|
||||
type: file
|
||||
description: Store the reads that `read2` passes filters but its paired `read1` doesn't.
|
||||
required: false
|
||||
example: unpaired.R2.fq.gz
|
||||
direction: output
|
||||
- name: --failed_out
|
||||
type: file
|
||||
description: |
|
||||
Store the reads that fail filters.
|
||||
|
||||
If one read failed and is written to --failed_out, its failure reason will be appended to its read name. For example, failed_quality_filter, failed_too_short etc.
|
||||
For PE data, if unpaired reads are not stored (by giving --unpaired1 or --unpaired2), the failed pair of reads will be put together. If one read passes the filters but its pair doesn't, the failure reason will be paired_read_is_failing.
|
||||
required: false
|
||||
example: failed.fq.gz
|
||||
direction: output
|
||||
- name: --overlapped_out
|
||||
type: file
|
||||
description: |
|
||||
For each read pair, output the overlapped region if it has no any mismatched base.
|
||||
direction: output
|
||||
- name: Report output arguments
|
||||
arguments:
|
||||
- name: --json
|
||||
alternatives: [-j]
|
||||
type: file
|
||||
description: |
|
||||
The json format report file name
|
||||
example: out.json
|
||||
direction: output
|
||||
- name: --html
|
||||
type: file
|
||||
description: |
|
||||
The html format report file name
|
||||
example: out.html
|
||||
direction: output
|
||||
- name: --report_title
|
||||
type: string
|
||||
description: |
|
||||
The title of the html report, default is "fastp report".
|
||||
example: fastp report
|
||||
- name: Adapter trimming
|
||||
description: |
|
||||
Adapter trimming is enabled by default, but you can disable it by `-A` or `--disable_adapter_trimming`. Adapter sequences can be automatically detected for both PE/SE data.
|
||||
|
||||
- For SE data, the adapters are evaluated by analyzing the tails of first ~1M reads. This evaluation may be inacurrate, and you can specify the adapter sequence by `-a` or `--adapter_sequence` option. If adapter sequence is specified, the auto detection for SE data will be disabled.
|
||||
- For PE data, the adapters can be detected by per-read overlap analysis, which seeks for the overlap of each pair of reads. This method is robust and fast, so normally you don't have to input the adapter sequence even you know it. But you can still specify the adapter sequences for read1 by `--adapter_sequence`, and for read2 by `--adapter_sequence_r2`. If `fastp` fails to find an overlap (i.e. due to low quality bases), it will use these sequences to trim adapters for read1 and read2 respectively.
|
||||
- For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify `--detect_adapter_for_pe` to enable it.
|
||||
- For PE data, `fastp` will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
|
||||
- The most widely used adapter is the Illumina TruSeq adapters. If your data is from the TruSeq library, you can add `--adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT` to your command lines, or enable auto detection for PE data by specifing `detect_adapter_for_pe`.
|
||||
- `fastp` contains some built-in known adapter sequences for better auto-detection. If you want to make some adapters to be a part of the built-in adapters, please file an issue.
|
||||
|
||||
You can also specify --adapter_fasta to give a FASTA file to tell fastp to trim multiple adapters in this FASTA file. Here is a sample of such adapter FASTA file:
|
||||
|
||||
```
|
||||
>Illumina TruSeq Adapter Read 1
|
||||
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
|
||||
>Illumina TruSeq Adapter Read 2
|
||||
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
|
||||
>polyA
|
||||
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
|
||||
```
|
||||
|
||||
The adapter sequence in this file should be at least 6bp long, otherwise it will be skipped. And you can give whatever you want to trim, rather than regular sequencing adapters (i.e. polyA).
|
||||
|
||||
`fastp` first trims the auto-detected adapter or the adapter sequences given by `--adapter_sequence | --adapter_sequence_r2`, then trims the adapters given by `--adapter_fasta` one by one.
|
||||
|
||||
The sequence distribution of trimmed adapters can be found at the HTML/JSON reports.
|
||||
arguments:
|
||||
- name: --disable_adapter_trimming
|
||||
alternatives: [-A]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Disable adapter trimming.
|
||||
- name: --detect_adapter_for_pe
|
||||
type: boolean_true
|
||||
description: |
|
||||
By default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data.
|
||||
- name: --adapter_sequence
|
||||
alternatives: [-a]
|
||||
type: string
|
||||
description: |
|
||||
The adapter sequences to be trimmed. For SE data, if not specified, the adapters will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped
|
||||
- name: --adapter_sequence_r2
|
||||
type: string
|
||||
description: |
|
||||
The adapter sequences to be trimmed for R2. This is used for PE data if R1/R2 are found overlapped.
|
||||
- name: --adapter_fasta
|
||||
type: file
|
||||
description: |
|
||||
A FASTA file containing all the adapter sequences to be trimmed. For SE data, if not specified, the adapters will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped.
|
||||
- name: Base trimming
|
||||
arguments:
|
||||
- name: --trim_front1
|
||||
alternatives: [-f]
|
||||
type: integer
|
||||
description: |
|
||||
Trimming how many bases in front for read1, default is 0.
|
||||
example: 0
|
||||
- name: --trim_tail1
|
||||
alternatives: [-t]
|
||||
type: integer
|
||||
description: |
|
||||
Trimming how many bases in tail for read1, default is 0.
|
||||
example: 0
|
||||
- name: --max_len1
|
||||
alternatives: [-b]
|
||||
type: integer
|
||||
min: 0
|
||||
description: |
|
||||
If read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation.
|
||||
- name: --trim_front2
|
||||
alternatives: [-F]
|
||||
type: integer
|
||||
description: |
|
||||
Trimming how many bases in front for read2, default is 0.
|
||||
example: 0
|
||||
- name: --trim_tail2
|
||||
alternatives: [-T]
|
||||
type: integer
|
||||
description: |
|
||||
Trimming how many bases in tail for read2, default is 0.
|
||||
example: 0
|
||||
- name: --max_len2
|
||||
alternatives: [-B]
|
||||
type: integer
|
||||
min: 0
|
||||
description: |
|
||||
If read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation.
|
||||
- name: Merging mode
|
||||
description: Allows merging paired-end reads into a single longer read if they are overlapping.
|
||||
arguments:
|
||||
- name: --merge
|
||||
alternatives: [-m]
|
||||
type: boolean_true
|
||||
description: |
|
||||
For paired-end input, merge each pair of reads into a single read if they are overlapped. The merged reads will be written to the file given by --merged_out, the unmerged reads will be written to the files specified by --out1 and --out2. The merging mode is disabled by default.
|
||||
- name: --merged_out
|
||||
type: file
|
||||
description: |
|
||||
In the merging mode, specify the file name to store merged output, or specify --stdout to stream the merged output.
|
||||
direction: output
|
||||
example: merged.fq.gz
|
||||
- name: --include_unmerged
|
||||
type: boolean_true
|
||||
description: |
|
||||
In the merging mode, write the unmerged or unpaired reads to the file specified by --merge. Disabled by default.
|
||||
- name: Additional input arguments
|
||||
description: Affects how the input is read.
|
||||
arguments:
|
||||
- name: --interleaved_in
|
||||
type: boolean_true
|
||||
description: |
|
||||
Indicate that <in1> is an interleaved FASTQ which contains both read1 and read2. Disabled by default.
|
||||
- name: --fix_mgi_id
|
||||
type: boolean_true
|
||||
description: |
|
||||
The MGI FASTQ ID format is not compatible with many BAM operation tools, enable this option to fix it.
|
||||
- name: --phred64
|
||||
alternatives: ["-6"]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
|
||||
- name: Additional output arguments
|
||||
description: Affects how the output is written.
|
||||
arguments:
|
||||
- name: --compression
|
||||
alternatives: ["-z"]
|
||||
type: integer
|
||||
description: |
|
||||
Compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4.
|
||||
example: 4
|
||||
min: 1
|
||||
max: 9
|
||||
- name: --dont_overwrite
|
||||
type: boolean_true
|
||||
description: |
|
||||
Don't overwrite existing files. Overwritting is allowed by default.
|
||||
- name: Logging arguments
|
||||
arguments:
|
||||
- name: --verbose
|
||||
alternatives: [-V]
|
||||
type: boolean_true
|
||||
description: Output verbose log information (i.e. when every 1M reads are processed).
|
||||
- name: Processing arguments
|
||||
arguments:
|
||||
- name: --reads_to_process
|
||||
type: long
|
||||
description: |
|
||||
Specify how many reads/pairs to be processed. Default 0 means process all reads.
|
||||
example: 1000000
|
||||
min: 0
|
||||
- name: Deduplication arguments
|
||||
arguments:
|
||||
- name: --dedup
|
||||
type: boolean_true
|
||||
description: |
|
||||
Enable deduplication to drop the duplicated reads/pairs
|
||||
- name: --dup_calc_accuracy
|
||||
type: integer
|
||||
description: |
|
||||
Accuracy level to calculate duplication (1~6). Higher level uses more memory (1G, 2G, 4G, 8G, 16G, 24G). Default 1 for no-dedup mode, and 3 for dedup mode.
|
||||
example: 3
|
||||
min: 1
|
||||
max: 6
|
||||
- name: --dont_eval_duplication
|
||||
type: boolean_true
|
||||
description: |
|
||||
Don't evaluate duplication rate to save time and use less memory.
|
||||
- name: PolyG tail trimming arguments
|
||||
arguments:
|
||||
- name: --trim_poly_g
|
||||
alternatives: [-g]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
|
||||
- name: --poly_g_min_len
|
||||
type: integer
|
||||
description: |
|
||||
The minimum length to detect polyG in the read tail. 10 by default.
|
||||
example: 10
|
||||
min: 1
|
||||
- name: --disable_trim_poly_g
|
||||
alternatives: [-G]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
|
||||
- name: PolyX tail trimming arguments
|
||||
arguments:
|
||||
- name: --trim_poly_x
|
||||
alternatives: [-x]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Enable polyX trimming in 3' ends.
|
||||
- name: --poly_x_min_len
|
||||
type: integer
|
||||
description: |
|
||||
The minimum length to detect polyX in the read tail. 10 by default.
|
||||
example: 10
|
||||
min: 1
|
||||
- name: Cut arguments
|
||||
arguments:
|
||||
- name: --cut_front
|
||||
alternatives: ["-5"]
|
||||
type: integer
|
||||
description: |
|
||||
Move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise.
|
||||
- name: --cut_tail
|
||||
alternatives: ["-3"]
|
||||
type: integer
|
||||
description: |
|
||||
Move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise.
|
||||
- name: --cut_right
|
||||
alternatives: ["-r"]
|
||||
type: integer
|
||||
description: |
|
||||
Move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop.
|
||||
- name: --cut_window_size
|
||||
alternatives: ["-W"]
|
||||
type: integer
|
||||
description: |
|
||||
The window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4.
|
||||
example: 4
|
||||
min: 1
|
||||
- name: --cut_mean_quality
|
||||
alternatives: ["-M"]
|
||||
type: integer
|
||||
description: |
|
||||
The mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20)
|
||||
example: 20
|
||||
min: 0
|
||||
- name: --cut_front_window_size
|
||||
type: integer
|
||||
description: |
|
||||
The window size option of cut_front, default to cut_window_size if not specified.
|
||||
example: 4
|
||||
min: 1
|
||||
- name: --cut_front_mean_quality
|
||||
type: integer
|
||||
description: |
|
||||
The mean quality requirement option of cut_front, default to cut_mean_quality if not specified.
|
||||
example: 20
|
||||
min: 0
|
||||
- name: --cut_tail_window_size
|
||||
type: integer
|
||||
description: |
|
||||
The window size option of cut_tail, default to cut_window_size if not specified.
|
||||
example: 4
|
||||
min: 1
|
||||
- name: --cut_tail_mean_quality
|
||||
type: integer
|
||||
description: |
|
||||
The mean quality requirement option of cut_tail, default to cut_mean_quality if not specified.
|
||||
example: 20
|
||||
min: 0
|
||||
- name: --cut_right_window_size
|
||||
type: integer
|
||||
description: |
|
||||
The window size option of cut_right, default to cut_window_size if not specified.
|
||||
example: 4
|
||||
min: 1
|
||||
- name: --cut_right_mean_quality
|
||||
type: integer
|
||||
description: |
|
||||
The mean quality requirement option of cut_right, default to cut_mean_quality if not specified.
|
||||
example: 20
|
||||
min: 0
|
||||
- name: Quality filtering arguments
|
||||
arguments:
|
||||
- name: --disable_quality_filtering
|
||||
alternatives: [-Q]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Quality filtering is enabled by default. If this option is specified, quality filtering is disabled.
|
||||
- name: --qualified_quality_phred
|
||||
alternatives: [-q]
|
||||
type: integer
|
||||
description: |
|
||||
The quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified.
|
||||
example: 15
|
||||
min: 0
|
||||
- name: --unqualified_percent_limit
|
||||
alternatives: [-u]
|
||||
type: integer
|
||||
description: |
|
||||
How many percents of bases are allowed to be unqualified (0~100). Default 40 means 40%.
|
||||
example: 40
|
||||
min: 0
|
||||
max: 100
|
||||
- name: --n_base_limit
|
||||
alternatives: [-n]
|
||||
type: integer
|
||||
description: |
|
||||
If one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5.
|
||||
example: 5
|
||||
min: 0
|
||||
- name: --average_qual
|
||||
alternatives: [-e]
|
||||
type: integer
|
||||
description: |
|
||||
If one read's average quality score <avg_qual, then this read/pair is discarded. Default 0 means no requirement.
|
||||
example: 0
|
||||
min: 0
|
||||
- name: Length filtering arguments
|
||||
arguments:
|
||||
- name: --disable_length_filtering
|
||||
alternatives: [-L]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Length filtering is enabled by default. If this option is specified, length filtering is disabled.
|
||||
- name: --length_required
|
||||
alternatives: [-l]
|
||||
type: integer
|
||||
description: |
|
||||
Reads shorter than length_required will be discarded, default is 15.
|
||||
example: 15
|
||||
min: 0
|
||||
- name: --length_limit
|
||||
type: integer
|
||||
description: |
|
||||
Reads longer than length_limit will be discarded, default 0 means no limitation.
|
||||
example: 0
|
||||
min: 0
|
||||
- name: Low complexity filtering arguments
|
||||
arguments:
|
||||
- name: --low_complexity_filter
|
||||
alternatives: [-y]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
|
||||
- name: --complexity_threshold
|
||||
alternatives: [-Y]
|
||||
type: integer
|
||||
description: |
|
||||
The threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required.
|
||||
example: 30
|
||||
min: 0
|
||||
- name: Index filtering arguments
|
||||
arguments:
|
||||
- name: --filter_by_index1
|
||||
type: file
|
||||
description: |
|
||||
Specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line.
|
||||
- name: --filter_by_index2
|
||||
type: file
|
||||
description: |
|
||||
Specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line.
|
||||
- name: --filter_by_index_threshold
|
||||
type: integer
|
||||
description: |
|
||||
The allowed difference of index barcode for index filtering, default 0 means completely identical.
|
||||
example: 0
|
||||
min: 0
|
||||
- name: Overlapped region correction
|
||||
arguments:
|
||||
- type: boolean_true
|
||||
name: --correction
|
||||
alternatives: [-c]
|
||||
description: |
|
||||
Enable base correction in overlapped regions (only for PE data), default is disabled.
|
||||
- name: --overlap_len_require
|
||||
type: integer
|
||||
description: |
|
||||
The minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default.
|
||||
example: 30
|
||||
min: 0
|
||||
- name: --overlap_diff_limit
|
||||
type: integer
|
||||
description: |
|
||||
The maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default.
|
||||
example: 5
|
||||
min: 0
|
||||
- name: --overlap_diff_percent_limit
|
||||
type: integer
|
||||
description: |
|
||||
The maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%.
|
||||
example: 20
|
||||
min: 0
|
||||
max: 100
|
||||
- name: UMI arguments
|
||||
arguments:
|
||||
- name: --umi
|
||||
alternatives: [-U]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Enable unique molecular identifier (UMI) preprocessing.
|
||||
- name: --umi_loc
|
||||
type: string
|
||||
description: |
|
||||
Specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none.
|
||||
choices: [index1, index2, read1, read2, per_index, per_read]
|
||||
- name: --umi_len
|
||||
type: integer
|
||||
description: |
|
||||
If the UMI is in read1/read2, its length should be provided.
|
||||
example: 0
|
||||
min: 0
|
||||
- name: --umi_prefix
|
||||
type: string
|
||||
description: |
|
||||
If specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default.
|
||||
- name: --umi_skip
|
||||
type: integer
|
||||
description: |
|
||||
If the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0.
|
||||
example: 0
|
||||
min: 0
|
||||
- name: --umi_delim
|
||||
type: string
|
||||
description: |
|
||||
If the UMI is in index1/index2, fastp can use a delimiter to separate UMI from the read sequence, default is none.
|
||||
- name: Overrepresentation analysis arguments
|
||||
arguments:
|
||||
- name: --overrepresentation_analysis
|
||||
alternatives: [-p]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Enable overrepresentation analysis.
|
||||
- name: --overrepresentation_sampling
|
||||
type: integer
|
||||
description: |
|
||||
One in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20.
|
||||
example: 20
|
||||
min: 1
|
||||
# # would need to set all outputs to multiple: true
|
||||
# - name: Split arguments
|
||||
# arguments:
|
||||
# - name: --split
|
||||
# alternatives: [-s]
|
||||
# type: boolean_true
|
||||
# description: |
|
||||
# Split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default.
|
||||
# - name: --split_by_lines
|
||||
# alternatives: [-S]
|
||||
# type: long
|
||||
# description: |
|
||||
# Split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default.
|
||||
# - name: --split_prefix_digits
|
||||
# type: integer
|
||||
# description: |
|
||||
# The digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding.
|
||||
# example: 4
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/fastp:0.23.4--hadf994f_2
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
fastp --version 2>&1 | sed 's# #: "#;s#$#"#' > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
93
src/fastp/help.txt
Normal file
93
src/fastp/help.txt
Normal file
@@ -0,0 +1,93 @@
|
||||
```bash
|
||||
fastp --help
|
||||
```
|
||||
|
||||
usage: fastp [options] ...
|
||||
options:
|
||||
-i, --in1 read1 input file name (string [=])
|
||||
-o, --out1 read1 output file name (string [=])
|
||||
-I, --in2 read2 input file name (string [=])
|
||||
-O, --out2 read2 output file name (string [=])
|
||||
--unpaired1 for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])
|
||||
--unpaired2 for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --unpaired1 (default mode), both unpaired reads will be written to this same file. (string [=])
|
||||
--overlapped_out for each read pair, output the overlapped region if it has no any mismatched base. (string [=])
|
||||
--failed_out specify the file to store reads that cannot pass the filters. (string [=])
|
||||
-m, --merge for paired-end input, merge each pair of reads into a single read if they are overlapped. The merged reads will be written to the file given by --merged_out, the unmerged reads will be written to the files specified by --out1 and --out2. The merging mode is disabled by default.
|
||||
--merged_out in the merging mode, specify the file name to store merged output, or specify --stdout to stream the merged output (string [=])
|
||||
--include_unmerged in the merging mode, write the unmerged or unpaired reads to the file specified by --merge. Disabled by default.
|
||||
-6, --phred64 indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
|
||||
-z, --compression compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4. (int [=4])
|
||||
--stdin input from STDIN. If the STDIN is interleaved paired-end FASTQ, please also add --interleaved_in.
|
||||
--stdout stream passing-filters reads to STDOUT. This option will result in interleaved FASTQ output for paired-end output. Disabled by default.
|
||||
--interleaved_in indicate that <in1> is an interleaved FASTQ which contains both read1 and read2. Disabled by default.
|
||||
--reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])
|
||||
--dont_overwrite don't overwrite existing files. Overwritting is allowed by default.
|
||||
--fix_mgi_id the MGI FASTQ ID format is not compatible with many BAM operation tools, enable this option to fix it.
|
||||
-V, --verbose output verbose log information (i.e. when every 1M reads are processed).
|
||||
-A, --disable_adapter_trimming adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
|
||||
-a, --adapter_sequence the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])
|
||||
--adapter_sequence_r2 the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=auto])
|
||||
--adapter_fasta specify a FASTA file to trim both read1 and read2 (if PE) by all the sequences in this FASTA file (string [=])
|
||||
--detect_adapter_for_pe by default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data.
|
||||
-f, --trim_front1 trimming how many bases in front for read1, default is 0 (int [=0])
|
||||
-t, --trim_tail1 trimming how many bases in tail for read1, default is 0 (int [=0])
|
||||
-b, --max_len1 if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation (int [=0])
|
||||
-F, --trim_front2 trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0])
|
||||
-T, --trim_tail2 trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0])
|
||||
-B, --max_len2 if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings (int [=0])
|
||||
-D, --dedup enable deduplication to drop the duplicated reads/pairs
|
||||
--dup_calc_accuracy accuracy level to calculate duplication (1~6), higher level uses more memory (1G, 2G, 4G, 8G, 16G, 24G). Default 1 for no-dedup mode, and 3 for dedup mode. (int [=0])
|
||||
--dont_eval_duplication don't evaluate duplication rate to save time and use less memory.
|
||||
-g, --trim_poly_g force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
|
||||
--poly_g_min_len the minimum length to detect polyG in the read tail. 10 by default. (int [=10])
|
||||
-G, --disable_trim_poly_g disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
|
||||
-x, --trim_poly_x enable polyX trimming in 3' ends.
|
||||
--poly_x_min_len the minimum length to detect polyX in the read tail. 10 by default. (int [=10])
|
||||
-5, --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise.
|
||||
-3, --cut_tail move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise.
|
||||
-r, --cut_right move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop.
|
||||
-W, --cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])
|
||||
-M, --cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])
|
||||
--cut_front_window_size the window size option of cut_front, default to cut_window_size if not specified (int [=4])
|
||||
--cut_front_mean_quality the mean quality requirement option for cut_front, default to cut_mean_quality if not specified (int [=20])
|
||||
--cut_tail_window_size the window size option of cut_tail, default to cut_window_size if not specified (int [=4])
|
||||
--cut_tail_mean_quality the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified (int [=20])
|
||||
--cut_right_window_size the window size option of cut_right, default to cut_window_size if not specified (int [=4])
|
||||
--cut_right_mean_quality the mean quality requirement option for cut_right, default to cut_mean_quality if not specified (int [=20])
|
||||
-Q, --disable_quality_filtering quality filtering is enabled by default. If this option is specified, quality filtering is disabled
|
||||
-q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
|
||||
-u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])
|
||||
-n, --n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
|
||||
-e, --average_qual if one read's average quality score <avg_qual, then this read/pair is discarded. Default 0 means no requirement (int [=0])
|
||||
-L, --disable_length_filtering length filtering is enabled by default. If this option is specified, length filtering is disabled
|
||||
-l, --length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
|
||||
--length_limit reads longer than length_limit will be discarded, default 0 means no limitation. (int [=0])
|
||||
-y, --low_complexity_filter enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
|
||||
-Y, --complexity_threshold the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])
|
||||
--filter_by_index1 specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])
|
||||
--filter_by_index2 specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])
|
||||
--filter_by_index_threshold the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])
|
||||
-c, --correction enable base correction in overlapped regions (only for PE data), default is disabled
|
||||
--overlap_len_require the minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default. (int [=30])
|
||||
--overlap_diff_limit the maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default. (int [=5])
|
||||
--overlap_diff_percent_limit the maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%. (int [=20])
|
||||
-U, --umi enable unique molecular identifier (UMI) preprocessing
|
||||
--umi_loc specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])
|
||||
--umi_len if the UMI is in read1/read2, its length should be provided (int [=0])
|
||||
--umi_prefix if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])
|
||||
--umi_skip if the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0 (int [=0])
|
||||
--umi_delim delimiter to use between the read name and the UMI, default is : (string [=:])
|
||||
-p, --overrepresentation_analysis enable overrepresented sequence analysis.
|
||||
-P, --overrepresentation_sampling one in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20. (int [=20])
|
||||
-j, --json the json format report file name (string [=fastp.json])
|
||||
-h, --html the html format report file name (string [=fastp.html])
|
||||
-R, --report_title should be quoted with ' or ", default is "fastp report" (string [=fastp report])
|
||||
-w, --thread worker thread number, default is 3 (int [=3])
|
||||
-s, --split split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])
|
||||
-S, --split_by_lines split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])
|
||||
-d, --split_prefix_digits the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])
|
||||
--cut_by_quality5 DEPRECATED, use --cut_front instead.
|
||||
--cut_by_quality3 DEPRECATED, use --cut_tail instead.
|
||||
--cut_by_quality_aggressive DEPRECATED, use --cut_right instead.
|
||||
--discard_unmerged DEPRECATED, no effect now, see the introduction for merging.
|
||||
-?, --help print this message
|
||||
105
src/fastp/script.sh
Normal file
105
src/fastp/script.sh
Normal file
@@ -0,0 +1,105 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# disable flags
|
||||
[[ "$par_disable_adapter_trimming" == "false" ]] && unset par_disable_adapter_trimming
|
||||
[[ "$par_detect_adapter_for_pe" == "false" ]] && unset par_detect_adapter_for_pe
|
||||
[[ "$par_merge" == "false" ]] && unset par_merge
|
||||
[[ "$par_include_unmerged" == "false" ]] && unset par_include_unmerged
|
||||
[[ "$par_interleaved_in" == "false" ]] && unset par_interleaved_in
|
||||
[[ "$par_fix_mgi_id" == "false" ]] && unset par_fix_mgi_id
|
||||
[[ "$par_phred64" == "false" ]] && unset par_phred64
|
||||
[[ "$par_dont_overwrite" == "false" ]] && unset par_dont_overwrite
|
||||
[[ "$par_verbose" == "false" ]] && unset par_verbose
|
||||
[[ "$par_dedup" == "false" ]] && unset par_dedup
|
||||
[[ "$par_dont_eval_duplication" == "false" ]] && unset par_dont_eval_duplication
|
||||
[[ "$par_trim_poly_g" == "false" ]] && unset par_trim_poly_g
|
||||
[[ "$par_disable_trim_poly_g" == "false" ]] && unset par_disable_trim_poly_g
|
||||
[[ "$par_trim_poly_x" == "false" ]] && unset par_trim_poly_x
|
||||
[[ "$par_disable_quality_filtering" == "false" ]] && unset par_disable_quality_filtering
|
||||
[[ "$par_disable_length_filtering" == "false" ]] && unset par_disable_length_filtering
|
||||
[[ "$par_low_complexity_filter" == "false" ]] && unset par_low_complexity_filter
|
||||
[[ "$par_umi" == "false" ]] && unset par_umi
|
||||
[[ "$par_overrepresentation_analysis" == "false" ]] && unset par_overrepresentation_analysis
|
||||
|
||||
# run command
|
||||
fastp \
|
||||
-i "$par_in1" \
|
||||
-o "$par_out1" \
|
||||
${par_in2:+--in2 "${par_in2}"} \
|
||||
${par_out2:+--out2 "${par_out2}"} \
|
||||
${par_unpaired1:+--unpaired1 "${par_unpaired1}"} \
|
||||
${par_unpaired2:+--unpaired2 "${par_unpaired2}"} \
|
||||
${par_failed_out:+--failed_out "${par_failed_out}"} \
|
||||
${par_overlapped_out:+--overlapped_out "${par_overlapped_out}"} \
|
||||
${par_json:+--json "${par_json}"} \
|
||||
${par_html:+--html "${par_html}"} \
|
||||
${par_report_title:+--report_title "${par_report_title}"} \
|
||||
${par_disable_adapter_trimming:+--disable_adapter_trimming} \
|
||||
${par_detect_adapter_for_pe:+--detect_adapter_for_pe} \
|
||||
${par_adapter_sequence:+--adapter_sequence "${par_adapter_sequence}"} \
|
||||
${par_adapter_sequence_r2:+--adapter_sequence_r2 "${par_adapter_sequence_r2}"} \
|
||||
${par_adapter_fasta:+--adapter_fasta "${par_adapter_fasta}"} \
|
||||
${par_trim_front1:+--trim_front1 "${par_trim_front1}"} \
|
||||
${par_trim_tail1:+--trim_tail1 "${par_trim_tail1}"} \
|
||||
${par_max_len1:+--max_len1 "${par_max_len1}"} \
|
||||
${par_trim_front2:+--trim_front2 "${par_trim_front2}"} \
|
||||
${par_trim_tail2:+--trim_tail2 "${par_trim_tail2}"} \
|
||||
${par_max_len2:+--max_len2 "${par_max_len2}"} \
|
||||
${par_merge:+--merge} \
|
||||
${par_merged_out:+--merged_out "${par_merged_out}"} \
|
||||
${par_include_unmerged:+--include_unmerged} \
|
||||
${par_interleaved_in:+--interleaved_in} \
|
||||
${par_fix_mgi_id:+--fix_mgi_id} \
|
||||
${par_phred64:+--phred64} \
|
||||
${par_compression:+--compression "${par_compression}"} \
|
||||
${par_dont_overwrite:+--dont_overwrite} \
|
||||
${par_verbose:+--verbose} \
|
||||
${par_reads_to_process:+--reads_to_process "${par_reads_to_process}"} \
|
||||
${par_dedup:+--dedup} \
|
||||
${par_dup_calc_accuracy:+--dup_calc_accuracy "${par_dup_calc_accuracy}"} \
|
||||
${par_dont_eval_duplication:+--dont_eval_duplication} \
|
||||
${par_trim_poly_g:+--trim_poly_g} \
|
||||
${par_poly_g_min_len:+--poly_g_min_len "${par_poly_g_min_len}"} \
|
||||
${par_disable_trim_poly_g:+--disable_trim_poly_g} \
|
||||
${par_trim_poly_x:+--trim_poly_x} \
|
||||
${par_poly_x_min_len:+--poly_x_min_len "${par_poly_x_min_len}"} \
|
||||
${par_cut_front:+--cut_front "${par_cut_front}"} \
|
||||
${par_cut_tail:+--cut_tail "${par_cut_tail}"} \
|
||||
${par_cut_right:+--cut_right "${par_cut_right}"} \
|
||||
${par_cut_window_size:+--cut_window_size "${par_cut_window_size}"} \
|
||||
${par_cut_mean_quality:+--cut_mean_quality "${par_cut_mean_quality}"} \
|
||||
${par_cut_front_window_size:+--cut_front_window_size "${par_cut_front_window_size}"} \
|
||||
${par_cut_front_mean_quality:+--cut_front_mean_quality "${par_cut_front_mean_quality}"} \
|
||||
${par_cut_tail_window_size:+--cut_tail_window_size "${par_cut_tail_window_size}"} \
|
||||
${par_cut_tail_mean_quality:+--cut_tail_mean_quality "${par_cut_tail_mean_quality}"} \
|
||||
${par_cut_right_window_size:+--cut_right_window_size "${par_cut_right_window_size}"} \
|
||||
${par_cut_right_mean_quality:+--cut_right_mean_quality "${par_cut_right_mean_quality}"} \
|
||||
${par_disable_quality_filtering:+--disable_quality_filtering} \
|
||||
${par_qualified_quality_phred:+--qualified_quality_phred "${par_qualified_quality_phred}"} \
|
||||
${par_unqualified_percent_limit:+--unqualified_percent_limit "${par_unqualified_percent_limit}"} \
|
||||
${par_n_base_limit:+--n_base_limit "${par_n_base_limit}"} \
|
||||
${par_average_qual:+--average_qual "${par_average_qual}"} \
|
||||
${par_disable_length_filtering:+--disable_length_filtering} \
|
||||
${par_length_required:+--length_required "${par_length_required}"} \
|
||||
${par_length_limit:+--length_limit "${par_length_limit}"} \
|
||||
${par_low_complexity_filter:+--low_complexity_filter} \
|
||||
${par_complexity_threshold:+--complexity_threshold "${par_complexity_threshold}"} \
|
||||
${par_filter_by_index1:+--filter_by_index1 "${par_filter_by_index1}"} \
|
||||
${par_filter_by_index2:+--filter_by_index2 "${par_filter_by_index2}"} \
|
||||
${par_filter_by_index_threshold:+--filter_by_index_threshold "${par_filter_by_index_threshold}"} \
|
||||
${par_correction:+--correction} \
|
||||
${par_overlap_len_require:+--overlap_len_require "${par_overlap_len_require}"} \
|
||||
${par_overlap_diff_limit:+--overlap_diff_limit "${par_overlap_diff_limit}"} \
|
||||
${par_overlap_diff_percent_limit:+--overlap_diff_percent_limit "${par_overlap_diff_percent_limit}"} \
|
||||
${par_umi:+--umi} \
|
||||
${par_umi_loc:+--umi_loc "${par_umi_loc}"} \
|
||||
${par_umi_len:+--umi_len "${par_umi_len}"} \
|
||||
${par_umi_prefix:+--umi_prefix "${par_umi_prefix}"} \
|
||||
${par_umi_skip:+--umi_skip "${par_umi_skip}"} \
|
||||
${par_umi_delim:+--umi_delim "${par_umi_delim}"} \
|
||||
${par_overrepresentation_analysis:+--overrepresentation_analysis} \
|
||||
${par_overrepresentation_sampling:+--overrepresentation_sampling "${par_overrepresentation_sampling}"} \
|
||||
${meta_cpus:+--thread "${meta_cpus}"}
|
||||
74
src/fastp/test.sh
Normal file
74
src/fastp/test.sh
Normal file
@@ -0,0 +1,74 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
## VIASH START
|
||||
meta_executable="target/docker/fastp/fastp"
|
||||
meta_resources_dir="src/fastp"
|
||||
## VIASH END
|
||||
|
||||
#########################################################################################
|
||||
mkdir fastp_se
|
||||
cd fastp_se
|
||||
|
||||
echo "> Run fastp on SE"
|
||||
"$meta_executable" \
|
||||
--in1 "$meta_resources_dir/test_data/se/a.fastq" \
|
||||
--out1 "trimmed.fastq" \
|
||||
--failed_out "failed.fastq" \
|
||||
--json "report.json" \
|
||||
--html "report.html" \
|
||||
--adapter_sequence ACGGCTAGCTA
|
||||
|
||||
echo ">> Check if output exists"
|
||||
[ ! -f "trimmed.fastq" ] && echo ">> trimmed.fastq does not exist" && exit 1
|
||||
[ ! -f "failed.fastq" ] && echo ">> failed.fastq does not exist" && exit 1
|
||||
[ ! -f "report.json" ] && echo ">> report.json does not exist" && exit 1
|
||||
[ ! -f "report.html" ] && echo ">> report.html does not exist" && exit 1
|
||||
|
||||
#########################################################################################
|
||||
cd ..
|
||||
mkdir fastp_pe_minimal
|
||||
cd fastp_pe_minimal
|
||||
|
||||
echo ">> Run fastp on PE with minimal parameters"
|
||||
"$meta_executable" \
|
||||
--in1 "$meta_resources_dir/test_data/pe/a.1.fastq" \
|
||||
--in2 "$meta_resources_dir/test_data/pe/a.2.fastq" \
|
||||
--out1 "trimmed_1.fastq" \
|
||||
--out2 "trimmed_2.fastq"
|
||||
|
||||
echo ">> Check if output exists"
|
||||
[ ! -f "trimmed_1.fastq" ] && echo ">> trimmed_1.fastq does not exist" && exit 1
|
||||
[ ! -f "trimmed_2.fastq" ] && echo ">> trimmed_2.fastq does not exist" && exit 1
|
||||
|
||||
#########################################################################################
|
||||
cd ..
|
||||
mkdir fastp_pe_many
|
||||
cd fastp_pe_many
|
||||
|
||||
echo ">> Run fastp on PE with many parameters"
|
||||
"$meta_executable" \
|
||||
--in1 "$meta_resources_dir/test_data/pe/a.1.fastq" \
|
||||
--in2 "$meta_resources_dir/test_data/pe/a.2.fastq" \
|
||||
--out1 "trimmed_1.fastq" \
|
||||
--out2 "trimmed_2.fastq" \
|
||||
--failed_out "failed.fastq" \
|
||||
--json "report.json" \
|
||||
--html "report.html" \
|
||||
--adapter_sequence ACGGCTAGCTA \
|
||||
--adapter_sequence_r2 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
|
||||
--merge \
|
||||
--merged_out "merged.fastq"
|
||||
|
||||
echo ">> Check if output exists"
|
||||
[ ! -f "trimmed_1.fastq" ] && echo ">> trimmed_1.fastq does not exist" && exit 1
|
||||
[ ! -f "trimmed_2.fastq" ] && echo ">> trimmed_2.fastq does not exist" && exit 1
|
||||
[ ! -f "failed.fastq" ] && echo ">> failed.fastq does not exist" && exit 1
|
||||
[ ! -f "report.json" ] && echo ">> report.json does not exist" && exit 1
|
||||
[ ! -f "report.html" ] && echo ">> report.html does not exist" && exit 1
|
||||
[ ! -f "merged.fastq" ] && echo ">> merged.fastq does not exist" && exit 1
|
||||
|
||||
#########################################################################################
|
||||
|
||||
echo "> Test successful"
|
||||
4
src/fastp/test_data/pe/a.1.fastq
Normal file
4
src/fastp/test_data/pe/a.1.fastq
Normal file
@@ -0,0 +1,4 @@
|
||||
@1
|
||||
ACGGCAT
|
||||
+
|
||||
!!!!!!!
|
||||
4
src/fastp/test_data/pe/a.2.fastq
Normal file
4
src/fastp/test_data/pe/a.2.fastq
Normal file
@@ -0,0 +1,4 @@
|
||||
@1
|
||||
ACGGCAT
|
||||
+
|
||||
!!!!!!!
|
||||
10
src/fastp/test_data/script.sh
Executable file
10
src/fastp/test_data/script.sh
Executable file
@@ -0,0 +1,10 @@
|
||||
# fastp test data
|
||||
|
||||
# Test data was obtained from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/fastp/test
|
||||
|
||||
if [ ! -d /tmp/snakemake-wrappers ]; then
|
||||
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
|
||||
fi
|
||||
|
||||
cp -r /tmp/snakemake-wrappers/bio/fastp/test/reads/* src/fastp/test_data
|
||||
|
||||
4
src/fastp/test_data/se/a.fastq
Normal file
4
src/fastp/test_data/se/a.fastq
Normal file
@@ -0,0 +1,4 @@
|
||||
@1
|
||||
ACGGCAT
|
||||
+
|
||||
!!!!!!!
|
||||
336
src/featurecounts/config.vsh.yaml
Normal file
336
src/featurecounts/config.vsh.yaml
Normal file
@@ -0,0 +1,336 @@
|
||||
name: featurecounts
|
||||
description: |
|
||||
featureCounts is a read summarization program for counting reads generated from either RNA or genomic DNA sequencing experiments by implementing highly efficient chromosome hashing and feature blocking techniques. It works with either single or paired-end reads and provides a wide range of options appropriate for different sequencing applications.
|
||||
keywords: ["Read counting", "Genomic features"]
|
||||
links:
|
||||
homepage: https://subread.sourceforge.net/
|
||||
documentation: https://subread.sourceforge.net/SubreadUsersGuide.pdf
|
||||
repository: https://github.com/ShiLab-Bioinformatics/subread
|
||||
references:
|
||||
doi: "10.1093/bioinformatics/btt656"
|
||||
license: GPL-3.0
|
||||
requirements:
|
||||
commands: [ featureCounts ]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --annotation
|
||||
alternatives: ["-a"]
|
||||
type: file
|
||||
description: |
|
||||
Name of an annotation file. GTF/GFF format by default. See '--format' option for more format information.
|
||||
required: true
|
||||
example: annotation.gtf
|
||||
- name: --input
|
||||
alternatives: ["-i"]
|
||||
type: file
|
||||
multiple: true
|
||||
description: |
|
||||
A list of SAM or BAM format files separated by semi-colon (;). They can be either name or location sorted. Location-sorted paired-end reads are automatically sorted by read names.
|
||||
required: true
|
||||
example: input_file1.bam
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --counts
|
||||
alternatives: ["-o"]
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Name of output file including read counts in tab delimited format.
|
||||
required: true
|
||||
example: features.tsv
|
||||
- name: --summary
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Summary statistics of counting results in tab delimited format.
|
||||
required: false
|
||||
example: summary.tsv
|
||||
- name: --junctions
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Count number of reads supporting each exon-exon junction. Junctions were identified from those exon-spanning reads in the input (containing 'N' in CIGAR string).
|
||||
example: junctions.txt
|
||||
required: false
|
||||
|
||||
- name: Annotation
|
||||
arguments:
|
||||
- name: --format
|
||||
alternatives: ["-F"]
|
||||
type: string
|
||||
description: |
|
||||
Specify format of the provided annotation file. Acceptable formats include 'GTF' (or compatible GFF format) and 'SAF'. 'GTF' by default.
|
||||
choices: [GTF, GFF, SAF]
|
||||
example: "GTF"
|
||||
required: false
|
||||
- name: --feature_type
|
||||
alternatives: ["-t"]
|
||||
type: string
|
||||
description: |
|
||||
Specify feature type(s) in a GTF annotation. If multiple types are provided, they should be separated by ';' with no space in between. 'exon' by default. Rows in the annotation with a matched feature will be extracted and used for read mapping.
|
||||
example: "exon"
|
||||
required: false
|
||||
multiple: true
|
||||
- name: --attribute_type
|
||||
alternatives: ["-g"]
|
||||
type: string
|
||||
description: |
|
||||
Specify attribute type in GTF annotation. 'gene_id' by default. Meta-features used for read counting will be extracted from annotation using the provided value.
|
||||
example: "gene_id"
|
||||
required: false
|
||||
- name: --extra_attributes
|
||||
type: string
|
||||
description: |
|
||||
Extract extra attribute types from the provided GTF annotation and include them in the counting output. These attribute types will not be used to group features. If more than one attribute type is provided they should be separated by semicolon (;).
|
||||
required: false
|
||||
multiple: true
|
||||
- name: --chrom_alias
|
||||
alternatives: ["-A"]
|
||||
type: file
|
||||
description: |
|
||||
Provide a chromosome name alias file to match chr names in annotation with those in the reads. This should be a two-column comma-delimited text file. Its first column should include chr names in the annotation and its second column should include chr names in the reads. Chr names are case sensitive. No column header should be included in the file.
|
||||
required: false
|
||||
example: chrom_alias.csv
|
||||
|
||||
- name: Level of summarization
|
||||
arguments:
|
||||
- name: --feature_level
|
||||
alternatives: ["-f"]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Perform read counting at feature level (eg. counting reads for exons rather than genes).
|
||||
|
||||
- name: Overlap between reads and features
|
||||
arguments:
|
||||
- name: --overlapping
|
||||
alternatives: ["-O"]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Assign reads to all their overlapping meta-features (or features if '--feature_level' is specified).
|
||||
- name: --min_overlap
|
||||
type: integer
|
||||
description: |
|
||||
Minimum number of overlapping bases in a read that is required for read assignment. 1 by default. Number of overlapping bases is counted from both reads if paired end. If a negative value is provided, then a gap of up to specified size will be allowed between read and the feature that the read is assigned to.
|
||||
required: false
|
||||
example: 1
|
||||
- name: --frac_overlap
|
||||
type: double
|
||||
description: |
|
||||
Minimum fraction of overlapping bases in a read that is required for read assignment. Value should be within range [0,1]. 0 by default. Number of overlapping bases is counted from both reads if paired end. Both this option and '--min_overlap' option need to be satisfied for read assignment.
|
||||
required: false
|
||||
min: 0
|
||||
max: 1
|
||||
example: 0
|
||||
- name: --frac_overlap_feature
|
||||
type: double
|
||||
description: |
|
||||
Minimum fraction of overlapping bases in a feature that is required for read assignment. Value should be within range [0,1]. 0 by default.
|
||||
required: false
|
||||
min: 0
|
||||
max: 1
|
||||
example: 0
|
||||
- name: --largest_overlap
|
||||
type: boolean_true
|
||||
description: |
|
||||
Assign reads to a meta-feature/feature that has the largest number of overlapping bases.
|
||||
- name: --non_overlap
|
||||
type: integer
|
||||
description: |
|
||||
Maximum number of non-overlapping bases in a read (or a read pair) that is allowed when being assigned to a feature. No limit is set by default.
|
||||
required: false
|
||||
- name: --non_overlap_feature
|
||||
type: integer
|
||||
description: |
|
||||
Maximum number of non-overlapping bases in a feature that is allowed in read assignment. No limit is set by default.
|
||||
required: false
|
||||
- name: --read_extension5
|
||||
type: integer
|
||||
description: |
|
||||
Reads are extended upstream by <int> bases from their 5' end.
|
||||
required: false
|
||||
- name: --read_extension3
|
||||
type: integer
|
||||
description: |
|
||||
Reads are extended upstream by <int> bases from their 3' end.
|
||||
required: false
|
||||
- name: --read2pos
|
||||
type: integer
|
||||
description: |
|
||||
Reduce reads to their 5' most base or 3' most base. Read counting is then performed based on the single base the read is reduced to.
|
||||
required: false
|
||||
choices: [3, 5]
|
||||
|
||||
- name: Multi-mapping reads
|
||||
arguments:
|
||||
- name: --multi_mapping
|
||||
alternatives: ["-M"]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Multi-mapping reads will also be counted. For a multi-mapping read, all its reported alignments will be counted. The 'NH' tag in BAM/SAM input is used to detect multi-mapping reads.
|
||||
|
||||
- name: Fractional counting
|
||||
arguments:
|
||||
- name: --fraction
|
||||
type: boolean_true
|
||||
description: |
|
||||
Assign fractional counts to features. This option must be used together with '--multi_mapping' or '--overlapping' or both. When '--multi_mapping' is specified, each reported alignment from a multi-mapping read (identified via 'NH' tag) will carry a fractional count of 1/x, instead of 1 (one), where x is the total number of alignments reported for the same read. When '--overlapping' is specified, each overlapping feature will receive a fractional count of 1/y, where y is the total number of features overlapping with the read. When both '--multi_mapping' and '--overlapping' are specified, each alignment will carry a fractional count of 1/(x*y).
|
||||
|
||||
- name: Read filtering
|
||||
arguments:
|
||||
- name: --min_map_quality
|
||||
alternatives: ["-Q"]
|
||||
type: integer
|
||||
description: |
|
||||
The minimum mapping quality score a read must satisfy in order to be counted. For paired-end reads, at least one end should satisfy this criteria. 0 by default.
|
||||
required: false
|
||||
example: 0
|
||||
- name: --split_only
|
||||
type: boolean_true
|
||||
description: |
|
||||
Count split alignments only (ie. alignments with CIGAR string containing 'N'). An example of split alignments is exon-spanning reads in RNA-seq data.
|
||||
- name: --non_split_only
|
||||
type: boolean_true
|
||||
description: |
|
||||
If specified, only non-split alignments (CIGAR strings do not contain letter 'N') will be counted. All the other alignments will be ignored.
|
||||
- name: --primary
|
||||
type: boolean_true
|
||||
description: |
|
||||
Count primary alignments only. Primary alignments are identified using bit 0x100 in SAM/BAM FLAG field.
|
||||
- name: --ignore_dup
|
||||
type: boolean_true
|
||||
description: |
|
||||
Ignore duplicate reads in read counting. Duplicate reads are identified using bit Ox400 in BAM/SAM FLAG field. The whole read pair is ignored if one of the reads is a duplicate read for paired end data.
|
||||
|
||||
- name: Strandedness
|
||||
arguments:
|
||||
- name: --strand
|
||||
alternatives: ["-s"]
|
||||
type: integer
|
||||
description: |
|
||||
Perform strand-specific read counting. A single integer value (applied to all input files) should be provided. Possible values include: 0 (unstranded), 1 (stranded) and 2 (reversely stranded). Default value is 0 (ie. unstranded read counting carried out for all input files).
|
||||
choices: [0, 1, 2]
|
||||
example: 0
|
||||
required: false
|
||||
|
||||
- name: Exon-exon junctions
|
||||
arguments:
|
||||
- name: --ref_fasta
|
||||
alternatives: ["-G"]
|
||||
type: file
|
||||
description: |
|
||||
Provide the name of a FASTA-format file that contains the reference sequences used in read mapping that produced the provided SAM/BAM files.
|
||||
required: false
|
||||
example: reference.fasta
|
||||
|
||||
- name: Parameters specific to paired end reads
|
||||
arguments:
|
||||
- name: --paired
|
||||
alternatives: ["-p"]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Specify that input data contain paired-end reads. To perform fragment counting (ie. counting read pairs), the '--countReadPairs' parameter should also be specified in addition to this parameter.
|
||||
- name: --count_read_pairs
|
||||
type: boolean_true
|
||||
description: |
|
||||
Count read pairs (fragments) instead of reads. This option is only applicable for paired-end reads.
|
||||
- name: --both_aligned
|
||||
alternatives: ["-B"]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Count read pairs (fragments) instead of reads. This option is only applicable for paired-end reads.
|
||||
- name: --check_pe_dist
|
||||
alternatives: ["-P"]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Check validity of paired-end distance when counting read pairs. Use '--min_length' and '--max_length' to set thresholds.
|
||||
- name: --min_length
|
||||
alternatives: ["-d"]
|
||||
type: integer
|
||||
description: |
|
||||
Minimum fragment/template length, 50 by default.
|
||||
required: false
|
||||
example: 50
|
||||
- name: --max_length
|
||||
alternatives: ["-D"]
|
||||
type: integer
|
||||
description: |
|
||||
Maximum fragment/template length, 600 by default.
|
||||
required: false
|
||||
example: 600
|
||||
- name: --same_strand
|
||||
alternatives: ["-C"]
|
||||
type: boolean_true
|
||||
description: |
|
||||
Do not count read pairs that have their two ends mapping to different chromosomes or mapping to same chromosome but on different strands.
|
||||
- name: --donotsort
|
||||
type: boolean_true
|
||||
description: |
|
||||
Do not sort reads in BAM/SAM input. Note that reads from the same pair are required to be located next to each other in the input.
|
||||
|
||||
- name: Read groups
|
||||
arguments:
|
||||
- name: --by_read_group
|
||||
type: boolean_true
|
||||
description: |
|
||||
Assign reads by read group. "RG" tag is required to be present in the input BAM/SAM files.
|
||||
|
||||
- name: Long reads
|
||||
arguments:
|
||||
- name: --long_reads
|
||||
type: boolean_true
|
||||
description: |
|
||||
Count long reads such as Nanopore and PacBio reads. Long read counting can only run in one thread and only reads (not read-pairs) can be counted. There is no limitation on the number of 'M' operations allowed in a CIGAR string in long read counting.
|
||||
|
||||
- name: Assignment results for each read
|
||||
arguments:
|
||||
- name: --detailed_results
|
||||
type: file
|
||||
direction: output
|
||||
description: |
|
||||
Directory to save the detailed assignment results. Use `--detailed_results_format` to determine the format of the detailed results.
|
||||
example: detailed_results/
|
||||
required: false
|
||||
- name: --detailed_results_format
|
||||
alternatives: ["-R"]
|
||||
type: string
|
||||
description: |
|
||||
Output detailed assignment results for each read or read-pair. Results are saved to a file that is in one of the following formats: CORE, SAM and BAM. See documentaiton for more info about these formats.
|
||||
required: false
|
||||
choices: [CORE, SAM, BAM]
|
||||
|
||||
- name: Miscellaneous
|
||||
arguments:
|
||||
- name: --max_M_op
|
||||
type: integer
|
||||
description: |
|
||||
Maximum number of 'M' operations allowed in a CIGAR string. 10 by default. Both 'X' and '=' are treated as 'M' and adjacent 'M' operations are merged in the CIGAR string.
|
||||
required: false
|
||||
example: 10
|
||||
- name: --verbose
|
||||
type: boolean_true
|
||||
description: |
|
||||
Output verbose information for debugging, such as un-matched chromosome/contig names.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/subread:2.0.6--he4a0461_0
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
featureCounts -v 2>&1 | sed 's/featureCounts v\([0-9.]*\)/featureCounts: \1/' > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
242
src/featurecounts/help.txt
Normal file
242
src/featurecounts/help.txt
Normal file
@@ -0,0 +1,242 @@
|
||||
```bash
|
||||
featureCounts
|
||||
```
|
||||
|
||||
Version 2.0.3
|
||||
|
||||
Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ...
|
||||
|
||||
## Mandatory arguments:
|
||||
|
||||
-a <string> Name of an annotation file. GTF/GFF format by default. See
|
||||
-F option for more format information. Inbuilt annotations
|
||||
(SAF format) is available in 'annotation' directory of the
|
||||
package. Gzipped file is also accepted.
|
||||
|
||||
-o <string> Name of output file including read counts. A separate file
|
||||
including summary statistics of counting results is also
|
||||
included in the output ('<string>.summary'). Both files
|
||||
are in tab delimited format.
|
||||
|
||||
input_file1 [input_file2] ... A list of SAM or BAM format files. They can be
|
||||
either name or location sorted. If no files provided,
|
||||
<stdin> input is expected. Location-sorted paired-end reads
|
||||
are automatically sorted by read names.
|
||||
|
||||
## Optional arguments:
|
||||
# Annotation
|
||||
|
||||
-F <string> Specify format of the provided annotation file. Acceptable
|
||||
formats include 'GTF' (or compatible GFF format) and
|
||||
'SAF'. 'GTF' by default. For SAF format, please refer to
|
||||
Users Guide.
|
||||
|
||||
-t <string> Specify feature type(s) in a GTF annotation. If multiple
|
||||
types are provided, they should be separated by ',' with
|
||||
no space in between. 'exon' by default. Rows in the
|
||||
annotation with a matched feature will be extracted and
|
||||
used for read mapping.
|
||||
|
||||
-g <string> Specify attribute type in GTF annotation. 'gene_id' by
|
||||
default. Meta-features used for read counting will be
|
||||
extracted from annotation using the provided value.
|
||||
|
||||
--extraAttributes Extract extra attribute types from the provided GTF
|
||||
annotation and include them in the counting output. These
|
||||
attribute types will not be used to group features. If
|
||||
more than one attribute type is provided they should be
|
||||
separated by comma.
|
||||
|
||||
-A <string> Provide a chromosome name alias file to match chr names in
|
||||
annotation with those in the reads. This should be a two-
|
||||
column comma-delimited text file. Its first column should
|
||||
include chr names in the annotation and its second column
|
||||
should include chr names in the reads. Chr names are case
|
||||
sensitive. No column header should be included in the
|
||||
file.
|
||||
|
||||
# Level of summarization
|
||||
|
||||
-f Perform read counting at feature level (eg. counting
|
||||
reads for exons rather than genes).
|
||||
|
||||
# Overlap between reads and features
|
||||
|
||||
-O Assign reads to all their overlapping meta-features (or
|
||||
features if -f is specified).
|
||||
|
||||
--minOverlap <int> Minimum number of overlapping bases in a read that is
|
||||
required for read assignment. 1 by default. Number of
|
||||
overlapping bases is counted from both reads if paired
|
||||
end. If a negative value is provided, then a gap of up
|
||||
to specified size will be allowed between read and the
|
||||
feature that the read is assigned to.
|
||||
|
||||
--fracOverlap <float> Minimum fraction of overlapping bases in a read that is
|
||||
required for read assignment. Value should be within range
|
||||
[0,1]. 0 by default. Number of overlapping bases is
|
||||
counted from both reads if paired end. Both this option
|
||||
and '--minOverlap' option need to be satisfied for read
|
||||
assignment.
|
||||
|
||||
--fracOverlapFeature <float> Minimum fraction of overlapping bases in a
|
||||
feature that is required for read assignment. Value
|
||||
should be within range [0,1]. 0 by default.
|
||||
|
||||
--largestOverlap Assign reads to a meta-feature/feature that has the
|
||||
largest number of overlapping bases.
|
||||
|
||||
--nonOverlap <int> Maximum number of non-overlapping bases in a read (or a
|
||||
read pair) that is allowed when being assigned to a
|
||||
feature. No limit is set by default.
|
||||
|
||||
--nonOverlapFeature <int> Maximum number of non-overlapping bases in a feature
|
||||
that is allowed in read assignment. No limit is set by
|
||||
default.
|
||||
|
||||
--readExtension5 <int> Reads are extended upstream by <int> bases from their
|
||||
5' end.
|
||||
|
||||
--readExtension3 <int> Reads are extended upstream by <int> bases from their
|
||||
3' end.
|
||||
|
||||
--read2pos <5:3> Reduce reads to their 5' most base or 3' most base. Read
|
||||
counting is then performed based on the single base the
|
||||
read is reduced to.
|
||||
|
||||
# Multi-mapping reads
|
||||
|
||||
-M Multi-mapping reads will also be counted. For a multi-
|
||||
mapping read, all its reported alignments will be
|
||||
counted. The 'NH' tag in BAM/SAM input is used to detect
|
||||
multi-mapping reads.
|
||||
|
||||
# Fractional counting
|
||||
|
||||
--fraction Assign fractional counts to features. This option must
|
||||
be used together with '-M' or '-O' or both. When '-M' is
|
||||
specified, each reported alignment from a multi-mapping
|
||||
read (identified via 'NH' tag) will carry a fractional
|
||||
count of 1/x, instead of 1 (one), where x is the total
|
||||
number of alignments reported for the same read. When '-O'
|
||||
is specified, each overlapping feature will receive a
|
||||
fractional count of 1/y, where y is the total number of
|
||||
features overlapping with the read. When both '-M' and
|
||||
'-O' are specified, each alignment will carry a fractional
|
||||
count of 1/(x*y).
|
||||
|
||||
# Read filtering
|
||||
|
||||
-Q <int> The minimum mapping quality score a read must satisfy in
|
||||
order to be counted. For paired-end reads, at least one
|
||||
end should satisfy this criteria. 0 by default.
|
||||
|
||||
--splitOnly Count split alignments only (ie. alignments with CIGAR
|
||||
string containing 'N'). An example of split alignments is
|
||||
exon-spanning reads in RNA-seq data.
|
||||
|
||||
--nonSplitOnly If specified, only non-split alignments (CIGAR strings do
|
||||
not contain letter 'N') will be counted. All the other
|
||||
alignments will be ignored.
|
||||
|
||||
--primary Count primary alignments only. Primary alignments are
|
||||
identified using bit 0x100 in SAM/BAM FLAG field.
|
||||
|
||||
--ignoreDup Ignore duplicate reads in read counting. Duplicate reads
|
||||
are identified using bit Ox400 in BAM/SAM FLAG field. The
|
||||
whole read pair is ignored if one of the reads is a
|
||||
duplicate read for paired end data.
|
||||
|
||||
# Strandness
|
||||
|
||||
-s <int or string> Perform strand-specific read counting. A single integer
|
||||
value (applied to all input files) or a string of comma-
|
||||
separated values (applied to each corresponding input
|
||||
file) should be provided. Possible values include:
|
||||
0 (unstranded), 1 (stranded) and 2 (reversely stranded).
|
||||
Default value is 0 (ie. unstranded read counting carried
|
||||
out for all input files).
|
||||
|
||||
# Exon-exon junctions
|
||||
|
||||
-J Count number of reads supporting each exon-exon junction.
|
||||
Junctions were identified from those exon-spanning reads
|
||||
in the input (containing 'N' in CIGAR string). Counting
|
||||
results are saved to a file named '<output_file>.jcounts'
|
||||
|
||||
-G <string> Provide the name of a FASTA-format file that contains the
|
||||
reference sequences used in read mapping that produced the
|
||||
provided SAM/BAM files. This optional argument can be used
|
||||
with '-J' option to improve read counting for junctions.
|
||||
|
||||
# Parameters specific to paired end reads
|
||||
|
||||
-p Specify that input data contain paired-end reads. To
|
||||
perform fragment counting (ie. counting read pairs), the
|
||||
'--countReadPairs' parameter should also be specified in
|
||||
addition to this parameter.
|
||||
|
||||
--countReadPairs Count read pairs (fragments) instead of reads. This option
|
||||
is only applicable for paired-end reads.
|
||||
|
||||
-B Only count read pairs that have both ends aligned.
|
||||
|
||||
-P Check validity of paired-end distance when counting read
|
||||
pairs. Use -d and -D to set thresholds.
|
||||
|
||||
-d <int> Minimum fragment/template length, 50 by default.
|
||||
|
||||
-D <int> Maximum fragment/template length, 600 by default.
|
||||
|
||||
-C Do not count read pairs that have their two ends mapping
|
||||
to different chromosomes or mapping to same chromosome
|
||||
but on different strands.
|
||||
|
||||
--donotsort Do not sort reads in BAM/SAM input. Note that reads from
|
||||
the same pair are required to be located next to each
|
||||
other in the input.
|
||||
|
||||
# Number of CPU threads
|
||||
|
||||
-T <int> Number of the threads. 1 by default.
|
||||
|
||||
# Read groups
|
||||
|
||||
--byReadGroup Assign reads by read group. "RG" tag is required to be
|
||||
present in the input BAM/SAM files.
|
||||
|
||||
|
||||
# Long reads
|
||||
|
||||
-L Count long reads such as Nanopore and PacBio reads. Long
|
||||
read counting can only run in one thread and only reads
|
||||
(not read-pairs) can be counted. There is no limitation on
|
||||
the number of 'M' operations allowed in a CIGAR string in
|
||||
long read counting.
|
||||
|
||||
# Assignment results for each read
|
||||
|
||||
-R <format> Output detailed assignment results for each read or read-
|
||||
pair. Results are saved to a file that is in one of the
|
||||
following formats: CORE, SAM and BAM. See Users Guide for
|
||||
more info about these formats.
|
||||
|
||||
--Rpath <string> Specify a directory to save the detailed assignment
|
||||
results. If unspecified, the directory where counting
|
||||
results are saved is used.
|
||||
|
||||
# Miscellaneous
|
||||
|
||||
--tmpDir <string> Directory under which intermediate files are saved (later
|
||||
removed). By default, intermediate files will be saved to
|
||||
the directory specified in '-o' argument.
|
||||
|
||||
--maxMOp <int> Maximum number of 'M' operations allowed in a CIGAR
|
||||
string. 10 by default. Both 'X' and '=' are treated as 'M'
|
||||
and adjacent 'M' operations are merged in the CIGAR
|
||||
string.
|
||||
|
||||
--verbose Output verbose information for debugging, such as un-
|
||||
matched chromosome/contig names.
|
||||
|
||||
-v Output version of the program.
|
||||
94
src/featurecounts/script.sh
Normal file
94
src/featurecounts/script.sh
Normal file
@@ -0,0 +1,94 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# create temporary directory
|
||||
tmp_dir=$(mktemp -d -p "$meta_temp_dir" "${meta_functionality_name}_XXXXXX")
|
||||
mkdir -p "$tmp_dir/temp"
|
||||
|
||||
# create detailed_results directory if variable is set and directory does not exist
|
||||
if [[ ! -z "$par_detailed_results" ]] && [[ ! -d "$par_detailed_results" ]]; then
|
||||
mkdir -p "$par_detailed_results"
|
||||
fi
|
||||
|
||||
# replace comma with semicolon
|
||||
par_feature_type=$(echo $par_feature_type | tr ',' ';')
|
||||
par_extra_attributes=$(echo $par_extra_attributes | tr ',' ';')
|
||||
|
||||
# unset flag variables
|
||||
[[ "$par_feature_level" == "false" ]] && unset par_feature_level
|
||||
[[ "$par_overlapping" == "false" ]] && unset par_overlapping
|
||||
[[ "$par_largest_overlap" == "false" ]] && unset par_largest_overlap
|
||||
[[ "$par_multi_mapping" == "false" ]] && unset par_multi_mapping
|
||||
[[ "$par_fraction" == "false" ]] && unset par_fraction
|
||||
[[ "$par_split_only" == "false" ]] && unset par_split_only
|
||||
[[ "$par_non_split_only" == "false" ]] && unset par_non_split_only
|
||||
[[ "$par_primary" == "false" ]] && unset par_primary
|
||||
[[ "$par_ignore_dup" == "false" ]] && unset par_ignore_dup
|
||||
[[ "$par_paired" == "false" ]] && unset par_paired
|
||||
[[ "$par_count_read_pairs" == "false" ]] && unset par_count_read_pairs
|
||||
[[ "$par_both_aligned" == "false" ]] && unset par_both_aligned
|
||||
[[ "$par_check_pe_dist" == "false" ]] && unset par_check_pe_dist
|
||||
[[ "$par_same_strand" == "false" ]] && unset par_same_strand
|
||||
[[ "$par_donotsort" == "false" ]] && unset par_donotsort
|
||||
[[ "$par_by_read_group" == "false" ]] && unset par_by_read_group
|
||||
[[ "$par_long_reads" == "false" ]] && unset par_long_reads
|
||||
[[ "$par_verbose" == "false" ]] && unset par_verbose
|
||||
|
||||
IFS=";" read -ra input <<< $par_input
|
||||
|
||||
featureCounts \
|
||||
${par_format:+-F "${par_format}"} \
|
||||
${par_feature_type:+-t "${par_feature_type}"} \
|
||||
${par_attribute_type:+-g "${par_attribute_type}"} \
|
||||
${par_extra_attributes:+--extraAttributes "${extra_attributes}"} \
|
||||
${par_chrom_alias:+-A "${par_chrom_alias}"} \
|
||||
${par_feature_level:+-f} \
|
||||
${par_overlapping:+-O} \
|
||||
${par_min_overlap:+--minOverlap "${par_min_overlap}"} \
|
||||
${par_frac_overlap:+--fracOverlap "${par_frac_overlap}"} \
|
||||
${par_frac_overlap_feature:+--fracOverlapFeature "${par_frac_overlap_feature}"} \
|
||||
${par_largest_overlap:+--largestOverlap} \
|
||||
${par_non_overlap:+--nonOverlap "${par_non_overlap}"} \
|
||||
${par_non_overlap_feature:+--nonOverlapFeature "${par_non_overlap_feature}"} \
|
||||
${par_read_extension5:+--readExtension5 "${par_read_extension5}"} \
|
||||
${par_read_extension3:+--readExtension3 "${par_read_extension3}"} \
|
||||
${par_read2pos:+--read2pos "${par_read2pos}"} \
|
||||
${par_multi_mapping:+-M} \
|
||||
${par_fraction:+--fraction} \
|
||||
${par_min_map_quality:+-Q "${par_min_map_quality}"} \
|
||||
${par_split_only:+--splitOnly} \
|
||||
${par_non_split_only:+--nonSplitOnly} \
|
||||
${par_primary:+--primary} \
|
||||
${par_ignore_dup:+--ignoreDup} \
|
||||
${par_strand:+-s "${par_strand}"} \
|
||||
${par_junctions:+-J} \
|
||||
${par_ref_fasta:+-G "${par_ref_fasta}"} \
|
||||
${par_paired:+-p} \
|
||||
${par_count_read_pairs:+--countReadPairs} \
|
||||
${par_both_aligned:+-B} \
|
||||
${par_check_pe_dist:+-P} \
|
||||
${par_min_length:+-d "${par_min_length}"} \
|
||||
${par_max_length:+-D "${par_max_length}"} \
|
||||
${par_same_strand:+-C} \
|
||||
${par_donotsort:+--donotsort} \
|
||||
${par_by_read_group:+--byReadGroup} \
|
||||
${par_long_reads:+-L} \
|
||||
${par_detailed_results:+--Rpath "${par_detailed_results}"} \
|
||||
${par_detailed_results_format:+-R "${par_detailed_results_format}"} \
|
||||
${par_max_M_op:+--maxMOp "${par_max_M_op}"} \
|
||||
${par_verbose:+--verbose} \
|
||||
${meta_cpus:+-T "${meta_cpus}"} \
|
||||
--tmpDir "$tmp_dir/temp" \
|
||||
-a "$par_annotation" \
|
||||
-o "$tmp_dir/output.txt" \
|
||||
"${input[*]}"
|
||||
|
||||
[[ ! -z "$par_counts" ]] && mv "$tmp_dir/output.txt" "$par_counts"
|
||||
[[ ! -z "$par_summary" ]] && mv "$tmp_dir/output.txt.summary" "$par_summary"
|
||||
if [[ ! -z "$par_junctions" ]] && [[ -e "$tmp_dir/output.txt.jcounts" ]]; then
|
||||
mv "$tmp_dir/output.txt.jcounts" "$par_junctions"
|
||||
fi
|
||||
59
src/featurecounts/test.sh
Normal file
59
src/featurecounts/test.sh
Normal file
@@ -0,0 +1,59 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
dir_in="$meta_resources_dir/test_data"
|
||||
|
||||
echo "> Run featureCounts (with junctions)"
|
||||
"$meta_executable" \
|
||||
--input "$dir_in/a.bam" \
|
||||
--annotation "$dir_in/annotation.gtf" \
|
||||
--counts "features.tsv" \
|
||||
--summary "summary.tsv" \
|
||||
--junctions "junction_counts.txt" \
|
||||
--ref_fasta "$dir_in/genome.fasta" \
|
||||
--overlapping \
|
||||
--frac_overlap 0.2 \
|
||||
--paired \
|
||||
--strand 0 \
|
||||
--detailed_results detailed_results \
|
||||
--detailed_results_format SAM
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "features.tsv" ] && echo "Output file features.tsv does not exist" && exit 1
|
||||
[ ! -f "summary.tsv" ] && echo "Output file summary.tsv does not exist" && exit 1
|
||||
[ ! -f "junction_counts.txt" ] && echo "Output file junction_counts.txt does not exist" && exit 1
|
||||
[ ! -d "detailed_results" ] && echo "Output directory detailed_results does not exist" && exit 1
|
||||
[ ! -f "detailed_results/a.bam.featureCounts.sam" ] && echo "Output file detailed_results/a.bam.featureCounts.sam does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "features.tsv" ] && echo "Output file features.tsv is empty" && exit 1
|
||||
[ ! -s "summary.tsv" ] && echo "Output file summary.tsv is empty" && exit 1
|
||||
[ ! -s "junction_counts.txt" ] && echo "Output file junction_counts.txt is empty" && exit 1
|
||||
[ ! -s "detailed_results/a.bam.featureCounts.sam" ] && echo "Output file detailed_results/a.bam.featureCounts.sam is empty" && exit 1
|
||||
|
||||
echo "> Run featureCounts (without junctions)"
|
||||
"$meta_executable" \
|
||||
--input "$dir_in/a.bam" \
|
||||
--annotation "$dir_in/annotation.gtf" \
|
||||
--counts "features.tsv" \
|
||||
--summary "summary.tsv" \
|
||||
--overlapping \
|
||||
--frac_overlap 0.2 \
|
||||
--paired \
|
||||
--strand 0 \
|
||||
--detailed_results detailed_results \
|
||||
--detailed_results_format SAM
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "features.tsv" ] && echo "Output file features.tsv does not exist" && exit 1
|
||||
[ ! -f "summary.tsv" ] && echo "Output file summary.tsv does not exist" && exit 1
|
||||
[ ! -d "detailed_results" ] && echo "Output directory detailed_results does not exist" && exit 1
|
||||
[ ! -f "detailed_results/a.bam.featureCounts.sam" ] && echo "Output file detailed_results/a.bam.featureCounts.sam does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "features.tsv" ] && echo "Output file features.tsv is empty" && exit 1
|
||||
[ ! -s "summary.tsv" ] && echo "Output file summary.tsv is empty" && exit 1
|
||||
[ ! -s "detailed_results/a.bam.featureCounts.sam" ] && echo "Output file detailed_results/a.bam.featureCounts.sam is empty" && exit 1
|
||||
|
||||
echo "> Test successful"
|
||||
BIN
src/featurecounts/test_data/a.bam
Normal file
BIN
src/featurecounts/test_data/a.bam
Normal file
Binary file not shown.
6
src/featurecounts/test_data/annotation.gtf
Normal file
6
src/featurecounts/test_data/annotation.gtf
Normal file
@@ -0,0 +1,6 @@
|
||||
1 havana gene 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; gene_name "A"; gene_source "havana"; gene_biotype "gene";
|
||||
1 havana transcript 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; transcript_id "ENST00000000000"; transcript_version "2"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; transcript_name "A-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
|
||||
1 havana exon 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; transcript_id "ENST00000000000"; transcript_version "2"; exon_number "1"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; transcript_name "A-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00000000000"; exon_version "1"; tag "basic"; transcript_support_level "1";
|
||||
2 havana gene 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; gene_name "B"; gene_source "havana"; gene_biotype "gene";
|
||||
2 havana transcript 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; transcript_id "ENST00000000001"; transcript_version "2"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; transcript_name "B-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
|
||||
2 havana exon 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; transcript_id "ENST00000000001"; transcript_version "2"; exon_number "1"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; transcript_name "B-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00000000001"; exon_version "1"; tag "basic"; transcript_support_level "1";
|
||||
4
src/featurecounts/test_data/genome.fasta
Normal file
4
src/featurecounts/test_data/genome.fasta
Normal file
@@ -0,0 +1,4 @@
|
||||
>1
|
||||
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
|
||||
>2
|
||||
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
|
||||
9
src/featurecounts/test_data/script.sh
Normal file
9
src/featurecounts/test_data/script.sh
Normal file
@@ -0,0 +1,9 @@
|
||||
# featureCounts test data
|
||||
|
||||
# Test data was obtained from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/subread/featurecounts/test
|
||||
|
||||
if [ ! -d /tmp/snakemake-wrappers ]; then
|
||||
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
|
||||
fi
|
||||
|
||||
cp -r /tmp/snakemake-wrappers/bio/subread/featurecounts/test/* src/subread/featurecounts/test_data
|
||||
397
src/gffread/config.vsh.yaml
Normal file
397
src/gffread/config.vsh.yaml
Normal file
@@ -0,0 +1,397 @@
|
||||
name: gffread
|
||||
description: Validate, filter, convert and perform various other operations on GFF files.
|
||||
keywords: [gff, conversion, validation, filtering]
|
||||
links:
|
||||
homepage: https://ccb.jhu.edu/software/stringtie/gff.shtml#gffread
|
||||
documentation: https://ccb.jhu.edu/software/stringtie/gff.shtml#gffread
|
||||
repository: https://github.com/gpertea/gffread
|
||||
references:
|
||||
doi: 10.12688/f1000research.23297.2
|
||||
license: MIT
|
||||
requirements:
|
||||
commands: [ gffread ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
type: file
|
||||
direction: input
|
||||
description: |
|
||||
A reference file in either the GFF3, GFF2 or GTF format.
|
||||
required: true
|
||||
example: annotation.gff
|
||||
- name: --chr_mapping
|
||||
alternatives: -m
|
||||
type: file
|
||||
direction: input
|
||||
description: |
|
||||
<chr_replace> is a name mapping table for converting reference sequence names,
|
||||
having this 2-column format: <original_ref_ID> <new_ref_ID>.
|
||||
- name: --seq_info
|
||||
alternatives: -s
|
||||
type: file
|
||||
direction: input
|
||||
description: |
|
||||
<seq_info.fsize> is a tab-delimited file providing this info for each of the mapped
|
||||
sequences: <seq-name> <seq-length> <seq-description> (useful for --description option with
|
||||
mRNA/EST/protein mappings).
|
||||
- name: --genome
|
||||
alternatives: -g
|
||||
type: file
|
||||
description: |
|
||||
Full path to a multi-fasta file with the genomic sequences for all input mappings,
|
||||
OR a directory with single-fasta files (one per genomic sequence, with file names
|
||||
matching sequence names).
|
||||
example: genome.fa
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --outfile
|
||||
alternatives: -o
|
||||
type: file
|
||||
direction: output
|
||||
required: true
|
||||
description: |
|
||||
Write the output records into <outfile>.
|
||||
default: output.gff
|
||||
- name: --force_exons
|
||||
type: boolean_true
|
||||
description: |
|
||||
Make sure that the lowest level GFF features are considered "exon" features.
|
||||
- name: --gene2exon
|
||||
type: boolean_true
|
||||
description: |
|
||||
For single-line genes not parenting any transcripts, add an exon feature spanning
|
||||
the entire gene (treat it as a transcript).
|
||||
- name: --t_adopt
|
||||
type: boolean_true
|
||||
description: |
|
||||
Try to find a parent gene overlapping/containing a transcript that does not have
|
||||
any explicit gene Parent.
|
||||
- name: --decode
|
||||
alternatives: -D
|
||||
type: boolean_true
|
||||
description: |
|
||||
Decode url encoded characters within attributes.
|
||||
- name: --merge_exons
|
||||
alternatives: -Z
|
||||
type: boolean_true
|
||||
description: |
|
||||
Merge very close exons into a single exon (when intron size<4).
|
||||
- name: --junctions
|
||||
alternatives: -j
|
||||
type: boolean_true
|
||||
description: |
|
||||
Output the junctions and the corresponding transcripts.
|
||||
- name: --spliced_exons
|
||||
alternatives: -w
|
||||
type: file
|
||||
direction: output
|
||||
must_exist: false
|
||||
description: |
|
||||
Write a fasta file with spliced exons for each transcript.
|
||||
example: exons.fa
|
||||
- name: --w_add
|
||||
type: integer
|
||||
description: |
|
||||
For the --spliced_exons option, extract additional <N> bases both upstream and
|
||||
downstream of the transcript boundaries.
|
||||
- name: --w_nocds
|
||||
type: boolean_true
|
||||
description: |
|
||||
For --spliced_exons, disable the output of CDS info in the FASTA file.
|
||||
- name: --spliced_cds
|
||||
alternatives: -x
|
||||
type: file
|
||||
must_exist: false
|
||||
example: cds.fa
|
||||
description: |
|
||||
Write a fasta file with spliced CDS for each GFF transcript.
|
||||
- name: --tr_cds
|
||||
alternatives: -y
|
||||
type: file
|
||||
must_exist: false
|
||||
example: tr_cds.fa
|
||||
description: |
|
||||
Write a protein fasta file with the translation of CDS for each record.
|
||||
- name: --w_coords
|
||||
alternatives: -W
|
||||
type: boolean_true
|
||||
description: |
|
||||
For --spliced_exons, --spliced_cds and -tr_cds options, write in the FASTA defline
|
||||
all the exon coordinates projected onto the spliced sequence.
|
||||
- name: --stop_dot
|
||||
alternatives: -S
|
||||
type: boolean_true
|
||||
description: |
|
||||
For --tr_cds option, use '*' instead of '.' as stop codon translation.
|
||||
- name: --id_version
|
||||
alternatives: -L
|
||||
type: boolean_true
|
||||
description: |
|
||||
Ensembl GTF to GFF3 conversion, adds version to IDs.
|
||||
- name: --trackname
|
||||
alternatives: -t
|
||||
type: string
|
||||
description: |
|
||||
Use <trackname> in the 2nd column of each GFF/GTF output line.
|
||||
- name: --gtf_output
|
||||
alternatives: -T
|
||||
type: boolean_true
|
||||
description: |
|
||||
Main output will be GTF instead of GFF3.
|
||||
- name: --bed
|
||||
type: boolean_true
|
||||
description: |
|
||||
Output records in BED format instead of default GFF3.
|
||||
- name: --tlf
|
||||
type: boolean_true
|
||||
description: |
|
||||
Output "transcript line format" which is like GFF but with exons and CDS related
|
||||
features stored as GFF attributes in the transcript feature line, like this:
|
||||
exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords>
|
||||
<exons> is a comma-delimited list of exon_start-exon_end coordinates;
|
||||
<CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>.
|
||||
- name: --table
|
||||
type: string
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
description: |
|
||||
Output a simple tab delimited format instead of GFF, with columns having the values
|
||||
of GFF attributes given in <attrlist>; special pseudo-attributes (prefixed by @) are
|
||||
recognized:
|
||||
@id, @geneid, @chr, @start, @end, @strand, @numexons, @exons, @cds, @covlen, @cdslen
|
||||
If any of --spliced_exons/--tr_cds/--spliced_cds FASTA output files are enabled, the
|
||||
same fields (excluding @id) are appended to the definition line of corresponding FASTA
|
||||
records.
|
||||
- name: --expose_dups
|
||||
type: boolean_true
|
||||
alternatives: [-E, -v]
|
||||
description: |
|
||||
Expose (warn about) duplicate transcript IDs and other potential problems with the
|
||||
given GFF/GTF records.
|
||||
- name: Options
|
||||
arguments:
|
||||
- name: --ids
|
||||
type: file
|
||||
description: |
|
||||
Discard records/transcripts if their IDs are not listed in <IDs.lst>.
|
||||
- name: --nids
|
||||
type: file
|
||||
description: |
|
||||
Discard records/transcripts if their IDs are listed in <IDs.lst>.
|
||||
- name: --maxintron
|
||||
alternatives: -i
|
||||
type: integer
|
||||
description: |
|
||||
Discard transcripts having an intron larger than <maxintron>.
|
||||
- name: --minlen
|
||||
alternatives: -l
|
||||
type: integer
|
||||
description: |
|
||||
Discard transcripts shorter than <minlen> bases.
|
||||
- name: --range
|
||||
alternatives: -r
|
||||
type: string
|
||||
description: |
|
||||
Only show transcripts overlapping coordinate range <start>..<end> (on chromosome/contig
|
||||
<chr>, strand <strand> if provided).
|
||||
- name: --strict_range
|
||||
alternatives: -R
|
||||
type: boolean_true
|
||||
description: |
|
||||
For --range option, discard all transcripts that are not fully contained within the given
|
||||
range.
|
||||
- name: --jmatch
|
||||
type: string
|
||||
description: |
|
||||
Only output transcripts matching the given junction.
|
||||
- name: --no_single_exon
|
||||
alternatives: -U
|
||||
type: boolean_true
|
||||
description: |
|
||||
Discard single-exon transcripts.
|
||||
- name: --coding
|
||||
alternatives: -C
|
||||
type: boolean_true
|
||||
description: |
|
||||
Coding only: discard mRNAs that have no CDS features.
|
||||
- name: --nc
|
||||
type: boolean_true
|
||||
description: |
|
||||
Non-coding only: discard mRNAs that have CDS features.
|
||||
- name: --ignore_locus
|
||||
type: boolean_true
|
||||
description: |
|
||||
Discard locus features and attributes found in the input.
|
||||
- name: --description
|
||||
alternatives: -A
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use the description field from <seq_info.fsize> and add it as the value for a 'descr'
|
||||
attribute to the GFF record.
|
||||
|
||||
- name: Sorting
|
||||
arguments:
|
||||
- name: --sort_alpha
|
||||
type: boolean_true
|
||||
description: |
|
||||
Chromosomes (reference sequences) are sorted alphabetically.
|
||||
- name: --sort_by
|
||||
type: file
|
||||
must_exist: true
|
||||
description: |
|
||||
Sort the reference sequences by the order in which their names are given in the
|
||||
<refseq.lst> file.
|
||||
- name: Misc options
|
||||
arguments:
|
||||
- name: --keep_attrs
|
||||
alternatives: -F
|
||||
type: boolean_true
|
||||
description: |
|
||||
Keep all GFF attributes (for non-exon features).
|
||||
- name: --keep_exon_attrs
|
||||
type: boolean_true
|
||||
description: |
|
||||
For -F option, do not attempt to reduce redundant exon/CDS attributes.
|
||||
- name: --no_exon_attrs
|
||||
alternatives: -G
|
||||
type: boolean_true
|
||||
description: |
|
||||
Do not keep exon attributes, move them to the transcript feature (for GFF3 output).
|
||||
- name: --attrs
|
||||
type: string
|
||||
description: |
|
||||
Only output the GTF/GFF attributes listed in <attr-list> which is a comma delimited
|
||||
list of attribute names to.
|
||||
- name: --keep_genes
|
||||
type: boolean_true
|
||||
description: |
|
||||
In transcript-only mode (default), also preserve gene records.
|
||||
- name: --keep_comments
|
||||
type: boolean_true
|
||||
description: |
|
||||
For GFF3 input/output, try to preserve comments.
|
||||
- name: --process_other
|
||||
alternatives: -O
|
||||
type: boolean_true
|
||||
description: |
|
||||
process other non-transcript GFF records (by default non-transcript records are ignored).
|
||||
- name: --rm_stop_codons
|
||||
alternatives: -V
|
||||
type: boolean_true
|
||||
description: |
|
||||
Discard any mRNAs with CDS having in-frame stop codons (requires --genome).
|
||||
- name: --adj_cds_start
|
||||
alternatives: -H
|
||||
type: boolean_true
|
||||
description: |
|
||||
For --rm_stop_codons option, check and adjust the starting CDS phase if the original phase
|
||||
leads to a translation with an in-frame stop codon.
|
||||
- name: --opposite_strand
|
||||
alternatives: -B
|
||||
type: boolean_true
|
||||
description: |
|
||||
For -V option, single-exon transcripts are also checked on the opposite strand (requires
|
||||
--genome).
|
||||
- name: --coding_status
|
||||
alternatives: -P
|
||||
type: boolean_true
|
||||
description: |
|
||||
Add transcript level GFF attributes about the coding status of each transcript, including
|
||||
partialness or in-frame stop codons (requires --genome).
|
||||
- name: --add_hasCDS
|
||||
type: boolean_true
|
||||
description: |
|
||||
Add a "hasCDS" attribute with value "true" for transcripts that have CDS features.
|
||||
- name: --adj_stop
|
||||
type: boolean_true
|
||||
description: |
|
||||
Stop codon adjustment: enables --coding_status and performs automatic adjustment of the CDS stop
|
||||
coordinate if premature or downstream.
|
||||
- name: --rm_noncanon
|
||||
alternatives: -N
|
||||
type: boolean_true
|
||||
description: |
|
||||
Discard multi-exon mRNAs that have any intron with a non-canonical splice site consensus
|
||||
(i.e. not GT-AG, GC-AG or AT-AC).
|
||||
- name: --complete_cds
|
||||
alternatives: -J
|
||||
type: boolean_true
|
||||
description: |
|
||||
Discard any mRNAs that either lack initial START codon or the terminal STOP codon, or
|
||||
have an in-frame stop codon (i.e. only print mRNAs with a complete CDS).
|
||||
- name: --no_pseudo
|
||||
type: boolean_true
|
||||
description: |
|
||||
Filter out records matching the 'pseudo' keyword.
|
||||
- name: --in_bed
|
||||
type: boolean_true
|
||||
description: |
|
||||
Input should be parsed as BED format (automatic if the input filename ends with .bed*).
|
||||
- name: --in_tlf
|
||||
type: boolean_true
|
||||
description: |
|
||||
Input GFF-like one-line-per-transcript format without exon/CDS features (see --tlf option
|
||||
below); automatic if the input filename ends with .tlf).
|
||||
- name: --stream
|
||||
type: boolean_true
|
||||
description: |
|
||||
Fast processing of input GFF/BED transcripts as they are received (no sorting, exons must
|
||||
be grouped by transcript in the input data).
|
||||
|
||||
- name: Clustering
|
||||
arguments:
|
||||
- name: --merge
|
||||
alternatives: -M
|
||||
type: boolean_true
|
||||
description: |
|
||||
Cluster the input transcripts into loci, discarding "redundant" transcripts (those with
|
||||
the same exact introns and fully contained or equal boundaries).
|
||||
- name: --dupinfo
|
||||
alternatives: -d
|
||||
type: file
|
||||
description: |
|
||||
For --merge option, write duplication info to file <dupinfo>.
|
||||
- name: --cluster_only
|
||||
type: boolean_true
|
||||
description: |
|
||||
Same as --merge but without discarding any of the "duplicate" transcripts, only create
|
||||
"locus" features.
|
||||
- name: --rm_redundant
|
||||
alternatives: -K
|
||||
type: boolean_true
|
||||
description: |
|
||||
For --merge option: also discard as redundant the shorter, fully contained transcripts (intron
|
||||
chains matching a part of the container).
|
||||
- name: --no_boundary
|
||||
alternatives: -Q
|
||||
type: boolean_true
|
||||
description: |
|
||||
For --merge option, no longer require boundary containment when assessing redundancy (can be
|
||||
combined with --rm_redundant); only introns have to match for multi-exon transcripts, and >=80%
|
||||
overlap for single-exon transcripts.
|
||||
- name: --no_overlap
|
||||
alternatives: -Y
|
||||
type: boolean_true
|
||||
description: |
|
||||
For --merge option, enforce --no_boundary but also discard overlapping single-exon transcripts,
|
||||
even on the opposite strand (can be combined with --rm_redudant).
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/gffread:0.12.7--hdcf5f25_3
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
echo "gffread: \"$(gffread --version 2>&1)\"" > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
140
src/gffread/help.txt
Normal file
140
src/gffread/help.txt
Normal file
@@ -0,0 +1,140 @@
|
||||
```sh
|
||||
gffread --help
|
||||
```
|
||||
|
||||
gffread v0.12.7. Usage:
|
||||
gffread [-g <genomic_seqs_fasta> | <dir>] [-s <seq_info.fsize>]
|
||||
[-o <outfile>] [-t <trackname>] [-r [<strand>]<chr>:<start>-<end> [-R]]
|
||||
[--jmatch <chr>:<start>-<end>] [--no-pseudo]
|
||||
[-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]
|
||||
[-j ][--ids <IDs.lst> | --nids <IDs.lst>] [--attrs <attr-list>] [-i <maxintron>]
|
||||
[--stream] [--bed | --gtf | --tlf] [--table <attrlist>] [--sort-by <ref.lst>]
|
||||
[<input_gff>]
|
||||
|
||||
Filter, convert or cluster GFF/GTF/BED records, extract the sequence of
|
||||
transcripts (exon or CDS) and more.
|
||||
By default (i.e. without -O) only transcripts are processed, discarding any
|
||||
other non-transcript features. Default output is a simplified GFF3 with only
|
||||
the basic attributes.
|
||||
|
||||
Options:
|
||||
--ids discard records/transcripts if their IDs are not listed in <IDs.lst>
|
||||
--nids discard records/transcripts if their IDs are listed in <IDs.lst>
|
||||
-i discard transcripts having an intron larger than <maxintron>
|
||||
-l discard transcripts shorter than <minlen> bases
|
||||
-r only show transcripts overlapping coordinate range <start>..<end>
|
||||
(on chromosome/contig <chr>, strand <strand> if provided)
|
||||
-R for -r option, discard all transcripts that are not fully
|
||||
contained within the given range
|
||||
--jmatch only output transcripts matching the given junction
|
||||
-U discard single-exon transcripts
|
||||
-C coding only: discard mRNAs that have no CDS features
|
||||
--nc non-coding only: discard mRNAs that have CDS features
|
||||
--ignore-locus : discard locus features and attributes found in the input
|
||||
-A use the description field from <seq_info.fsize> and add it
|
||||
as the value for a 'descr' attribute to the GFF record
|
||||
-s <seq_info.fsize> is a tab-delimited file providing this info
|
||||
for each of the mapped sequences:
|
||||
<seq-name> <seq-length> <seq-description>
|
||||
(useful for -A option with mRNA/EST/protein mappings)
|
||||
Sorting: (by default, chromosomes are kept in the order they were found)
|
||||
--sort-alpha : chromosomes (reference sequences) are sorted alphabetically
|
||||
--sort-by : sort the reference sequences by the order in which their
|
||||
names are given in the <refseq.lst> file
|
||||
Misc options:
|
||||
-F keep all GFF attributes (for non-exon features)
|
||||
--keep-exon-attrs : for -F option, do not attempt to reduce redundant
|
||||
exon/CDS attributes
|
||||
-G do not keep exon attributes, move them to the transcript feature
|
||||
(for GFF3 output)
|
||||
--attrs <attr-list> only output the GTF/GFF attributes listed in <attr-list>
|
||||
which is a comma delimited list of attribute names to
|
||||
--keep-genes : in transcript-only mode (default), also preserve gene records
|
||||
--keep-comments: for GFF3 input/output, try to preserve comments
|
||||
-O process other non-transcript GFF records (by default non-transcript
|
||||
records are ignored)
|
||||
-V discard any mRNAs with CDS having in-frame stop codons (requires -g)
|
||||
-H for -V option, check and adjust the starting CDS phase
|
||||
if the original phase leads to a translation with an
|
||||
in-frame stop codon
|
||||
-B for -V option, single-exon transcripts are also checked on the
|
||||
opposite strand (requires -g)
|
||||
-P add transcript level GFF attributes about the coding status of each
|
||||
transcript, including partialness or in-frame stop codons (requires -g)
|
||||
--add-hasCDS : add a "hasCDS" attribute with value "true" for transcripts
|
||||
that have CDS features
|
||||
--adj-stop stop codon adjustment: enables -P and performs automatic
|
||||
adjustment of the CDS stop coordinate if premature or downstream
|
||||
-N discard multi-exon mRNAs that have any intron with a non-canonical
|
||||
splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)
|
||||
-J discard any mRNAs that either lack initial START codon
|
||||
or the terminal STOP codon, or have an in-frame stop codon
|
||||
(i.e. only print mRNAs with a complete CDS)
|
||||
--no-pseudo: filter out records matching the 'pseudo' keyword
|
||||
--in-bed: input should be parsed as BED format (automatic if the input
|
||||
filename ends with .bed*)
|
||||
--in-tlf: input GFF-like one-line-per-transcript format without exon/CDS
|
||||
features (see --tlf option below); automatic if the input
|
||||
filename ends with .tlf)
|
||||
--stream: fast processing of input GFF/BED transcripts as they are received
|
||||
((no sorting, exons must be grouped by transcript in the input data)
|
||||
Clustering:
|
||||
-M/--merge : cluster the input transcripts into loci, discarding
|
||||
"redundant" transcripts (those with the same exact introns
|
||||
and fully contained or equal boundaries)
|
||||
-d <dupinfo> : for -M option, write duplication info to file <dupinfo>
|
||||
--cluster-only: same as -M/--merge but without discarding any of the
|
||||
"duplicate" transcripts, only create "locus" features
|
||||
-K for -M option: also discard as redundant the shorter, fully contained
|
||||
transcripts (intron chains matching a part of the container)
|
||||
-Q for -M option, no longer require boundary containment when assessing
|
||||
redundancy (can be combined with -K); only introns have to match for
|
||||
multi-exon transcripts, and >=80% overlap for single-exon transcripts
|
||||
-Y for -M option, enforce -Q but also discard overlapping single-exon
|
||||
transcripts, even on the opposite strand (can be combined with -K)
|
||||
Output options:
|
||||
--force-exons: make sure that the lowest level GFF features are considered
|
||||
"exon" features
|
||||
--gene2exon: for single-line genes not parenting any transcripts, add an
|
||||
exon feature spanning the entire gene (treat it as a transcript)
|
||||
--t-adopt: try to find a parent gene overlapping/containing a transcript
|
||||
that does not have any explicit gene Parent
|
||||
-D decode url encoded characters within attributes
|
||||
-Z merge very close exons into a single exon (when intron size<4)
|
||||
-g full path to a multi-fasta file with the genomic sequences
|
||||
for all input mappings, OR a directory with single-fasta files
|
||||
(one per genomic sequence, with file names matching sequence names)
|
||||
-j output the junctions and the corresponding transcripts
|
||||
-w write a fasta file with spliced exons for each transcript
|
||||
--w-add <N> for the -w option, extract additional <N> bases
|
||||
both upstream and downstream of the transcript boundaries
|
||||
--w-nocds for -w, disable the output of CDS info in the FASTA file
|
||||
-x write a fasta file with spliced CDS for each GFF transcript
|
||||
-y write a protein fasta file with the translation of CDS for each record
|
||||
-W for -w, -x and -y options, write in the FASTA defline all the exon
|
||||
coordinates projected onto the spliced sequence;
|
||||
-S for -y option, use '*' instead of '.' as stop codon translation
|
||||
-L Ensembl GTF to GFF3 conversion, adds version to IDs
|
||||
-m <chr_replace> is a name mapping table for converting reference
|
||||
sequence names, having this 2-column format:
|
||||
<original_ref_ID> <new_ref_ID>
|
||||
-t use <trackname> in the 2nd column of each GFF/GTF output line
|
||||
-o write the output records into <outfile> instead of stdout
|
||||
-T main output will be GTF instead of GFF3
|
||||
--bed output records in BED format instead of default GFF3
|
||||
--tlf output "transcript line format" which is like GFF
|
||||
but with exons and CDS related features stored as GFF
|
||||
attributes in the transcript feature line, like this:
|
||||
exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords>
|
||||
<exons> is a comma-delimited list of exon_start-exon_end coordinates;
|
||||
<CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>
|
||||
--table output a simple tab delimited format instead of GFF, with columns
|
||||
having the values of GFF attributes given in <attrlist>; special
|
||||
pseudo-attributes (prefixed by @) are recognized:
|
||||
@id, @geneid, @chr, @start, @end, @strand, @numexons, @exons,
|
||||
@cds, @covlen, @cdslen
|
||||
If any of -w/-y/-x FASTA output files are enabled, the same fields
|
||||
(excluding @id) are appended to the definition line of corresponding
|
||||
FASTA records
|
||||
-v,-E expose (warn about) duplicate transcript IDs and other potential
|
||||
problems with the given GFF/GTF records
|
||||
119
src/gffread/script.sh
Normal file
119
src/gffread/script.sh
Normal file
@@ -0,0 +1,119 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# unset flags
|
||||
[[ "$par_coding" == "false" ]] && unset par_coding
|
||||
[[ "$par_strict_range" == "false" ]] && unset par_strict_range
|
||||
[[ "$par_no_single_exon" == "false" ]] && unset par_no_single_exon
|
||||
[[ "$par_no_exon_attrs" == "false" ]] && unset par_no_exon_attrs
|
||||
[[ "$par_nc" == "false" ]] && unset par_nc
|
||||
[[ "$par_ignore_locus" == "false" ]] && unset par_ignore_locus
|
||||
[[ "$par_description" == "false" ]] && unset par_description
|
||||
[[ "$par_sort_alpha" == "false" ]] && unset par_sort_alpha
|
||||
[[ "$par_keep_genes" == "false" ]] && unset par_keep_genes
|
||||
[[ "$par_keep_attrs" == "false" ]] && unset par_keep_attrs
|
||||
[[ "$par_keep_exon_attrs" == "false" ]] && unset par_keep_exon_attrs
|
||||
[[ "$par_keep_comments" == "false" ]] && unset par_keep_comments
|
||||
[[ "$par_process_other" == "false" ]] && unset par_process_other
|
||||
[[ "$par_rm_stop_codons" == "false" ]] && unset par_rm_stop_codons
|
||||
[[ "$par_adj_cds_start" == "false" ]] && unset par_adj_cds_start
|
||||
[[ "$par_opposite_strand" == "false" ]] && unset par_opposite_strand
|
||||
[[ "$par_coding_status" == "false" ]] && unset par_coding_status
|
||||
[[ "$par_add_hasCDS" == "false" ]] && unset par_add_hasCDS
|
||||
[[ "$par_adj_stop" == "false" ]] && unset par_adj_stop
|
||||
[[ "$par_rm_noncanon" == "false" ]] && unset par_rm_noncanon
|
||||
[[ "$par_complete_cds" == "false" ]] && unset par_complete_cds
|
||||
[[ "$par_no_pseudo" == "false" ]] && unset par_no_pseudo
|
||||
[[ "$par_in_bed" == "false" ]] && unset par_in_bed
|
||||
[[ "$par_in_tlf" == "false" ]] && unset par_in_tlf
|
||||
[[ "$par_stream" == "false" ]] && unset par_stream
|
||||
[[ "$par_merge" == "false" ]] && unset par_merge
|
||||
[[ "$par_rm_redundant" == "false" ]] && unset par_rm_redundant
|
||||
[[ "$par_no_boundary" == "false" ]] && unset par_no_boundary
|
||||
[[ "$par_no_overlap" == "false" ]] && unset par_no_overlap
|
||||
[[ "$par_force_exons" == "false" ]] && unset par_force_exons
|
||||
[[ "$par_gene2exon" == "false" ]] && unset par_gene2exon
|
||||
[[ "$par_t_adopt" == "false" ]] && unset par_t_adopt
|
||||
[[ "$par_decode" == "false" ]] && unset par_decode
|
||||
[[ "$par_merge_exons" == "false" ]] && unset par_merge_exons
|
||||
[[ "$par_junctions" == "false" ]] && unset par_junctions
|
||||
[[ "$par_w_nocds" == "false" ]] && unset par_w_nocds
|
||||
[[ "$par_tr_cds" == "false" ]] && unset par_tr_cds
|
||||
[[ "$par_w_coords" == "false" ]] && unset par_w_coords
|
||||
[[ "$par_stop_dot" == "false" ]] && unset par_stop_dot
|
||||
[[ "$par_id_version" == "false" ]] && unset par_id_version
|
||||
[[ "$par_gtf_output" == "false" ]] && unset par_gtf_output
|
||||
[[ "$par_bed" == "false" ]] && unset par_bed
|
||||
[[ "$par_tlf" == "false" ]] && unset par_tlf
|
||||
[[ "$par_expose_dups" == "false" ]] && unset par_expose_dups
|
||||
[[ "$par_cluster_only" == "false" ]] && unset par_cluster_only
|
||||
|
||||
|
||||
$(which gffread) \
|
||||
"$par_input" \
|
||||
${par_chr_mapping:+-m "$par_chr_mapping"} \
|
||||
${par_seq_info:+-s "$par_seq_info"} \
|
||||
-o "$par_outfile" \
|
||||
${par_force_exons:+--force-exons} \
|
||||
${par_gene2exon:+--gene2exon} \
|
||||
${par_t_adopt:+--t-adopt} \
|
||||
${par_decode:+-D} \
|
||||
${par_merge_exons:+-Z} \
|
||||
${par_genome:+-g "$par_genome"} \
|
||||
${par_junctions:+-j} \
|
||||
${par_spliced_exons:+-w "$par_spliced_exons"} \
|
||||
${par_w_add:+--w-add "$par_w_add"} \
|
||||
${par_w_nocds:+--w-nocds} \
|
||||
${par_spliced_cds:+-x "$par_spliced_cds"} \
|
||||
${par_tr_cds:+-y "$par_tr_cds"} \
|
||||
${par_w_coords:+-W} \
|
||||
${par_stop_dot:+-S} \
|
||||
${par_id_version:+-L} \
|
||||
${par_trackname:+-t "$par_trackname"} \
|
||||
${par_gtf_output:+-T} \
|
||||
${par_bed:+--bed} \
|
||||
${par_tlf:+--tlf} \
|
||||
${par_table:+--table "$par_table"} \
|
||||
${par_expose_dups:+-E} \
|
||||
${par_ids:+--ids "$par_ids"} \
|
||||
${par_nids:+--nids "$par_nids"} \
|
||||
${par_maxintron:+-i "$par_maxintron"} \
|
||||
${par_minlen:+-l "$par_minlen"} \
|
||||
${par_range:+-r "$par_range"} \
|
||||
${par_strict_range:+-R} \
|
||||
${par_jmatch:+--jmatch "$par_jmatch"} \
|
||||
${par_no_single_exon:+-U} \
|
||||
${par_coding:+-C} \
|
||||
${par_nc:+--nc} \
|
||||
${par_ignore_locus:+--ignore-locus} \
|
||||
${par_description:+-A} \
|
||||
${par_sort_alpha:+--sort-alpha} \
|
||||
${par_sort_by:+--sort-by "$par_sort_by"} \
|
||||
${par_keep_attrs:+-F} \
|
||||
${par_keep_exon_attrs:+--keep-exon-attrs} \
|
||||
${par_no_exon_attrs:+-G} \
|
||||
${par_attrs:+--attrs "$par_attrs"} \
|
||||
${par_keep_genes:+--keep-genes} \
|
||||
${par_keep_comments:+--keep-comments} \
|
||||
${par_process_other:+-O} \
|
||||
${par_rm_stop_codons:+-V} \
|
||||
${par_adj_cds_start:+-H} \
|
||||
${par_opposite_strand:+-B} \
|
||||
${par_coding_status:+-P} \
|
||||
${par_add_hasCDS:+--add-hasCDS} \
|
||||
${par_adj_stop:+--adj-stop} \
|
||||
${par_rm_noncanon:+-N} \
|
||||
${par_complete_cds:+-J} \
|
||||
${par_no_pseudo:+--no-pseudo} \
|
||||
${par_in_bed:+--in-bed} \
|
||||
${par_in_tlf:+--in-tlf} \
|
||||
${par_stream:+--stream} \
|
||||
${par_merge:+-M} \
|
||||
${par_dupinfo:+-d "$par_dupinfo"} \
|
||||
${par_cluster_only:+--cluster-only} \
|
||||
${par_rm_redundant:+-K} \
|
||||
${par_no_boundary:+-Q} \
|
||||
${par_no_overlap:+-Y}
|
||||
|
||||
111
src/gffread/test.sh
Executable file
111
src/gffread/test.sh
Executable file
@@ -0,0 +1,111 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -e
|
||||
|
||||
test_output_dir="${meta_resources_dir}/test_data/test_output"
|
||||
test_dir="${meta_resources_dir}/test_data"
|
||||
expected_output_dir="${meta_resources_dir}/test_data/output"
|
||||
|
||||
mkdir -p "$test_output_dir"
|
||||
|
||||
|
||||
################################################################################
|
||||
|
||||
echo "> Test 1 - Read annotation file, output GFF"
|
||||
|
||||
"$meta_executable" \
|
||||
--expose_dups \
|
||||
--outfile "$test_output_dir/ann_simple.gff" \
|
||||
--input "$test_dir/sequence.gff3"
|
||||
|
||||
|
||||
echo ">> Check if output exists"
|
||||
[ ! -f "$test_output_dir/ann_simple.gff" ] \
|
||||
&& echo "Output file test_output/ann_simple.gff does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "$test_output_dir/ann_simple.gff" ] \
|
||||
&& echo "Output file test_output/ann_simple.gff is empty" && exit 1
|
||||
|
||||
echo ">> Compare output to expected output"
|
||||
|
||||
# compare file expect lines starting with "#"
|
||||
diff <(grep -v "^#" "$expected_output_dir/ann_simple.gff") \
|
||||
<(grep -v "^#" "$test_output_dir/ann_simple.gff") || \
|
||||
(echo "Output file ann_simple.gff does not match expected output" && exit 1)
|
||||
|
||||
################################################################################
|
||||
|
||||
echo "> Test 2 - Read annotation file, output GTF"
|
||||
|
||||
"$meta_executable" \
|
||||
--gtf_output \
|
||||
--outfile "$test_output_dir/annotation.gtf" \
|
||||
--input "$test_dir/sequence.gff3"
|
||||
|
||||
echo ">> Check if output exists"
|
||||
[ ! -f "$test_output_dir/annotation.gtf" ] \
|
||||
&& echo "Output file test_output/annotation.gtf does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "$test_output_dir/annotation.gtf" ] \
|
||||
&& echo "Output file test_output/annotation.gtf is empty" && exit 1
|
||||
|
||||
echo ">> Compare output to expected output"
|
||||
diff "$expected_output_dir/annotation.gtf" "$test_output_dir/annotation.gtf" || \
|
||||
(echo "Output file annotation.gtf does not match expected output" && exit 1)
|
||||
|
||||
################################################################################
|
||||
|
||||
echo "> Test 3 - Generate fasta file from annotation file"
|
||||
|
||||
|
||||
"$meta_executable" \
|
||||
--genome "$test_dir/sequence.fasta" \
|
||||
--spliced_exons "$test_output_dir/transcripts.fa" \
|
||||
--outfile "$test_output_dir/output.gff" \
|
||||
--input "$test_dir/sequence.gff3"
|
||||
|
||||
echo ">> Check if output exists"
|
||||
[ ! -f "$test_output_dir/transcripts.fa" ] \
|
||||
&& echo "Output file transcripts.fa does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "$test_output_dir/transcripts.fa" ] \
|
||||
&& echo "Output file transcripts.fa is empty" && exit 1
|
||||
|
||||
echo ">> Compare output to expected output"
|
||||
diff "$expected_output_dir/transcripts.fa" "$test_output_dir/transcripts.fa" || \
|
||||
(echo "Output file transcripts.fa does not match expected output" && exit 1)
|
||||
|
||||
################################################################################
|
||||
|
||||
echo "> Test 4 - Generate table from GFF annotation file"
|
||||
|
||||
"$meta_executable" \
|
||||
--table @id,@chr,@start,@end,@strand,@exons,Name,gene,product \
|
||||
--outfile "$test_output_dir/annotation.tbl" \
|
||||
--input "$test_dir/sequence.gff3"
|
||||
|
||||
echo ">> Check if output exists"
|
||||
[ ! -f "$test_output_dir/annotation.tbl" ] \
|
||||
&& echo "Output file test_output/annotation.tbl does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "$test_output_dir/annotation.tbl" ] \
|
||||
&& echo "Output file test_output/annotation.tbl is empty" && exit 1
|
||||
|
||||
echo ">> Compare output to expected output"
|
||||
diff "$expected_output_dir/annotation.tbl" "$test_output_dir/annotation.tbl" || \
|
||||
(echo "Output file annotation.tbl does not match expected output" && exit 1)
|
||||
|
||||
################################################################################
|
||||
|
||||
rm -r "$test_output_dir"
|
||||
|
||||
echo "> All tests successful"
|
||||
|
||||
exit 0
|
||||
38
src/gffread/test_data/README.md
Normal file
38
src/gffread/test_data/README.md
Normal file
@@ -0,0 +1,38 @@
|
||||
## GffRead usage examples
|
||||
|
||||
GffRead can be used to simply read an annotation file in a GFF format, and print it in either GFF3 (default) or
|
||||
GTF2 format (with the -T option), while discarding any non-trasncript features and optional attributes.
|
||||
It can also report some potential issues found in the input GFF records. The command line for such a quick GFF/GTF
|
||||
file cleanup would be:
|
||||
```
|
||||
gffread -E annotation.gff -o ann_simple.gff
|
||||
```
|
||||
|
||||
This will create a minimalist GFF3 re-formatting of the transcript records found in the input file (`annotation.gff` in this example).
|
||||
The -E option directs GffRead to "expose" (display warnings about) any potential formatting issues
|
||||
encountered while parsing the input file.
|
||||
|
||||
In order to obtain the GTF2 version of the same transcript records, the `-T` option should be added:
|
||||
```
|
||||
gffread annotation.gff -T -o annotation.gtf
|
||||
```
|
||||
|
||||
GffRead can be used to generate a FASTA file with the DNA sequences for all transcripts in a GFF file. For this operation
|
||||
a fasta file with the genomic sequences has to be provided as well. This can be accomplished with a command line like this:
|
||||
```
|
||||
gffread -w transcripts.fa -g genome.fa annotation.gff
|
||||
```
|
||||
The file `genome.fa` in this example would be a multi-fasta file with the chromosome/contig sequences of the target genome.
|
||||
This also requires that every contig or chromosome name found in the 1st column of the input GFF file
|
||||
(`annotation.gff` in this example) must have a corresponding sequence entry in the `genome.fa` file.
|
||||
|
||||
|
||||
```
|
||||
gffread --table @id,@chr,@start,@end,@strand,@exons,Name,gene,product \
|
||||
-o annotation.tbl annotation.gff
|
||||
```
|
||||
This shows how the `--table` option can make a tab delimited table out of a GFF3 input.
|
||||
|
||||
The `output` directory contains all the output files that should be generated by the above examples.
|
||||
|
||||
|
||||
5
src/gffread/test_data/output/ann_simple.gff
Normal file
5
src/gffread/test_data/output/ann_simple.gff
Normal file
@@ -0,0 +1,5 @@
|
||||
##gff-version 3
|
||||
# gffread v0.12.7
|
||||
# gffread -E -o output/ann_simple.gff sequence.gff3
|
||||
NM_141699.3 RefSeq gene 22 795 . + . ID=gene-Dmel_CG16905;gene_name=eloF
|
||||
NM_141699.3 RefSeq CDS 22 795 . + 0 Parent=gene-Dmel_CG16905
|
||||
2
src/gffread/test_data/output/annotation.gtf
Normal file
2
src/gffread/test_data/output/annotation.gtf
Normal file
@@ -0,0 +1,2 @@
|
||||
NM_141699.3 RefSeq transcript 22 795 . + . transcript_id "gene-Dmel_CG16905"; gene_id "gene-Dmel_CG16905"; gene_name "eloF"
|
||||
NM_141699.3 RefSeq CDS 22 795 . + 0 transcript_id "gene-Dmel_CG16905"; gene_name "eloF";
|
||||
1
src/gffread/test_data/output/annotation.tbl
Normal file
1
src/gffread/test_data/output/annotation.tbl
Normal file
@@ -0,0 +1 @@
|
||||
gene-Dmel_CG16905 NM_141699.3 22 795 + 22-795 eloF eloF elongase F
|
||||
13
src/gffread/test_data/output/transcripts.fa
Normal file
13
src/gffread/test_data/output/transcripts.fa
Normal file
@@ -0,0 +1,13 @@
|
||||
>gene-Dmel_CG16905 CDS=1-774
|
||||
ATGTTCGCTCCGATAGATCCTGTAAAGATACCCGTTGTAAGCAATCCATGGATAACCATGGGCACATTGA
|
||||
TTGGCTATCTGCTGTTTGTGCTCAAGCTGGGCCCCAAAATCATGGAGCACCGAAAGCCCTTCCATTTGAA
|
||||
TGGCGTCATCAGGATCTACAACATATTCCAGATCCTTTACAATGGTCTAATACTCGTTTTAGGAGTTCAC
|
||||
TTCCTGTTTGTCCTGAAAGCCTACCAAATCAGTTGCATTGTTAGCCTGCCGATGGATCACAAATATAAGG
|
||||
ATAGAGAGCGTTTGATTTGCACTTTGTACCTGGTGAACAAATTCGTAGACCTTGTGGAAACCATTTTCTT
|
||||
TGTGCTCCGCAAAAAGGACAGACAGATATCCTTCCTGCACGTCTTCCATCATTTTGCGATGGCATTTTTT
|
||||
GGATATCTCTACTACTGCTTCCACGGATACGGTGGCGTTGCCTTTCCACAGTGCCTGCTAAACACCGCCG
|
||||
TCCACGTGATTATGTACGCCTACTACTATCTATCCTCGATCAGCAAGGAGGTGCAGAGAAGTCTCTGGTG
|
||||
GAAGAAATACATCACAATTGCTCAGCTGGTCCAGTTCGCCATTATTCTGCTCCACTGTACCATCACGCTG
|
||||
GCACAGCCCAACTGCGCGGTCAACAGACCCTTGACCTACGGATGCGGATCGCTTTCAGCGTTTTTTGCAG
|
||||
TGATATTTAGCCAATTTTATTACCACAACTACATAAAGCCAGGAAAGAAGTCAGCGAAACAAAACAAAAA
|
||||
TTAA
|
||||
9
src/gffread/test_data/script.sh
Executable file
9
src/gffread/test_data/script.sh
Executable file
@@ -0,0 +1,9 @@
|
||||
#!/bin/bash
|
||||
|
||||
# clone repo
|
||||
if [ ! -d /tmp/gffread_source ]; then
|
||||
git clone --depth 2 --single-branch --branch master https://github.com/gpertea/gffread.git /tmp/gffread_source
|
||||
fi
|
||||
|
||||
# copy test data
|
||||
cp -r /tmp/gffread_source/examples/* src/gffread/test_data
|
||||
16
src/gffread/test_data/sequence.fasta
Normal file
16
src/gffread/test_data/sequence.fasta
Normal file
@@ -0,0 +1,16 @@
|
||||
>NM_141699.3 Drosophila melanogaster elongase F (eloF), mRNA
|
||||
CACAACTCGATTAGATTCGCCATGTTCGCTCCGATAGATCCTGTAAAGATACCCGTTGTAAGCAATCCAT
|
||||
GGATAACCATGGGCACATTGATTGGCTATCTGCTGTTTGTGCTCAAGCTGGGCCCCAAAATCATGGAGCA
|
||||
CCGAAAGCCCTTCCATTTGAATGGCGTCATCAGGATCTACAACATATTCCAGATCCTTTACAATGGTCTA
|
||||
ATACTCGTTTTAGGAGTTCACTTCCTGTTTGTCCTGAAAGCCTACCAAATCAGTTGCATTGTTAGCCTGC
|
||||
CGATGGATCACAAATATAAGGATAGAGAGCGTTTGATTTGCACTTTGTACCTGGTGAACAAATTCGTAGA
|
||||
CCTTGTGGAAACCATTTTCTTTGTGCTCCGCAAAAAGGACAGACAGATATCCTTCCTGCACGTCTTCCAT
|
||||
CATTTTGCGATGGCATTTTTTGGATATCTCTACTACTGCTTCCACGGATACGGTGGCGTTGCCTTTCCAC
|
||||
AGTGCCTGCTAAACACCGCCGTCCACGTGATTATGTACGCCTACTACTATCTATCCTCGATCAGCAAGGA
|
||||
GGTGCAGAGAAGTCTCTGGTGGAAGAAATACATCACAATTGCTCAGCTGGTCCAGTTCGCCATTATTCTG
|
||||
CTCCACTGTACCATCACGCTGGCACAGCCCAACTGCGCGGTCAACAGACCCTTGACCTACGGATGCGGAT
|
||||
CGCTTTCAGCGTTTTTTGCAGTGATATTTAGCCAATTTTATTACCACAACTACATAAAGCCAGGAAAGAA
|
||||
GTCAGCGAAACAAAACAAAAATTAACTAAATTTAAACTAAATCATGAGTACAAAGCCTAAAGATTCGTGA
|
||||
AGCAACAATAGCCACAGCCTATTTTTGAATATTTCATATATGATTTTATGGGGTAAATGAATTAAAAAAC
|
||||
ATTTGTTTTCTTGGCGTCAAACT
|
||||
|
||||
9
src/gffread/test_data/sequence.gff3
Normal file
9
src/gffread/test_data/sequence.gff3
Normal file
@@ -0,0 +1,9 @@
|
||||
##gff-version 3
|
||||
#!gff-spec-version 1.21
|
||||
#!processor NCBI annotwriter
|
||||
##sequence-region NM_141699.3 1 933
|
||||
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7227
|
||||
NM_141699.3 RefSeq region 1 933 . + . ID=NM_141699.3:1..933;Dbxref=taxon:7227;Name=3R;chromosome=3R;gbkey=Src;genome=chromosome;genotype=y[1]%3B Gr22b[1] Gr22d[1] cn[1] CG33964[R4.2] bw[1] sp[1]%3B LysC[1] MstProx[1] GstD5[1] Rh6[1];mol_type=mRNA
|
||||
NM_141699.3 RefSeq gene 1 933 . + . ID=gene-Dmel_CG16905;Dbxref=FLYBASE:FBgn0037762,GeneID:41211;Name=eloF;cyt_map=85E10-85E10;description=elongase F;gbkey=Gene;gen_map=3-49 cM;gene=eloF;gene_synonym=CG16905,Dmel\CG16905,EloF;locus_tag=Dmel_CG16905
|
||||
NM_141699.3 RefSeq CDS 22 795 . + 0 ID=cds-NP_649956.1;Parent=gene-Dmel_CG16905;Dbxref=FLYBASE:FBpp0081622,GeneID:41211,GenBank:NP_649956.1,FLYBASE:FBgn0037762;Name=NP_649956.1;gbkey=CDS;gene=eloF;locus_tag=Dmel_CG16905;orig_transcript_id=gnl|FlyBase|CG16905-RA;product=elongase F;protein_id=NP_649956.1
|
||||
|
||||
250
src/lofreq/call/config.vsh.yaml
Normal file
250
src/lofreq/call/config.vsh.yaml
Normal file
@@ -0,0 +1,250 @@
|
||||
name: lofreq_call
|
||||
namespace: lofreq
|
||||
description: |
|
||||
Call variants from a BAM file.
|
||||
|
||||
LoFreq* (i.e. LoFreq version 2) is a fast and sensitive variant-caller for inferring SNVs and indels from next-generation sequencing data. It makes full use of base-call qualities and other sources of errors inherent in sequencing (e.g. mapping or base/indel alignment uncertainty), which are usually ignored by other methods or only used for filtering.
|
||||
|
||||
LoFreq* can run on almost any type of aligned sequencing data (e.g. Illumina, IonTorrent or Pacbio) since no machine- or sequencing-technology dependent thresholds are used. It automatically adapts to changes in coverage and sequencing quality and can therefore be applied to a variety of data-sets e.g. viral/quasispecies, bacterial, metagenomics or somatic data.
|
||||
|
||||
LoFreq* is very sensitive; most notably, it is able to predict variants below the average base-call quality (i.e. sequencing error rate). Each variant call is assigned a p-value which allows for rigorous false positive control. Even though it uses no approximations or heuristics, it is very efficient due to several runtime optimizations and also provides a (pseudo-)parallel implementation. LoFreq* is generic and fast enough to be applied to high-coverage data and large genomes. On a single processor it takes a minute to analyze Dengue genome sequencing data with nearly 4000X coverage, roughly one hour to call SNVs on a 600X coverage E.coli genome and also roughly an hour to run on a 100X coverage human exome dataset.
|
||||
keywords: [ "variant calling", "low frequancy variant calling", "lofreq", "lofreq/call"]
|
||||
links:
|
||||
homepage: https://csb5.github.io/lofreq/
|
||||
documentation: https://csb5.github.io/lofreq/commands/
|
||||
references:
|
||||
doi: 10.1093/nar/gks918
|
||||
license: "MIT"
|
||||
requirements:
|
||||
commands: [ lofreq ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
type: file
|
||||
description: |
|
||||
Input BAM file.
|
||||
required: true
|
||||
example: "normal.bam"
|
||||
- name: --input_bai
|
||||
type: file
|
||||
description: |
|
||||
Index file for the input BAM file.
|
||||
required: true
|
||||
example: "normal.bai"
|
||||
- name: --ref
|
||||
alternatives: -f
|
||||
type: file
|
||||
description: |
|
||||
Indexed reference fasta file (gzip supported). Default: none.
|
||||
required: true
|
||||
example: "reference.fasta"
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --out
|
||||
alternatives: -o
|
||||
type: file
|
||||
description: |
|
||||
Vcf output file. Default: stdout.
|
||||
required: true
|
||||
direction: output
|
||||
example: "output.vcf"
|
||||
- name: Arguments
|
||||
arguments:
|
||||
- name: --region
|
||||
alternatives: -r
|
||||
type: string
|
||||
description: |
|
||||
Limit calls to this region (chrom:start-end). Default: none.
|
||||
required: false
|
||||
example: "chr1:1000-2000"
|
||||
- name: --bed
|
||||
alternatives: -l
|
||||
type: file
|
||||
description: |
|
||||
List of positions (chr pos) or regions (BED). Default: none.
|
||||
required: false
|
||||
example: "regions.bed"
|
||||
- name: --min_bq
|
||||
alternatives: -q
|
||||
type: integer
|
||||
description: |
|
||||
Skip any base with baseQ smaller than INT. Default: 6.
|
||||
required: false
|
||||
example: 6
|
||||
- name: --min_alt_bq
|
||||
alternatives: -Q
|
||||
type: integer
|
||||
description: |
|
||||
Skip alternate bases with baseQ smaller than INT. Default: 6.
|
||||
required: false
|
||||
example: 6
|
||||
- name: --def_alt_bq
|
||||
alternatives: -R
|
||||
type: integer
|
||||
description: |
|
||||
Overwrite baseQs of alternate bases (that passed bq filter) with this value (-1: use median ref-bq; 0: keep). Default: 0.
|
||||
required: false
|
||||
example: 0
|
||||
- name: --min_jq
|
||||
alternatives: -j
|
||||
type: integer
|
||||
description: |
|
||||
Skip any base with joinedQ smaller than INT. Default: 0.
|
||||
example: 0
|
||||
- name: --min_alt_jq
|
||||
alternatives: -J
|
||||
type: integer
|
||||
description: |
|
||||
Skip alternate bases with joinedQ smaller than INT. Default: 0.
|
||||
required: false
|
||||
example: 0
|
||||
- name: --def_alt_jq
|
||||
alternatives: -K
|
||||
type: integer
|
||||
description: |
|
||||
Overwrite joinedQs of alternate bases (that passed jq filter) with this value (-1: use median ref-bq; 0: keep). Default: 0.
|
||||
required: false
|
||||
example: 0
|
||||
- name: --no_baq
|
||||
alternatives: -B
|
||||
type: boolean_true
|
||||
description: |
|
||||
Disable use of base-alignment quality (BAQ).
|
||||
- name: --no_idaq
|
||||
alternatives: -A
|
||||
type: boolean_true
|
||||
description: |
|
||||
Don't use IDAQ values (NOT recommended under ANY circumstances other than debugging).
|
||||
- name: --del_baq
|
||||
alternatives: -D
|
||||
type: boolean_true
|
||||
description: |
|
||||
Delete pre-existing BAQ values, i.e. compute even if already present in BAM.
|
||||
- name: --no_ext_baq
|
||||
alternatives: -e
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use 'normal' BAQ (samtools default) instead of extended BAQ (both computed on the fly if not already present in lb tag).
|
||||
- name: --min_mq
|
||||
alternatives: -m
|
||||
type: integer
|
||||
description: |
|
||||
Skip reads with mapping quality smaller than INT. Default: 0.
|
||||
required: false
|
||||
example: 0
|
||||
- name: --max_mq
|
||||
alternatives: -M
|
||||
type: integer
|
||||
description: |
|
||||
Cap mapping quality at INT. Default: 255.
|
||||
required: false
|
||||
example: 255
|
||||
- name: --no_mq
|
||||
alternatives: -N
|
||||
type: boolean_true
|
||||
description: |
|
||||
Don't merge mapping quality in LoFreq's model.
|
||||
- name: --call_indels
|
||||
type: boolean_true
|
||||
description: |
|
||||
Enable indel calls (note: preprocess your file to include indel alignment qualities!).
|
||||
- name: --only_indels
|
||||
type: boolean_true
|
||||
description: |
|
||||
Only call indels; no SNVs.
|
||||
- name: --src_qual
|
||||
alternatives: -s
|
||||
type: boolean_true
|
||||
description: |
|
||||
Enable computation of source quality.
|
||||
- name: --ign_vcf
|
||||
alternatives: -S
|
||||
type: file
|
||||
description: |
|
||||
Ignore variants in this vcf file for source quality computation. Multiple files can be given separated by commas.
|
||||
required: false
|
||||
example: "variants.vcf"
|
||||
- name: --def_nm_q
|
||||
alternatives: -T
|
||||
type: integer
|
||||
description: |
|
||||
If >= 0, then replace non-match base qualities with this default value. Default: -1.
|
||||
required: false
|
||||
example: -1
|
||||
- name: --sig
|
||||
alternatives: -a
|
||||
type: double
|
||||
description: |
|
||||
P-Value cutoff / significance level. Default: 0.010000.
|
||||
required: false
|
||||
example: 0.01
|
||||
- name: --bonf
|
||||
alternatives: -b
|
||||
type: string
|
||||
description: |
|
||||
Bonferroni factor. 'dynamic' (increase per actually performed test) or INT. Default: Dynamic.
|
||||
required: false
|
||||
example: "dynamic"
|
||||
- name: --min_cov
|
||||
alternatives: -C
|
||||
type: integer
|
||||
description: |
|
||||
Test only positions having at least this coverage. Default: 1.
|
||||
(note: without --no-default-filter default filters (incl. coverage) kick in after predictions are done).
|
||||
required: false
|
||||
example: 1
|
||||
- name: --max_depth
|
||||
alternatives: -d
|
||||
type: integer
|
||||
description: |
|
||||
Cap coverage at this depth. Default: 1000000.
|
||||
required: false
|
||||
example: 1000000
|
||||
- name: --illumina_13
|
||||
type: boolean_true
|
||||
description: |
|
||||
Assume the quality is Illumina-1.3-1.7/ASCII+64 encoded.
|
||||
- name: --use_orphan
|
||||
type: boolean_true
|
||||
description: |
|
||||
Count anomalous read pairs (i.e. where mate is not aligned properly).
|
||||
- name: --plp_summary_only
|
||||
type: boolean_true
|
||||
description: |
|
||||
No variant calling. Just output pileup summary per column.
|
||||
- name: --no_default_filter
|
||||
type: boolean_true
|
||||
description: |
|
||||
Don't run default 'lofreq filter' automatically after calling variants.
|
||||
- name: --force_overwrite
|
||||
type: boolean_true
|
||||
description: |
|
||||
Overwrite any existing output.
|
||||
- name: --verbose
|
||||
type: boolean_true
|
||||
description: |
|
||||
Be verbose.
|
||||
- name: --debug
|
||||
type: boolean_true
|
||||
description: |
|
||||
Enable debugging.
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/lofreq:2.1.5--py38h794fc9e_10
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
version=$(lofreq version | grep 'version' | sed 's/version: //') && \
|
||||
echo "lofreq: $version" > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
49
src/lofreq/call/help.txt
Normal file
49
src/lofreq/call/help.txt
Normal file
@@ -0,0 +1,49 @@
|
||||
lofreq call: call variants from BAM file
|
||||
|
||||
Usage: lofreq call [options] in.bam
|
||||
|
||||
Options:
|
||||
- Reference:
|
||||
-f | --ref FILE Indexed reference fasta file (gzip supported) [null]
|
||||
- Output:
|
||||
-o | --out FILE Vcf output file [- = stdout]
|
||||
- Regions:
|
||||
-r | --region STR Limit calls to this region (chrom:start-end) [null]
|
||||
-l | --bed FILE List of positions (chr pos) or regions (BED) [null]
|
||||
- Base-call quality:
|
||||
-q | --min-bq INT Skip any base with baseQ smaller than INT [6]
|
||||
-Q | --min-alt-bq INT Skip alternate bases with baseQ smaller than INT [6]
|
||||
-R | --def-alt-bq INT Overwrite baseQs of alternate bases (that passed bq filter) with this value (-1: use median ref-bq; 0: keep) [0]
|
||||
-j | --min-jq INT Skip any base with joinedQ smaller than INT [0]
|
||||
-J | --min-alt-jq INT Skip alternate bases with joinedQ smaller than INT [0]
|
||||
-K | --def-alt-jq INT Overwrite joinedQs of alternate bases (that passed jq filter) with this value (-1: use median ref-bq; 0: keep) [0]
|
||||
- Base-alignment (BAQ) and indel-aligment (IDAQ) qualities:
|
||||
-B | --no-baq Disable use of base-alignment quality (BAQ)
|
||||
-A | --no-idaq Don't use IDAQ values (NOT recommended under ANY circumstances other than debugging)
|
||||
-D | --del-baq Delete pre-existing BAQ values, i.e. compute even if already present in BAM
|
||||
-e | --no-ext-baq Use 'normal' BAQ (samtools default) instead of extended BAQ (both computed on the fly if not already present in lb tag)
|
||||
- Mapping quality:
|
||||
-m | --min-mq INT Skip reads with mapping quality smaller than INT [0]
|
||||
-M | --max-mq INT Cap mapping quality at INT [255]
|
||||
-N | --no-mq Don't merge mapping quality in LoFreq's model
|
||||
- Indels:
|
||||
--call-indels Enable indel calls (note: preprocess your file to include indel alignment qualities!)
|
||||
--only-indels Only call indels; no SNVs
|
||||
- Source quality:
|
||||
-s | --src-qual Enable computation of source quality
|
||||
-S | --ign-vcf FILE Ignore variants in this vcf file for source quality computation. Multiple files can be given separated by commas
|
||||
-T | --def-nm-q INT If >= 0, then replace non-match base qualities with this default value [-1]
|
||||
- P-values:
|
||||
-a | --sig P-Value cutoff / significance level [0.010000]
|
||||
-b | --bonf Bonferroni factor. 'dynamic' (increase per actually performed test) or INT ['dynamic']
|
||||
- Misc.:
|
||||
-C | --min-cov INT Test only positions having at least this coverage [1]
|
||||
(note: without --no-default-filter default filters (incl. coverage) kick in after predictions are done)
|
||||
-d | --max-depth INT Cap coverage at this depth [1000000]
|
||||
--illumina-1.3 Assume the quality is Illumina-1.3-1.7/ASCII+64 encoded
|
||||
--use-orphan Count anomalous read pairs (i.e. where mate is not aligned properly)
|
||||
--plp-summary-only No variant calling. Just output pileup summary per column
|
||||
--no-default-filter Don't run default 'lofreq filter' automatically after calling variants
|
||||
--force-overwrite Overwrite any existing output
|
||||
--verbose Be verbose
|
||||
--debug Enable debugging
|
||||
57
src/lofreq/call/script.sh
Normal file
57
src/lofreq/call/script.sh
Normal file
@@ -0,0 +1,57 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Unset all parameters that are set to "false"
|
||||
[[ "$par_no_baq" == "false" ]] && unset par_no_baq
|
||||
[[ "$par_no_idaq" == "false" ]] && unset par_no_idaq
|
||||
[[ "$par_del_baq" == "false" ]] && unset par_del_baq
|
||||
[[ "$par_no_ext_baq" == "false" ]] && unset par_no_ext_baq
|
||||
[[ "$par_no_mq" == "false" ]] && unset par_no_mq
|
||||
[[ "$par_call_indels" == "false" ]] && unset par_call_indels
|
||||
[[ "$par_only_indels" == "false" ]] && unset par_only_indels
|
||||
[[ "$par_src_qual" == "false" ]] && unset par_src_qual
|
||||
[[ "$par_illumina_13" == "false" ]] && unset par_illumina_13
|
||||
[[ "$par_use_orphan" == "false" ]] && unset par_use_orphan
|
||||
[[ "$par_plp_summary_only" == "false" ]] && unset par_plp_summary_only
|
||||
[[ "$par_no_default_filter" == "false" ]] && unset par_no_default_filter
|
||||
[[ "$par_force_overwrite" == "false" ]] && unset par_force_overwrite
|
||||
[[ "$par_verbose" == "false" ]] && unset par_verbose
|
||||
[[ "$par_debug" == "false" ]] && unset par_debug
|
||||
|
||||
# Run lofreq call
|
||||
lofreq call \
|
||||
-f "$par_ref" \
|
||||
-o "$par_out" \
|
||||
${par_region:+-r "${par_region}"} \
|
||||
${par_bed:+-l "${par_bed}"} \
|
||||
${par_min_bq:+-q "${par_min_bq}"} \
|
||||
${par_min_alt_bq:+-Q "${par_min_alt_bq}"} \
|
||||
${par_def_alt_bq:+-R "${par_def_alt_bq}"} \
|
||||
${par_min_jq:+-j "${par_min_jq}"} \
|
||||
${par_alt_jq:+-K "${par_alt_jq}"} \
|
||||
${par_no_baq:+-B} \
|
||||
${par_no_idaq:+-A} \
|
||||
${par_del_baq:+-D} \
|
||||
${par_no_ext_baq:+-e} \
|
||||
${par_min_mq:+-m "${par_min_mq}"} \
|
||||
${par_max_mq:+-M "${par_max_mq}"} \
|
||||
${par_no_mq:+-N} \
|
||||
${par_call_indels:+--call-indels} \
|
||||
${par_only_indels:+--only-indels} \
|
||||
${par_src_qual:+-s} \
|
||||
${par_ign_vcf:+-S "${par_ign_vcf}"} \
|
||||
${par_def_nm_q:+-T "${par_def_nm_q}"} \
|
||||
${par_sig:+-a "${par_sig}"} \
|
||||
${par_bonf:+-b "${par_bonf}"} \
|
||||
${par_min_cov:+-C "${par_min_cov}"} \
|
||||
${par_max_depth:+-d "${par_max_depth}"} \
|
||||
${par_illumina_13:+--illumina-1.3} \
|
||||
${par_use_orphan:+--use-orphan} \
|
||||
${par_plp_summary_only:+--plp-summary-only} \
|
||||
${par_no_default_filter:+--no-default-filter} \
|
||||
${par_force_overwrite:+--force-overwrite} \
|
||||
${par_verbose:+--verbose} \
|
||||
${par_debug:+--debug} \
|
||||
"$par_input"
|
||||
20
src/lofreq/call/test.sh
Normal file
20
src/lofreq/call/test.sh
Normal file
@@ -0,0 +1,20 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
dir_in="${meta_resources_dir%/}/test_data"
|
||||
|
||||
echo "> Run lofreq call"
|
||||
"$meta_executable" \
|
||||
--input "$dir_in/a.bam" \
|
||||
--input_bai "$dir_in/a.bai" \
|
||||
--ref "$dir_in/genome.fasta" \
|
||||
--out "output.vcf" \
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "output.vcf" ] && echo "Output file output.vcf does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "output.vcf" ] && echo "Output file output.vcf is empty" && exit 1
|
||||
|
||||
echo "> Test successful"
|
||||
BIN
src/lofreq/call/test_data/a.bai
Normal file
BIN
src/lofreq/call/test_data/a.bai
Normal file
Binary file not shown.
BIN
src/lofreq/call/test_data/a.bam
Normal file
BIN
src/lofreq/call/test_data/a.bam
Normal file
Binary file not shown.
8
src/lofreq/call/test_data/genome.fasta
Normal file
8
src/lofreq/call/test_data/genome.fasta
Normal file
@@ -0,0 +1,8 @@
|
||||
>SheilaA
|
||||
GCTAGCTCAGAAAAAAAAAA
|
||||
>SheilaB
|
||||
GCTAGCTCAGAAAAAAAAAA
|
||||
>SheilaC
|
||||
GCTAGCTCAGAAAAAAAAAA
|
||||
>SheilaD
|
||||
GCTAGCTCAGAAAAAAAAAA
|
||||
4
src/lofreq/call/test_data/genome.fasta.fai
Normal file
4
src/lofreq/call/test_data/genome.fasta.fai
Normal file
@@ -0,0 +1,4 @@
|
||||
SheilaA 20 9 20 21
|
||||
SheilaB 20 39 20 21
|
||||
SheilaC 20 69 20 21
|
||||
SheilaD 20 99 20 21
|
||||
10
src/lofreq/call/test_data/script.sh
Normal file
10
src/lofreq/call/test_data/script.sh
Normal file
@@ -0,0 +1,10 @@
|
||||
# pear test data
|
||||
|
||||
# Test data was obtained from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/lofreq/call/test/data
|
||||
|
||||
if [ ! -d /tmp/snakemake-wrappers ]; then
|
||||
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
|
||||
fi
|
||||
|
||||
cp -r /tmp/snakemake-wrappers/bio/lofreq/call/test/data/* src/lofreq/call/test_data
|
||||
|
||||
82
src/lofreq/indelqual/config.vsh.yaml
Normal file
82
src/lofreq/indelqual/config.vsh.yaml
Normal file
@@ -0,0 +1,82 @@
|
||||
name: lofreq_indelqual
|
||||
namespace: lofreq
|
||||
description: |
|
||||
Insert indel qualities into BAM file (required for indel predictions).
|
||||
|
||||
The preferred way of inserting indel qualities should be via GATK's BQSR (>=2) If that's not possible, use this subcommand.
|
||||
The command has two modes: 'uniform' and 'dindel':
|
||||
- 'uniform' will assign a given value uniformly, whereas
|
||||
- 'dindel' will insert indel qualities based on Dindel (PMID 20980555).
|
||||
Both will overwrite any existing values.
|
||||
Do not realign your BAM file afterwards!
|
||||
keywords: [ "bam", "indel", "qualities", "indelqual", "lofreq", "lofreq/indelqual"]
|
||||
links:
|
||||
homepage: https://csb5.github.io/lofreq/
|
||||
documentation: https://csb5.github.io/lofreq/commands/
|
||||
references:
|
||||
doi: 10.1093/nar/gks918
|
||||
license: "MIT"
|
||||
requirements:
|
||||
commands: [ lofreq ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
type: file
|
||||
description: |
|
||||
Input BAM file.
|
||||
required: true
|
||||
example: "normal.bam"
|
||||
- name: --ref
|
||||
alternatives: -f
|
||||
type: file
|
||||
description: |
|
||||
Reference sequence used for mapping (Only required for --dindel).
|
||||
required: false
|
||||
example: "reference.fasta"
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --out
|
||||
alternatives: -o
|
||||
type: file
|
||||
description: |
|
||||
Output BAM file.
|
||||
required: true
|
||||
direction: output
|
||||
example: "output.bam"
|
||||
- name: Arguments
|
||||
arguments:
|
||||
- name: --uniform
|
||||
alternatives: -u
|
||||
type: string
|
||||
description: |
|
||||
Add this indel quality uniformly to all bases. Use two comma separated values to specify insertion and deletion quality separately. (clashes with --dindel).
|
||||
required: false
|
||||
example: "50,50"
|
||||
- name: --dindel
|
||||
type: boolean_true
|
||||
description: |
|
||||
Add Dindel's indel qualities (Illumina specific) (clashes with -u; needs --ref).
|
||||
- name: --verbose
|
||||
type: boolean_true
|
||||
description: |
|
||||
Be verbose.
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/lofreq:2.1.5--py38h794fc9e_10
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
version=$(lofreq version | grep 'version' | sed 's/version: //') && \
|
||||
echo "lofreq: $version" > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
21
src/lofreq/indelqual/help.txt
Normal file
21
src/lofreq/indelqual/help.txt
Normal file
@@ -0,0 +1,21 @@
|
||||
lofreq indelqual: Insert indel qualities into BAM file (required for indel predictions)
|
||||
|
||||
Usage: lofreq indelqual [options] in.bam
|
||||
Options:
|
||||
-u | --uniform INT[,INT] Add this indel quality uniformly to all bases.
|
||||
Use two comma separated values to specify
|
||||
insertion and deletion quality separately.
|
||||
(clashes with --dindel)
|
||||
--dindel Add Dindel's indel qualities (Illumina specific)
|
||||
(clashes with -u; needs --ref)
|
||||
-f | --ref Reference sequence used for mapping
|
||||
(Only required for --dindel)
|
||||
-o | --out FILE Output BAM file [- = stdout = default]
|
||||
--verbose Be verbose
|
||||
|
||||
The preferred way of inserting indel qualities should be via GATK's BQSR (>=2) If that's not possible, use this subcommand.
|
||||
The command has two modes: 'uniform' and 'dindel':
|
||||
- 'uniform' will assign a given value uniformly, whereas
|
||||
- 'dindel' will insert indel qualities based on Dindel (PMID 20980555).
|
||||
Both will overwrite any existing values.
|
||||
Do not realign your BAM file afterwards!
|
||||
17
src/lofreq/indelqual/script.sh
Normal file
17
src/lofreq/indelqual/script.sh
Normal file
@@ -0,0 +1,17 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# Unset all parameters that are set to "false"
|
||||
[[ "$par_dindel" == "false" ]] && unset par_dindel
|
||||
[[ "$par_verbose" == "false" ]] && unset par_verbose
|
||||
|
||||
# run lofreq indelqual
|
||||
lofreq indelqual \
|
||||
-o "$par_out" \
|
||||
${par_uniform:+-u "${par_uniform}"} \
|
||||
${par_dindel:+--dindel} \
|
||||
${par_ref:+-f "${par_ref}"} \
|
||||
${par_verbose:+--verbose} \
|
||||
"$par_input"
|
||||
46
src/lofreq/indelqual/test.sh
Normal file
46
src/lofreq/indelqual/test.sh
Normal file
@@ -0,0 +1,46 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
dir_in="${meta_resources_dir%/}/test_data"
|
||||
|
||||
#############################################
|
||||
mkdir uniform
|
||||
cd uniform
|
||||
|
||||
echo "> Run lofreq indelqual uniform"
|
||||
"$meta_executable" \
|
||||
--input "$dir_in/test.bam" \
|
||||
-u 15 \
|
||||
--out "uniform.bam" \
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "uniform.bam" ] && echo "Output file uniform.bam does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "uniform.bam" ] && echo "Output file uniform.bam is empty" && exit 1
|
||||
|
||||
cd ..
|
||||
|
||||
#############################################
|
||||
mkdir dindel
|
||||
cd dindel
|
||||
|
||||
echo "> run lofreq indelqual dindel"
|
||||
"$meta_executable" \
|
||||
--input "$dir_in/test.bam" \
|
||||
--ref "$dir_in/test.fa" \
|
||||
--dindel \
|
||||
--out "dindel.bam"
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "dindel.bam" ] && echo "Output file dindel.bam does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "dindel.bam" ] && echo "Output file dindel.bam is empty" && exit 1
|
||||
|
||||
cd ..
|
||||
|
||||
#############################################
|
||||
|
||||
echo "> Test successful"
|
||||
44
src/lofreq/indelqual/test_data/script.sh
Executable file
44
src/lofreq/indelqual/test_data/script.sh
Executable file
@@ -0,0 +1,44 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
TMPDIR=$(mktemp -d)
|
||||
trap "rm -rf $TMPDIR" EXIT
|
||||
|
||||
### Step 1: Generate Test Reference FASTA File (`test.fa`)
|
||||
|
||||
cat > $TMPDIR/test.fa <<EOF
|
||||
>chr1
|
||||
AACTCTCCGTGCTGTCCGGGGTCACTGTGATGCCAGTGCCGTCGACGGACCACAGGAGCGCCGCCAATTACGATTTATA
|
||||
GGCGGCCCGGCCGATTATATCTTTGGCGGTCCCCTAGGCTCTCTAGGGGCCCGCACTGAAGAGGGCAACTCTGCAAGGA
|
||||
CACGAATCTGACTCCTTAATAAAGGTGTGAAATCTGTCCGGTCGTCTCCTAATATGGGGCTTCATCATCTCAGGCGAAA
|
||||
TCAGCGCCCGACGGGCCATAGTAAGCGGTGTTGTGGCATAGGTGCAGGTGGCCACCGATTATAACAGGATGACATACGC
|
||||
GGAATTCGGGGTATGATGCTCTCCCGACACTTTGAGACAATAAATAGTTTAGTGTCCTGATGGTCTAAACCGAAGTCAT
|
||||
TCAAAATAGCTAAGTGTAGTCTTCCCGTTCTAGGGATAGTCTAGGACATGCCCTATATTGGTTTTCTCTTACCGCGGAC
|
||||
TACTCCCGCGCCCTCGGAGGTGTCTCAATTCATCCATGTTGATCCTTCAAATCGGGGCAGCGACGGGGGCACGGAGGGG
|
||||
GTACGATAACCGCTAAATTGACCACCACCATCGATGATTCTACCATCTCTATCCATCCAACCCTTTTTTTGTTTATTTC
|
||||
CTCTATGGGTTACAGCTA
|
||||
EOF
|
||||
|
||||
### Step 2: Index the Reference FASTA File
|
||||
|
||||
samtools faidx $TMPDIR/test.fa
|
||||
|
||||
### Step 3: Generate Test Reads with `wgsim`
|
||||
|
||||
wgsim -N 100 -1 70 -2 70 $TMPDIR/test.fa $TMPDIR/reads1.fq $TMPDIR/reads2.fq
|
||||
|
||||
### Step 4: Align Reads to Generate BAM File
|
||||
|
||||
bwa index $TMPDIR/test.fa
|
||||
|
||||
bwa mem $TMPDIR/test.fa $TMPDIR/reads1.fq $TMPDIR/reads2.fq > $TMPDIR/aligned_reads.sam
|
||||
|
||||
### Step 5: Convert SAM to BAM, Sort, and Index
|
||||
|
||||
samtools view -Sb $TMPDIR/aligned_reads.sam > $TMPDIR/test.bam
|
||||
|
||||
### Step 6: Copy output
|
||||
|
||||
cp $TMPDIR/test.bam src/lofreq/indelqual/test_data/test.bam
|
||||
cp $TMPDIR/test.fa src/lofreq/indelqual/test_data/test.fa
|
||||
BIN
src/lofreq/indelqual/test_data/test.bam
Normal file
BIN
src/lofreq/indelqual/test_data/test.bam
Normal file
Binary file not shown.
10
src/lofreq/indelqual/test_data/test.fa
Normal file
10
src/lofreq/indelqual/test_data/test.fa
Normal file
@@ -0,0 +1,10 @@
|
||||
>chr1
|
||||
AACTCTCCGTGCTGTCCGGGGTCACTGTGATGCCAGTGCCGTCGACGGACCACAGGAGCGCCGCCAATTACGATTTATA
|
||||
GGCGGCCCGGCCGATTATATCTTTGGCGGTCCCCTAGGCTCTCTAGGGGCCCGCACTGAAGAGGGCAACTCTGCAAGGA
|
||||
CACGAATCTGACTCCTTAATAAAGGTGTGAAATCTGTCCGGTCGTCTCCTAATATGGGGCTTCATCATCTCAGGCGAAA
|
||||
TCAGCGCCCGACGGGCCATAGTAAGCGGTGTTGTGGCATAGGTGCAGGTGGCCACCGATTATAACAGGATGACATACGC
|
||||
GGAATTCGGGGTATGATGCTCTCCCGACACTTTGAGACAATAAATAGTTTAGTGTCCTGATGGTCTAAACCGAAGTCAT
|
||||
TCAAAATAGCTAAGTGTAGTCTTCCCGTTCTAGGGATAGTCTAGGACATGCCCTATATTGGTTTTCTCTTACCGCGGAC
|
||||
TACTCCCGCGCCCTCGGAGGTGTCTCAATTCATCCATGTTGATCCTTCAAATCGGGGCAGCGACGGGGGCACGGAGGGG
|
||||
GTACGATAACCGCTAAATTGACCACCACCATCGATGATTCTACCATCTCTATCCATCCAACCCTTTTTTTGTTTATTTC
|
||||
CTCTATGGGTTACAGCTA
|
||||
229
src/multiqc/config.vsh.yaml
Normal file
229
src/multiqc/config.vsh.yaml
Normal file
@@ -0,0 +1,229 @@
|
||||
name: "multiqc"
|
||||
description: |
|
||||
MultiQC aggregates results from bioinformatics analyses across many samples into a single report.
|
||||
It searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.
|
||||
info:
|
||||
keywords: [QC, html report, aggregate analysis]
|
||||
links:
|
||||
homepage: https://multiqc.info/
|
||||
documentation: https://multiqc.info/docs/
|
||||
repository: https://github.com/MultiQC/MultiQC
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btw354
|
||||
licence: GPL v3 or later
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
multiple: true
|
||||
required: true
|
||||
example: data/results/
|
||||
description: |
|
||||
File paths to be searched for analysis results to be included in the report.
|
||||
|
||||
- name: "Ouput"
|
||||
arguments:
|
||||
- name: "--output_report"
|
||||
type: file
|
||||
direction: output
|
||||
must_exist: false
|
||||
example: multiqc_report.html
|
||||
description: |
|
||||
Filepath of the generated report.
|
||||
- name: "--output_data"
|
||||
type: file
|
||||
required: false
|
||||
direction: output
|
||||
example: multiqc_data
|
||||
must_exist: false
|
||||
description: |
|
||||
Output directory for parsed data files. If not provided, parsed data will not be published.
|
||||
- name: "--output_plots"
|
||||
type: file
|
||||
required: false
|
||||
direction: output
|
||||
must_exist: false
|
||||
example: multiqc_plots
|
||||
description: |
|
||||
Output directory for generated plots. If not provided, plots will not be published.
|
||||
|
||||
- name: "Modules and analyses to run"
|
||||
arguments:
|
||||
- name: "--include_modules"
|
||||
type: string
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
example: fastqc,cutadapt
|
||||
description: Use only these module
|
||||
- name: "--exclude_modules"
|
||||
type: string
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
example: fastqc,cutadapt
|
||||
description: Do not use only these modules
|
||||
- name: "--ignore_analysis"
|
||||
type: string
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
example: run_one/*,run_two/*
|
||||
- name: "--ignore_samples"
|
||||
type: string
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
example: sample_1*,sample_3*
|
||||
- name: "--ignore_symlinks"
|
||||
type: boolean_true
|
||||
description: Ignore symlinked directories and files
|
||||
|
||||
- name: "Sample name handling"
|
||||
arguments:
|
||||
- name: "--dirs"
|
||||
type: boolean_true
|
||||
description: Prepend directory to sample names to avoid clashing filenames
|
||||
- name: "--dirs_depth"
|
||||
type: integer
|
||||
description: Prepend n directories to sample names. Negative number to take from start of path.
|
||||
- name: "--full_names"
|
||||
type: boolean_true
|
||||
description: Do not clean the sample names (leave as full file name)
|
||||
- name: "--fn_as_s_name"
|
||||
type: boolean_true
|
||||
description: Use the log filename as the sample name
|
||||
- name: "--replace_names"
|
||||
type: file
|
||||
example: replace_names.tsv
|
||||
description: TSV file to rename sample names during report generation
|
||||
|
||||
- name: "Report Customisation"
|
||||
arguments:
|
||||
- name: "--title"
|
||||
type: string
|
||||
description: |
|
||||
Report title. Printed as page header, used for filename if not otherwise specified.
|
||||
- name: "--comment"
|
||||
type: string
|
||||
description: |
|
||||
Custom comment, will be printed at the top of the report.
|
||||
- name: "--template"
|
||||
type: string
|
||||
choices: [default, gathered, geo, highcharts, sections, simple]
|
||||
description: |
|
||||
Report template to use.
|
||||
- name: "--sample_names"
|
||||
type: file
|
||||
description: |
|
||||
TSV file containing alternative sample names for renaming buttons in the report.
|
||||
example: sample_names.tsv
|
||||
- name: "--sample_filters"
|
||||
type: file
|
||||
description: |
|
||||
TSV file containing show/hide patterns for the report
|
||||
example: sample_filters.tsv
|
||||
- name: "--custom_css_file"
|
||||
type: file
|
||||
description: |
|
||||
Custom CSS file to add to the final report
|
||||
example: custom_style_sheet.css
|
||||
- name: "--profile_runtime"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Add analysis of how long MultiQC takes to run to the report
|
||||
|
||||
- name: "MultiQC behaviour"
|
||||
arguments:
|
||||
- name: "--verbose"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Increase output verbosity.
|
||||
- name: "--quiet"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Only show log warnings
|
||||
- name: "--strict"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Don't catch exceptions, run additional code checks to help development.
|
||||
- name: "--development"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Development mode. Do not compress and minimise JS, export uncompressed plot data.
|
||||
- name: "--require_logs"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Require all explicitly requested modules to have log files. If not, MultiQC will exit with an error.
|
||||
- name: "--no_megaqc_upload"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Don't upload generated report to MegaQC, even if MegaQC options are found.
|
||||
- name: "--no_ansi"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Disable coloured log output.
|
||||
- name: "--cl_config"
|
||||
type: string
|
||||
required: false
|
||||
description: |
|
||||
YAML formatted string that allows to customize MultiQC behaviour like input file detection.
|
||||
example: "qualimap_config: { general_stats_coverage: [20,40,200] }"
|
||||
|
||||
- name: "Output format"
|
||||
arguments:
|
||||
- name: "--flat"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use only flat plots (static images).
|
||||
- name: "--interactive"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use only interactive plots (in-browser Javascript).
|
||||
- name: "--data_dir"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Force the parsed data directory to be created.
|
||||
- name: "--no_data_dir"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Prevent the parsed data directory from being created.
|
||||
- name: "--zip_data_dir"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Compress the data directory.
|
||||
- name: "--data_format"
|
||||
type: string
|
||||
choices: [tsv, csv, json, yaml]
|
||||
description: |
|
||||
Output parsed data in a different format than the default 'txt'.
|
||||
- name: "--pdf"
|
||||
type: boolean_true
|
||||
description: |
|
||||
Creates PDF report with the 'simple' template. Requires Pandoc to be installed.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/multiqc:1.21--pyhdfd78af_0
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
multiqc --version | sed 's/multiqc, version\s\(.*\)/multiqc: "\1"/' > /var/software_versions.txt
|
||||
test_setup:
|
||||
- type: apt
|
||||
packages:
|
||||
- jq
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
|
||||
|
||||
67
src/multiqc/help.txt
Normal file
67
src/multiqc/help.txt
Normal file
@@ -0,0 +1,67 @@
|
||||
```bash
|
||||
multiqc --help
|
||||
```
|
||||
|
||||
/// MultiQC 🔍 | v1.20
|
||||
|
||||
Usage: multiqc [OPTIONS] [ANALYSIS DIRECTORY]
|
||||
|
||||
MultiQC aggregates results from bioinformatics analyses across many samples into a single report.
|
||||
It searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.
|
||||
To run, supply with one or more directory to scan for analysis results. For example, to run in the current working directory, use 'multiqc .'
|
||||
|
||||
╭─ Main options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||
│ --force -f Overwrite any existing reports │
|
||||
│ --config -c Specific config file to load, after those in MultiQC dir / home dir / working dir. (PATH) │
|
||||
│ --cl-config Specify MultiQC config YAML on the command line (TEXT) │
|
||||
│ --filename -n Report filename. Use 'stdout' to print to standard out. (TEXT) │
|
||||
│ --outdir -o Create report in the specified output directory. (TEXT) │
|
||||
│ --ignore -x Ignore analysis files (GLOB EXPRESSION) │
|
||||
│ --ignore-samples Ignore sample names (GLOB EXPRESSION) │
|
||||
│ --ignore-symlinks Ignore symlinked directories and files │
|
||||
│ --file-list -l Supply a file containing a list of file paths to be searched, one per row │
|
||||
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
||||
╭─ Choosing modules to run ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||
│ --module -m Use only this module. Can specify multiple times. (MODULE NAME) │
|
||||
│ --exclude -e Do not use this module. Can specify multiple times. (MODULE NAME) │
|
||||
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
||||
╭─ Sample handling ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||
│ --dirs -d Prepend directory to sample names │
|
||||
│ --dirs-depth -dd Prepend n directories to sample names. Negative number to take from start of path. (INTEGER) │
|
||||
│ --fullnames -s Do not clean the sample names (leave as full file name) │
|
||||
│ --fn_as_s_name Use the log filename as the sample name │
|
||||
│ --replace-names TSV file to rename sample names during report generation (PATH) │
|
||||
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
||||
╭─ Report customisation ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||
│ --title -i Report title. Printed as page header, used for filename if not otherwise specified. (TEXT) │
|
||||
│ --comment -b Custom comment, will be printed at the top of the report. (TEXT) │
|
||||
│ --template -t Report template to use. (default|gathered|geo|highcharts|sections|simple) │
|
||||
│ --sample-names TSV file containing alternative sample names for renaming buttons in the report (PATH) │
|
||||
│ --sample-filters TSV file containing show/hide patterns for the report (PATH) │
|
||||
│ --custom-css-file Custom CSS file to add to the final report (PATH) │
|
||||
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
||||
╭─ Output files ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||
│ --flat -fp Use only flat plots (static images) │
|
||||
│ --interactive -ip Use only interactive plots (in-browser Javascript) │
|
||||
│ --export -p Export plots as static images in addition to the report │
|
||||
│ --data-dir Force the parsed data directory to be created. │
|
||||
│ --no-data-dir Prevent the parsed data directory from being created. │
|
||||
│ --data-format -k Output parsed data in a different format. (tsv|csv|json|yaml) │
|
||||
│ --zip-data-dir -z Compress the data directory. │
|
||||
│ --no-report Do not generate a report, only export data and plots │
|
||||
│ --pdf Creates PDF report with the 'simple' template. Requires Pandoc to be installed. │
|
||||
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
||||
╭─ MultiQC behaviour ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||
│ --verbose -v Increase output verbosity. (INTEGER RANGE) │
|
||||
│ --quiet -q Only show log warnings │
|
||||
│ --strict Don't catch exceptions, run additional code checks to help development. │
|
||||
│ --development,--dev Development mode. Do not compress and minimise JS, export uncompressed plot data │
|
||||
│ --require-logs Require all explicitly requested modules to have log files. If not, MultiQC will exit with an error. │
|
||||
│ --profile-runtime Add analysis of how long MultiQC takes to run to the report │
|
||||
│ --no-megaqc-upload Don't upload generated report to MegaQC, even if MegaQC options are found │
|
||||
│ --no-ansi Disable coloured log output │
|
||||
│ --version Show the version and exit. │
|
||||
│ --help -h Show this message and exit. │
|
||||
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
||||
|
||||
See http://multiqc.info for more details.
|
||||
130
src/multiqc/script.sh
Executable file
130
src/multiqc/script.sh
Executable file
@@ -0,0 +1,130 @@
|
||||
#!/bin/bash
|
||||
|
||||
# disable flags
|
||||
[[ "$par_ignore_symlinks" == "false" ]] && unset par_ignore_symlinks
|
||||
[[ "$par_dirs" == "false" ]] && unset par_dirs
|
||||
[[ "$par_full_names" == "false" ]] && unset par_full_names
|
||||
[[ "$par_fn_as_s_name" == "false" ]] && unset par_fn_as_s_name
|
||||
[[ "$par_profile_runtime" == "false" ]] && unset par_profile_runtime
|
||||
[[ "$par_verbose" == "false" ]] && unset par_verbose
|
||||
[[ "$par_quiet" == "false" ]] && unset par_quiet
|
||||
[[ "$par_strict" == "false" ]] && unset par_strict
|
||||
[[ "$par_development" == "false" ]] && unset par_development
|
||||
[[ "$par_require_logs" == "false" ]] && unset par_require_logs
|
||||
[[ "$par_no_megaqc_upload" == "false" ]] && unset par_no_megaqc_upload
|
||||
[[ "$par_no_ansi" == "false" ]] && unset par_no_ansi
|
||||
[[ "$par_flat" == "false" ]] && unset par_flat
|
||||
[[ "$par_interactive" == "false" ]] && unset par_interactive
|
||||
[[ "$par_static_plot_export" == "false" ]] && unset par_static_plot_export
|
||||
[[ "$par_data_dir" == "false" ]] && unset par_data_dir
|
||||
[[ "$par_no_data_dir" == "false" ]] && unset par_no_data_dir
|
||||
[[ "$par_zip_data_dir" == "false" ]] && unset par_zip_data_dir
|
||||
[[ "$par_pdf" == "false" ]] && unset par_pdf
|
||||
|
||||
|
||||
# handle inputs
|
||||
out_dir=$(dirname "$par_output_report")
|
||||
output_report_file=$(basename "$par_output_report")
|
||||
report_name="${output_report_file%.*}"
|
||||
|
||||
# handle outputs
|
||||
[[ -z "$par_output_report" ]] && no_report=true
|
||||
[[ -z "$par_output_data" ]] && no_data_dir=true
|
||||
[[ ! -z "$par_output_data" ]] && data_dir=true
|
||||
[[ ! -z "$par_output_plots" ]] && export=true
|
||||
|
||||
# handle multiples
|
||||
IFS=";" read -ra inputs <<< $par_input
|
||||
|
||||
if [[ -n "$par_include_modules" ]]; then
|
||||
include_modules=""
|
||||
IFS="," read -ra incl_modules <<< $par_include_modules
|
||||
for i in "${incl_modules[@]}"; do
|
||||
include_modules+="--include $i "
|
||||
done
|
||||
unset IFS
|
||||
fi
|
||||
|
||||
if [[ -n "$par_exclude_modules" ]]; then
|
||||
exclude_modules=""
|
||||
IFS="," read -ra excl_modules <<< $par_exclude_modules
|
||||
for i in "${excl_modules[@]}"; do
|
||||
exclude_modules+="--exclude $i"
|
||||
done
|
||||
unset IFS
|
||||
fi
|
||||
|
||||
if [[ -n "$par_ignore_analysis" ]]; then
|
||||
ignore=""
|
||||
IFS="," read -ra ignore_analysis <<< $par_ignore_analysis
|
||||
for i in "${ignore_analysis[@]}"; do
|
||||
ignore+="--ignore $i "
|
||||
done
|
||||
unset IFS
|
||||
fi
|
||||
|
||||
if [[ -n "$par_ignore_samples" ]]; then
|
||||
ignore_samples=""
|
||||
IFS="," read -ra ign_samples <<< $par_ignore_samples
|
||||
for i in "${ign_samples[@]}"; do
|
||||
ignore_samples+="--ignore-samples $i"
|
||||
done
|
||||
unset IFS
|
||||
fi
|
||||
|
||||
# run multiqc
|
||||
multiqc \
|
||||
${par_output_report:+--filename "$report_name"} \
|
||||
${out_dir:+--outdir "$out_dir"} \
|
||||
${no_report:+--no-report} \
|
||||
${no_data_dir:+--no-data-dir} \
|
||||
${data_dir:+--data-dir} \
|
||||
${export:+--export} \
|
||||
${par_title:+--title "$par_title"} \
|
||||
${par_comment:+--comment "$par_comment"} \
|
||||
${par_template:+--template "$par_template"} \
|
||||
${par_sample_names:+--sample-names "$par_sample_names"} \
|
||||
${par_sample_filters:+--sample-filters "$par_sample_filters"} \
|
||||
${par_custom_css_file:+--custom-css-file "$par_custom_css_file"} \
|
||||
${par_profile_runtime:+--profile-runtime} \
|
||||
${par_dirs:+--dirs} \
|
||||
${par_dirs_depth:+--dirs-depth "$par_dirs_depth"} \
|
||||
${par_full_names:+--full-names} \
|
||||
${par_fn_as_s_name:+--fn-as-s-name} \
|
||||
${par_ignore_names:+--ignore-names "$par_ignore_names"} \
|
||||
${par_ignore_symlinks:+--ignore-symlinks} \
|
||||
${ignore_samples} \
|
||||
${ignore} \
|
||||
${exclude_modules} \
|
||||
${include_modules} \
|
||||
${par_include_modules:+--include-modules "$par_include_modules"} \
|
||||
${par_data_format:+--data-format "$par_data_format"} \
|
||||
${par_cl_config:+--cl-config "$par_cl_config"} \
|
||||
${par_zip_data_dir:+--zip-data-dir} \
|
||||
${par_pdf:+--pdf} \
|
||||
${par_interactive:+--interactive} \
|
||||
${par_flat:+--flat} \
|
||||
${par_verbose:+--verbose} \
|
||||
${par_quiet:+--quiet} \
|
||||
${par_strict:+--strict} \
|
||||
${par_no_megaqc_upload:+--no-megaqc-upload} \
|
||||
${par_no_ansi:+--no-ansi} \
|
||||
${par_profile_runtime:+--profile-runtime} \
|
||||
${par_require_logs:+--require-logs} \
|
||||
${par_development:+--development} \
|
||||
--force \
|
||||
"${inputs[@]}"
|
||||
|
||||
# Move outputs
|
||||
|
||||
if [[ -n "$par_output_data" ]] && [[ -d "${out_dir}/${report_name}_data" ]]; then
|
||||
mv "${out_dir}/${report_name}_data" "$par_output_data"
|
||||
elif [[ -n "$par_output_data" ]] && [[ ! -d "${out_dir}/${report_name}_data" ]]; then
|
||||
echo "WARNING: Data could not be saved because data folder was not generated by multiqc. This could be due to filtering out of modules or samples."
|
||||
fi
|
||||
|
||||
if [[ -n "$par_output_plots" ]] && [[ -d "${out_dir}/${report_name}_plots" ]]; then
|
||||
mv "${out_dir}/${report_name}_plots" "$par_output_plots"
|
||||
elif [[ -n "$par_output_plots" ]] && [[ ! -d "${out_dir}/${report_name}_plots" ]]; then
|
||||
echo "WARNING: Plots could not be saved because plots folder was not generated by multiqc. This could be due to filtering out of modules or samples."
|
||||
fi
|
||||
44
src/multiqc/test.sh
Normal file
44
src/multiqc/test.sh
Normal file
@@ -0,0 +1,44 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing input/output handling"
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$meta_resources_dir/test_data/" \
|
||||
--output_report test1.html \
|
||||
--output_data data1 \
|
||||
--output_plots plots1 \
|
||||
--quiet
|
||||
|
||||
[ ! -f test1.html ] && echo "MultiQC report does not exist!" && exit 1
|
||||
[ ! -d data1 ] && echo "MultiQC data directory does not exist!" && exit 1
|
||||
[ ! -d plots1 ] && echo "MultiQC plots directory does not exist!" && exit 1
|
||||
|
||||
echo ">>> Testing module exclusion"
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$meta_resources_dir/test_data/" \
|
||||
--output_report test2.html \
|
||||
--output_data data2 \
|
||||
--output_plots plots2 \
|
||||
--exclude_modules samtools \
|
||||
--quiet
|
||||
|
||||
[ -f test2.html ] && echo "MultiQC report should not exist!" && exit 1
|
||||
[ -d data2 ] && echo "MultiQC data directory should not exist!" && exit 1
|
||||
[ -d plots2 ] && echo "MultiQC plots directory should not exist!" && exit 1
|
||||
|
||||
echo ">>> Testing sample exclusion"
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$meta_resources_dir/test_data/" \
|
||||
--output_report test3.html \
|
||||
--output_data data3 \
|
||||
--ignore_samples a \
|
||||
--quiet
|
||||
|
||||
key_to_check=".report_general_stats_data[0].a"
|
||||
json_file="data3/multiqc_data.json"
|
||||
[[ $(jq -r "$key_to_check" "$json_file") != null ]] && echo "$key_to_check should not be present in $json_file" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
1504
src/multiqc/test_data/a.txt
Normal file
1504
src/multiqc/test_data/a.txt
Normal file
File diff suppressed because it is too large
Load Diff
1505
src/multiqc/test_data/b.txt
Normal file
1505
src/multiqc/test_data/b.txt
Normal file
File diff suppressed because it is too large
Load Diff
9
src/multiqc/test_data/script.sh
Normal file
9
src/multiqc/test_data/script.sh
Normal file
@@ -0,0 +1,9 @@
|
||||
# multiqc test data
|
||||
|
||||
# Test data from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/busco/test
|
||||
|
||||
if [ ! -d /tmp/snakemake-wrappers ]; then
|
||||
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
|
||||
fi
|
||||
|
||||
cp -r /tmp/snakemake-wrappers/bio/multiqc/test/samtools_stats/* src/multiqc/test_data
|
||||
161
src/pear/config.vsh.yaml
Normal file
161
src/pear/config.vsh.yaml
Normal file
@@ -0,0 +1,161 @@
|
||||
name: pear
|
||||
description: |
|
||||
PEAR is an ultrafast, memory-efficient and highly accurate pair-end read merger. It is fully parallelized and can run with as low as just a few kilobytes of memory.
|
||||
|
||||
PEAR evaluates all possible paired-end read overlaps and without requiring the target fragment size as input. In addition, it implements a statistical test for minimizing false-positive results. Together with a highly optimized implementation, it can merge millions of paired end reads within a couple of minutes on a standard desktop computer.
|
||||
keywords: [ "pair-end", "read", "merge" ]
|
||||
links:
|
||||
homepage: https://cme.h-its.org/exelixis/web/software/pear
|
||||
repository: https://github.com/tseemann/PEAR
|
||||
documentation: https://cme.h-its.org/exelixis/web/software/pear/doc.html
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btt593
|
||||
license: "CC-BY-NC-SA-3.0"
|
||||
requirements:
|
||||
commands: [ pear , gzip ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --forward_fastq
|
||||
alternatives: -f
|
||||
type: file
|
||||
description: Forward paired-end FASTQ file
|
||||
required: true
|
||||
example: "forward.fastq"
|
||||
- name: --reverse_fastq
|
||||
alternatives: -r
|
||||
type: file
|
||||
description: Reverse paired-end FASTQ file
|
||||
required: true
|
||||
example: "reverse.fastq"
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --assembled
|
||||
type: file
|
||||
description: The output file containing assembled reads. Can be compressed with gzip.
|
||||
required: true
|
||||
direction: output
|
||||
- name: --unassembled_forward
|
||||
type: file
|
||||
description: The output file containing forward reads that could not be assembled. Can be compressed with gzip.
|
||||
required: true
|
||||
direction: output
|
||||
- name: --unassembled_reverse
|
||||
type: file
|
||||
description: The output file containing reverse reads that could not be assembled. Can be compressed with gzip.
|
||||
required: true
|
||||
direction: output
|
||||
- name: --discarded
|
||||
type: file
|
||||
description: The output file containing reads that were discarded due to too low quality or too many uncalled bases. Can be compressed with gzip.
|
||||
required: true
|
||||
direction: output
|
||||
- name: Arguments
|
||||
arguments:
|
||||
- name: --p_value
|
||||
alternatives: -p
|
||||
type: double
|
||||
description: |
|
||||
Specify a p-value for the statistical test. If the computed p-value of a possible assembly exceeds the specified p-value then paired-end read will not be assembled. Valid options are: 0.0001, 0.001, 0.01, 0.05 and 1.0. Setting 1.0 disables the test.
|
||||
example: 0.01
|
||||
required: false
|
||||
- name: --min_overlap
|
||||
alternatives: -v
|
||||
type: integer
|
||||
description: |
|
||||
Specify the minimum overlap size. The minimum overlap may be set to 1 when the statistical test is used. However, further restricting the minimum overlap size to a proper value may reduce false-positive assembles.
|
||||
required: false
|
||||
example: 10
|
||||
- name: --max_assembly_length
|
||||
alternatives: -m
|
||||
type: integer
|
||||
description: |
|
||||
Specify the maximum possible length of the assembled sequences. Setting this value to 0 disables the restriction and assembled sequences may be arbitrary long.
|
||||
required: false
|
||||
example: 0
|
||||
- name: --min_assembly_length
|
||||
alternatives: -n
|
||||
type: integer
|
||||
description: |
|
||||
Specify the minimum possible length of the assembled sequences. Setting this value to 0 disables the restriction and assembled sequences may be arbitrary short.
|
||||
required: false
|
||||
example: 0
|
||||
- name: --min_trim_length
|
||||
alternatives: -t
|
||||
type: integer
|
||||
description: |
|
||||
Specify the minimum length of reads after trimming the low quality part (see option -q)
|
||||
required: false
|
||||
example: 1
|
||||
- name: --quality_threshold
|
||||
alternatives: -q
|
||||
type: integer
|
||||
description: |
|
||||
Specify the quality threshold for trimming the low quality part of a read. If the quality scores of two consecutive bases are strictly less than the specified threshold, the rest of the read will be trimmed.
|
||||
required: false
|
||||
example: 0
|
||||
- name: --max_uncalled_base
|
||||
alternatives: -u
|
||||
type: double
|
||||
description: |
|
||||
Specify the maximal proportion of uncalled bases in a read. Setting this value to 0 will cause PEAR to discard all reads containing uncalled bases. The other extreme setting is 1 which causes PEAR to process all reads independent on the number of uncalled bases.
|
||||
example: 1.0
|
||||
required: false
|
||||
- name: --test_method
|
||||
alternatives: -g
|
||||
type: integer
|
||||
description: |
|
||||
Specify the type of statistical test. Two options are available. 1: Given the minimum allowed overlap, test using the highest OES. Note that due to its discrete nature, this test usually yields a lower p-value for the assembled read than the cut- off (specified by -p). For example, setting the cut-off to 0.05 using this test, the assembled reads might have an actual p-value of 0.02.
|
||||
2. Use the acceptance probability (m.a.p). This test methods computes the same probability as test method 1. However, it assumes that the minimal overlap is the observed overlap with the highest OES, instead of the one specified by -v. Therefore, this is not a valid statistical test and the 'p-value' is in fact the maximal probability for accepting the assembly. Nevertheless, we observed in practice that for the case the actual overlap sizes are relatively small, test 2 can correctly assemble more reads with only slightly higher false-positive rate.
|
||||
required: false
|
||||
example: 1
|
||||
- name: --emperical_freqs
|
||||
alternatives: -e
|
||||
type: boolean_true
|
||||
description: |
|
||||
Disable empirical base frequencies.
|
||||
- name: --score_method
|
||||
alternatives: -s
|
||||
type: integer
|
||||
description: |
|
||||
Specify the scoring method. 1. OES with +1 for match and -1 for mismatch. 2: Assembly score (AS). Use +1 for match and -1 for mismatch multiplied by base quality scores. 3: Ignore quality scores and use +1 for a match and -1 for a mismatch.
|
||||
required: false
|
||||
example: 2
|
||||
- name: --phred_base
|
||||
alternatives: -b
|
||||
type: integer
|
||||
description: |
|
||||
Base PHRED quality score.
|
||||
required: false
|
||||
example: 33
|
||||
- name: --cap
|
||||
alternatives: -c
|
||||
type: integer
|
||||
description: |
|
||||
Specify the upper bound for the resulting quality score. If set to zero, capping is disabled.
|
||||
required: false
|
||||
example: 40
|
||||
- name: --nbase
|
||||
alternatives: -z
|
||||
type: boolean_true
|
||||
description: |
|
||||
When merging a base-pair that consists of two non-equal bases out of which none is degenerate, set the merged base to N and use the highest quality score of the two bases
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/pear:0.9.6--h9d449c0_10
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
version=$(pear -h | grep 'PEAR v' | sed 's/PEAR v//' | sed 's/ .*//') && \
|
||||
echo "pear: $version" > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
91
src/pear/help.txt
Normal file
91
src/pear/help.txt
Normal file
@@ -0,0 +1,91 @@
|
||||
```bash
|
||||
pear -h
|
||||
```
|
||||
|
||||
____ _____ _ ____
|
||||
| _ \| ____| / \ | _ \
|
||||
| |_) | _| / _ \ | |_) |
|
||||
| __/| |___ / ___ \| _ <
|
||||
|_| |_____/_/ \_\_| \_\
|
||||
PEAR v0.9.6 [January 15, 2015] - [+bzlib +zlib]
|
||||
|
||||
Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR
|
||||
Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593
|
||||
|
||||
License: Creative Commons Licence
|
||||
Bug-reports and requests to: Tomas.Flouri@h-its.org and Jiajie.Zhang@h-its.org
|
||||
|
||||
|
||||
Usage: pear <options>
|
||||
Standard (mandatory):
|
||||
-f, --forward-fastq <str> Forward paired-end FASTQ file.
|
||||
-r, --reverse-fastq <str> Reverse paired-end FASTQ file.
|
||||
-o, --output <str> Output filename.
|
||||
Optional:
|
||||
-p, --p-value <float> Specify a p-value for the statistical test. If the computed
|
||||
p-value of a possible assembly exceeds the specified p-value
|
||||
then paired-end read will not be assembled. Valid options
|
||||
are: 0.0001, 0.001, 0.01, 0.05 and 1.0. Setting 1.0 disables
|
||||
the test. (default: 0.01)
|
||||
-v, --min-overlap <int> Specify the minimum overlap size. The minimum overlap may be
|
||||
set to 1 when the statistical test is used. However, further
|
||||
restricting the minimum overlap size to a proper value may
|
||||
reduce false-positive assembles. (default: 10)
|
||||
-m, --max-assembly-length <int> Specify the maximum possible length of the assembled
|
||||
sequences. Setting this value to 0 disables the restriction
|
||||
and assembled sequences may be arbitrary long. (default: 0)
|
||||
-n, --min-assembly-length <int> Specify the minimum possible length of the assembled
|
||||
sequences. Setting this value to 0 disables the restriction
|
||||
and assembled sequences may be arbitrary short. (default:
|
||||
50)
|
||||
-t, --min-trim-length <int> Specify the minimum length of reads after trimming the low
|
||||
quality part (see option -q). (default: 1)
|
||||
-q, --quality-threshold <int> Specify the quality score threshold for trimming the low
|
||||
quality part of a read. If the quality scores of two
|
||||
consecutive bases are strictly less than the specified
|
||||
threshold, the rest of the read will be trimmed. (default:
|
||||
0)
|
||||
-u, --max-uncalled-base <float> Specify the maximal proportion of uncalled bases in a read.
|
||||
Setting this value to 0 will cause PEAR to discard all reads
|
||||
containing uncalled bases. The other extreme setting is 1
|
||||
which causes PEAR to process all reads independent on the
|
||||
number of uncalled bases. (default: 1)
|
||||
-g, --test-method <int> Specify the type of statistical test. Two options are
|
||||
available. (default: 1)
|
||||
1: Given the minimum allowed overlap, test using the highest
|
||||
OES. Note that due to its discrete nature, this test usually
|
||||
yields a lower p-value for the assembled read than the cut-
|
||||
off (specified by -p). For example, setting the cut-off to
|
||||
0.05 using this test, the assembled reads might have an
|
||||
actual p-value of 0.02.
|
||||
|
||||
2. Use the acceptance probability (m.a.p). This test methods
|
||||
computes the same probability as test method 1. However, it
|
||||
assumes that the minimal overlap is the observed overlap
|
||||
with the highest OES, instead of the one specified by -v.
|
||||
Therefore, this is not a valid statistical test and the
|
||||
'p-value' is in fact the maximal probability for accepting
|
||||
the assembly. Nevertheless, we observed in practice that for
|
||||
the case the actual overlap sizes are relatively small, test
|
||||
2 can correctly assemble more reads with only slightly
|
||||
higher false-positive rate.
|
||||
-e, --empirical-freqs Disable empirical base frequencies. (default: use empirical
|
||||
base frequencies)
|
||||
-s, --score-method <int> Specify the scoring method. (default: 2)
|
||||
1. OES with +1 for match and -1 for mismatch.
|
||||
2: Assembly score (AS). Use +1 for match and -1 for mismatch
|
||||
multiplied by base quality scores.
|
||||
3: Ignore quality scores and use +1 for a match and -1 for a
|
||||
mismatch.
|
||||
-b, --phred-base <int> Base PHRED quality score. (default: 33)
|
||||
-y, --memory <str> Specify the amount of memory to be used. The number may be
|
||||
followed by one of the letters K, M, or G denoting
|
||||
Kilobytes, Megabytes and Gigabytes, respectively. Bytes are
|
||||
assumed in case no letter is specified.
|
||||
-c, --cap <int> Specify the upper bound for the resulting quality score. If
|
||||
set to zero, capping is disabled. (default: 40)
|
||||
-j, --threads <int> Number of threads to use
|
||||
-z, --nbase When merging a base-pair that consists of two non-equal
|
||||
bases out of which none is degenerate, set the merged base
|
||||
to N and use the highest quality score of the two bases
|
||||
-h, --help This help screen.
|
||||
65
src/pear/script.sh
Normal file
65
src/pear/script.sh
Normal file
@@ -0,0 +1,65 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
[[ "$par_emperical_freqs" == "false" ]] && unset par_emperical_freqs
|
||||
[[ "$par_nbase" == "false" ]] && unset par_nbase
|
||||
|
||||
if [[ "${par_forward_fastq##*.}" == "gz" ]]; then
|
||||
gunzip $par_forward_fastq
|
||||
par_forward_fastq=${par_forward_fastq%.*}
|
||||
fi
|
||||
if [[ "${par_reverse_fastq##*.}" == "gz" ]]; then
|
||||
gunzip $par_reverse_fastq
|
||||
par_reverse_fastq=${par_reverse_fastq%.*}
|
||||
fi
|
||||
|
||||
output_dir=$(mktemp -d -p "$meta_temp_dir" "pear.XXXXXX")
|
||||
|
||||
pear \
|
||||
-f "$par_forward_fastq" \
|
||||
-r "$par_reverse_fastq" \
|
||||
-o "$output_dir" \
|
||||
${par_p_value:+-p "${par_p_value}"} \
|
||||
${par_min_overlap:+-v "${par_min_overlap}"} \
|
||||
${par_max_assembly_length:+-m "${par_max_assembly_length}"} \
|
||||
${par_min_assembly_length:+-n "${par_min_assembly_length}"} \
|
||||
${par_min_trim_length:+-t "${par_min_trim_length}"} \
|
||||
${par_quality_threshold:+-q "${par_quality_threshold}"} \
|
||||
${par_max_uncalled_base:+-u "${par_max_uncalled_base}"} \
|
||||
${par_test_method:+-g "${par_test_method}"} \
|
||||
${par_score_method:+-s "${par_score_method}"} \
|
||||
${par_phred_base:+-b "${par_phred_base}"} \
|
||||
${meta_memory_mb:+--memory "${meta_memory_mb}M"} \
|
||||
${par_cap:+-c "${par_cap}"} \
|
||||
${meta_cpus:+-j "${meta_cpus}"} \
|
||||
${par_emperical_freqs:+-e} \
|
||||
${par_nbase:+-z}
|
||||
|
||||
|
||||
if [[ "${par_assembled##*.}" == "gz" ]]; then
|
||||
gzip -9 -c ${output_dir}.assembled.fastq > ${par_assembled}
|
||||
else
|
||||
mv ${output_dir}.assembled.fastq ${par_assembled}
|
||||
fi
|
||||
|
||||
if [[ "${par_unassembled_forward##*.}" == "gz" ]]; then
|
||||
gzip -9 -c ${output_dir}.unassembled.forward.fastq > ${par_unassembled_forward}
|
||||
else
|
||||
mv ${output_dir}.unassembled.forward.fastq ${par_unassembled_forward}
|
||||
fi
|
||||
|
||||
if [[ "${par_unassembled_reverse##*.}" == "gz" ]]; then
|
||||
gzip -9 -c ${output_dir}.unassembled.reverse.fastq > ${par_unassembled_reverse}
|
||||
else
|
||||
mv ${output_dir}.unassembled.reverse.fastq ${par_unassembled_reverse}
|
||||
fi
|
||||
|
||||
if [[ "${par_discarded##*.}" == "gz" ]]; then
|
||||
gzip -9 -c ${output_dir}.discarded.fastq > ${par_discarded}
|
||||
else
|
||||
mv ${output_dir}.discarded.fastq ${par_discarded}
|
||||
fi
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user