Build branch qualimap with version qualimap (58d7dac)

Build pipeline: viash-hub.biobox.qualimap-nmgj2

Source commit: 58d7dacfc2

Source message: Update src/qualimap/qualimap_rnaseq/script.sh

Co-authored-by: Robrecht Cannoodt <rcannood@gmail.com>
This commit is contained in:
CI
2024-07-29 13:51:23 +00:00
commit 1e77c47f20
438 changed files with 264552 additions and 0 deletions

18
.gitignore vendored Normal file
View File

@@ -0,0 +1,18 @@
*.DS_Store
*__pycache__
# IDE ignores
.idea/
.vscode/
# R specific ignores
.Rhistory
.Rproj.user
*.Rproj
# viash specific ignores
target/
# nextflow specific ignores
.nextflow*
work

107
CHANGELOG.md Normal file
View File

@@ -0,0 +1,107 @@
# biobox x.x.x
## BUG FIXES
* `pear`: fix component not exiting with the correct exitcode when PEAR fails.
* `cutadapt`: fix `--par_quality_cutoff_r2` argument.
* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode`.
* `multiqc`: update multiple separator to `;` (PR #81).
## MINOR CHANGES
* `busco` components: update BUSCO to `5.7.1`.
## NEW FEATURES
* `qualimap/qualimap_rnaseq`: RNA-seq QC analysis using qualimap (PR #74).
# biobox 0.1.0
## BREAKING CHANGES
* Change default `multiple_sep` to `;` (PR #25). This aligns with an upcoming breaking change in
Viash 0.9.0 in order to avoid issues with the current default separator `:` unintentionally
splitting up certain file paths.
## NEW FEATURES
* `arriba`: Detect gene fusions from RNA-seq data (PR #1).
* `fastp`: An ultra-fast all-in-one FASTQ preprocessor (PR #3).
* `busco`:
- `busco/busco_run`: Assess genome assembly and annotation completeness with single copy orthologs (PR #6).
- `busco/busco_list_datasets`: Lists available busco datasets (PR #18).
- `busco/busco_download_datasets`: Download busco datasets (PR #19).
* `cutadapt`: Remove adapter sequences from high-throughput sequencing reads (PR #7).
* `featurecounts`: Assign sequence reads to genomic features (PR #11).
* `bgzip`: Add bgzip functionality to compress and decompress files (PR #13).
* `pear`: Paired-end read merger (PR #10).
* `lofreq/call`: Call variants from a BAM file (PR #17).
* `lofreq/indelqual`: Insert indel qualities into BAM file (PR #17).
* `multiqc`: Aggregate results from bioinformatics analyses across many samples into a single report (PR #42).
* `star`:
- `star/star_align_reads`: Align reads to a reference genome (PR #22).
- `star/star_genome_generate`: Generate a genome index for STAR alignment (PR #58).
* `gffread`: Validate, filter, convert and perform other operations on GFF files (PR #29).
* `salmon`:
- `salmon/salmon_index`: Create a salmon index for the transcriptome to use Salmon in the mapping-based mode (PR #24).
- `salmon/salmon_quant`: Transcript quantification from RNA-seq data (PR #24).
* `samtools`:
- `samtools/samtools_flagstat`: Counts the number of alignments in SAM/BAM/CRAM files for each FLAG type (PR #31).
- `samtools/samtools_idxstats`: Reports alignment summary statistics for a SAM/BAM/CRAM file (PR #32).
- `samtools/samtools_index`: Index SAM/BAM/CRAM files (PR #35).
- `samtools/samtools_sort`: Sort SAM/BAM/CRAM files (PR #36).
- `samtools/samtools_stats`: Reports alignment summary statistics for a BAM file (PR #39).
- `samtools/samtools_faidx`: Indexes FASTA files to enable random access to fasta and fastq files (PR #41).
- `samtools/samtools_collate`: Shuffles and groups reads in SAM/BAM/CRAM files together by their names (PR #42).
- `samtools/samtools_view`: Views and converts SAM/BAM/CRAM files (PR #48).
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTQ (PR #52).
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53).
* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).
* `umitools`:
- `umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).
* `bedtools`:
- `bedtools_getfasta`: extract sequences from a FASTA file for each of the
intervals defined in a BED/GFF/VCF file (PR #59).
## MINOR CHANGES
* Uniformize component metadata (PR #23).
* Update to Viash 0.8.5 (PR #25).
* Update to Viash 0.9.0-RC3 (PR #51).
* Update to Viash 0.9.0-RC6 (PR #63).
* Switch to viash-hub/toolbox actions (PR #64).
## DOCUMENTATION
* Update README (PR #64).
## BUG FIXES
* Add escaping character before leading hashtag in the description field of the config file (PR #50).
* Format URL in biobase/bcl_convert description (PR #55).

383
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,383 @@
# Contributing guidelines
We encourage contributions from the community. To contribute:
1. **Fork the Repository**: Start by forking this repository to your account.
2. **Develop Your Component**: Create your Viash component, ensuring it aligns with our best practices (detailed below).
3. **Submit a Pull Request**: After testing your component, submit a pull request for review.
## Procedure of adding a component
### Step 1: Find a component to contribute
* Find a tool to contribute to this repo.
* Check whether it is already in the [Project board](https://github.com/orgs/viash-hub/projects/1).
* Check whether there is a corresponding [Snakemake wrapper](https://github.com/snakemake/snakemake-wrappers/blob/master/bio) or [nf-core module](https://github.com/nf-core/modules/tree/master/modules/nf-core) which we can use as inspiration.
* Create an issue to show that you are working on this component.
### Step 2: Add config template
Change all occurrences of `xxx` to the name of the component.
Create a file at `src/xxx/config.vsh.yaml` with contents:
```yaml
name: xxx
description: xxx
keywords: [tag1, tag2]
links:
homepage: yyy
documentation: yyy
issue_tracker: yyy
repository: yyy
references:
doi: 12345/12345678.yz
license: MIT/Apache-2.0/GPL-3.0/...
argument_groups:
- name: Inputs
arguments: <...>
- name: Outputs
arguments: <...>
- name: Arguments
arguments: <...>
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- <...>
runners:
- type: executable
- type: nextflow
```
### Step 3: Fill in the metadata
Fill in the relevant metadata fields in the config. Here is an example of the metadata of an existing component.
```yaml
functionality:
name: arriba
description: Detect gene fusions from RNA-Seq data
keywords: [Gene fusion, RNA-Seq]
links:
homepage: https://arriba.readthedocs.io/en/latest/
documentation: https://arriba.readthedocs.io/en/latest/
repository: https://github.com/suhrig/arriba
issue_tracker: https://github.com/suhrig/arriba/issues
references:
doi: 10.1101/gr.257246.119
bibtex: |
@article{
... a bibtex entry in case the doi is not available ...
}
license: MIT
```
### Step 4: Find a suitable container
Google `biocontainer <name of component>` and find the container that is most suitable. Typically the link will be `https://quay.io/repository/biocontainers/xxx?tab=tags`.
If no such container is found, you can create a custom container in the next step.
### Step 5: Create help file
To help develop the component, we store the `--help` output of the tool in a file at `src/xxx/help.txt`.
````bash
cat <<EOF > src/xxx/help.txt
```sh
xxx --help
```
EOF
docker run quay.io/biocontainers/xxx:tag xxx --help >> src/xxx/help.txt
````
Notes:
* This help file has no functional purpose, but it is useful for the developer to see the help output of the tool.
* Some tools might not have a `--help` argument but instead have a `-h` argument. For example, for `arriba`, the help message is obtained by running `arriba -h`:
```bash
docker run quay.io/biocontainers/arriba:2.4.0--h0033a41_2 arriba -h
```
### Step 6: Create or fetch test data
To help develop the component, it's interesting to have some test data available. In most cases, we can use the test data from the Snakemake wrappers.
To make sure we can reproduce the test data in the future, we store the command to fetch the test data in a file at `src/xxx/test_data/script.sh`.
```bash
cat <<EOF > src/xxx/test_data/script.sh
# clone repo
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi
# copy test data
cp -r /tmp/snakemake-wrappers/bio/xxx/test/* src/xxx/test_data
EOF
```
The test data should be suitable for testing this component. Ensure that the test data is small enough: ideally <1KB, preferably <10KB, if need be <100KB.
### Step 7: Add arguments for the input files
By looking at the help file, we add the input arguments to the config file. Here is an example of the input arguments of an existing component.
For instance, in the [arriba help file](src/arriba/help.txt), we see the following:
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
-o fusions.tsv [-O fusions.discarded.tsv] \
[OPTIONS]
-x FILE File in SAM/BAM/CRAM format with main alignments as generated by STAR
(Aligned.out.sam). Arriba extracts candidate reads from this file.
Based on this information, we can add the following input arguments to the config file.
```yaml
argument_groups:
- name: Inputs
arguments:
- name: --bam
alternatives: -x
type: file
description: |
File in SAM/BAM/CRAM format with main alignments as generated by STAR
(Aligned.out.sam). Arriba extracts candidate reads from this file.
required: true
example: Aligned.out.bam
```
Check the [documentation](https://viash.io/reference/config/functionality/arguments) for more information on the format of input arguments.
Several notes:
* Argument names should be formatted in `--snake_case`. This means arguments like `--foo-bar` should be formatted as `--foo_bar`, and short arguments like `-f` should receive a longer name like `--foo`.
* Input arguments can have `multiple: true` to allow the user to specify multiple files.
### Step 8: Add arguments for the output files
By looking at the help file, we now also add output arguments to the config file.
For example, in the [arriba help file](src/arriba/help.txt), we see the following:
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
-o fusions.tsv [-O fusions.discarded.tsv] \
[OPTIONS]
-o FILE Output file with fusions that have passed all filters.
-O FILE Output file with fusions that were discarded due to filtering.
Based on this information, we can add the following output arguments to the config file.
```yaml
argument_groups:
- name: Outputs
arguments:
- name: --fusions
alternatives: -o
type: file
direction: output
description: |
Output file with fusions that have passed all filters.
required: true
example: fusions.tsv
- name: --fusions_discarded
alternatives: -O
type: file
direction: output
description: |
Output file with fusions that were discarded due to filtering.
required: false
example: fusions.discarded.tsv
```
Note:
* Preferably, these outputs should not be directores but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
### Step 9: Add arguments for the other arguments
Finally, add all other arguments to the config file. There are a few exceptions:
* Arguments related to specifying CPU and memory requirements are handled separately and should not be added to the config file.
* Arguments related to printing the information such as printing the version (`-v`, `--version`) or printing the help (`-h`, `--help`) should not be added to the config file.
### Step 10: Add a Docker engine
To ensure reproducibility of components, we require that all components are run in a Docker container.
```yaml
engines:
- type: docker
image: quay.io/biocontainers/xxx:0.1.0--py_0
```
The container should have your tool installed, as well as `ps`.
If you didn't find a suitable container in the previous step, you can create a custom container. For example:
```yaml
engines:
- type: docker
image: python:3.10
setup:
- type: python
packages: numpy
```
For more information on how to do this, see the [documentation](https://viash.io/guide/component/add-dependencies.html#steps-for-creating-a-custom-docker-platform).
Here is a list of base containers we can recommend:
* Bash: [`bash`](https://hub.docker.com/_/bash), [`ubuntu`](https://hub.docker.com/_/ubuntu)
* C#: [`ghcr.io/data-intuitive/dotnet-script`](https://github.com/data-intuitive/ghcr-dotnet-script/pkgs/container/dotnet-script)
* JavaScript: [`node`](https://hub.docker.com/_/node)
* Python: [`python`](https://hub.docker.com/_/python), [`nvcr.io/nvidia/pytorch`](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
* R: [`eddelbuettel/r2u`](https://hub.docker.com/r/eddelbuettel/r2u), [`rocker/tidyverse`](https://hub.docker.com/r/rocker/tidyverse)
* Scala: [`sbtscala/scala-sbt`](https://hub.docker.com/r/sbtscala/scala-sbt)
### Step 11: Write a runner script
Next, we need to write a runner script that runs the tool with the input arguments. Create a Bash script named `src/xxx/script.sh` which runs the tool with the input arguments.
```bash
#!/bin/bash
## VIASH START
## VIASH END
xxx \
--input "$par_input" \
--output "$par_output" \
$([ "$par_option" = "true" ] && echo "--option")
```
When building a Viash component, Viash will automatically replace the `## VIASH START` and `## VIASH END` lines (and anything in between) with environment variables based on the arguments specified in the config.
As an example, this is what the Bash script for the `arriba` component looks like:
```bash
#!/bin/bash
## VIASH START
## VIASH END
arriba \
-x "$par_bam" \
-a "$par_genome" \
-g "$par_gene_annotation" \
-o "$par_fusions" \
${par_known_fusions:+-k "${par_known_fusions}"} \
${par_blacklist:+-b "${par_blacklist}"} \
${par_structural_variants:+-d "${par_structural_variants}"} \
$([ "$par_skip_duplicate_marking" = "true" ] && echo "-u") \
$([ "$par_extra_information" = "true" ] && echo "-X") \
$([ "$par_fill_gaps" = "true" ] && echo "-I")
```
### Step 12: Create test script
If the unit test requires test resources, these should be provided in the `test_resources` section of the component.
```yaml
functionality:
# ...
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
```
Create a test script at `src/xxx/test.sh` that runs the component with the test data. This script should run the component (available with `$meta_executable`) with the test data and check if the output is as expected. The script should exit with a non-zero exit code if the output is not as expected. For example:
```bash
#!/bin/bash
## VIASH START
## VIASH END
echo "> Run xxx with test data"
"$meta_executable" \
--input "$meta_resources_dir/test_data/input.txt" \
--output "output.txt" \
--option
echo ">> Checking output"
[ ! -f "output.txt" ] && echo "Output file output.txt does not exist" && exit 1
```
For example, this is what the test script for the `arriba` component looks like:
```bash
#!/bin/bash
## VIASH START
## VIASH END
echo "> Run arriba with blacklist"
"$meta_executable" \
--bam "$meta_resources_dir/test_data/A.bam" \
--genome "$meta_resources_dir/test_data/genome.fasta" \
--gene_annotation "$meta_resources_dir/test_data/annotation.gtf" \
--blacklist "$meta_resources_dir/test_data/blacklist.tsv" \
--fusions "fusions.tsv" \
--fusions_discarded "fusions_discarded.tsv" \
--interesting_contigs "1,2"
echo ">> Checking output"
[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
```
### Step 12: Create a `/var/software_versions.txt` file
For the sake of transparency and reproducibility, we require that the versions of the software used in the component are documented.
For now, this is managed by creating a file `/var/software_versions.txt` in the `setup` section of the Docker engine.
```yaml
engines:
- type: docker
image: quay.io/biocontainers/xxx:0.1.0--py_0
setup:
- type: docker
run: |
echo "xxx: \"0.1.0\"" > /var/software_versions.txt
```

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2024 Data Intuitive
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

72
README.md Normal file
View File

@@ -0,0 +1,72 @@
# 🌱📦 biobox
[![ViashHub](https://img.shields.io/badge/ViashHub-biobox-7a4baa.png)](https://web.viash-hub.com/packages/biobox)
[![GitHub](https://img.shields.io/badge/GitHub-viash--hub%2Fbiobox-blue.png)](https://github.com/viash-hub/biobox)
[![GitHub
License](https://img.shields.io/github/license/viash-hub/biobox.png)](https://github.com/viash-hub/biobox/blob/main/LICENSE)
[![GitHub
Issues](https://img.shields.io/github/issues/viash-hub/biobox.png)](https://github.com/viash-hub/biobox/issues)
[![Viash
version](https://img.shields.io/badge/Viash-v0.9.0--RC6-blue)](https://viash.io)
A collection of bioinformatics tools for working with sequence data.
## Objectives
- **Reusability**: Facilitating the use of components across various
projects and contexts.
- **Reproducibility**: Ensuring that components are reproducible and can
be easily shared.
- **Best Practices**: Adhering to established standards in software
development and bioinformatics.
## Contributing
We encourage contributions from the community. To contribute:
1. **Fork the Repository**: Start by forking this repository to your
account.
2. **Develop Your Component**: Create your Viash component, ensuring it
aligns with our best practices (detailed below).
3. **Submit a Pull Request**: After testing your component, submit a
pull request for review.
## Contribution Guidelines
The contribution guidelines describes which steps you should follow to
contribute a component to this repository.
1. Find a component to contribute
2. Add config template
3. Fill in the metadata
4. Find a suitable container
5. Create help file
6. Create or fetch test data
7. Add arguments for the input files
8. Add arguments for the output files
9. Add arguments for the other arguments
10. Add a Docker engine
11. Write a runner script
12. Create test script
13. Create a `/var/software_versions.txt` file
See the
[CONTRIBUTING](https://github.com/viash-hub/biobox/blob/main/CONTRIBUTING.md)
file for more details.
## Support and Community
For support, questions, or to join our community:
- **Issues**: Submit questions or issues via the [GitHub issue
tracker](https://github.com/viash-hub/biobox/issues).
- **Discussions**: Join our discussions via [GitHub
Discussions](https://github.com/viash-hub/biobox/discussions).
## License
This repository is licensed under an MIT license. See the
[LICENSE](https://github.com/viash-hub/biobox/blob/main/LICENSE) file
for details.

62
README.qmd Normal file
View File

@@ -0,0 +1,62 @@
---
format: gfm
---
```{r setup, include=FALSE}
project <- yaml::read_yaml("_viash.yaml")
license <- paste0(project$links$repository, "/blob/main/LICENSE")
contributing <- paste0(project$links$repository, "/blob/main/CONTRIBUTING.md")
```
# 🌱📦 `r project$name`
[![ViashHub](https://img.shields.io/badge/ViashHub-`r project$name`-7a4baa)](https://web.viash-hub.com/packages/`r project$name`)
[![GitHub](https://img.shields.io/badge/GitHub-viash--hub%2F`r project$name`-blue)](`r project$links$repository`)
[![GitHub License](https://img.shields.io/github/license/viash-hub/`r project$name`)](`r license`)
[![GitHub Issues](https://img.shields.io/github/issues/viash-hub/`r project$name`)](`r project$links$issue_tracker`)
[![Viash version](https://img.shields.io/badge/Viash-v`r gsub("-", "--", project$viash_version)`-blue)](https://viash.io)
`r project$description`
## Objectives
- **Reusability**: Facilitating the use of components across various projects and contexts.
- **Reproducibility**: Ensuring that components are reproducible and can be easily shared.
- **Best Practices**: Adhering to established standards in software development and bioinformatics.
## Contributing
We encourage contributions from the community. To contribute:
1. **Fork the Repository**: Start by forking this repository to your account.
2. **Develop Your Component**: Create your Viash component, ensuring it aligns with our best practices (detailed below).
3. **Submit a Pull Request**: After testing your component, submit a pull request for review.
## Contribution Guidelines
The contribution guidelines describes which steps you should follow to contribute a component to this repository.
```{r echo=FALSE}
lines <- readr::read_lines("CONTRIBUTING.md")
index_start <- grep("^### Step [0-9]*:", lines)
index_end <- c(index_start[-1] - 1, length(lines))
name <- gsub("^### Step [0-9]*: *", "", lines[index_start])
knitr::asis_output(
paste(paste0(" 1. ", name, "\n"), collapse = "")
)
```
See the [CONTRIBUTING](`r contributing`) file for more details.
## Support and Community
For support, questions, or to join our community:
- **Issues**: Submit questions or issues via the [GitHub issue tracker](`r project$links$issue_tracker`).
- **Discussions**: Join our discussions via [GitHub Discussions](`r project$links$repository`/discussions).
## License
This repository is licensed under an MIT license. See the [LICENSE](`r license`) file for details.

13
_viash.yaml Normal file
View File

@@ -0,0 +1,13 @@
name: biobox
description: |
A collection of bioinformatics tools for working with sequence data.
license: MIT
keywords: [bioinformatics, modules, sequencing]
links:
issue_tracker: https://github.com/viash-hub/biobox/issues
repository: https://github.com/viash-hub/biobox
viash_version: 0.9.0-RC6
config_mods: |
.requirements.commands := ['ps']

3
main.nf Normal file
View File

@@ -0,0 +1,3 @@
workflow {
print("This is a dummy placeholder for pipeline execution. Please use the corresponding nf files for running pipelines.")
}

6
nextflow.config Normal file
View File

@@ -0,0 +1,6 @@
manifest {
name = "biobox"
version = "qualimap"
defaultBranch = "main"
nextflowVersion = "!>=20.12.1-edge"
}

385
src/arriba/config.vsh.yaml Normal file
View File

@@ -0,0 +1,385 @@
name: arriba
description: Detect gene fusions from RNA-Seq data
keywords: [Gene fusion, RNA-Seq]
links:
homepage: https://arriba.readthedocs.io/en/latest/
documentation: https://arriba.readthedocs.io/en/latest/
repository: https://github.com/suhrig/arriba
references:
doi: 10.1101/gr.257246.119
license: MIT
requirements:
cpus: 1
commands: [ arriba ]
argument_groups:
- name: Inputs
arguments:
- name: --bam
alternatives: -x
type: file
description: |
File in SAM/BAM/CRAM format with main alignments as generated by STAR
(Aligned.out.sam). Arriba extracts candidate reads from this file.
required: true
example: Aligned.out.bam
- name: --genome
alternatives: -a
type: file
description: |
FastA file with genome sequence (assembly). The file may be gzip-compressed. An
index with the file extension .fai must exist only if CRAM files are processed.
required: true
example: assembly.fa
- name: --gene_annotation
alternatives: -g
type: file
description: |
GTF file with gene annotation. The file may be gzip-compressed.
required: true
example: annotation.gtf
- name: --known_fusions
alternatives: -k
type: file
description: |
File containing known/recurrent fusions. Some cancer entities are often
characterized by fusions between the same pair of genes. In order to boost
sensitivity, a list of known fusions can be supplied using this parameter. The list
must contain two columns with the names of the fused genes, separated by tabs.
required: false
example: known_fusions.tsv
- name: --blacklist
alternatives: -b
type: file
description: |
File containing blacklisted events (recurrent artifacts and transcripts
observed in healthy tissue).
required: false
example: blacklist.tsv
- name: --structural_variants
alternatives: -d
type: file
description: |
Tab-separated file with coordinates of structural variants found using
whole-genome sequencing data. These coordinates serve to increase sensitivity
towards weakly expressed fusions and to eliminate fusions with low evidence.
required: false
example: structural_variants_from_WGS.tsv
- name: --tags
alternatives: -t
type: file
description: |
Tab-separated file containing fusions to annotate with tags in the 'tags' column.
The first two columns specify the genes; the third column specifies the tag. The
file may be gzip-compressed.
required: false
example: tags.tsv
- name: --protein_domains
alternatives: -p
type: file
description: |
File in GFF3 format containing coordinates of the protein domains of genes. The
protein domains retained in a fusion are listed in the column
'retained_protein_domains'. The file may be gzip-compressed.
required: false
example: protein_domains.gff3
- name: Outputs
arguments:
- name: --fusions
alternatives: -o
type: file
direction: output
description: |
Output file with fusions that have passed all filters.
required: true
example: fusions.tsv
- name: --fusions_discarded
alternatives: -O
type: file
direction: output
description: |
Output file with fusions that were discarded due to filtering.
required: false
example: fusions.discarded.tsv
- name: Arguments
arguments:
- name: --max_genomic_breakpoint_distance
alternatives: -D
type: long
description: |
When a file with genomic breakpoints obtained via
whole-genome sequencing is supplied via the --structural_variants
parameter, this parameter determines how far a
genomic breakpoint may be away from a
transcriptomic breakpoint to consider it as a
related event. For events inside genes, the
distance is added to the end of the gene; for
intergenic events, the distance threshold is
applied as is. Default: 100000.
required: false
- name: --strandedness
alternatives: -s
type: string
description: |
Whether a strand-specific protocol was used for library preparation,
and if so, the type of strandedness (auto/yes/no/reverse). When
unstranded data is processed, the strand can sometimes be inferred from
splice-patterns. But in unclear situations, stranded data helps
resolve ambiguities. Default: auto
choices: ["auto", "yes", "no", "reverse"]
required: false
- name: --interesting_contigs
alternatives: -i
type: string
description: |
List of interesting contigs. Fusions between genes
on other contigs are ignored. Contigs can be specified with or without the
prefix "chr". Asterisks (*) are treated as wild-cards.
Default: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_*
required: false
multiple: true
example: ["1", "2", "AC_*", "NC_*"]
- name: --viral_contigs
alternatives: -v
type: string
description: |
List of viral contigs. Asterisks (*) are treated as
wild-cards.
Default: AC_* NC_*
required: false
multiple: true
example: ["AC_*", "NC_*"]
- name: --disable_filters
alternatives: -f
type: string
description: |
List of filters to disable. By default all filters are
enabled.
choices: [ homologs, low_entropy, isoforms,
top_expressed_viral_contigs, viral_contigs, uninteresting_contigs,
non_coding_neighbors, mismatches, duplicates, no_genomic_support,
genomic_support, intronic, end_to_end, relative_support,
low_coverage_viral_contigs, merge_adjacent, mismappers, multimappers,
same_gene, long_gap, internal_tandem_duplication, small_insert_size,
read_through, inconsistently_clipped, intragenic_exonic,
marginal_read_through, spliced, hairpin, blacklist, min_support,
select_best, in_vitro, short_anchor, known_fusions, no_coverage,
homopolymer, many_spliced ]
required: false
multiple: true
- name: --max_e_value
alternatives: -E
type: double
description: |
Arriba estimates the number of fusions with a given number of supporting
reads which one would expect to see by random chance. If the expected number
of fusions (e-value) is higher than this threshold, the fusion is
discarded by the 'relative_support' filter. Note: Increasing this
threshold can dramatically increase the number of false positives and may
increase the runtime of resource-intensive steps. Fractional values are
possible. Default: 0.300000
required: false
- name: --min_supporting_reads
alternatives: -S
type: integer
description: |
The 'min_support' filter discards all fusions with fewer than
this many supporting reads (split reads and discordant mates
combined). Default: 2
required: false
example: 2
- name: --max_mismappers
alternatives: -m
type: double
description: |
When more than this fraction of supporting reads turns out to be
mismappers, the 'mismappers' filter discards the fusion. Default:
0.800000
required: false
example: 0.8
- name: --max_homolog_identity
alternatives: -L
type: double
description: |
Genes with more than the given fraction of sequence identity are
considered homologs and removed by the 'homologs' filter.
Default: 0.300000
required: false
example: 0.3
- name: --homopolymer_length
alternatives: -H
type: integer
description: |
The 'homopolymer' filter removes breakpoints adjacent to
homopolymers of the given length or more. Default: 6
required: false
example: 6
- name: --read_through_distance
alternatives: -R
type: integer
description: |
The 'read_through' filter removes read-through fusions
where the breakpoints are less than the given distance away
from each other. Default: 10000
required: false
example: 10000
- name : --min_anchor_length
alternatives: -A
type: integer
description: |
Alignment artifacts are often characterized by split reads coming
from only one gene and no discordant mates. Moreover, the split
reads only align to a short stretch in one of the genes. The
'short_anchor' filter removes these fusions. This parameter sets
the threshold in bp for what the filter considers short. Default: 23
required: false
example: 23
- name: --many_spliced_events
alternatives: -M
type: integer
description: |
The 'many_spliced' filter recovers fusions between genes that
have at least this many spliced breakpoints. Default: 4
required: false
example: 4
- name: --max_kmer_content
alternatives: -K
type: double
description: |
The 'low_entropy' filter removes reads with repetitive 3-mers. If
the 3-mers make up more than the given fraction of the sequence, then
the read is discarded. Default: 0.600000
required: false
example: 0.6
- name: --max_mismatch_pvalue
alternatives: -V
type: double
description: |
The 'mismatches' filter uses a binomial model to calculate a
p-value for observing a given number of mismatches in a read. If
the number of mismatches is too high, the read is discarded.
Default: 0.010000
required: false
example: 0.05
- name: --fragment_length
alternatives: -F
type: integer
description: |
When paired-end data is given, the fragment length is estimated
automatically and this parameter has no effect. But when single-end
data is given, the mean fragment length should be specified to
effectively filter fusions that arise from hairpin structures.
Default: 200
required: false
example: 200
- name: --max_reads
alternatives: -U
type: integer
description: |
Subsample fusions with more than the given number of supporting reads. This
improves performance without compromising sensitivity, as long as the
threshold is high. Counting of supporting reads beyond the threshold is
inaccurate, obviously. Default: 300
required: false
example: 300
- name: --quantile
alternatives: -Q
type: double
description: |
Highly expressed genes are prone to produce artifacts during library
preparation. Genes with an expression above the given quantile are eligible
for filtering by the 'in_vitro' filter. Default: 0.998000
required: false
example: 0.998
- name: --exonic_fraction
alternatives: -e
type: double
description: |
The breakpoints of false-positive predictions of intragenic events
are often both in exons. True predictions are more likely to have at
least one breakpoint in an intron, because introns are larger. If the
fraction of exonic sequence between two breakpoints is smaller than
the given fraction, the 'intragenic_exonic' filter discards the
event. Default: 0.330000
required: false
example: 0.33
- name: --top_n
alternatives: -T
type: integer
description: |
Only report viral integration sites of the top N most highly expressed viral
contigs. Default: 5
required: false
example: 5
- name: --covered_fraction
alternatives: -C
type: double
description: |
Ignore virally associated events if the virus is not fully
expressed, i.e., less than the given fraction of the viral contig is
transcribed. Default: 0.050000
required: false
example: 0.05
- name: --max_itd_length
alternatives: -l
type: integer
description: |
Maximum length of internal tandem duplications. Note: Increasing
this value beyond the default can impair performance and lead to many
false positives. Default: 100
required: false
example: 100
- name: --min_itd_allele_fraction
alternatives: -z
type: double
description: |
Required fraction of supporting reads to report an internal
tandem duplication. Default: 0.070000
required: false
example: 0.07
- name: --min_itd_supporting_reads
alternatives: -Z
type: integer
description: |
Required absolute number of supporting reads to report an
internal tandem duplication. Default: 10
required: false
example: 10
- name: --skip_duplicate_marking
alternatives: -u
type: boolean_true
description: |
Instead of performing duplicate marking itself, Arriba relies on duplicate marking by a
preceding program using the BAM_FDUP flag. This makes sense when unique molecular
identifiers (UMI) are used.
- name: --extra_information
alternatives: -X
type: boolean_true
description: |
To reduce the runtime and file size, by default, the columns 'fusion_transcript',
'peptide_sequence', and 'read_identifiers' are left empty in the file containing
discarded fusion candidates (see parameter -O). When this flag is set, this extra
information is reported in the discarded fusions file.
- name: --fill_gaps
alternatives: -I
type: boolean_true
description: |
If assembly of the fusion transcript sequence from the supporting reads is incomplete
(denoted as '...'), fill the gaps using the assembly sequence wherever possible.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/arriba:2.4.0--h0033a41_2
setup:
- type: docker
run: |
arriba -h | grep 'Version:' 2>&1 | sed 's/Version:\s\(.*\)/arriba: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

198
src/arriba/help.txt Normal file
View File

@@ -0,0 +1,198 @@
```bash
arriba -h
```
Arriba gene fusion detector
---------------------------
Version: 2.4.0
Arriba is a fast tool to search for aberrant transcripts such as gene fusions.
It is based on chimeric alignments found by the STAR RNA-Seq aligner.
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
-o fusions.tsv [-O fusions.discarded.tsv] \
[OPTIONS]
-c FILE File in SAM/BAM/CRAM format with chimeric alignments as generated by STAR
(Chimeric.out.sam). This parameter is only required, if STAR was run with the
parameter '--chimOutType SeparateSAMold'. When STAR was run with the parameter
'--chimOutType WithinBAM', it suffices to pass the parameter -x to Arriba and -c
can be omitted.
-x FILE File in SAM/BAM/CRAM format with main alignments as generated by STAR
(Aligned.out.sam). Arriba extracts candidate reads from this file.
-g FILE GTF file with gene annotation. The file may be gzip-compressed.
-G GTF_FEATURES Comma-/space-separated list of names of GTF features.
Default: gene_name=gene_name|gene_id gene_id=gene_id
transcript_id=transcript_id feature_exon=exon feature_CDS=CDS
-a FILE FastA file with genome sequence (assembly). The file may be gzip-compressed. An
index with the file extension .fai must exist only if CRAM files are processed.
-b FILE File containing blacklisted events (recurrent artifacts and transcripts
observed in healthy tissue).
-k FILE File containing known/recurrent fusions. Some cancer entities are often
characterized by fusions between the same pair of genes. In order to boost
sensitivity, a list of known fusions can be supplied using this parameter. The list
must contain two columns with the names of the fused genes, separated by tabs.
-o FILE Output file with fusions that have passed all filters.
-O FILE Output file with fusions that were discarded due to filtering.
-t FILE Tab-separated file containing fusions to annotate with tags in the 'tags' column.
The first two columns specify the genes; the third column specifies the tag. The
file may be gzip-compressed.
-p FILE File in GFF3 format containing coordinates of the protein domains of genes. The
protein domains retained in a fusion are listed in the column
'retained_protein_domains'. The file may be gzip-compressed.
-d FILE Tab-separated file with coordinates of structural variants found using
whole-genome sequencing data. These coordinates serve to increase sensitivity
towards weakly expressed fusions and to eliminate fusions with low evidence.
-D MAX_GENOMIC_BREAKPOINT_DISTANCE When a file with genomic breakpoints obtained via
whole-genome sequencing is supplied via the -d
parameter, this parameter determines how far a
genomic breakpoint may be away from a
transcriptomic breakpoint to consider it as a
related event. For events inside genes, the
distance is added to the end of the gene; for
intergenic events, the distance threshold is
applied as is. Default: 100000
-s STRANDEDNESS Whether a strand-specific protocol was used for library preparation,
and if so, the type of strandedness (auto/yes/no/reverse). When
unstranded data is processed, the strand can sometimes be inferred from
splice-patterns. But in unclear situations, stranded data helps
resolve ambiguities. Default: auto
-i CONTIGS Comma-/space-separated list of interesting contigs. Fusions between genes
on other contigs are ignored. Cfontigs can be specified with or without the
prefix "chr". Asterisks (*) are treated as wild-cards.
Default: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_*
-v CONTIGS Comma-/space-separated list of viral contigs. Asterisks (*) are treated as
wild-cards.
Default: AC_* NC_*
-f FILTERS Comma-/space-separated list of filters to disable. By default all filters are
enabled. Valid values: homologs, low_entropy, isoforms,
top_expressed_viral_contigs, viral_contigs, uninteresting_contigs,
non_coding_neighbors, mismatches, duplicates, no_genomic_support,
genomic_support, intronic, end_to_end, relative_support,
low_coverage_viral_contigs, merge_adjacent, mismappers, multimappers,
same_gene, long_gap, internal_tandem_duplication, small_insert_size,
read_through, inconsistently_clipped, intragenic_exonic,
marginal_read_through, spliced, hairpin, blacklist, min_support,
select_best, in_vitro, short_anchor, known_fusions, no_coverage,
homopolymer, many_spliced
-E MAX_E-VALUE Arriba estimates the number of fusions with a given number of supporting
reads which one would expect to see by random chance. If the expected number
of fusions (e-value) is higher than this threshold, the fusion is
discarded by the 'relative_support' filter. Note: Increasing this
threshold can dramatically increase the number of false positives and may
increase the runtime of resource-intensive steps. Fractional values are
possible. Default: 0.300000
-S MIN_SUPPORTING_READS The 'min_support' filter discards all fusions with fewer than
this many supporting reads (split reads and discordant mates
combined). Default: 2
-m MAX_MISMAPPERS When more than this fraction of supporting reads turns out to be
mismappers, the 'mismappers' filter discards the fusion. Default:
0.800000
-L MAX_HOMOLOG_IDENTITY Genes with more than the given fraction of sequence identity are
considered homologs and removed by the 'homologs' filter.
Default: 0.300000
-H HOMOPOLYMER_LENGTH The 'homopolymer' filter removes breakpoints adjacent to
homopolymers of the given length or more. Default: 6
-R READ_THROUGH_DISTANCE The 'read_through' filter removes read-through fusions
where the breakpoints are less than the given distance away
from each other. Default: 10000
-A MIN_ANCHOR_LENGTH Alignment artifacts are often characterized by split reads coming
from only one gene and no discordant mates. Moreover, the split
reads only align to a short stretch in one of the genes. The
'short_anchor' filter removes these fusions. This parameter sets
the threshold in bp for what the filter considers short. Default: 23
-M MANY_SPLICED_EVENTS The 'many_spliced' filter recovers fusions between genes that
have at least this many spliced breakpoints. Default: 4
-K MAX_KMER_CONTENT The 'low_entropy' filter removes reads with repetitive 3-mers. If
the 3-mers make up more than the given fraction of the sequence, then
the read is discarded. Default: 0.600000
-V MAX_MISMATCH_PVALUE The 'mismatches' filter uses a binomial model to calculate a
p-value for observing a given number of mismatches in a read. If
the number of mismatches is too high, the read is discarded.
Default: 0.010000
-F FRAGMENT_LENGTH When paired-end data is given, the fragment length is estimated
automatically and this parameter has no effect. But when single-end
data is given, the mean fragment length should be specified to
effectively filter fusions that arise from hairpin structures.
Default: 200
-U MAX_READS Subsample fusions with more than the given number of supporting reads. This
improves performance without compromising sensitivity, as long as the
threshold is high. Counting of supporting reads beyond the threshold is
inaccurate, obviously. Default: 300
-Q QUANTILE Highly expressed genes are prone to produce artifacts during library
preparation. Genes with an expression above the given quantile are eligible
for filtering by the 'in_vitro' filter. Default: 0.998000
-e EXONIC_FRACTION The breakpoints of false-positive predictions of intragenic events
are often both in exons. True predictions are more likely to have at
least one breakpoint in an intron, because introns are larger. If the
fraction of exonic sequence between two breakpoints is smaller than
the given fraction, the 'intragenic_exonic' filter discards the
event. Default: 0.330000
-T TOP_N Only report viral integration sites of the top N most highly expressed viral
contigs. Default: 5
-C COVERED_FRACTION Ignore virally associated events if the virus is not fully
expressed, i.e., less than the given fraction of the viral contig is
transcribed. Default: 0.050000
-l MAX_ITD_LENGTH Maximum length of internal tandem duplications. Note: Increasing
this value beyond the default can impair performance and lead to many
false positives. Default: 100
-z MIN_ITD_ALLELE_FRACTION Required fraction of supporting reads to report an internal
tandem duplication. Default: 0.070000
-Z MIN_ITD_SUPPORTING_READS Required absolute number of supporting reads to report an
internal tandem duplication. Default: 10
-u Instead of performing duplicate marking itself, Arriba relies on duplicate marking by a
preceding program using the BAM_FDUP flag. This makes sense when unique molecular
identifiers (UMI) are used.
-X To reduce the runtime and file size, by default, the columns 'fusion_transcript',
'peptide_sequence', and 'read_identifiers' are left empty in the file containing
discarded fusion candidates (see parameter -O). When this flag is set, this extra
information is reported in the discarded fusions file.
-I If assembly of the fusion transcript sequence from the supporting reads is incomplete
(denoted as '...'), fill the gaps using the assembly sequence wherever possible.
-h Print help and exit.
Code repository: https://github.com/suhrig/arriba
Get help/report bugs: https://github.com/suhrig/arriba/issues
User manual: https://arriba.readthedocs.io/
Please cite: https://doi.org/10.1101/gr.257246.119

54
src/arriba/script.sh Normal file
View File

@@ -0,0 +1,54 @@
#!/bin/bash
## VIASH START
## VIASH END
# unset flags
[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking
[[ "$par_extra_information" == "false" ]] && unset par_extra_information
[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps
# replace ';' with ','
par_interesting_contigs=$(echo $par_interesting_contigs | tr ';' ',')
par_viral_contigs=$(echo $par_viral_contigs | tr ';' ',')
par_disable_filters=$(echo $par_disable_filters | tr ';' ',')
# run arriba
arriba \
-x "$par_bam" \
-a "$par_genome" \
-g "$par_gene_annotation" \
-o "$par_fusions" \
${par_known_fusions:+-k "${par_known_fusions}"} \
${par_blacklist:+-b "${par_blacklist}"} \
${par_structural_variants:+-d "${par_structural_variants}"} \
${par_tags:+-t "${par_tags}"} \
${par_protein_domains:+-p "${par_protein_domains}"} \
${par_fusions_discarded:+-O "${par_fusions_discarded}"} \
${par_max_genomic_breakpoint_distance:+-D "${par_max_genomic_breakpoint_distance}"} \
${par_strandedness:+-s "${par_strandedness}"} \
${par_interesting_contigs:+-i "${par_interesting_contigs}"} \
${par_viral_contigs:+-v "${par_viral_contigs}"} \
${par_disable_filters:+-f "${par_disable_filters}"} \
${par_max_e_value:+-E "${par_max_e_value}"} \
${par_min_supporting_reads:+-S "${par_min_supporting_reads}"} \
${par_max_mismappers:+-m "${par_max_mismappers}"} \
${par_max_homolog_identity:+-L "${par_max_homolog_identity}"} \
${par_homopolymer_length:+-H "${par_homopolymer_length}"} \
${par_read_through_distance:+-R "${par_read_through_distance}"} \
${par_min_anchor_length:+-A "${par_min_anchor_length}"} \
${par_many_spliced_events:+-M "${par_many_spliced_events}"} \
${par_max_kmer_content:+-K "${par_max_kmer_content}"} \
${par_max_mismatch_pvalue:+-V "${par_max_mismatch_pvalue}"} \
${par_fragment_length:+-F "${par_fragment_length}"} \
${par_max_reads:+-U "${par_max_reads}"} \
${par_quantile:+-Q "${par_quantile}"} \
${par_exonic_fraction:+-e "${par_exonic_fraction}"} \
${par_top_n:+-T "${par_top_n}"} \
${par_covered_fraction:+-C "${par_covered_fraction}"} \
${par_max_itd_length:+-l "${par_max_itd_length}"} \
${par_min_itd_allele_fraction:+-z "${par_min_itd_allele_fraction}"} \
${par_min_itd_supporting_reads:+-Z "${par_min_itd_supporting_reads}"} \
${par_skip_duplicate_marking:+-u} \
${par_extra_information:+-X} \
${par_fill_gaps:+-I}

45
src/arriba/test.sh Normal file
View File

@@ -0,0 +1,45 @@
#!/bin/bash
set -e
dir_in="$meta_resources_dir/test_data"
echo "> Run arriba with blacklist"
"$meta_executable" \
--bam "$dir_in/A.bam" \
--genome "$dir_in/genome.fasta" \
--gene_annotation "$dir_in/annotation.gtf" \
--blacklist "$dir_in/blacklist.tsv" \
--fusions "fusions.tsv" \
--fusions_discarded "fusions_discarded.tsv" \
--interesting_contigs "1,2"
echo ">> Checking output"
[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
rm fusions.tsv fusions_discarded.tsv
echo "> Run arriba without blacklist"
"$meta_executable" \
--bam "$dir_in/A.bam" \
--genome "$dir_in/genome.fasta" \
--gene_annotation "$dir_in/annotation.gtf" \
--fusions "fusions.tsv" \
--fusions_discarded "fusions_discarded.tsv" \
--interesting_contigs "1,2" \
--disable_filters blacklist
echo ">> Checking output"
[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
echo "> Test successful"

BIN
src/arriba/test_data/A.bam Normal file

Binary file not shown.

View File

@@ -0,0 +1,6 @@
1 havana gene 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; gene_name "A"; gene_source "havana"; gene_biotype "gene";
1 havana transcript 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; transcript_id "ENST00000000000"; transcript_version "2"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; transcript_name "A-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
1 havana exon 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; transcript_id "ENST00000000000"; transcript_version "2"; exon_number "1"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; transcript_name "A-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00000000000"; exon_version "1"; tag "basic"; transcript_support_level "1";
2 havana gene 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; gene_name "B"; gene_source "havana"; gene_biotype "gene";
2 havana transcript 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; transcript_id "ENST00000000001"; transcript_version "2"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; transcript_name "B-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
2 havana exon 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; transcript_id "ENST00000000001"; transcript_version "2"; exon_number "1"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; transcript_name "B-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00000000001"; exon_version "1"; tag "basic"; transcript_support_level "1";

View File

View File

@@ -0,0 +1,4 @@
>1
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>2
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

10
src/arriba/test_data/script.sh Executable file
View File

@@ -0,0 +1,10 @@
# arriba test data
# Test data was obtained from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/arriba/test
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi
cp -r /tmp/snakemake-wrappers/bio/arriba/test/* src/arriba/test_data

View File

@@ -0,0 +1,159 @@
name: bcl_convert
description: |
Convert bcl files to fastq files using bcl-convert.
Information about upgrading from bcl2fastq via
[Upgrading from bcl2fastq to BCL Convert](https://emea.support.illumina.com/bulletins/2020/10/upgrading-from-bcl2fastq-to-bcl-convert.html)
and [BCL Convert Compatible Products](https://support.illumina.com/sequencing/sequencing_software/bcl-convert/compatibility.html)
argument_groups:
- name: Input arguments
arguments:
- name: "--bcl_input_directory"
alternatives: ["-i"]
type: file
required: true
description: Input run directory
example: bcl_dir
- name: "--sample_sheet"
alternatives: ["-s"]
type: file
description: Path to SampleSheet.csv file (default searched for in --bcl_input_directory)
example: bcl_dir/sample_sheet.csv
- name: --run_info
type: file
description: Path to RunInfo.xml file (default root of BCL input directory)
example: bcl_dir/RunInfo.xml
- name: Lane and tile settings
arguments:
- name: "--bcl_only_lane"
type: integer
description: Convert only specified lane number (default all lanes)
example: 1
- name: --first_tile_only
type: boolean
description: Only convert first tile of input (for testing & debugging)
example: true
- name: --tiles
type: string
description: Process only a subset of tiles by a regular expression
example: "s_[0-9]+_1"
- name: --exclude_tiles
type: string
description: Exclude set of tiles by a regular expression
example: "s_[0-9]+_1"
- name: Resource arguments
arguments:
- name: --shared_thread_odirect_output
type: boolean
description: Use linux native asynchronous io (io_submit) for file output (Default=false)
example: true
- name: --bcl_num_parallel_tiles
type: integer
description: "\\# of tiles to process in parallel (default 1)"
example: 1
- name: --bcl_num_conversion_threads
type: integer
description: "\\# of threads for conversion (per tile, default # cpu threads)"
example: 1
- name: --bcl_num_compression_threads
type: integer
description: "\\# of threads for fastq.gz output compression (per tile, default # cpu threads, or HW+12)"
example: 1
- name: --bcl_num_decompression_threads
type: integer
description:
"\\# of threads for bcl/cbcl input decompression (per tile, default half # cpu threads, or HW+8).
Only applies when preloading files"
example: 1
- name: Run arguments
arguments:
- name: --bcl_only_matched_reads
type: boolean
description: For pure BCL conversion, do not output files for 'Undetermined' [unmatched] reads (output by default)
example: true
- name: --no_lane_splitting
type: boolean
description: Do not split FASTQ file by lane (false by default)
example: true
- name: --num_unknown_barcodes_reported
type: integer
description: "\\# of Top Unknown Barcodes to output (1000 by default)"
example: 1000
- name: --bcl_validate_sample_sheet_only
type: boolean
description: Only validate RunInfo.xml & SampleSheet files (produce no FASTQ files)
example: true
- name: --strict_mode
type: boolean
description: Abort if any files are missing (false by default)
example: true
- name: --sample_name_column_enabled
type: boolean
description: Use sample sheet 'Sample_Name' column when naming fastq files & subdirectories
example: true
- name: Output arguments
arguments:
- name: "--output_directory"
alternatives: ["-o"]
type: file
direction: output
required: true
description: Output directory containig fastq files
example: fastq_dir
- name: --bcl_sampleproject_subdirectories
type: boolean
description: Output to subdirectories based upon sample sheet 'Sample_Project' column
example: true
- name: --fastq_gzip_compression_level
type: integer
description: Set fastq output compression level 0-9 (default 1)
example: 1
- name: "--reports"
type: file
direction: output
required: false
description: Reports directory
example: reports_dir
- name: "--logs"
type: file
direction: output
required: false
description: Reports directory
example: logs_dir
# bcl-convert arguments not taken into account
# --force
# --output-legacy-stats arg Also output stats in legacy (bcl2fastq2) format (false by default)
# --no-sample-sheet arg Enable legacy no-sample-sheet operation (No demux or trimming. No settings
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: debian:trixie-slim
# https://support.illumina.com/sequencing/sequencing_software/bcl-convert/downloads.html
setup:
- type: apt
packages: [wget, gdb, which, hostname, alien, procps]
- type: docker
run: |
wget https://s3.amazonaws.com/webdata.illumina.com/downloads/software/bcl-convert/bcl-convert-4.2.7-2.el8.x86_64.rpm -O /tmp/bcl-convert.rpm && \
alien -i /tmp/bcl-convert.rpm && \
rm -rf /var/lib/apt/lists/* && \
rm /tmp/bcl-convert.rpm
- type: docker
run: |
echo "bcl-convert: \"$(bcl-convert -V 2>&1 >/dev/null | sed -n '/Version/ s/^bcl-convert\ Version //p')\"" > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

38
src/bcl_convert/help.txt Normal file
View File

@@ -0,0 +1,38 @@
bcl-convert Version 00.000.000.4.2.7
Copyright (c) 2014-2022 Illumina, Inc.
Run BCL Conversion (BCL directory to *.fastq.gz)
bcl-convert --bcl-input-directory <BCL_ROOT_DIR> --output-directory <PATH> [options]
Options:
-h [ --help ] Print this help message
-V [ --version ] Print the version and exit
--output-directory arg Output BCL directory for BCL conversion (must be specified)
-f [ --force ] Force: allow destination diretory to already exist
--bcl-input-directory arg Input BCL directory for BCL conversion (must be specified)
--sample-sheet arg Path to SampleSheet.csv file (default searched for in --bcl-input-directory)
--bcl-only-lane arg Convert only specified lane number (default all lanes)
--strict-mode arg Abort if any files are missing (false by default)
--first-tile-only arg Only convert first tile of input (for testing & debugging)
--tiles arg Process only a subset of tiles by a regular expression
--exclude-tiles arg Exclude set of tiles by a regular expression
--bcl-sampleproject-subdirectories arg Output to subdirectories based upon sample sheet 'Sample_Project' column
--sample-name-column-enabled arg Use sample sheet 'Sample_Name' column when naming fastq files & subdirectories
--fastq-gzip-compression-level arg Set fastq output compression level 0-9 (default 1)
--shared-thread-odirect-output arg Use linux native asynchronous io (io_submit) for file output (Default=false)
--bcl-num-parallel-tiles arg # of tiles to process in parallel (default 1)
--bcl-num-conversion-threads arg # of threads for conversion (per tile, default # cpu threads)
--bcl-num-compression-threads arg # of threads for fastq.gz output compression (per tile, default # cpu threads,
or HW+12)
--bcl-num-decompression-threads arg # of threads for bcl/cbcl input decompression (per tile, default half # cpu
threads, or HW+8. Only applies when preloading files)
--bcl-only-matched-reads arg For pure BCL conversion, do not output files for 'Undetermined' [unmatched]
reads (output by default)
--run-info arg Path to RunInfo.xml file (default root of BCL input directory)
--no-lane-splitting arg Do not split FASTQ file by lane (false by default)
--num-unknown-barcodes-reported arg # of Top Unknown Barcodes to output (1000 by default)
--bcl-validate-sample-sheet-only arg Only validate RunInfo.xml & SampleSheet files (produce no FASTQ files)
--output-legacy-stats arg Also output stats in legacy (bcl2fastq2) format (false by default)
--no-sample-sheet arg Enable legacy no-sample-sheet operation (No demux or trimming. No settings
supported. False by default, not recommended

40
src/bcl_convert/script.sh Normal file
View File

@@ -0,0 +1,40 @@
#!/bin/bash
set -eo pipefail
$(which bcl-convert) \
--bcl-input-directory "$par_bcl_input_directory" \
--output-directory "$par_output_directory" \
${par_sample_sheet:+ --sample-sheet "$par_sample_sheet"} \
${par_run_info:+ --run-info "$par_run_info"} \
${par_bcl_only_lane:+ --bcl-only-lane "$par_bcl_only_lane"} \
${par_first_tile_only:+ --first-tile-only "$par_first_tile_only"} \
${par_tiles:+ --tiles "$par_tiles"} \
${par_exclude_tiles:+ --exclude-tiles "$par_exclude_tiles"} \
${par_shared_thread_odirect_output:+ --shared-thread-odirect-output "$par_shared_thread_odirect_output"} \
${par_bcl_num_parallel_tiles:+ --bcl-num-parallel-tiles "$par_bcl_num_parallel_tiles"} \
${par_bcl_num_conversion_threads:+ --bcl-num-conversion-threads "$par_bcl_num_conversion_threads"} \
${par_bcl_num_compression_threads:+ --bcl-num-compression-threads "$par_bcl_num_compression_threads"} \
${par_bcl_num_decompression_threads:+ --bcl-num-decompression-threads "$par_bcl_num_decompression_threads"} \
${par_bcl_only_matched_reads:+ --bcl-only-matched-reads "$par_bcl_only_matched_reads"} \
${par_no_lane_splitting:+ --no-lane-splitting "$par_no_lane_splitting"} \
${par_num_unknown_barcodes_reported:+ --num-unknown-barcodes-reported "$par_num_unknown_barcodes_reported"} \
${par_bcl_validate_sample_sheet_only:+ --bcl-validate-sample-sheet-only "$par_bcl_validate_sample_sheet_only"} \
${par_strict_mode:+ --strict-mode "$par_strict_mode"} \
${par_sample_name_column_enabled:+ --sample-name-column-enabled "$par_sample_name_column_enabled"} \
${par_bcl_sampleproject_subdirectories:+ --bcl-sampleproject-subdirectories "$par_bcl_sampleproject_subdirectories"} \
${par_fastq_gzip_compression_level:+ --fastq-gzip-compression-level "$par_fastq_gzip_compression_level"}
if [ ! -z "$par_reports" ]; then
echo "Moving reports to their own location"
mv "${par_output_directory}/Reports" "$par_reports"
else
echo "Leaving reports alone"
fi
if [ ! -z "$par_logs" ]; then
echo "Moving logs to their own location"
mv "${par_output_directory}/Logs" "$par_logs"
else
echo "Leaving logs alone"
fi

70
src/bcl_convert/test.sh Normal file
View File

@@ -0,0 +1,70 @@
#!/bin/bash
# Tests are sourced from:
# https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/inputs/cr-direct-demultiplexing-bcl-convert
# Test input files are fetched from:
# https://cf.10xgenomics.com/supp/spatial-exp/demultiplexing/iseq-DI.tar.gz
# https://cf.10xgenomics.com/supp/spatial-exp/demultiplexing/bcl_convert_samplesheet.csv
set -eo pipefail
echo ">> Fetching and preparing test data"
data_src="https://cf.10xgenomics.com/supp/spatial-exp/demultiplexing/iseq-DI.tar.gz"
sample_sheet_src="https://cf.10xgenomics.com/supp/spatial-exp/demultiplexing/bcl_convert_samplesheet.csv"
test_data_dir="test_data"
mkdir $test_data_dir
wget -q $data_src -O $test_data_dir/data.tar.gz
wget -q $sample_sheet_src -O $test_data_dir/sample_sheet.csv
tar xzf $test_data_dir/data.tar.gz -C $test_data_dir
rm $test_data_dir/data.tar.gz
echo ">> Execute and verify output"
$meta_executable \
--bcl_input_directory "$test_data_dir/iseq-DI" \
--sample_sheet "$test_data_dir/sample_sheet.csv" \
--output_directory fastq \
--reports reports \
--logs logs
echo ">>> Checking whether the output dir exists"
[[ ! -d fastq ]] && echo "Output dir could not be found!" && exit 1
echo ">>> Checking whether output fastq files are created"
[[ ! -f fastq/Undetermined_S0_L001_R1_001.fastq.gz ]] && echo "Output fastq files could not be found!" && exit 1
[[ ! -f fastq/iseq-DI_S1_L001_R1_001.fastq.gz ]] && echo "Output fastq files could not be found!" && exit 1
echo ">>> Checking whether the report dir exists"
[[ ! -d reports ]] && echo "Reports dir could not be found!" && exit 1
echo ">>> Checking whether the log dir exists"
[[ ! -d logs ]] && echo "Logs dir could not be found!" && exit 1
# print final message
echo ">>> Test finished successfully"
echo ">> Execute with additional arguments and verify output"
$meta_executable \
--bcl_input_directory "$test_data_dir/iseq-DI" \
--sample_sheet "$test_data_dir/sample_sheet.csv" \
--output_directory fastq1 \
--bcl_only_matched_reads true \
--bcl_num_compression_threads 1 \
--no_lane_splitting false \
--fastq_gzip_compression_level 9
echo ">> Checking whether the output dir exists"
[[ ! -d fastq1 ]] && echo "Output dir could not be found!" && exit 1
echo ">> Checking whether output fastq files are created"
[[ -f fastq1/Undetermined_S0_L001_R1_001.fastq.gz ]] && echo "Undetermined should not be generated!" && exit 1
[[ ! -f fastq1/iseq-DI_S1_L001_R1_001.fastq.gz ]] && echo "Output fastq files could not be found!" && exit 1
# print final message
echo ">> Test finished successfully"
# do not remove this
# as otherwise your test might exit with a different exit code
exit 0

View File

@@ -0,0 +1,103 @@
name: bedtools_getfasta
namespace: bedtools
description: Extract sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file.
keywords: [sequencing, fasta, BED, GFF, VCF]
links:
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html
repository: https://github.com/arq5x/bedtools2
references:
doi: 10.1093/bioinformatics/btq033
license: GPL-2.0
requirements:
commands: [bedtools]
argument_groups:
- name: Input arguments
arguments:
- name: --input_fasta
type: file
description: |
FASTA file containing sequences for each interval specified in the input BED file.
The headers in the input FASTA file must exactly match the chromosome column in the BED file.
- name: "--input_bed"
type: file
description: |
BED file containing intervals to extract from the FASTA file.
BED files containing a single region require a newline character
at the end of the line, otherwise a blank output file is produced.
- name: --rna
type: boolean_true
description: |
The FASTA is RNA not DNA. Reverse complementation handled accordingly.
- name: Run arguments
arguments:
- name: "--strandedness"
type: boolean_true
alternatives: ["-s"]
description: |
Force strandedness. If the feature occupies the antisense strand, the output sequence will
be reverse complemented. By default strandedness is not taken into account.
- name: Output arguments
arguments:
- name: --output
alternatives: [-o]
required: true
type: file
direction: output
description: |
Output file where the output from the 'bedtools getfasta' commend will
be written to.
- name: --tab
type: boolean_true
description: |
Report extract sequences in a tab-delimited format instead of in FASTA format.
- name: --bed_out
type: boolean_true
description: |
Report extract sequences in a tab-delimited BED format instead of in FASTA format.
- name: "--name"
type: boolean_true
description: |
Set the FASTA header for each extracted sequence to be the "name" and coordinate columns from the BED feature.
- name: "--name_only"
type: boolean_true
description: |
Set the FASTA header for each extracted sequence to be the "name" columns from the BED feature.
- name: "--split"
type: boolean_true
description: |
When --input is in BED12 format, create a separate fasta entry for each block in a BED12 record,
blocks being described in the 11th and 12th column of the BED.
- name: "--full_header"
type: boolean_true
description: |
Use full fasta header. By default, only the word before the first space or tab is used.
# Arguments not taken into account:
#
# -fo [Specify an output file name. By default, output goes to stdout.
#
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: debian:stable-slim
setup:
- type: apt
packages: [bedtools, procps]
- type: docker
run: |
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,22 @@
#!/usr/bin/env bash
set -eo pipefail
unset_if_false=( par_rna par_strandedness par_tab par_bed_out par_name par_name_only par_split par_full_header )
for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
done
bedtools getfasta \
-fi "$par_input_fasta" \
-bed "$par_input_bed" \
${par_rna:+-rna} \
${par_name:+-name} \
${par_name_only:+-nameOnly} \
${par_tab:+-tab} \
${par_bed_out:+-bedOut} \
${par_strandedness:+-s} \
${par_split:+-split} \
${par_full_header:+-fullHeader} > "$par_output"

View File

@@ -0,0 +1,119 @@
#!/usr/bin/env bash
set -eo pipefail
TMPDIR=$(mktemp -d)
function clean_up {
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
}
trap clean_up EXIT
# Create dummy test fasta file
cat > "$TMPDIR/test.fa" <<EOF
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
EOF
TAB="$(printf '\t')"
# Create dummy bed file
cat > "$TMPDIR/test.bed" <<EOF
chr1${TAB}5${TAB}10${TAB}myseq
EOF
# Create expected bed file
cat > "$TMPDIR/expected.fasta" <<EOF
>chr1:5-10
AAACC
EOF
"$meta_executable" \
--input_bed "$TMPDIR/test.bed" \
--input_fasta "$TMPDIR/test.fa" \
--output "$TMPDIR/output.fasta"
cmp --silent "$TMPDIR/output.fasta" "$TMPDIR/expected.fasta" || { echo "files are different:"; exit 1; }
# Create expected bed file for --name
cat > "$TMPDIR/expected_with_name.fasta" <<EOF
>myseq::chr1:5-10
AAACC
EOF
"$meta_executable" \
--input_bed "$TMPDIR/test.bed" \
--input_fasta "$TMPDIR/test.fa" \
--name \
--output "$TMPDIR/output_with_name.fasta"
cmp --silent "$TMPDIR/output_with_name.fasta" "$TMPDIR/expected_with_name.fasta" || { echo "Files when using --name are different."; exit 1; }
# Create expected bed file for --name_only
cat > "$TMPDIR/expected_with_name_only.fasta" <<EOF
>myseq
AAACC
EOF
"$meta_executable" \
--input_bed "$TMPDIR/test.bed" \
--input_fasta "$TMPDIR/test.fa" \
--name_only \
--output "$TMPDIR/output_with_name_only.fasta"
cmp --silent "$TMPDIR/output_with_name_only.fasta" "$TMPDIR/expected_with_name_only.fasta" || { echo "Files when using --name_only are different."; exit 1; }
# Create expected tab-delimited file for --tab
cat > "$TMPDIR/expected_tab.out" <<EOF
myseq${TAB}AAACC
EOF
"$meta_executable" \
--input_bed "$TMPDIR/test.bed" \
--input_fasta "$TMPDIR/test.fa" \
--name_only \
--tab \
--output "$TMPDIR/tab.out"
cmp --silent "$TMPDIR/expected_tab.out" "$TMPDIR/tab.out" || { echo "Files when using --tab are different."; exit 1; }
# Create expected tab-delimited file for --bed_out
cat > "$TMPDIR/expected.bed" <<EOF
chr1${TAB}5${TAB}10${TAB}myseq${TAB}AAACC
EOF
"$meta_executable" \
--input_bed "$TMPDIR/test.bed" \
--input_fasta "$TMPDIR/test.fa" \
--bed_out \
--output "$TMPDIR/output.bed"
cmp --silent "$TMPDIR/expected.bed" "$TMPDIR/output.bed" || { echo "Files when using --bed_out are different."; exit 1; }
# Create dummy bed file for strandedness
cat > "$TMPDIR/test_strandedness.bed" <<EOF
chr1${TAB}20${TAB}25${TAB}forward${TAB}1${TAB}+
chr1${TAB}20${TAB}25${TAB}reverse${TAB}1${TAB}-
EOF
# Create expected tab-delimited file for --bed_out
cat > "$TMPDIR/expected_strandedness.fasta" <<EOF
>forward(+)
CGCTA
>reverse(-)
TAGCG
EOF
"$meta_executable" \
--input_bed "$TMPDIR/test_strandedness.bed" \
--input_fasta "$TMPDIR/test.fa" \
-s \
--name_only \
--output "$TMPDIR/output_strandedness.fasta"
cmp --silent "$TMPDIR/expected_strandedness.fasta" "$TMPDIR/output_strandedness.fasta" || { echo "Files when using -s are different."; exit 1; }

View File

@@ -0,0 +1,47 @@
name: busco_download_datasets
namespace: busco
description: Downloads available busco datasets
keywords: [lineage datasets]
links:
homepage: https://busco.ezlab.org/
documentation: https://busco.ezlab.org/busco_userguide.html
repository: https://gitlab.com/ezlab/busco
references:
doi: 10.1007/978-1-4939-9173-0_14
license: MIT
argument_groups:
- name: Inputs
arguments:
- name: --download
type: string
description: |
Download dataset. Possible values are a specific dataset name, "all", "prokaryota", "eukaryota", or "virus".
The full list of available datasets can be viewed [here](https://busco-data.ezlab.org/v5/data/lineages/) or by running the busco/busco_list_datasets component.
required: true
example: stramenopiles_odb10
- name: Outputs
arguments:
- name: --download_path
direction: output
type: file
description: |
Local filepath for storing BUSCO dataset downloads
required: false
default: busco_downloads
example: busco_downloads
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: quay.io/biocontainers/busco:5.7.1--pyhdfd78af_0
setup:
- type: docker
run: |
busco --version | sed 's/BUSCO\s\(.*\)/busco: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,14 @@
#!/bin/bash
## VIASH START
## VIASH END
if [ ! -d "$par_download_path" ]; then
mkdir -p "$par_download_path"
fi
busco \
--download_path "$par_download_path" \
--download "$par_download"

View File

@@ -0,0 +1,15 @@
echo "> Downloading busco stramenopiles_odb10 dataset"
"$meta_executable" \
--download stramenopiles_odb10 \
--download_path downloads
echo ">> Checking output"
[ ! -f "downloads/file_versions.tsv" ] && echo "file_versions.tsv does not exist" && exit 1
[ ! -f "downloads/lineages/stramenopiles_odb10/dataset.cfg" ] && echo "dataset.cfg does not exist" && exit 1
echo ">> Checking if output is empty"
[ ! -s "downloads/file_versions.tsv" ] && echo "file_versions.tsv is empty" && exit 1
[ ! -s "downloads/lineages/stramenopiles_odb10/dataset.cfg" ] && echo "dataset.cfg is empty" && exit 1
rm -r downloads

View File

@@ -0,0 +1,39 @@
name: busco_list_datasets
namespace: busco
description: Lists the available busco datasets
keywords: [lineage datasets]
links:
homepage: https://busco.ezlab.org/
documentation: https://busco.ezlab.org/busco_userguide.html
repository: https://gitlab.com/ezlab/busco
references:
doi: 10.1007/978-1-4939-9173-0_14
license: MIT
argument_groups:
- name: Outputs
arguments:
- name: --output
alternatives: ["-o"]
direction: output
type: file
description: |
Output file of the available busco datasets
required: false
default: busco_dataset_list.txt
example: file.txt
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: quay.io/biocontainers/busco:5.7.1--pyhdfd78af_0
setup:
- type: docker
run: |
busco --version | sed 's/BUSCO\s\(.*\)/busco: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,6 @@
#!/bin/bash
## VIASH START
## VIASH END
busco --list-datasets | awk '/^#{40}/{flag=1; next} flag{print}' > $par_output

View File

@@ -0,0 +1,15 @@
#!/bin/bash
## VIASH START
## VIASH END
"$meta_executable" \
--output datasets.txt
echo ">> Checking output"
[ ! -f "datasets.txt" ] && echo "datasets.txt does not exist" && exit 1
echo ">> Checking if output is empty"
[ ! -s "datasets.txt" ] && echo "datasets.txt is empty" && exit 1
rm datasets.txt

View File

@@ -0,0 +1,218 @@
name: busco_run
namespace: busco
description: Assessment of genome assembly and annotation completeness with single copy orthologs
keywords: [Genome assembly, quality control]
links:
homepage: https://busco.ezlab.org/
documentation: https://busco.ezlab.org/busco_userguide.html
repository: https://gitlab.com/ezlab/busco
references:
doi: 10.1007/978-1-4939-9173-0_14
license: MIT
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: ["-i"]
type: file
description: |
Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set. Also possible to use a path to a directory containing multiple input files.
required: true
example: file.fasta
- name: --mode
alternatives: ["-m"]
type: string
choices: [genome, geno, transcriptome, tran, proteins, prot]
required: true
description: |
Specify which BUSCO analysis mode to run. There are three valid modes:
- geno or genome, for genome assemblies (DNA)
- tran or transcriptome, for transcriptome assemblies (DNA)
- prot or proteins, for annotated gene sets (protein)
example: proteins
- name: --lineage_dataset
alternatives: ["-l"]
type: string
required: false
description: |
Specify a BUSCO lineage dataset that is most closely related to the assembly or gene set being assessed.
The full list of available datasets can be viewed [here](https://busco-data.ezlab.org/v5/data/lineages/) or by running the busco/busco_list_datasets component.
When unsure, the "--auto_lineage" flag can be set to automatically find the optimal lineage path.
BUSCO will automatically download the requested dataset if it is not already present in the download folder.
You can optionally provide a path to a local dataset instead of a name, e.g. path/to/dataset.
Datasets can be downloaded using the busco/busco_download_dataset component.
example: stramenopiles_odb10
- name: Outputs
arguments:
- name: --short_summary_json
required: false
direction: output
type: file
example: short_summary.json
description: |
Output file for short summary in JSON format.
- name: --short_summary_txt
required: false
direction: output
type: file
example: short_summary.txt
description: |
Output file for short summary in TXT format.
- name: --full_table
required: false
direction: output
type: file
example: full_table.tsv
description: |
Full table output in TSV format.
- name: --missing_busco_list
required: false
direction: output
type: file
example: missing_busco_list.tsv
description: |
Missing list output in TSV format.
- name: --output_dir
required: false
direction: output
type: file
example: output_dir/
description: |
The full output directory, if so desired.
- name: Resource and Run Settings
arguments:
- name: --force
type: boolean_true
description: |
Force rewriting of existing files. Must be used when output files with the provided name already exist.
- name: --quiet
alternatives: ["-q"]
type: boolean_true
description: |
Disable the info logs, displays only errors.
- name: --restart
alternatives: ["-r"]
type: boolean_true
description: |
Continue a run that had already partially completed. Restarting skips calls to tools that have completed but performs all pre- and post-processing steps.
- name: --tar
type: boolean_true
description: |
Compress some subdirectories with many files to save space.
- name: Lineage Dataset Settings
arguments:
- name: --auto_lineage
type: boolean_true
description: |
Run auto-lineage pipelilne to automatically determine BUSCO lineage dataset that is most closely related to the assembly or gene set being assessed.
- name: --auto_lineage_euk
type: boolean_true
description: |
Run auto-placement just on eukaryota tree to find optimal lineage path.
- name: --auto_lineage_prok
type: boolean_true
description: |
Run auto_lineage just on prokaryota trees to find optimum lineage path.
- name: --datasets_version
type: string
required: false
description: |
Specify the version of BUSCO datasets
example: odb10
- name: Augustus Settings
arguments:
- name: --augustus
type: boolean_true
description: |
Use augustus gene predictor for eukaryote runs.
- name: --augustus_parameters
type: string
required: false
description: |
Additional parameters to be passed to Augustus (see Augustus documentation: https://github.com/Gaius-Augustus/Augustus/blob/master/docs/RUNNING-AUGUSTUS.md).
Parameters should be contained within a single string, without whitespace and seperated by commas.
example: "--PARAM1=VALUE1,--PARAM2=VALUE2"
- name: --augustus_species
type: string
required: false
description: |
Specify the augustus species
- name: --long
type: boolean_true
description: |
Optimize Augustus self-training mode. This adds considerably to the run time, but can improve results for some non-model organisms.
- name: BBTools Settings
arguments:
- name: --contig_break
type: integer
required: false
description: |
Number of contiguous Ns to signify a break between contigs in BBTools analysis.
- name: --limit
type: integer
required: false
description: |
Number of candidate regions (contig or transcript) from the BLAST output to consider per BUSCO.
This option is only effective in pipelines using BLAST, i.e. the genome pipeline (see --augustus) or the prokaryota transcriptome pipeline.
- name: --scaffold_composition
type: boolean_true
description: |
Writes ACGTN content per scaffold to a file scaffold_composition.txt.
- name: BLAST Settings
arguments:
- name: --e_value
type: double
required: false
description: |
E-value cutoff for BLAST searches.
- name: Protein Gene Prediction settings
arguments:
- name: --miniprot
type: boolean_true
description: |
Use Miniprot gene predictor.
- name: MetaEuk Settings
arguments:
- name: --metaeuk
type: boolean_true
description: |
Use Metaeuk gene predictor.
- name: --metaeuk_parameters
type: string
description: |
Pass additional arguments to Metaeuk for the first run (see Metaeuk documentation https://github.com/soedinglab/metaeuk).
All parameters should be contained within a single string with no white space, with each parameter separated by a comma.
example: "--max-overlap=15,--min-exon-aa=15"
- name: --metaeuk_rerun_parameters
type: string
description: |
Pass additional arguments to Metaeuk for the second run (see Metaeuk documentation https://github.com/soedinglab/metaeuk).
All parameters should be contained within a single string with no white space, with each parameter separated by a comma.
example: "--max-overlap=15,--min-exon-aa=15"
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/busco:5.7.1--pyhdfd78af_0
setup:
- type: docker
run: |
busco --version | sed 's/BUSCO\s\(.*\)/busco: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,63 @@
```bash
busco -h
```
usage: busco -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
Welcome to BUSCO 5.7.1: the Benchmarking Universal Single-Copy Ortholog assessment tool.
For more detailed usage information, please review the README file provided with this distribution and the BUSCO user guide. Visit this page https://gitlab.com/ezlab/busco#how-to-cite-busco to see how to cite BUSCO
optional arguments:
-i SEQUENCE_FILE, --in SEQUENCE_FILE
Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set. Also possible to use a path to a directory containing multiple input files.
-o OUTPUT, --out OUTPUT
Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. The path to the output folder is set with --out_path.
-m MODE, --mode MODE Specify which BUSCO analysis mode to run.
There are three valid modes:
- geno or genome, for genome assemblies (DNA)
- tran or transcriptome, for transcriptome assemblies (DNA)
- prot or proteins, for annotated gene sets (protein)
-l LINEAGE, --lineage_dataset LINEAGE
Specify the name of the BUSCO lineage to be used.
--augustus Use augustus gene predictor for eukaryote runs
--augustus_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
Pass additional arguments to Augustus. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
--augustus_species AUGUSTUS_SPECIES
Specify a species for Augustus training.
--auto-lineage Run auto-lineage to find optimum lineage path
--auto-lineage-euk Run auto-placement just on eukaryote tree to find optimum lineage path
--auto-lineage-prok Run auto-lineage just on non-eukaryote trees to find optimum lineage path
-c N, --cpu N Specify the number (N=integer) of threads/cores to use.
--config CONFIG_FILE Provide a config file
--contig_break n Number of contiguous Ns to signify a break between contigs. Default is n=10.
--datasets_version DATASETS_VERSION
Specify the version of BUSCO datasets, e.g. odb10
--download [dataset [dataset ...]]
Download dataset. Possible values are a specific dataset name, "all", "prokaryota", "eukaryota", or "virus". If used together with other command line arguments, make sure to place this last.
--download_base_url DOWNLOAD_BASE_URL
Set the url to the remote BUSCO dataset location
--download_path DOWNLOAD_PATH
Specify local filepath for storing BUSCO dataset downloads
-e N, --evalue N E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)
-f, --force Force rewriting of existing files. Must be used when output files with the provided name already exist.
-h, --help Show this help message and exit
--limit N How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
--list-datasets Print the list of available BUSCO datasets
--long Optimization Augustus self-training mode (Default: Off); adds considerably to the run time, but can improve results for some non-model organisms
--metaeuk Use Metaeuk gene predictor
--metaeuk_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
Pass additional arguments to Metaeuk for the first run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
--metaeuk_rerun_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
Pass additional arguments to Metaeuk for the second run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
--miniprot Use Miniprot gene predictor
--skip_bbtools Skip BBTools for assembly statistics
--offline To indicate that BUSCO cannot attempt to download files
--opt-out-run-stats Opt out of data collection. Information on the data collected is available in the user guide.
--out_path OUTPUT_PATH
Optional location for results folder, excluding results folder name. Default is current working directory.
-q, --quiet Disable the info logs, displays only errors
-r, --restart Continue a run that had already partially completed.
--scaffold_composition
Writes ACGTN content per scaffold to a file scaffold_composition.txt
--tar Compress some subdirectories with many files to save space
-v, --version Show this version and exit

View File

@@ -0,0 +1,72 @@
#!/bin/bash
## VIASH START
## VIASH END
[[ "$par_tar" == "false" ]] && unset par_tar
[[ "$par_force" == "false" ]] && unset par_force
[[ "$par_quiet" == "false" ]] && unset par_quiet
[[ "$par_restart" == "false" ]] && unset par_restart
[[ "$par_auto_lineage" == "false" ]] && unset par_auto_lineage
[[ "$par_auto_lineage_euk" == "false" ]] && unset par_auto_lineage_euk
[[ "$par_auto_lineage_prok" == "false" ]] && unset par_auto_lineage_prok
[[ "$par_augustus" == "false" ]] && unset par_augustus
[[ "$par_long" == "false" ]] && unset par_long
[[ "$par_scaffold_composition" == "false" ]] && unset par_scaffold_composition
[[ "$par_miniprot" == "false" ]] && unset par_miniprot
tmp_dir=$(mktemp -d -p "$meta_temp_dir" busco_XXXXXXXXX)
prefix=$(openssl rand -hex 8)
busco \
--in "$par_input" \
--mode "$par_mode" \
--out "$prefix" \
--out_path "$tmp_dir" \
--opt-out-run-stats \
${meta_cpus:+--cpu "${meta_cpus}"} \
${par_lineage_dataset:+--lineage_dataset "$par_lineage_dataset"} \
${par_augustus:+--augustus} \
${par_augustus_parameters:+--augustus_parameters "$par_augustus_parameters"} \
${par_augustus_species:+--augustus_species "$par_augustus_species"} \
${par_auto_lineage:+--auto-lineage} \
${par_auto_lineage_euk:+--auto-lineage-euk} \
${par_auto_lineage_prok:+--auto-lineage-prok} \
${par_contig_break:+--contig_break $par_contig_break} \
${par_datasets_version:+--datasets_version "$par_datasets_version"} \
${par_e_value:+--evalue "$par_e_value"} \
${par_force:+--force} \
${par_limit:+--limit "$par_limit"} \
${par_long:+--long} \
${par_metaeuk:+--metaeuk} \
${par_metaeuk_parameters:+--metaeuk_parameters "$par_metaeuk_parameters"} \
${par_metaeuk_rerun_parameters:+--metaeuk_rerun_parameters "$par_metaeuk_rerun_parameters"} \
${par_miniprot:+--miniprot} \
${par_quiet:+--quiet} \
${par_restart:+--restart} \
${par_scaffold_composition:+--scaffold_composition} \
${par_tar:+--tar} \
out_dir=$(find "$tmp_dir/$prefix" -maxdepth 1 -name 'run_*')
if [[ -n "$par_short_summary_json" ]]; then
cp "$out_dir/short_summary.json" "$par_short_summary_json"
fi
if [[ -n "$par_short_summary_txt" ]]; then
cp "$out_dir/short_summary.txt" "$par_short_summary_txt"
fi
if [[ -n "$par_full_table" ]]; then
cp "$out_dir/full_table.tsv" "$par_full_table"
fi
if [[ -n "$par_missing_busco_list" ]]; then
cp "$out_dir/missing_busco_list.tsv" "$par_missing_busco_list"
fi
if [[ -n "$par_output_dir" ]]; then
if [[ -d "$par_output_dir" ]]; then
rm -r "$par_output_dir"
fi
cp -r -L "$out_dir" "$par_output_dir"
fi

View File

@@ -0,0 +1,88 @@
test_dir="$meta_resources_dir/test_data"
mkdir "run_prot_stramenopiles"
cd "run_prot_stramenopiles"
echo "> Running busco with lineage dataset"
"$meta_executable" \
--input $test_dir/protein.fasta \
--mode proteins \
--lineage_dataset stramenopiles_odb10 \
--output_dir output \
--short_summary_json short_summary.json \
--short_summary_txt short_summary.txt \
--full_table full_table.tsv \
--missing_busco_list missing_busco_list.tsv
echo ">> Checking output"
[ ! -f "output/full_table.tsv" ] && echo "full_table.tsv does not exist" && exit 1
[ ! -f "output/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv does not exist" && exit 1
[ ! -f "output/short_summary.json" ] && echo "short_summary.json does not exist" && exit 1
[ ! -f "output/short_summary.txt" ] && echo "short_summary.txt does not exist" && exit 1
[ ! -f "full_table.tsv" ] && echo "full_table.tsv does not exist" && exit 1
[ ! -f "missing_busco_list.tsv" ] && echo "missing_busco_list.tsv does not exist" && exit 1
[ ! -f "short_summary.json" ] && echo "short_summary.json does not exist" && exit 1
[ ! -f "short_summary.txt" ] && echo "short_summary.txt does not exist" && exit 1
echo ">> Checking if output is empty"
[ ! -s "output/full_table.tsv" ] && echo "full_table.tsv is empty" && exit 1
[ ! -s "output/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv is empty" && exit 1
[ ! -s "output/short_summary.json" ] && echo "short_summary.json is empty" && exit 1
[ ! -s "output/short_summary.txt" ] && echo "short_summary.txt is empty" && exit 1
[ ! -s "full_table.tsv" ] && echo "full_table.tsv is empty" && exit 1
[ ! -s "missing_busco_list.tsv" ] && echo "missing_busco_list.tsv is empty" && exit 1
[ ! -s "short_summary.json" ] && echo "short_summary.json is empty" && exit 1
[ ! -s "short_summary.txt" ] && echo "short_summary.txt is empty" && exit 1
cd ..
mkdir "run_prot_autolineage"
cd "run_prot_autolineage"
echo "> Running busco with auto lineage"
"$meta_executable" \
--input $test_dir/protein.fasta \
--mode proteins \
--auto_lineage \
--output_dir output
echo ">> Checking output"
[ ! -f "output/full_table.tsv" ] && echo "full_table.tsv does not exist in output folder" && exit 1
[ ! -f "output/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv does not exist in output folder" && exit 1
[ ! -f "output/short_summary.json" ] && echo "short_summary.json does not exist in output folder" && exit 1
[ ! -f "output/short_summary.txt" ] && echo "short_summary.txt does not exist in output folder" && exit 1
echo ">> Checking if output is empty"
[ ! -s "output/full_table.tsv" ] && echo "full_table.tsv in output folder is empty" && exit 1
[ ! -s "output/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv in output folder is empty" && exit 1
[ ! -s "output/short_summary.json" ] && echo "short_summary.json in output folder is empty" && exit 1
[ ! -s "output/short_summary.txt" ] && echo "short_summary.txt in output folder is empty" && exit 1
rm -r output/
cd ..
mkdir "run_genome"
cd "run_genome"
echo "> Running busco with genome data"
"$meta_executable" \
--input $test_dir/genome.fna \
--mode genome \
--lineage_dataset saccharomycetes_odb10 \
--output_dir output
echo ">> Checking output"
[ ! -f "output/full_table.tsv" ] && echo "full_table.tsv does not exist in output folder" && exit 1
[ ! -f "output/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv does not exist in output folder" && exit 1
[ ! -f "output/short_summary.json" ] && echo "short_summary.json does not exist in output folder" && exit 1
[ ! -f "output/short_summary.txt" ] && echo "short_summary.txt does not exist in output folder" && exit 1
echo ">> Checking if output is empty"
[ ! -s "output/full_table.tsv" ] && echo "full_table.tsv in output folder is empty" && exit 1
[ ! -s "output/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv in output folder is empty" && exit 1
[ ! -s "output/short_summary.json" ] && echo "short_summary.json in output folder is empty" && exit 1
[ ! -s "output/short_summary.txt" ] && echo "short_summary.txt in output folder is empty" && exit 1
rm -r output/

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,64 @@
>341721at2759_1001832_1:000010
MASRPVKKRKLTPPGDDEASSRKSGGKIQKAFLKNAANWDLEQDYETRARKGKKKEKESTRLPLKLPGGRVQHVSAPDNDFQAIESDEDWLDGAEDVSEDEESKDKKAPEEPEKPEHEQILEAKEELAKIALMLNESPDENTGAFKALAKIGQSRIITIKKLALATQLTVYKDVIPGYRIRPVAEDGPEEKLSKDVRKLRTYETCLISGYQAYVKELTKHAKTGHANGLASVAITCACNLLTAVPHFNFRSDLVKILVGKLSTRRVDDDFNKCLQALETLFEEDEEGRPSMEAVSLLSKMMKAREYQVNESVVNLFLHLRLLSDFSGKGSKDSVDRMDDGPSKKPKSKREFRTKRERKQIKEQKALQKDMAQADALVQHEERDRMEGETLKLVFGTYFRVLKMRVPHLMGAVLEGLSKYAHLINQNFFGDLLEALKDLIRHSDASEKDDAEEKEDEEADDDAPVRNPSREALLCTTTAFALLAGQDAHNARADLHLDLSFFTTHLYQSLFPLSLHPDLELGARSLHLPDPDKPSQNRKSNSSNKVNLQTTTVLLIRCLTAVLLPPWNVRSVPPVRLAAFAKQLMTAALHVPEKSAQALLALLADVAGTHGRRIAALWNTEERKGDGAFNPLAESAEASNPFAATVWEGEILRRHYCPAVRRGVGIVEKSLSLAER
>296129at2759_1069680_1:000010
MMKKKQIDSRIPTLIKNGVQEKKRTLFVIVGDRGRDQIVNLHWLLSQTRIASRPSVLWMYKKDLLGFTSHRKKREAKIKKEIKKGIRDPNEATTPFELFISVTNIRYTYYKESEKILGQTFGMLVLQDFEAITPNLLARTIETVEGGGIIVILFKTMENLKQLYTMTMDIHSRYRTEAHQDVVARFNGRFILSLGHCSSCLFVDDELNVLPISEAKKVKPLPKPQLEEPKKELEELKQKYEDKQLLRSLIDVAKTVDQARALITFVEAISEKTLRSTVALTAARGRGKSAALGLAISAAVAYGYSNIFITSPNPENLKTLFEFTFKGFNSLKYEEHIDYDIIQSLNPSFNKSIVRVNIFRNHRQTIQYIHPSDAYVLGQAELLVIDEAAAIPMPLVKKLLGPYLTFMASTVNGYEGTGRSLSLKLIQQLREQSRGFAHENTKSGNSEKSMINRSEKLNKESGINSIGGRKLREITLEEPIRYSYGDPVEEWLNKLLCLDINISLKQFLEQGCPHPSQCELYYVNRDTLFSYHPVSESFLQMMMSLYVASHYKNSPNDLQLMADAPAHQLFVLLPPVKEDDNKLPEPLCVIQVALEGEISRESVVNNLTRGYRTGGDLIPWVITEQFQDDKFASLSGARIVRIATNPEYIRMGYGSHALKLLENFYEGKYLNLSEETISESNENIKIINNNLESSLLTDDIKIKDLKIMPPLLLKLSEKKPGLIHYLGVSYGLTPQLYKFWKRAEFIPVYLRQTPNDLTGEHTCLMLKLLQDKSETWLNEFSNDFRKRFLSLLSFSFRSFPTILCLNIIESINNDLIQKDNVHVITKSEIDINLSPFDLKRLESYANNMLDYHTIIDMLPYIADLYFKGRFGKDLKMTGVQSAILLALGLQKRLLEDIEKELNLPSNQVLAMLVKILRKLSSFFKDIYYKAIDNTLPIERKNLKNQLQTHADENDNFRGFIPLKATLKEELDHLSSEMEDSIKEKQRELINSLDLQKYIIKGQEEDWDKAEQHIKNGIYSGKSSVVSIQSHSLKREHESLTDIPHIKKKHQKKHKRKV
>1217666at2759_1073089_1:000010
MPINQPSNQIKFTNVSVVRLKKGKKRFELACYKNKLLEYRSGAEKDLDNVLQVPTIFLSVSKAQTAPSAELTKAFGANIPADEIRQEILRKGEVQVGERERKEISERVEKELLDIVSGRLVDPTTKRVYTPGMISKALDQLSSASGQMQQTQGEGSGATDEKGAAQPRKPMWTGVAPNKSAKSQALDAMKALIAWQPIPVMRARMRLRVTCPVSILKHSVKAPSGGGASKEKEAPSGNSKSNKGKKGPKSRAARQQDSDAEDGKSDAEAAPKTPSNVKDKILGYIESIESQEVIGGDEWEVVGFAEPGAYKGLNEFVGNETRGRGRVEVLDMTVTHEE
>513979at2759_1159556_1:000010
MAVVDIQARFSPHHPLEPDLLYEIQSILRLHGLSVDDLFFKWDAYCIRMDLDAQAALSLANVRSLKQSIQDDLEKSHRSTTQVRSERKVAAAPKAVSGGDVYGMLDGLVPSTPAAGGKRSRGVAAGGGGSGLKKKMDSLKMNSSPAGMKEQLSAFNGLPATSFAERANAGDVVEILNAQLPPCEAPLAPFPEPRIKLTAASDQKKMAYKPLAVKLSEASEVLDDRIDEFAALVQDYHGLEDSAFGSAASQGTTEVVAVGRIASDAMEGKLNAAALVLETSRRTGMGLRVPLKMHKVPSWSFFPGQVVALRGTNATGGEFVVEQVLDVPLLPSAASTPSALEAHRARMSGVPPGGGAAAATTDSDAAAPAPAPAPLTILYAAGPYTADDNLDYEPLHALCSQAADALADALVLAGPFLDIDHPLVAAGDFDLPPEDEAALDPDTATMSAVFRHLVAPALNRACAANPHLTVVLVPSVRDVLARHVSWPQDAIARKELGLAKAARIVSNPMTLSMNEVVVGVSSQDVLHELRNEECSRACPPGDLMGRLCRYLVEQRHYFPLFPPTDRARLPRTGTQSGLATGAVLDPSYLRLGEMVNVRPDVMVVPSSLPPFAKASSVVESVLAINPGPLSKRKGAGTFARMTLHAPPVGGGSEMTSHRVFDRARVEIVRI
>543764at2759_1165861_1:000010
MALGRAARPVGWTDCCAAVEKKPNYKSGMTQPARTITAGDNLLLKLPSGQTRTIKNVTSDSSISLGKFGKFQTNELIDQPFGLTFDILEDGKLVRNEQINLALELNPMLDELNSFESIKGMANGISNVEDIEATNEMIKESDGAQKLTNVEIEELKKSGLSGREIILRQIQQHSAFELKSEFSKAKYIKRKEKKFLKMFTCIDPTIHNMSQYLFENHNFAIKGLRPDTLSQMLSLSNVRPGWKGIVVDDIGGLLVAAVLIRMGGEGTIFVLNNADSPPDLHLLELFNLPKSVLGPLKSLNWAQTEADWTTSDIEELLLLHRDPPQPLPILDSTLPDPQLKQLSQRTKKQPNNRSKSMRKFERVQELLSMRQEFLDTQFEGLLTCSEYEPESIVTKLVNKLSGSSTIVIYSCHLRPLSDLQTLLKKSSMPSTSSSSLGGSSSLVEQNELTKRMKENKTEFIQITISEPWLRAYQVLVGRTHPEMAGTHHGGFVFSAIKVFNSCS
>1558822at2759_1266660_1:000010
MSIAEILPLEIIDKTVGQPVLVMLTSHREFSGTLVGYDDFVNVVLEEVVEYDHDQEIKRHAGKMLLSGNNIAMLVPGGKRVQ
>1287094at2759_1291522_0:000010
MGNILVKKNRVTITEADRAILTLRTQRRKMEEHRRRVEALMERETTVARTLVAKQQRPAALLALKKKRLHETQLEGLDNCLLTLEETLTQVESAQRTARLMAALKQGADVLSALQRAMPLESVEQLMEQGAESREYEMRLQALLGESLGEDQSAAAERELDEMEAQLIEEDVLDLPKVPSHAVARPASARAIGQAASERQLEPEIAA
>83779at2759_1296121_1:000010
MCGLTLTIRPLSLSLSSPSVSDCSSSDSTEDADLALLDSFRSTNAQRGPDSQRTFKHTVTLDDDDNGVTTTTTTKSTTKSKVEICLTATVLGLRGDLTAQPLVGNRGVLGWNGQVFEGIDIGTEENDTRKIFERLEKGERVEDVLSGVEGPFAFIYLDLENDILHYQLDPLSRRSLLIHPAEVAVDSNPSVTRHFILSSSRSTLAREHGVDMRALLGGEGGTIDLRRIKVVQNQGFLTMDMSDALKHRHTLSPDQDASCSSSSGSWTKVAPINTALPPDNLPLDNPKIKEEVPKFIEQLKESVKRRVENIPNPEKGCSRVAVLFSGGIDCTFLAYLIHLCLPPEDPIDLINVAFSPAPKLSSLSSNGADKGKGKSPALPAAPTYDVPDRLSGRDALVELKQVCPDREWRFVEIDVPYDEARAHRQNVLDLMYPSSTEMDHSLALPLYFASRGYGSVRKEGSNHSEPYRVKAKVYISGLGADEQLGGYARHRHAYQREGWQGLISETQMDIARLPTRNLSRDDRMLSSHARDARYPYLSLSFISYLSSLPVHLKCDPRLGEGQGDKILLRKAVESVGLVRASGRVKRAMQFGTRSSKLGGRGSGVKGPKAGERQVE
>1057950at2759_1314783_1:000010
MSSRQATHADSWYVGDGRRLDSELSKNLAAVEGDANYSPPIKGCKAVIAPHAGYSYSGRAAAWAYKSIDTTGIKRIFILGPSHHVYLDGCALSKCEKYETPLGELPIDLDTVKELRATGEFQDMDIQTDEDEHSIEMHLPYVRKVFEGLDIAIVPILIGAINLNKENKFGTVLAPYLAKDDTFFVISSDFCHWGTRFQYTFYYPRPPPTSTPAIRLSKADPNPSTLATHPIHASISAIDHEAMDLMTMPPQTAQQAHIDFAEYLRTTKNTICGRHPIGVLLGALAVLQSQGRVPHLKFVRYEQSSQCQTVRDSSVSYASAYITV
>453044at2759_1330018_1:000010
MPAAPQDPFFKSIGSAAADTEALREQPDEQDEQETDLEPIDEDRPLQEVESLCMSCGEQGVTRMLLTSIPYFREVIVMSFRCEHCGNQNNEIQSASTIREHGAMYTVKILNQGDLNRQLVKSEAATVTIPEFELTIPPLRGQLTTVEGTLRDTIQDLAADQPLRRIQDPPTFDKIEALLAKLKEVVPDDEDEAAPTMKERHPEDPVRPFTVILDDPTGNSFIEFSGSMSDPKWSLREYARSMDQNITLGLSQPEDEEKEKVTQKGGPFTEEDEDGLPAEEVFIFPGICSSCGHPVDTRMKKVNIPYFKDIIIMSTNCSACGYRDNEVKSGGAISDKGKRITLKVEDAEDLSRDILKSETCGLEIPEIDLALHAGTLGGRFTTVEGILTQVYDELSEKVFRGDSVGSANSKDNQEFETFLGSMKEVMTAARPFTLILDDPLANSYLQNLYAPDPDPNMEIVTYDRTFDQNEDLGLNDMKVEGYEAPS
>1323575at2759_1392248_1:000010
MSQPQPPPLRYIRYEPSREDEYVAAMRQLISKDLSEPYSIYVYRYFLYQWGDLCFMTVDDSRPEDPIVGVVVSKLEPHRGGPMRGYIAMLAVREEYRGRGIATKLVRMAIDAMIARDADEIALETEITNTAAMKLYERLGFLRSKRLHRYYLNGNSAYRLVLYLKEGVGNMRTSFDPYAAPAEARPEMSGAAAVPAAPAPPPLLQGNGR
>160593at2759_139723_0:000010
MADAELAKALKDLPNRVLNVPVEERPELFQNVIAVLPNPGINATIVRGICKVIGTTLTKYKDPESQTLVKELLVAVLKQHPDLTYEHFNAVLKALLAKDLAGAPPIKAAQASALALGWANLIALHADHETAVGKKEFPKLLEVQAGLYQLSLTSGIQKISDKAYSFLRDFFASDESLAQRYFDKLLAMEPSSGVIVMLCTIVRYLHQEQGTVELLDQHKPKLLDHLVKGLITVKTKPHASDIVACSILLKAITKDELRTIIVPALQRSMLRSAEVILRAVGAIVNEIELDVSDYALDLGKPLVQNLASKEETVRQEAVESLKQVALKCGTPNAIETLLKEVFAVLNGSGGKITVAELRINLLQGAGNLSYNKIPSQKIQTILPAACDHFTKVIEAEIQEKVVCHALEMFGLWTVNHRGEIPAKIVQLFKKGLDAKAQTIRTSYLQWFLSCLHDGKLPNGIDFTTTLSKIVERAAQSPTQTPVVSEGVGAACILLLTNPSVSEKLKDFWNIVLDTNKSPFLSERFLSTTNAETRCYVMVICEQLLIKHRNELKGSSTTDPLIRAATVCVMSAQAKVRRYCLPLVTKIVNSEDGVSLAKFLLAELTRYVECTKILSEGEPAEEGIAPAQALVDAVCTVCNVEKVANPDAQSLALSALLCSHHPAAVSVRGDLWESILERYGLYGKQFIALNTAQIEEVFFNSYKATAMYENTLATLSRISPELILSVLVKNVTDQLNNSRMSNVTDEEYFTYLTPDGELYDKSVIPNTDEQVQTAHLKRENKAYSYKEQLEELQLRRELEEKRRKEGKWKPPQLTPKQKEVIDKQREKENAIKARLQALHDTITTLISQIEGAAKGTPKQLPLFFPALLPAILRVFSSPLAAPAMVKLYYRLKDICFGEERVELGRDIAIATIRLSKPHCDLEESWCTANLVELVSDILVALYDETIDMYNVHREEEASKRYLLDAPAFSYTFEFLKRALTLPEAKKDESLLINGVQIIAYHAQLKGDTVDGKDLGDVYHPLYMPRLEMIRLLLRLIQQHRGRVQTQAVAALLDVAESCSGREYTTRAEQREIEALLVALQEELDAVRDVALRALAIMIDVLPSIADDYEFGLRLTRRLWVAKHDLSADIKQLATGIWQDGAYEVPIVMADELMKDIIHPELCVQKAAAAALVSILVEDSSTIDGVVEQLLEIYREKVVMIPAKLDQFDREVEPAIDPWGPRRGVAITLGSISPFLTPELVKSVIQFMVRSGLRDRQEIVHKEMLAASLAIVEHHGKDSVTYLLPTFEYFLDKAPSKGAYDNIRQAVVILMGSLARHLDREDERIQPIIDRLLAALETPSQQVQEAVANCIPHLIPSVKDKAPEIVKKLLQQLVKSEKYGVRRGAAYGIAGVVKGLGILSLKQLDIMSKLTHYIQDKKNYKSREGALFAFEMLCSTLGRLFEPYIVHVLPHLLQCFGDSSVYVRQAADECAKTVMAKLSAHGVKLVLPSLLNALDEDSWRTKTASVELLGSMAFCAPKQLSSCLPSIVPKLMEVLGDSHIKVQEAGANALRVIGSVIKNPEIQAIVPVLLTALEDPSSKTSACLQSLLETKFVHFIDAPSLALIMPVVQRAFMDRSTETRKMAAQIIGNMYSLTDQKDLTPYLPNIIPGLKTSLLDPVPEVRGVSARALGAMVRGMGESSFEDLLPWLMQTLTSESSSVDRSGAAQGLSEVVGGLGVEKLHKLMPEIIATAERTDIAPHVKDGYIMMFIYMPSAFPNDFTPYIGQIINPILKALADENEYVRDTALKAGQRIVNLYAESAITLLLPELEKGLFDDNWRIRYSSVQLLGDLLYKISGVSGKMTTQTASEDDNFGTEQSHKAIIRSLGADRRNRVLAGLYMGRSDVSLMVRQAALHVWKVVVTNTPRTLREILPTLFSLLLGCLASTSYDKRQVAARTLGDLVRKLGERVLPEIIPILERGLSSDQADQRQGVCIGLSEIMASTSRDMVLTFVNSLVPTVRKALADPLPEVRHAAAKTFDSLHTTVGARALEDILPSMLESLADPDPDVAEWTLDGLRQVMAIKSRVVLPYLIPQLTAKPVNTKALSILASVAGEALTKYLPKILPALLAALAAAQGTPEEVQELEYCQAVILSVSDEVGIRTIMDTVMESTKSEIPETRRAAATLLCAFCTHSPGDYSQYVPQLLRGLLWLLSDGDREVLQRSWDALNAVTKTLDSAQQIAHVTDVRQAVKFASSDLPKGGELPGFCLPKGITPLLPVFREAILNGLPEEKENAAQGLGEVIKLTSPASLQPSVVHITGPLIRILGDRFNAGVKAAVLETLAILLHKVGIMLKQFLPQLQTTFLKALHDPSRTVRIKAGHALAELIVIHTRPDPLFVEMHNGIKSADDSAVRETMLQALRGIVTPAGDKMTEPLRKQIYATLAGMLAHPEDVSRAAAAGCFGALCRWLTPEQVDDALTSHMLNEDYGDDATLRHGRTAALFVALKEHPGGIVTTKYEPKICKVITGALVSDKISVAMNGVRAGGYLLQYGMTDGTAKLSTAVIGPFVKSMNHSSNEVKQLLAKTCTYLARVVPAERIAPEYLKLAIPMLVNGTKEKNGYVRSNSEIALVHVLRLRDGEEFHQRCITLLEPGARESLSEVVSKVLRKVAMQAVGKEEELDDTILT
>1346432at2759_1447883_1:000010
MSSMRNAVQRRVHRERAQPANREKWGILEKHKDYSLRARDYSVKKAKLQRLREKADTRNPDEFAFGMMSGKSRTQGKHGARDTESAALSLETVKLLKTQDAGYLRVVGERIRRQMMAVDEEVRVQEGISGVSANGAAAGGGGGGGRKVVFVDSVEEQRERALEDEGKSDDDEEQGDFDEVDEEEQRQQKTQPKSKKQLEAEKLAQKEMLKARKLKIKAAEARSKKLQALTDQHKNIVAAEQELDWQRGKMENSVGGVNKHGLRWKVRERKR
>761109at2759_198730_1:000010
MAMTFTEDSIKELRLRLEDAVVKCSERCLYQSAKWAAEMLNSLVSTDGNDTDAESPMETDLQPTVNPFSLQSDPTEATLELQEAHKYLLAKSYFDTREYDRCAAVFLPPTIPPVPLSTVSPNVKSRASLTPQKGKRKSFIRPGLKSGQALPRNPYPNLSQKSLFLALYAKYLAGEKRRDEETEMVLGPADGGMTVNRELPDLARGLEGWFEERRERGLQDQGQGWLEYLYAVILIKGKNEEEAKKWLVRSVHLFPFHWGAWQELNDLLPSVDDLKQVAETLPQNIMSFIFQVHCSQELYQATDETHQTLNGLESIFPTSAFLKTERALLYYHSRDFEDASAIFADILIDSPHRLDSLDHYSNILYVMGARPQLAFVAQLATATDKFRPETCCVVGNYYSLKSEHEKAVMYFRRALTLDRNFLSAWTLMGHEYIEMKNTHAAIESYRRAVDVNRKDYRAWYGLGQAYEVLDMCFYALYYYQRTAALKPYDPKMWQAVGTCYAKMNQIPQSIKAMKRALVAGAYYEQRADAATADHPAAGRKILDPDLLHQIALLYEKMNNEDEAAAYMELTLQQESGEIERTETDSDDDDGDDNSDDGTTQRRSRRQRRRQKSRDDDNEIEAVGGTGVTATTSKARLWLARWALKHGDLNRADQLAGELCQDGVEVEEAKALMRDVRARREGGGG
>1617752at2759_2004952_1:000010
MPSSFVTPGQQRYLRACMVCSIVMTYSRFRDEGCPNCDEFLHLAGSQDQIESCTSQVFEGLITLANPAKSWIAKWQRLDGYVGGVYAIKVSGQLPDEIRTTLEDEYRIQYIPRDGTQTEADA
>1588798at2759_215358_0:000010
MTLPPTQQEPHTPEAFSLFVSFNHREPQNDDVMADLGIKAGDKVMMVWTQPSAPEGLKQHAEELAAIVGADGKVSVENLERLLLSSHSASSFDCVLSCLLADSSPVHTSETLEELARVLKPGGKLVLDEAVTGAETSQVRTAEKLISALKLSGFMSVTEVSKAELTAEALSALRTATGYQGNTLSRVRVSASKPNFEVGSSSQIKLSFGKKTPKPAEKPALDPNTVKMWTLSANDMGDDDVDLVDSDALLDEEDLKKPDPASLKVSCRDSGKKKACKNCSCGLAEELEQESTGKQKTNLPKSACGSCYLGDAFRCASCPYAGMPAFKPGEKIVLDKKTLTDA
>1275837at2759_28005_1:000010
MSSRDKASPSSPKETKGEHHLNEESDNDNNERRDEQQVTASAYLPSASRVDVHPLVLLSLVDHFARMNTKVRQKKRVVGLLLGRYKTDAAGTQVLDINNSFAVPFDEDPHNSDVWFFDTNYAEEMFVMHRRVHPKTKIVGWYASGPTVQQNDMLLHLLVADRFCANPVYCVVNTDPSHKGVPVLAYTTVQGREGARSLEFRNIPTHVGAEEAEEIGVEHLLRDLTDSTVTTLSSQLEERERSLEHMARVLVQIEEYLSDVASGALPASEDVLEALQELISLQPETYLKKKSLELNRFTNDRTIATFLGSIARCIGGLHEVILNRRVLARELKEIKARRAEAEEQRMDNEKNKIAEASPERKQ
>1264469at2759_29058_0:000010
MRPPLAIVRTYCTTAAPKSSNFIDEMKRNFIATNTFQKTLLSCGSAAISLLNPHRGDMIACLGEVTGESAIKYMRQKMTETEEGTEILKEKPRINSGTVSFDKLSQMPDNTLGRVYADFMTENNITADSRLPVQFIEDPELAYVMQRYREVHDLVHATLFMRTSMLGEVTVKWVEGIQTRLPMCISGGIWGAARLKPKHRQMYLKYYLPWAIKTGNNAKFMQGIYFEKRWDQDIDDFHKEMNIVRLVKK
>673132at2759_326594_0:000010
MTLLTVFKQFKKFQDAGKSVARSLSIKDDQESKKTCLYDLHIENNGKMVNFSGWLLPIQYRDSITASHQHTRTHASLFDVGHMLQSHVSGCDSGEFLESLTTADLQNLAQGGAALTVFTNKSGGILDDLIITKDRNDRFFVVSNAGRRNEDIELMLGRQAEMKSQGKNVTIEFLDPLEQGLIALQGPSAATTLQTLVKIDLTKLKFMNSVETKINQKSVRISRCGYTGEDGFEISVNGKDARTISEMILEVPDIKLAGLGARDSLRLEAGFCLYGHDINESITPVEASLQWLIAKRRREAANFPGAEFILEQIKNGPKKKRVGLILGQGPPARENATILTSAGERVGIVTSGGPSPTLGKPIAMGYVPLEHVHTGTPVLTEIRGKTYKALITKMPFVKPHYYSDKR
>887370at2759_331117_1:000010
MVVRSFLPLLSLLIALATFTSAASDYHEALVLQPLPQSSLLASFNFRGNTSQEAFDQRHFRYFPRALGQILQHTHTKELHIRFTTGRWDAESWGTRPWNGTKEGNTGVELWAWIDAPDSESAFARWISLTQSLSGLFCASLNFIDSTRTTRPVVSFEPIGDHSPSSDLHLLHGTLPGEVVCTENLTPFLKLLPCKGKAGVSSLLDGHKLFDASWQSMSVDVRPVCPQGGECLMQIEQTVDIVLDIERSKRPRDNPIPRPVPNDQLNCDNSKPYHSDDTCYPLERGSGKGWSLNEIFGRTLNGVCSLDEGQRPGEEAICLRVPHEQGVYTTSGVEETKRPDGYTRCFTLQPSGTFDLVIPEQSHTSLAPRDEPVLSAERTIVGHGQERGGMRIIFDNPSDAHPVDFIYFETLPWFLRPYVHTLRATITGRDGATRSVPVSHIVKETFYRPAIDRERGTQLELALSVPAASIVTLTYDFEKAILRYTEYPPDANRGFNVAPAVIKLSSANGNTIAHDTPIYMRTTSLLLPLPTPDFSMPYNVIILTSTVIALAFGSIFNLLVRRFVAADQAAALTAQTLKGRLLGKIVALRDRISGKRSKVE
>166920at2759_38123_0:000010
MAFLDFVFPLSKDELLERSDSQYYVRDQVTTSELPEKLKGCFESLHDDGPLFILENFDTLYGLLAHFKSVDFNQLHKVYTKLLIKSITEFIPILENYFSKETPDDELQNKYLNVIKMTVYILTEFIISFESRLQKEYQKVVIDVRARKVKVRAAIKHKEKYNWDWDFHLSNGLNSIHQLLKAKINKLWDPPVVEEEFVNTIANCCYKIIEDPCIASVKHKELRIFIFQVIGYLIKKYNHGISCTVKIVQLLKNCDHLVSPLAQAVTMFIRNHGCKSLVREIVREISEMDDGNEAAGQGQDNSKMVAAFLNEIAAEGPEYVIPAMDELLLNLEKESYMMRNCTLTILTELLLQVYKKENLSSEAKDQRDEYLNSLMEHIYDVHTFVRTKVLQLFQKLVIEKALPLAFTLQLVDRAIGRLMDKSSNVVKYAVQLLRTMIVSNPFAAKLGVEELKKKLAEAKATLTELEKNLPETSAQLSLVDEWNNIHYPVLLKIIREILEDGMYGCFLFYFL
>1275837at2759_402676_1:000010
MESMNDMFKKINAREKLVGWYHTGPQLRSSDLEINNLFKKYIPNPVLVIIDVQSKAVGLPTSAYFAVDEIKDDGTKSSLTFVHLPSSIEAEEAEEIGVEHLLRDTRDITAGTLATRVTEQVQSLRALEQRLDEIAVYLRKVVDGQLPINHTILGELQGVFNLLPNIFKTSNENDPLGLENGDERSFNINSNDQLMTVYLSSIVRSVIALHDLLDSLAASKAAEQEQDKLDLKQESTDSEKRATTAAVDEDPFMPN
>1284731at2759_42254_0:000010
MAEAGAVAAEYPSGGRARAARTLLDQVVLPGEELLLPEQEDADGPGGAGERPLQARDPYLKWGVRRACCEIPYVPVRGDHVIGIVTAKSGDTFKVDVGGSEPASLSYLAFEGATKRNRPNVQVGDLIYGQFVVANKDMEPEMVCIDGCGRANGMGVIGQDGLLFKVTLGLIRKLLAPDCEIIQELGKLYPLEIVFGMNGRIWVKAKTIQQTLILANILEACEHMTTDQRKQIFSRLAES
>1228942at2759_45354_1:000010
MNHDPFQWGRPRDEIYGHYDHKIAQASTSEFPSMHTQQPIITGTSVLGLKFDTGVVIAADHMGSYGSLLRFNNLERLICVGSETIVGVSGDISDFQHIERLLHELETEEEVYDTDGGHNLRAPNIHEYLSRVLYNRRLKMDPLWNAILVAGFNDDRTPFIRYVDLLGVTYGALALATGFGAHLAIPLLRKLVPYDLDYVKVKEADAREAVVNAMRVLYYRDARASDKYTLAVLSFKDGKVDVHFDQELKVTNQSWKFAEKVIGYGSKQQ
>759498at2759_502779_1:000010
MDGSRGSRKRKAVTRDLGEEPGVVSGNELHLDSADGSLADHSEDLDGSSDSEIELADDLNSDDDEEEEEEEEEDEDEINSDEVPSDIEPKVVGKKSGPGGEVDIIVRGDDTASDDDDDDDDDFESDDRPNYRVVKDANGNERYVYDEINPDDNSDYSETDENANTIGNIPLSFYDQYPHIGYNINGKKIMRPAKGQALDALLDSIELPKGFTGLTDPATGKPLELTQDELELLRKVQMNEITEEGYDPYQPTIEYFTSKLEVMPLSAAPEPKRRFVPSKHEAKRVMKLVKAIREGRILPYKQPAEEDEAEEGVQTYDIWANETPRADHPMHIPAPKLPPPGYEESYHPPPEYLPDEKEKSAWLNTDPEDRETEYLPTDHDALRKVPGYESFVKEKFERCLDLYLAPRVRRSKLNIDPESLLPKLPSPEELKPFPSTCATLFRGHQGRVRTLAIDPTGVWLASGGDDGTVRVWDILTGRQFWSVALSGDDAINVVRWRPGKDAVVLAAAAGDSIFLMVPPVLDPEMEKASFEVVDAGWGYAKTSPSTFTSTDSTKTSPVQWTRPSSSLLDSGVQAVISLGYVAKSLSWHRRGDYFVTVCPGTSTPVSLAIAIHTLSKHLTQQPFRRRLKGGGPPQTAHFHPSKPILFVANQRTIRAYDLSRQTLVKILQPGARWISSFDIHPTSSSTSGGDNLIVGSYDRRLLWHDVDLSPRPYKTLRYHQKAIRAVRYHANYPLFADASDDGSLQIFHGSVTGDLLSNASIVPLKVLRGHKVTGELGVLDLDWHPKEAWCVSAGADGTCRLWM
>375960at2759_51337_0:000010
MFFREHIFNIIGAFDIPRFVYNSERKKFLPLLMTNHPAPNLLGTAKDKAELYRERYTLLHQRTHRHELFTPPVIGSYPNESGSKFQLKTIETLLGSTTKIGDVIVLGMITQLKEGKFFLEDPTRTVQLDLSQAQFHSGLYTEACFVLAEGKAYYGSINFFGGPSNTSVKTSTKLKQLEEENKDAMFVFVSDVWLDRAEVLEKLHIMFSGYSPAPPSCFILCGNFSSAPYGKNQIQALKDSLKTLADIICEYPNIHQSSRFVFVPGPKDPGFGSILPRPPLAESITSEFRQKIPFSVFTTNPCRIQYCTEEIIIFREDIVNKMCRNCVRFPSSNLDIPNHFVKTILSQGHLTPLPLYVCPVYWARFPSSNLDIPNHGSFPRSGFSFKVFYPSSKTVEDSKLQGF
>919955at2759_5643_1:000010
MAAPMAVDKAKAPKIDVDEFLTLAISETPAELHPFFESFRSLYSRKLWHQLTNKLFEFFDHPLSKPYRVDVFNKFVRDFGLRLNQLRLVEMGVKVSKEIDNPVTHLQFLTDLLERVNIEKSPEAHVLLLSSLAHAKLLYGDHEGTKNDIDAAWKVLDELSSVDPSVNAAYYGVAADYYKSKAEYAPYYKNSLLYLACIDPAKDLTAEERLLRAHDLGIAAFLGDTIYNFGELPILQENYPFLRQKICLMALIESVFKRGSYDRTMSFQTIAEETHLPLDEVEHLVMKALSLKLIKGSLDQVDQKAQITWVQPRVLSREQIGQLAQRLAAWNSKLHQVEERIAPEVLVNS
>817008at2759_5849_1:000010
MDKLKTIYIDSALSIIKGALCVILQIPTGRTTESIKKKQNNVGIITVKSIFKEPTISQYNDIKQLIKTKIEENCPFYNYQINRTIAEKIYGDTIYDNYGLSKEINEVNLIILEEWNINCNRNRVLKHSGLIKNIEINKFKYLNNKESLEVHFLVNPKYTFEELNTIYKNEEELNNFLLSPIIKVTNKKIYEIEDKKSEFSYLYEEDILPKNKVLPPSGIENVNYESSKVVTPWDVNIGEEGINYNKLIKEFGCSKISDEHIRKIEKLTNRKAHHFIRRGIFFSHRDLDFLLNYYEQNGYFYIYTGRGPSSLSMHLGHLIPFYFCKYLQDAFNVPLIIQLSDDEKFLFNQNYSLDDINRFTKENVKDIIAVGFNPELTFIFKNTEYANHLYPTVLAIHKKTTLNQSMNVFGFNNSDNIGKISYPSFQIAPCFSQCFPNFLKKNIPCLVPQGIDQDPYFRLSRDIAVKLALYKPVVIHSVFMPGLQGVNTKMSSTKKKDNKNMDSKQDINNSVIFLTDSPEQIKNKINKYAFSGGGATIAEHKEKGADLEKDISYQYLRYFLVDDEKLNEIGEKYKKGEMLSGEIKKILIDILTDLVQKHQEKRNSLTDEDILYFFNDNKSSLKKFKDM
>1426075at2759_61621_0:000010
MTASQPNPQLPQSLPALKTSGTCARLPSTGRKLHLRIARAHPRVSRELFRRSGCGCGAGLSSAETDIAFLFSASGYRSHILKTMSGSFYFVIVGHHDNPVLKWSFXPAGKAESKDDHRHLNQFIAHAALDLVDENMWLSNNMYLKTVDKFNEWFVSAFVTAGHMRFIMLHDIRQEDGIKNFFTDVYDLYIKFSMNPFYEPNSPIRSSAFDRKVQFLGKKHLLS
>655400at2759_688394_1:000010
MAASRSPRLSSLLLRTTPLSRPTWQRTLSTRGFATAISNKLDNVYDMVIVGGGIAGTALACSLATNPSMKDYRIALIEAMDLSNTNNWAPATGRYSNRVVSLTPASMQFFEKIGVADELYRDRIQPYNCMKVSDGVTNASIEFDTNLLSSSTNPDDLPIAYMIENVHLQHSILKTLQTSKGKGATVDILQKARVASIRMQEQDAKETKDTLDLSDWPIIEMENGQSLQARLLVGADGVNSPVRSFAKIESLGWDYNMHGVVATFKTDPSRKNDTAYQRFLPTGPIAMLPLGDGHASMVWSMPPDMAHKVKKIPAQAFCTLVNSAFRLSMEDLDYLRSKIDPTTFEPLCDFDSEYNWRQGVAKHGLGDMEMMERELAFPPIVESVDETSRASFPLRMRNSQQYFADRVVLVGDAAHTVHPLAGQGLNQGILDVACLSDILQRGASEGQDIGNLHLLREYASVRYLRNLLMISACDKLHRLYSTDFAPITWIRSLGLSSVNQLDFVKAEIMKYAMGIEQ
>946128at2759_765440_1:000010
MPTTVCTAKASYKKTPGQLELTETHLQWFADGKKAPSVRVLYAEAASLFCSKEGAAQIRLKLGLVGDDTGHNFTFTSPQSVAYKERETFKKELTNIISRNRSVPNVTTPRPPLNTSISSTTPAISNAPTPRSVVPPSRASTSRAPSVSSDGRTPIVPGSDPTSDFRLRKQVLVSNPELGALHRDLVMSGQITEAEFWEGREHLLLAQTATESQKRGRPGQLVDPRPETVEGGEVKIVITPQLVHDIFEEYPVVAKAYNDNVPNKLSEAEFWKRYFQSKLFNAHRASIRSSAAQHVVKDDKIFDKYLEKDDDELEPRRQRDEGINLFVNLGATREDHGETGNEQDITMQAGRQRGALPLIRKFNEHSERLLNSALGDEPTAKRRRIDAGKEDAYSQIDLDDLHDPEASAGIILEMQDRQRYFEGQMASAASAEAAAGKNLDIRAILGETKVNLHDWETNLAQLKINKKSGDAALLSMTENVSARLEIKMKKNDIPPELFSQMTTCQTAANEFLRQFWLSMYPPAADHQVLAPATPAQKAAKAAKMIGYLGKTHEKVDALIRTAQVEAVDAAKVEIVRAVCFVYIITVNFNANLQAMKPILDAVDRALAFYRSRKPPK
>1287401at2759_870435_1:000010
MSSSIVGSLTRGCRTPSVNINPHPFFRCRTSLYHGIGKPPSWLHSRTQLWRTIGTSSSKHTPPSSASVSARRPTAIPSYNASREQMYKTRNRNLLMYTSAVVILGVGITYAAVPLYRMFCSATGFAGTPSVVSTSSGRFDPSRLTPDTDARRIRVHFNADRAEALPWKFFPQQKYVEVLPGESSLAFYKARNESKKDIIGIATYNVTPDRVAPYFSKVECFCFEEQKLLAGEEVDMPLLFFIDKDILDDPSCRGVNDVVLSYTFFKARRNAQGHLEPDAEEDVVQRSLGFEGYEHSPRAETKKVEGSKANS

View File

@@ -0,0 +1,12 @@
# busco test data
# Test data from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/busco/test
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi
cp -r /tmp/snakemake-wrappers/bio/busco/test/protein.fasta src/busco/test_data
# Test data from busco test data at https://gitlab.com/ezlab/busco/-/tree/master/test_data?ref_type=heads
wget -O src/busco/test_data/genome.fna "https://gitlab.com/ezlab/busco/-/raw/master/test_data/eukaryota/genome.fna?ref_type=heads&inline=false"

View File

@@ -0,0 +1,481 @@
name: cutadapt
description: |
Cutadapt removes adapter sequences from high-throughput sequencing reads.
keywords: [RNA-seq, scRNA-seq, high-throughput]
links:
homepage: https://cutadapt.readthedocs.io
documentation: https://cutadapt.readthedocs.io
repository: https://github.com/marcelm/cutadapt
references:
doi: 10.14806/ej.17.1.200
license: MIT
argument_groups:
####################################################################
- name: Specify Adapters for R1
arguments:
- name: --adapter
alternatives: [-a]
type: string
multiple: true
description: |
Sequence of an adapter ligated to the 3' end (paired data:
of the first read). The adapter and subsequent bases are
trimmed. If a '$' character is appended ('anchoring'), the
adapter is only found if it is a suffix of the read.
required: false
- name: --front
alternatives: [-g]
type: string
multiple: true
description: |
Sequence of an adapter ligated to the 5' end (paired data:
of the first read). The adapter and any preceding bases
are trimmed. Partial matches at the 5' end are allowed. If
a '^' character is prepended ('anchoring'), the adapter is
only found if it is a prefix of the read.
required: false
- name: --anywhere
alternatives: [-b]
type: string
multiple: true
description: |
Sequence of an adapter that may be ligated to the 5' or 3'
end (paired data: of the first read). Both types of
matches as described under -a and -g are allowed. If the
first base of the read is part of the match, the behavior
is as with -g, otherwise as with -a. This option is mostly
for rescuing failed library preparations - do not use if
you know which end your adapter was ligated to!
required: false
####################################################################
- name: Specify Adapters using Fasta files for R1
arguments:
- name: --adapter_fasta
type: file
multiple: true
description: |
Fasta file containing sequences of an adapter ligated to the 3' end (paired data:
of the first read). The adapter and subsequent bases are
trimmed. If a '$' character is appended ('anchoring'), the
adapter is only found if it is a suffix of the read.
required: false
- name: --front_fasta
type: file
description: |
Fasta file containing sequences of an adapter ligated to the 5' end (paired data:
of the first read). The adapter and any preceding bases
are trimmed. Partial matches at the 5' end are allowed. If
a '^' character is prepended ('anchoring'), the adapter is
only found if it is a prefix of the read.
required: false
- name: --anywhere_fasta
type: file
description: |
Fasta file containing sequences of an adapter that may be ligated to the 5' or 3'
end (paired data: of the first read). Both types of
matches as described under -a and -g are allowed. If the
first base of the read is part of the match, the behavior
is as with -g, otherwise as with -a. This option is mostly
for rescuing failed library preparations - do not use if
you know which end your adapter was ligated to!
required: false
####################################################################
- name: Specify Adapters for R2
arguments:
- name: --adapter_r2
alternatives: [-A]
type: string
multiple: true
description: |
Sequence of an adapter ligated to the 3' end (paired data:
of the first read). The adapter and subsequent bases are
trimmed. If a '$' character is appended ('anchoring'), the
adapter is only found if it is a suffix of the read.
required: false
- name: --front_r2
alternatives: [-G]
type: string
multiple: true
description: |
Sequence of an adapter ligated to the 5' end (paired data:
of the first read). The adapter and any preceding bases
are trimmed. Partial matches at the 5' end are allowed. If
a '^' character is prepended ('anchoring'), the adapter is
only found if it is a prefix of the read.
required: false
- name: --anywhere_r2
alternatives: [-B]
type: string
multiple: true
description: |
Sequence of an adapter that may be ligated to the 5' or 3'
end (paired data: of the first read). Both types of
matches as described under -a and -g are allowed. If the
first base of the read is part of the match, the behavior
is as with -g, otherwise as with -a. This option is mostly
for rescuing failed library preparations - do not use if
you know which end your adapter was ligated to!
required: false
####################################################################
- name: Specify Adapters using Fasta files for R2
arguments:
- name: --adapter_r2_fasta
type: file
description: |
Fasta file containing sequences of an adapter ligated to the 3' end (paired data:
of the first read). The adapter and subsequent bases are
trimmed. If a '$' character is appended ('anchoring'), the
adapter is only found if it is a suffix of the read.
required: false
- name: --front_r2_fasta
type: file
description: |
Fasta file containing sequences of an adapter ligated to the 5' end (paired data:
of the first read). The adapter and any preceding bases
are trimmed. Partial matches at the 5' end are allowed. If
a '^' character is prepended ('anchoring'), the adapter is
only found if it is a prefix of the read.
required: false
- name: --anywhere_r2_fasta
type: file
description: |
Fasta file containing sequences of an adapter that may be ligated to the 5' or 3'
end (paired data: of the first read). Both types of
matches as described under -a and -g are allowed. If the
first base of the read is part of the match, the behavior
is as with -g, otherwise as with -a. This option is mostly
for rescuing failed library preparations - do not use if
you know which end your adapter was ligated to!
required: false
####################################################################
- name: Paired-end options
arguments:
- name: --pair_adapters
type: boolean_true
description: |
Treat adapters given with -a/-A etc. as pairs. Either both
or none are removed from each read pair.
- name: --pair_filter
type: string
choices: [any, both, first]
description: |
Which of the reads in a paired-end read have to match the
filtering criterion in order for the pair to be filtered.
- name: --interleaved
type: boolean_true
description: |
Read and/or write interleaved paired-end reads.
####################################################################
- name: Input parameters
arguments:
- name: --input
type: file
required: true
description: |
Input fastq file for single-end reads or R1 for paired-end reads.
- name: --input_r2
type: file
required: false
description: |
Input fastq file for R2 in the case of paired-end reads.
- name: --error_rate
alternatives: [-E, --errors]
type: double
description: |
Maximum allowed error rate (if 0 <= E < 1), or absolute
number of errors for full-length adapter match (if E is an
integer >= 1). Error rate = no. of errors divided by
length of matching region. Default: 0.1 (10%).
example: 0.1
- name: --no_indels
type: boolean_false
description: |
Allow only mismatches in alignments.
- name: --times
type: integer
alternatives: [-n]
description: |
Remove up to COUNT adapters from each read. Default: 1.
example: 1
- name: --overlap
alternatives: [-O]
type: integer
description: |
Require MINLENGTH overlap between read and adapter for an
adapter to be found. The default is 3.
example: 3
- name: --match_read_wildcards
type: boolean_true
description: |
Interpret IUPAC wildcards in reads.
- name: --no_match_adapter_wildcards
type: boolean_false
description: |
Do not interpret IUPAC wildcards in adapters.
- name: --action
type: string
choices:
- trim
- retain
- mask
- lowercase
- none
description: |
What to do if a match was found. trim: trim adapter and
up- or downstream sequence; retain: trim, but retain
adapter; mask: replace with 'N' characters; lowercase:
convert to lowercase; none: leave unchanged.
The default is trim.
example: trim
- name: --revcomp
alternatives: [--rc]
type: boolean_true
description: |
Check both the read and its reverse complement for adapter
matches. If match is on reverse-complemented version,
output that one.
####################################################################
- name: "Demultiplexing options"
arguments:
- name: "--demultiplex_mode"
type: string
choices: ["single", "unique_dual", "combinatorial_dual"]
required: false
description: |
Enable demultiplexing and set the mode for it.
With mode 'unique_dual', adapters from the first and second read are used,
and the indexes from the reads are only used in pairs. This implies
--pair_adapters.
Enabling mode 'combinatorial_dual' allows all combinations of the sets of indexes
on R1 and R2. It is necessary to write each read pair to an output
file depending on the adapters found on both R1 and R2.
Mode 'single', uses indexes or barcodes located at the 5'
end of the R1 read (single).
####################################################################
- name: Read modifications
arguments:
- name: --cut
alternatives: [-u]
type: integer
multiple: true
description: |
Remove LEN bases from each read (or R1 if paired; use --cut_r2
option for R2). If LEN is positive, remove bases from the
beginning. If LEN is negative, remove bases from the end.
Can be used twice if LENs have different signs. Applied
*before* adapter trimming.
- name: --cut_r2
type: integer
multiple: true
description: |
Remove LEN bases from each read (for R2). If LEN is positive, remove bases from the
beginning. If LEN is negative, remove bases from the end.
Can be used twice if LENs have different signs. Applied
*before* adapter trimming.
- name: --nextseq_trim
type: string
description: |
NextSeq-specific quality trimming (each read). Trims also
dark cycles appearing as high-quality G bases.
- name: --quality_cutoff
alternatives: [-q]
type: string
description: |
Trim low-quality bases from 5' and/or 3' ends of each read
before adapter removal. Applied to both reads if data is
paired. If one value is given, only the 3' end is trimmed.
If two comma-separated cutoffs are given, the 5' end is
trimmed with the first cutoff, the 3' end with the second.
- name: --quality_cutoff_r2
alternatives: [-Q]
type: string
description: |
Quality-trimming cutoff for R2. Default: same as for R1
- name: --quality_base
type: integer
description: |
Assume that quality values in FASTQ are encoded as
ascii(quality + N). This needs to be set to 64 for some
old Illumina FASTQ files. The default is 33.
example: 33
- name: --poly_a
type: boolean_true
description: Trim poly-A tails
- name: --length
alternatives: [-l]
type: integer
description: |
Shorten reads to LENGTH. Positive values remove bases at
the end while negative ones remove bases at the beginning.
This and the following modifications are applied after
adapter trimming.
- name: --trim_n
type: boolean_true
description: Trim N's on ends of reads.
- name: --length_tag
type: string
description: |
Search for TAG followed by a decimal number in the
description field of the read. Replace the decimal number
with the correct length of the trimmed read. For example,
use --length-tag 'length=' to correct fields like
'length=123'.
example: "length="
- name: --strip_suffix
type: string
description: |
Remove this suffix from read names if present. Can be
given multiple times.
- name: --prefix
alternatives: [-x]
type: string
description: |
Add this prefix to read names. Use {name} to insert the
name of the matching adapter.
- name: --suffix
alternatives: [-y]
type: string
description: |
Add this suffix to read names; can also include {name}
- name: --rename
type: string
description: |
Rename reads using TEMPLATE containing variables such as
{id}, {adapter_name} etc. (see documentation)
- name: --zero_cap
alternatives: [-z]
type: boolean_true
description: Change negative quality values to zero.
####################################################################
- name: Filtering of processed reads
description: |
Filters are applied after above read modifications. Paired-end reads are
always discarded pairwise (see also --pair_filter).
arguments:
- name: --minimum_length
alternatives: [-m]
type: string
description: |
Discard reads shorter than LEN. Default is 0.
When trimming paired-end reads, the minimum lengths for R1 and R2 can be specified separately by separating them with a colon (:).
If the colon syntax is not used, the same minimum length applies to both reads, as discussed above.
Also, one of the values can be omitted to impose no restrictions.
For example, with -m 17:, the length of R1 must be at least 17, but the length of R2 is ignored.
example: "0"
- name: --maximum_length
alternatives: [-M]
type: string
description: |
Discard reads longer than LEN. Default: no limit.
For paired reads, see the remark for --minimum_length
- name: --max_n
type: string
description: |
Discard reads with more than COUNT 'N' bases. If COUNT is
a number between 0 and 1, it is interpreted as a fraction
of the read length.
- name: --max_expected_errors
alternatives: [--max_ee]
type: long
description: |
Discard reads whose expected number of errors (computed
from quality values) exceeds ERRORS.
- name: --max_average_error_rate
alternatives: [--max_aer]
type: long
description: |
as --max_expected_errors (see above), but divided by
length to account for reads of varying length.
- name: --discard_trimmed
alternatives: [--discard]
type: boolean_true
description: |
Discard reads that contain an adapter. Use also -O to
avoid discarding too many randomly matching reads.
- name: --discard_untrimmed
alternatives: [--trimmed_only]
type: boolean_true
description: |
Discard reads that do not contain an adapter.
- name: --discard_casava
type: boolean_true
description: |
Discard reads that did not pass CASAVA filtering (header
has :Y:).
####################################################################
- name: Output parameters
arguments:
- name: --report
type: string
choices: [full, minimal]
description: |
Which type of report to print: 'full' (default) or 'minimal'.
example: full
- name: --json
type: boolean_true
description: |
Write report in JSON format to this file.
- name: --output
type: file
description: |
Glob pattern for matching the expected output files.
Should include `$output_dir`.
example: "fastq/*_001.fast[a,q]"
direction: output
required: true
must_exist: true
multiple: true
- name: --fasta
type: boolean_true
description: |
Output FASTA to standard output even on FASTQ input.
- name: --info_file
type: boolean_true
description: |
Write information about each read and its adapter matches
into info.txt in the output directory.
See the documentation for the file format.
# - name: -Z
# - name: --rest_file
# - name: --wildcard-file
# - name: --too_short_output
# - name: --too_long_output
# - name: --untrimmed_output
# - name: --untrimmed_paired_output
# - name: too_short_paired_output
# - name: too_long_paired_output
- name: Debug
arguments:
- type: boolean_true
name: --debug
description: Print debug information
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: python:3.12
setup:
- type: python
pip:
- cutadapt
- type: docker
run: |
cutadapt --version | sed 's/\(.*\)/cutadapt: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

218
src/cutadapt/help.txt Normal file
View File

@@ -0,0 +1,218 @@
cutadapt version 4.6
Copyright (C) 2010 Marcel Martin <marcel.martin@scilifelab.se> and contributors
Cutadapt removes adapter sequences from high-throughput sequencing reads.
Usage:
cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
For paired-end reads:
cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq
Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. All reads from input.fastq will be written to
output.fastq with the adapter sequence removed. Adapter matching is
error-tolerant. Multiple adapter sequences can be given (use further -a
options), but only the best-matching adapter will be removed.
Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.
Citation:
Marcel Martin. Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.
http://dx.doi.org/10.14806/ej.17.1.200
Run "cutadapt --help" to see all command-line options.
See https://cutadapt.readthedocs.io/ for full documentation.
Options:
-h, --help Show this help message and exit
--version Show version number and exit
--debug Print debug log. Use twice to also print DP matrices
-j CORES, --cores CORES
Number of CPU cores to use. Use 0 to auto-detect. Default:
1
Finding adapters:
Parameters -a, -g, -b specify adapters to be removed from each read (or from
R1 if data is paired-end. If specified multiple times, only the best matching
adapter is trimmed (but see the --times option). Use notation 'file:FILE' to
read adapter sequences from a FASTA file.
-a ADAPTER, --adapter ADAPTER
Sequence of an adapter ligated to the 3' end (paired data:
of the first read). The adapter and subsequent bases are
trimmed. If a '$' character is appended ('anchoring'), the
adapter is only found if it is a suffix of the read.
-g ADAPTER, --front ADAPTER
Sequence of an adapter ligated to the 5' end (paired data:
of the first read). The adapter and any preceding bases
are trimmed. Partial matches at the 5' end are allowed. If
a '^' character is prepended ('anchoring'), the adapter is
only found if it is a prefix of the read.
-b ADAPTER, --anywhere ADAPTER
Sequence of an adapter that may be ligated to the 5' or 3'
end (paired data: of the first read). Both types of
matches as described under -a and -g are allowed. If the
first base of the read is part of the match, the behavior
is as with -g, otherwise as with -a. This option is mostly
for rescuing failed library preparations - do not use if
you know which end your adapter was ligated to!
-e E, --error-rate E, --errors E
Maximum allowed error rate (if 0 <= E < 1), or absolute
number of errors for full-length adapter match (if E is an
integer >= 1). Error rate = no. of errors divided by
length of matching region. Default: 0.1 (10%)
--no-indels Allow only mismatches in alignments. Default: allow both
mismatches and indels
-n COUNT, --times COUNT
Remove up to COUNT adapters from each read. Default: 1
-O MINLENGTH, --overlap MINLENGTH
Require MINLENGTH overlap between read and adapter for an
adapter to be found. Default: 3
--match-read-wildcards
Interpret IUPAC wildcards in reads. Default: False
-N, --no-match-adapter-wildcards
Do not interpret IUPAC wildcards in adapters.
--action {trim,retain,mask,lowercase,none}
What to do if a match was found. trim: trim adapter and
up- or downstream sequence; retain: trim, but retain
adapter; mask: replace with 'N' characters; lowercase:
convert to lowercase; none: leave unchanged. Default: trim
--rc, --revcomp Check both the read and its reverse complement for adapter
matches. If match is on reverse-complemented version,
output that one. Default: check only read
Additional read modifications:
-u LEN, --cut LEN Remove LEN bases from each read (or R1 if paired; use -U
option for R2). If LEN is positive, remove bases from the
beginning. If LEN is negative, remove bases from the end.
Can be used twice if LENs have different signs. Applied
*before* adapter trimming.
--nextseq-trim 3'CUTOFF
NextSeq-specific quality trimming (each read). Trims also
dark cycles appearing as high-quality G bases.
-q [5'CUTOFF,]3'CUTOFF, --quality-cutoff [5'CUTOFF,]3'CUTOFF
Trim low-quality bases from 5' and/or 3' ends of each read
before adapter removal. Applied to both reads if data is
paired. If one value is given, only the 3' end is trimmed.
If two comma-separated cutoffs are given, the 5' end is
trimmed with the first cutoff, the 3' end with the second.
--quality-base N Assume that quality values in FASTQ are encoded as
ascii(quality + N). This needs to be set to 64 for some
old Illumina FASTQ files. Default: 33
--poly-a Trim poly-A tails
--length LENGTH, -l LENGTH
Shorten reads to LENGTH. Positive values remove bases at
the end while negative ones remove bases at the beginning.
This and the following modifications are applied after
adapter trimming.
--trim-n Trim N's on ends of reads.
--length-tag TAG Search for TAG followed by a decimal number in the
description field of the read. Replace the decimal number
with the correct length of the trimmed read. For example,
use --length-tag 'length=' to correct fields like
'length=123'.
--strip-suffix STRIP_SUFFIX
Remove this suffix from read names if present. Can be
given multiple times.
-x PREFIX, --prefix PREFIX
Add this prefix to read names. Use {name} to insert the
name of the matching adapter.
-y SUFFIX, --suffix SUFFIX
Add this suffix to read names; can also include {name}
--rename TEMPLATE Rename reads using TEMPLATE containing variables such as
{id}, {adapter_name} etc. (see documentation)
--zero-cap, -z Change negative quality values to zero.
Filtering of processed reads:
Filters are applied after above read modifications. Paired-end reads are
always discarded pairwise (see also --pair-filter).
-m LEN[:LEN2], --minimum-length LEN[:LEN2]
Discard reads shorter than LEN. Default: 0
-M LEN[:LEN2], --maximum-length LEN[:LEN2]
Discard reads longer than LEN. Default: no limit
--max-n COUNT Discard reads with more than COUNT 'N' bases. If COUNT is
a number between 0 and 1, it is interpreted as a fraction
of the read length.
--max-expected-errors ERRORS, --max-ee ERRORS
Discard reads whose expected number of errors (computed
from quality values) exceeds ERRORS.
--max-average-error-rate ERROR_RATE, --max-aer ERROR_RATE
as --max-expected-errors (see above), but divided by
length to account for reads of varying length.
--discard-trimmed, --discard
Discard reads that contain an adapter. Use also -O to
avoid discarding too many randomly matching reads.
--discard-untrimmed, --trimmed-only
Discard reads that do not contain an adapter.
--discard-casava Discard reads that did not pass CASAVA filtering (header
has :Y:).
Output:
--quiet Print only error messages.
--report {full,minimal}
Which type of report to print: 'full' or 'minimal'.
Default: full
--json FILE Dump report in JSON format to FILE
-o FILE, --output FILE
Write trimmed reads to FILE. FASTQ or FASTA format is
chosen depending on input. Summary report is sent to
standard output. Use '{name}' for demultiplexing (see
docs). Default: write to standard output
--fasta Output FASTA to standard output even on FASTQ input.
-Z Use compression level 1 for gzipped output files (faster,
but uses more space)
--info-file FILE Write information about each read and its adapter matches
into FILE. See the documentation for the file format.
-r FILE, --rest-file FILE
When the adapter matches in the middle of a read, write
the rest (after the adapter) to FILE.
--wildcard-file FILE When the adapter has N wildcard bases, write adapter bases
matching wildcard positions to FILE. (Inaccurate with
indels.)
--too-short-output FILE
Write reads that are too short (according to length
specified by -m) to FILE. Default: discard reads
--too-long-output FILE
Write reads that are too long (according to length
specified by -M) to FILE. Default: discard reads
--untrimmed-output FILE
Write reads that do not contain any adapter to FILE.
Default: output to same file as trimmed reads
Paired-end options:
The -A/-G/-B/-U/-Q options work like their lowercase counterparts, but are
applied to R2 (second read in pair)
-A ADAPTER 3' adapter to be removed from R2
-G ADAPTER 5' adapter to be removed from R2
-B ADAPTER 5'/3 adapter to be removed from R2
-U LENGTH Remove LENGTH bases from R2
-Q [5'CUTOFF,]3'CUTOFF
Quality-trimming cutoff for R2. Default: same as for R1
-p FILE, --paired-output FILE
Write R2 to FILE.
--pair-adapters Treat adapters given with -a/-A etc. as pairs. Either both
or none are removed from each read pair.
--pair-filter {any,both,first}
Which of the reads in a paired-end read have to match the
filtering criterion in order for the pair to be filtered.
Default: any
--interleaved Read and/or write interleaved paired-end reads.
--untrimmed-paired-output FILE
Write second read in a pair to this FILE when no adapter
was found. Use with --untrimmed-output. Default: output to
same file as trimmed reads
--too-short-paired-output FILE
Write second read in a pair to this file if pair is too
short.
--too-long-paired-output FILE
Write second read in a pair to this file if pair is too
long.

258
src/cutadapt/script.sh Normal file
View File

@@ -0,0 +1,258 @@
#!/bin/bash
## VIASH START
par_adapter='AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC;GGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
par_input='src/cutadapt/test_data/se/a.fastq'
par_report='full'
par_json='false'
par_fasta='false'
par_info_file='false'
par_debug='true'
## VIASH END
function debug {
[[ "$par_debug" == "true" ]] && echo "DEBUG: $@"
}
output_dir=$(dirname $par_output)
[[ ! -d $output_dir ]] && mkdir -p $output_dir
# Init
###########################################################
echo ">> Paired-end data or not?"
mode=""
if [[ -z $par_input_r2 ]]; then
mode="se"
echo " Single end"
input="$par_input"
else
echo " Paired end"
mode="pe"
input="$par_input $par_input_r2"
fi
# Adapter arguments
# - paired and single-end
# - string and fasta
###########################################################
function add_flags {
local arg=$1
local flag=$2
local prefix=$3
[[ -z $prefix ]] && prefix=""
# This function should not be called if the input is empty
# but check for it just in case
if [[ -z $arg ]]; then
return
fi
local output=""
IFS=';' read -r -a array <<< "$arg"
for a in "${array[@]}"; do
output="$output $flag $prefix$a"
done
echo $output
}
debug ">> Parsing arguments dealing with adapters"
adapter_args=$(echo \
${par_adapter:+$(add_flags "$par_adapter" "--adapter")} \
${par_adapter_fasta:+$(add_flags "$par_adapter_fasta" "--adapter" "file:")} \
${par_front:+$(add_flags "$par_front" "--front")} \
${par_front_fasta:+$(add_flags "$par_front_fasta" "--front" "file:")} \
${par_anywhere:+$(add_flags "$par_anywhere" "--anywhere")} \
${par_anywhere_fasta:+$(add_flags "$par_anywhere_fasta" "--anywhere" "file:")} \
${par_adapter_r2:+$(add_flags "$par_adapter_r2" "-A")} \
${par_adapter_fasta_r2:+$(add_flags "$par_adapter_fasta_r2" "-A" "file:")} \
${par_front_r2:+$(add_flags "$par_front_r2" "-G")} \
${par_front_fasta_r2:+$(add_flags "$par_front_fasta_r2" "-G" "file:")} \
${par_anywhere_r2:+$(add_flags "$par_anywhere_r2" "-B")} \
${par_anywhere_fasta_r2:+$(add_flags "$par_anywhere_fasta_r2" "-B" "file:")} \
)
debug "Arguments to cutadapt:"
debug "$adapter_args"
debug
# Paired-end options
###########################################################
echo ">> Parsing arguments for paired-end reads"
[[ "$par_pair_adapters" == "false" ]] && unset par_pair_adapters
[[ "$par_interleaved" == "false" ]] && unset par_interleaved
paired_args=$(echo \
${par_pair_adapters:+--pair-adapters} \
${par_pair_filter:+--pair-filter "${par_pair_filter}"} \
${par_interleaved:+--interleaved}
)
debug "Arguments to cutadapt:"
debug $paired_args
debug
# Input arguments
###########################################################
echo ">> Parsing input arguments"
[[ "$par_no_indels" == "true" ]] && unset par_no_indels
[[ "$par_match_read_wildcards" == "false" ]] && unset par_match_read_wildcards
[[ "$par_no_match_adapter_wildcards" == "true" ]] && unset par_no_match_adapter_wildcards
[[ "$par_revcomp" == "false" ]] && unset par_revcomp
input_args=$(echo \
${par_error_rate:+--error-rate "${par_error_rate}"} \
${par_no_indels:+--no-indels} \
${par_times:+--times "${par_times}"} \
${par_overlap:+--overlap "${par_overlap}"} \
${par_match_read_wildcards:+--match-read-wildcards} \
${par_no_match_adapter_wildcards:+--no-match-adapter-wildcards} \
${par_action:+--action "${par_action}"} \
${par_revcomp:+--revcomp} \
)
debug "Arguments to cutadapt:"
debug $input_args
debug
# Read modifications
###########################################################
echo ">> Parsing read modification arguments"
[[ "$par_poly_a" == "false" ]] && unset par_poly_a
[[ "$par_trim_n" == "false" ]] && unset par_trim_n
[[ "$par_zero_cap" == "false" ]] && unset par_zero_cap
mod_args=$(echo \
${par_cut:+--cut "${par_cut}"} \
${par_cut_r2:+--cut_r2 "${par_cut_r2}"} \
${par_nextseq_trim:+--nextseq-trim "${par_nextseq_trim}"} \
${par_quality_cutoff:+--quality-cutoff "${par_quality_cutoff}"} \
${par_quality_cutoff_r2:+-Q "${par_quality_cutoff_r2}"} \
${par_quality_base:+--quality-base "${par_quality_base}"} \
${par_poly_a:+--poly-a} \
${par_length:+--length "${par_length}"} \
${par_trim_n:+--trim-n} \
${par_length_tag:+--length-tag "${par_length_tag}"} \
${par_strip_suffix:+--strip-suffix "${par_strip_suffix}"} \
${par_prefix:+--prefix "${par_prefix}"} \
${par_suffix:+--suffix "${par_suffix}"} \
${par_rename:+--rename "${par_rename}"} \
${par_zero_cap:+--zero-cap} \
)
debug "Arguments to cutadapt:"
debug $mod_args
debug
# Filtering of processed reads arguments
###########################################################
echo ">> Filtering of processed reads arguments"
[[ "$par_discard_trimmed" == "false" ]] && unset par_discard_trimmed
[[ "$par_discard_untrimmed" == "false" ]] && unset par_discard_untrimmed
[[ "$par_discard_casava" == "false" ]] && unset par_discard_casava
# Parse and transform the minimum and maximum length arguments
[[ -z $par_minimum_length ]]
filter_args=$(echo \
${par_minimum_length:+--minimum-length "${par_minimum_length}"} \
${par_maximum_length:+--maximum-length "${par_maximum_length}"} \
${par_max_n:+--max-n "${par_max_n}"} \
${par_max_expected_errors:+--max-expected-errors "${par_max_expected_errors}"} \
${par_max_average_error_rate:+--max-average-error-rate "${par_max_average_error_rate}"} \
${par_discard_trimmed:+--discard-trimmed} \
${par_discard_untrimmed:+--discard-untrimmed} \
${par_discard_casava:+--discard-casava} \
)
debug "Arguments to cutadapt:"
debug $filter_args
debug
# Optional output arguments
###########################################################
echo ">> Optional arguments"
[[ "$par_json" == "false" ]] && unset par_json
[[ "$par_fasta" == "false" ]] && unset par_fasta
[[ "$par_info_file" == "false" ]] && unset par_info_file
optional_output_args=$(echo \
${par_report:+--report "${par_report}"} \
${par_json:+--json "report.json"} \
${par_fasta:+--fasta} \
${par_info_file:+--info-file "info.txt"} \
)
debug "Arguments to cutadapt:"
debug $optional_output_args
debug
# Output arguments
# We write the output to a directory rather than
# individual files.
###########################################################
if [[ -z $par_fasta ]]; then
ext="fastq"
else
ext="fasta"
fi
demultiplex_mode="$par_demultiplex_mode"
if [[ $mode == "se" ]]; then
if [[ "$demultiplex_mode" == "unique_dual" ]] || [[ "$demultiplex_mode" == "combinatorial_dual" ]]; then
echo "Demultiplexing dual indexes is not possible with single-end data."
exit 1
fi
prefix="trimmed_"
if [[ ! -z "$demultiplex_mode" ]]; then
prefix="{name}_"
fi
output_args=$(echo \
--output "$output_dir/${prefix}001.$ext" \
)
else
demultiplex_indicator_r1='{name}_'
demultiplex_indicator_r2=$demultiplex_indicator_r1
if [[ "$demultiplex_mode" == "combinatorial_dual" ]]; then
demultiplex_indicator_r1='{name1}_{name2}_'
demultiplex_indicator_r2='{name1}_{name2}_'
fi
prefix_r1="trimmed_"
prefix_r2="trimmed_"
if [[ ! -z "$demultiplex_mode" ]]; then
prefix_r1=$demultiplex_indicator_r1
prefix_r2=$demultiplex_indicator_r2
fi
output_args=$(echo \
--output "$output_dir/${prefix_r1}R1_001.$ext" \
--paired-output "$output_dir/${prefix_r2}R2_001.$ext" \
)
fi
debug "Arguments to cutadapt:"
debug $output_args
debug
# Full CLI
# Set the --cores argument to 0 unless meta_cpus is set
###########################################################
echo ">> Running cutadapt"
par_cpus=0
[[ ! -z $meta_cpus ]] && par_cpus=$meta_cpus
cli=$(echo \
$input \
$adapter_args \
$paired_args \
$input_args \
$mod_args \
$filter_args \
$optional_output_args \
$output_args \
--cores $par_cpus
)
debug ">> Full CLI to be run:"
debug cutadapt $cli | sed -e 's/--/\r\n --/g'
debug
cutadapt $cli

261
src/cutadapt/test.sh Normal file
View File

@@ -0,0 +1,261 @@
#!/bin/bash
set -e
set -eo pipefail
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || (echo "File '$1' does not exist" && exit 1)
}
assert_file_doesnt_exist() {
[ ! -f "$1" ] || (echo "File '$1' exists but shouldn't" && exit 1)
}
assert_file_empty() {
[ ! -s "$1" ] || (echo "File '$1' is not empty but should be" && exit 1)
}
assert_file_not_empty() {
[ -s "$1" ] || (echo "File '$1' is empty but shouldn't be" && exit 1)
}
assert_file_contains() {
grep -q "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
}
assert_file_not_contains() {
grep -q "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
}
#############################################
mkdir test_multiple_output
cd test_multiple_output
echo "#############################################"
echo "> Run cutadapt with multiple outputs"
cat > example.fa <<'EOF'
>read1
MYSEQUENCEADAPTER
>read2
MYSEQUENCEADAP
>read3
MYSEQUENCEADAPTERSOMETHINGELSE
>read4
MYSEQUENCEADABTER
>read5
MYSEQUENCEADAPTR
>read6
MYSEQUENCEADAPPTER
>read7
ADAPTERMYSEQUENCE
>read8
PTERMYSEQUENCE
>read9
SOMETHINGADAPTERMYSEQUENCE
EOF
"$meta_executable" \
--report minimal \
--output "out_test/*.fasta" \
--adapter ADAPTER \
--input example.fa \
--fasta \
--demultiplex_mode single \
--no_match_adapter_wildcards \
--json
echo ">> Checking output"
assert_file_exists "report.json"
assert_file_exists "out_test/1_001.fasta"
assert_file_exists "out_test/unknown_001.fasta"
cd ..
echo
#############################################
mkdir test_simple_single_end
cd test_simple_single_end
echo "#############################################"
echo "> Run cutadapt on single-end data"
cat > example.fa <<'EOF'
>read1
MYSEQUENCEADAPTER
>read2
MYSEQUENCEADAP
>read3
MYSEQUENCEADAPTERSOMETHINGELSE
>read4
MYSEQUENCEADABTER
>read5
MYSEQUENCEADAPTR
>read6
MYSEQUENCEADAPPTER
>read7
ADAPTERMYSEQUENCE
>read8
PTERMYSEQUENCE
>read9
SOMETHINGADAPTERMYSEQUENCE
EOF
"$meta_executable" \
--report minimal \
--output "out_test1/*.fasta" \
--adapter ADAPTER \
--input example.fa \
--demultiplex_mode single \
--fasta \
--no_match_adapter_wildcards \
--json
echo ">> Checking output"
assert_file_exists "report.json"
assert_file_exists "out_test1/1_001.fasta"
assert_file_exists "out_test1/unknown_001.fasta"
echo ">> Check if output is empty"
assert_file_not_empty "report.json"
assert_file_not_empty "out_test1/1_001.fasta"
assert_file_not_empty "out_test1/unknown_001.fasta"
echo ">> Check contents"
for i in 1 2 3 7 9; do
assert_file_contains "out_test1/1_001.fasta" ">read$i"
done
for i in 4 5 6 8; do
assert_file_contains "out_test1/unknown_001.fasta" ">read$i"
done
cd ..
echo
#############################################
mkdir test_multiple_single_end
cd test_multiple_single_end
echo "#############################################"
echo "> Run with a combination of inputs"
cat > example.fa <<'EOF'
>read1
ACGTACGTACGTAAAAA
>read2
ACGTACGTACGTCCCCC
>read3
ACGTACGTACGTGGGGG
>read4
ACGTACGTACGTTTTTT
EOF
cat > adapters1.fasta <<'EOF'
>adapter1
CCCCC
EOF
cat > adapters2.fasta <<'EOF'
>adapter2
GGGGG
EOF
"$meta_executable" \
--report minimal \
--output "out_test2/*.fasta" \
--adapter AAAAA \
--adapter_fasta adapters1.fasta \
--adapter_fasta adapters2.fasta \
--demultiplex_mode single \
--input example.fa \
--fasta \
--json
echo ">> Checking output"
assert_file_exists "report.json"
assert_file_exists "out_test2/1_001.fasta"
assert_file_exists "out_test2/adapter1_001.fasta"
assert_file_exists "out_test2/adapter2_001.fasta"
assert_file_exists "out_test2/unknown_001.fasta"
echo ">> Check if output is empty"
assert_file_not_empty "report.json"
assert_file_not_empty "out_test2/1_001.fasta"
assert_file_not_empty "out_test2/adapter1_001.fasta"
assert_file_not_empty "out_test2/adapter2_001.fasta"
assert_file_not_empty "out_test2/unknown_001.fasta"
echo ">> Check contents"
assert_file_contains "out_test2/1_001.fasta" ">read1"
assert_file_contains "out_test2/adapter1_001.fasta" ">read2"
assert_file_contains "out_test2/adapter2_001.fasta" ">read3"
assert_file_contains "out_test2/unknown_001.fasta" ">read4"
cd ..
echo
#############################################
mkdir test_simple_paired_end
cd test_simple_paired_end
echo "#############################################"
echo "> Run cutadapt on paired-end data"
cat > example_R1.fastq <<'EOF'
@read1
ACGTACGTACGTAAAAA
+
IIIIIIIIIIIIIIIII
@read2
ACGTACGTACGTCCCCC
+
IIIIIIIIIIIIIIIII
EOF
cat > example_R2.fastq <<'EOF'
@read1
ACGTACGTACGTGGGGG
+
IIIIIIIIIIIIIIIII
@read2
ACGTACGTACGTTTTTT
+
IIIIIIIIIIIIIIIII
EOF
"$meta_executable" \
--report minimal \
--output "out_test3/*.fastq" \
--adapter AAAAA \
--adapter_r2 GGGGG \
--input example_R1.fastq \
--input_r2 example_R2.fastq \
--quality_cutoff 20 \
--demultiplex_mode unique_dual \
--json \
---cpus 1
echo ">> Checking output"
assert_file_exists "report.json"
assert_file_exists "out_test3/1_R1_001.fastq"
assert_file_exists "out_test3/1_R2_001.fastq"
assert_file_exists "out_test3/unknown_R1_001.fastq"
assert_file_exists "out_test3/unknown_R2_001.fastq"
echo ">> Check if output is empty"
assert_file_not_empty "report.json"
assert_file_not_empty "out_test3/1_R1_001.fastq"
assert_file_not_empty "out_test3/1_R2_001.fastq"
assert_file_not_empty "out_test3/unknown_R1_001.fastq"
echo ">> Check contents"
assert_file_contains "out_test3/1_R1_001.fastq" "@read1"
assert_file_contains "out_test3/1_R2_001.fastq" "@read1"
assert_file_contains "out_test3/unknown_R1_001.fastq" "@read2"
assert_file_contains "out_test3/unknown_R2_001.fastq" "@read2"
cd ..
echo
#############################################
echo "#############################################"
echo "> Test successful"

196
src/falco/config.vsh.yaml Normal file
View File

@@ -0,0 +1,196 @@
name: falco
description: A C++ drop-in replacement of FastQC to assess the quality of sequence read data
keywords: [qc, fastqc, sequencing]
links:
documentation: https://falco.readthedocs.io/en/latest/
repository: https://github.com/smithlabcode/falco
references:
doi: 10.12688/f1000research.21142.2
license: GPL-3.0
requirements:
commands: [falco]
# Notes:
# - falco as arguments similar to -subsample and we update those to --subsample
# - The outdir argument is not required
# - The input argument in falco is positional but we changed this to --input
argument_groups:
- name: Input arguments
arguments:
- name: --input
required: true
type: file
multiple: true
description: input fastq files
example: input1.fastq;input2.fastq
- name: Run arguments
arguments:
- name: --nogroup
type: boolean_true
description: |
Disable grouping of bases for reads >50bp.
All reports will show data for every base in
the read. WARNING: When using this option,
your plots may end up a ridiculous size. You
have been warned!
- name: --contaminents
type: file
description: |
Specifies a non-default file which contains
the list of contaminants to screen
overrepresented sequences against. The file
must contain sets of named contaminants in
the form name[tab]sequence. Lines prefixed
with a hash will be ignored. Default:
https://github.com/smithlabcode/falco/blob/v1.2.2/Configuration/contaminant_list.txt
- name: --adapters
type: file
description: |
Specifies a non-default file which contains
the list of adapter sequences which will be
explicity searched against the library. The
file must contain sets of named adapters in
the form name[tab]sequence. Lines prefixed
with a hash will be ignored. Default:
https://github.com/smithlabcode/falco/blob/v1.2.2/Configuration/adapter_list.txt
- name: --limits
type: file
description: |
Specifies a non-default file which contains
a set of criteria which will be used to
determine the warn/error limits for the
various modules. This file can also be used
to selectively remove some modules from the
output all together. The format needs to
mirror the default limits.txt file found in
the Configuration folder. Default:
https://github.com/smithlabcode/falco/blob/v1.2.2/Configuration/limits.txt
- name: --subsample
alternatives: [-s]
type: integer
example: 10
description: |
[Falco only] makes falco faster (but
possibly less accurate) by only processing
reads that are a multiple of this value (using
0-based indexing to number reads).
- name: --bisulfite
alternatives: [-b]
type: boolean_true
description: |
[Falco only] reads are whole genome
bisulfite sequencing, and more Ts and fewer
Cs are therefore expected and will be
accounted for in base content.
- name: --reverse_complliment
alternatives: [-r]
type: boolean_true
description: |
[Falco only] The input is a
reverse-complement. All modules will be
tested by swapping A/T and C/G
- name: Output arguments
arguments:
- name: --outdir
alternatives: [-o]
required: true
type: file
direction: output
description: |
Create all output files in the specified
output directory. FALCO-SPECIFIC: If the
directory does not exists, the program will
create it.
example: output
- name: --format
type: string
choices: [bam, sam, bam_mapped, sam_mapped, fastq, fq, fastq.gz, fq.gz]
alternatives: ["-f"]
description: |
Bypasses the normal sequence file format
detection and forces the program to use the
specified format. Validformats are bam, sam,
bam_mapped, sam_mapped, fastq, fq, fastq.gz
or fq.gz.
- name: --data_filename
alternatives: [-D]
type: file
direction: output
description: |
[Falco only] Specify filename for FastQC
data output (TXT). If not specified, it will
be called fastq_data.txt in either the input
file's directory or the one specified in the
--output flag. Only available when running
falco with a single input.
- name: --report_filename
alternatives: [-R]
type: file
direction: output
description: |
[Falco only] Specify filename for FastQC
report output (HTML). If not specified, it
will be called fastq_report.html in either
the input file's directory or the one
specified in the --output flag. Only
available when running falco with a single
input.
- name: --summary_filename
alternatives: [-S]
type: file
direction: output
description: |
[Falco only] Specify filename for the short
summary output (TXT). If not specified, it
will be called fastq_report.html in either
the input file's directory or the one
specified in the --output flag. Only
available when running falco with a single
input.
# Arguments not taken into account:
#
# -skip-data [Falco only] Do not create FastQC data text
# file.
# -skip-report [Falco only] Do not create FastQC report
# HTML file.
# -skip-summary [Falco only] Do not create FastQC summary
# file
# -K, -add-call [Falco only] add the command call call to
# FastQC data output and FastQC report HTML
# (this may break the parse of fastqc_data.txt
# in programs that are very strict about the
# FastQC output format).
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: debian:trixie-slim
setup:
- type: apt
packages: [wget, build-essential, g++, zlib1g-dev, procps]
- type: docker
run: |
wget https://github.com/smithlabcode/falco/releases/download/v1.2.2/falco-1.2.2.tar.gz -O /tmp/falco.tar.gz && \
cd /tmp && \
tar xvf falco.tar.gz && \
cd falco-1.2.2 && \
./configure && \
make all && \
make install
- type: docker
run: |
echo "falco: \"$(falco -v | sed -n 's/^falco //p')\"" > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

156
src/falco/help.txt Normal file
View File

@@ -0,0 +1,156 @@
Usage: falco [OPTIONS] <seqfile1> <seqfile2> ...
Options:
-h, --help Print this help file and exit
-v, --version Print the version of the program and exit
-o, --outdir Create all output files in the specified
output directory. FALCO-SPECIFIC: If the
directory does not exists, the program will
create it. If this option is not set then
the output file for each sequence file is
created in the same directory as the
sequence file which was processed.
--casava [IGNORED BY FALCO] Files come from raw
casava output. Files in the same sample
group (differing only by the group number)
will be analysed as a set rather than
individually. Sequences with the filter flag
set in the header will be excluded from the
analysis. Files must have the same names
given to them by casava (including being
gzipped and ending with .gz) otherwise they
won't be grouped together correctly.
--nano [IGNORED BY FALCO] Files come from nanopore
sequences and are in fast5 format. In this
mode you can pass in directories to process
and the program will take in all fast5 files
within those directories and produce a
single output file from the sequences found
in all files.
--nofilter [IGNORED BY FALCO] If running with --casava
then don't remove read flagged by casava as
poor quality when performing the QC
analysis.
--extract [ALWAYS ON IN FALCO] If set then the zipped
output file will be uncompressed in the same
directory after it has been created. By
default this option will be set if fastqc is
run in non-interactive mode.
-j, --java [IGNORED BY FALCO] Provides the full path to
the java binary you want to use to launch
fastqc. If not supplied then java is assumed
to be in your path.
--noextract [IGNORED BY FALCO] Do not uncompress the
output file after creating it. You should
set this option if you do not wish to
uncompress the output when running in
non-interactive mode.
--nogroup Disable grouping of bases for reads >50bp.
All reports will show data for every base in
the read. WARNING: When using this option,
your plots may end up a ridiculous size. You
have been warned!
--min_length [NOT YET IMPLEMENTED IN FALCO] Sets an
artificial lower limit on the length of the
sequence to be shown in the report. As long
as you set this to a value greater or equal
to your longest read length then this will
be the sequence length used to create your
read groups. This can be useful for making
directly comaparable statistics from
datasets with somewhat variable read
lengths.
-f, --format Bypasses the normal sequence file format
detection and forces the program to use the
specified format. Validformats are bam, sam,
bam_mapped, sam_mapped, fastq, fq, fastq.gz
or fq.gz.
-t, --threads [NOT YET IMPLEMENTED IN FALCO] Specifies the
number of files which can be processed
simultaneously. Each thread will be
allocated 250MB of memory so you shouldn't
run more threads than your available memory
will cope with, and not more than 6 threads
on a 32 bit machine [1]
-c, --contaminants Specifies a non-default file which contains
the list of contaminants to screen
overrepresented sequences against. The file
must contain sets of named contaminants in
the form name[tab]sequence. Lines prefixed
with a hash will be ignored. Default:
/tmp/falco-1.2.2/Configuration/contaminant_list.txt
-a, --adapters Specifies a non-default file which contains
the list of adapter sequences which will be
explicity searched against the library. The
file must contain sets of named adapters in
the form name[tab]sequence. Lines prefixed
with a hash will be ignored. Default:
/tmp/falco-1.2.2/Configuration/adapter_list.txt
-l, --limits Specifies a non-default file which contains
a set of criteria which will be used to
determine the warn/error limits for the
various modules. This file can also be used
to selectively remove some modules from the
output all together. The format needs to
mirror the default limits.txt file found in
the Configuration folder. Default:
/tmp/falco-1.2.2/Configuration/limits.txt
-k, --kmers [IGNORED BY FALCO AND ALWAYS SET TO 7]
Specifies the length of Kmer to look for in
the Kmer content module. Specified Kmer
length must be between 2 and 10. Default
length is 7 if not specified.
-q, --quiet Supress all progress messages on stdout and
only report errors.
-d, --dir [IGNORED: FALCO DOES NOT CREATE TMP FILES]
Selects a directory to be used for temporary
files written when generating report images.
Defaults to system temp directory if not
specified.
-s, -subsample [Falco only] makes falco faster (but
possibly less accurate) by only processing
reads that are multiple of this value (using
0-based indexing to number reads). [1]
-b, -bisulfite [Falco only] reads are whole genome
bisulfite sequencing, and more Ts and fewer
Cs are therefore expected and will be
accounted for in base content.
-r, -reverse-complement [Falco only] The input is a
reverse-complement. All modules will be
tested by swapping A/T and C/G
-skip-data [Falco only] Do not create FastQC data text
file.
-skip-report [Falco only] Do not create FastQC report
HTML file.
-skip-summary [Falco only] Do not create FastQC summary
file
-D, -data-filename [Falco only] Specify filename for FastQC
data output (TXT). If not specified, it will
be called fastq_data.txt in either the input
file's directory or the one specified in the
--output flag. Only available when running
falco with a single input.
-R, -report-filename [Falco only] Specify filename for FastQC
report output (HTML). If not specified, it
will be called fastq_report.html in either
the input file's directory or the one
specified in the --output flag. Only
available when running falco with a single
input.
-S, -summary-filename [Falco only] Specify filename for the short
summary output (TXT). If not specified, it
will be called fastq_report.html in either
the input file's directory or the one
specified in the --output flag. Only
available when running falco with a single
input.
-K, -add-call [Falco only] add the command call call to
FastQC data output and FastQC report HTML
(this may break the parse of fastqc_data.txt
in programs that are very strict about the
FastQC output format).
Help options:
-?, -help print this help message
-about print about message

24
src/falco/script.sh Normal file
View File

@@ -0,0 +1,24 @@
#!/bin/bash
set -eo pipefail
[[ "$par_nogroup" == "false" ]] && unset par_nogroup
[[ "$par_bisulfite" == "false" ]] && unset par_bisulfite
[[ "$par_reverse_compliment" == "false" ]] && unset par_reverse_compliment
IFS=";" read -ra input <<< $par_input
$(which falco) \
${par_nogroup:+--nogroup} \
${par_contaminants:+--contaminants "$par_contaminants"} \
${par_adapters:+--adapters "$par_adapters"} \
${par_limits:+--limits "$par_limits"} \
${par_subsample:+-subsample $par_subsample} \
${par_bisulfite:+-bisulfite} \
${par_reverse_compliment:+-reverse-compliment} \
${par_outdir:+--outdir "$par_outdir"} \
${par_format:+--format "$par_format"} \
${par_data_filename:+-data-filename "$par_data_filename"} \
${par_report_filename:+-report-filename "$par_report_filename"} \
${par_summary_filename:+-summary-filename "$par_summary_filename"} \
${input[*]}

79
src/falco/test.sh Normal file
View File

@@ -0,0 +1,79 @@
#!/bin/bash
set -e
echo "> Prepare test data"
# We use data from this repo: https://github.com/hartwigmedical/testData
echo ">> Fetching and preparing test data"
fastq1="https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R1_001.fastq.gz"
fastq2="https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R2_001.fastq.gz"
TMPDIR=$(mktemp -d "$meta_temp_dir/$meta_functionality_name-XXXXXX")
function clean_up {
[[ -d "$TMPDIR" ]] && rm -r "$TMPDIR"
}
trap clean_up EXIT
test_data_dir="$TMPDIR/test_data"
mkdir $test_data_dir
wget -q $fastq1 -O $test_data_dir/R1.fastq.gz
wget -q $fastq2 -O $test_data_dir/R2.fastq.gz
echo ">> Run falco on test data, output to dir"
echo ">>> Run falco"
$meta_executable \
--input "$test_data_dir/R1.fastq.gz;$test_data_dir/R2.fastq.gz" \
--outdir "$TMPDIR/output1"
echo ">>> Checking whether output exists"
[ ! -d "$TMPDIR/output1" ] && echo "Output directory not created" && exit 1
[ ! -f "$TMPDIR/output1/R1.fastq.gz_fastqc_report.html" ] && echo "Report not created" && exit 1
[ ! -f "$TMPDIR/output1/R1.fastq.gz_summary.txt" ] && echo "Summary not created" && exit 1
[ ! -f "$TMPDIR/output1/R1.fastq.gz_fastqc_data.txt" ] && echo "fastqc_data not created" && exit 1
[ ! -f "$TMPDIR/output1/R2.fastq.gz_fastqc_report.html" ] && echo "Report not created" && exit 1
[ ! -f "$TMPDIR/output1/R2.fastq.gz_summary.txt" ] && echo "Summary not created" && exit 1
[ ! -f "$TMPDIR/output1/R2.fastq.gz_fastqc_data.txt" ] && echo "fastqc_data not created" && exit 1
echo ">>> cleanup"
rm -rf "$TMPDIR/output1"
echo ">> Run falco on test data, output to individual files"
echo ">>> Please note this is only possible for 1 input fastq file!"
echo ">>> Run falco"
$meta_executable \
--input "$test_data_dir/R1.fastq.gz" \
--data_filename "$TMPDIR/output2/data.txt" \
--report_filename "$TMPDIR/output2/report.html" \
--summary_filename "$TMPDIR/output2/summary.txt" \
--outdir "$TMPDIR/output2/"
echo ">>> Checking whether output exists"
[ ! -d "$TMPDIR/output2" ] && echo "Output directory not created" && exit 1
[ ! -f "$TMPDIR/output2/report.html" ] && echo "Report not created" && exit 1
[ ! -f "$TMPDIR/output2/summary.txt" ] && echo "Summary not created" && exit 1
[ ! -f "$TMPDIR/output2/data.txt" ] && echo "fastqc_data not created" && exit 1
echo ">>> cleanup"
rm -rf $TMPDIR/output2/
echo ">> Run falco on test data, subsample"
echo ">>> Run falco"
$meta_executable \
--input "$test_data_dir/R1.fastq.gz" \
--data_filename "$TMPDIR/output3/data.txt" \
--report_filename "$TMPDIR/output3/report.html" \
--summary_filename "$TMPDIR/output3/summary.txt" \
--subsample 100 \
--outdir "$TMPDIR/output3"
echo ">>> Checking whether output exists"
[ ! -d "$TMPDIR/output3" ] && echo "Output directory not created" && exit 1
[ ! -f "$TMPDIR/output3/report.html" ] && echo "Report not created" && exit 1
[ ! -f "$TMPDIR/output3/summary.txt" ] && echo "Summary not created" && exit 1
[ ! -f "$TMPDIR/output3/data.txt" ] && echo "fastqc_data not created" && exit 1
echo ">>> cleanup"
rm -rf "$TMPDIR/output3/"
echo "All tests succeeded!"

576
src/fastp/config.vsh.yaml Normal file
View File

@@ -0,0 +1,576 @@
name: fastp
description: |
An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...).
Features:
- comprehensive quality profiling for both before and after filtering data (quality curves, base contents, KMER, Q20/Q30, GC Ratio, duplication, adapter contents...)
- filter out bad reads (too low quality, too short, or too many N...)
- cut low quality bases for per read in its 5' and 3' by evaluating the mean quality from a sliding window (like Trimmomatic but faster).
- trim all reads in front and tail
- cut adapters. Adapter sequences can be automatically detected, which means you don't have to input the adapter sequences to trim them.
- correct mismatched base pairs in overlapped regions of paired end reads, if one base is with high quality while the other is with ultra low quality
- trim polyG in 3' ends, which is commonly seen in NovaSeq/NextSeq data. Trim polyX in 3' ends to remove unwanted polyX tailing (i.e. polyA tailing for mRNA-Seq data)
- preprocess unique molecular identifier (UMI) enabled data, shift UMI to sequence name.
- report JSON format result for further interpreting.
- visualize quality control and filtering results on a single HTML page (like FASTQC but faster and more informative).
- split the output to multiple files (0001.R1.gz, 0002.R1.gz...) to support parallel processing. Two modes can be used, limiting the total split file number, or limitting the lines of each split file.
- support long reads (data from PacBio / Nanopore devices).
- support reading from STDIN and writing to STDOUT
- support interleaved input
- support ultra-fast FASTQ-level deduplication
keywords: [RNA-Seq, Trimming, Quality control]
links:
repository: https://github.com/OpenGene/fastp
documentation: https://github.com/OpenGene/fastp/blob/master/README.md
references:
doi: "10.1093/bioinformatics/bty560"
license: MIT
argument_groups:
- name: Inputs
description: |
`fastp` supports both single-end (SE) and paired-end (PE) input.
- for SE data, you only have to specify read1 input by `-i` or `--in1`.
- for PE data, you should also specify read2 input by `-I` or `--in2`.
arguments:
- name: --in1
alternatives: [-i]
type: file
description: Input FastQ file. Must be single-end or paired-end R1. Can be gzipped.
required: true
example: in.R1.fq.gz
- name: --in2
alternatives: [-I]
type: file
description: Input FastQ file. Must be paired-end R2. Can be gzipped.
required: false
example: in.R2.fq.gz
- name: Outputs
description: |
- for SE data, you only have to specify read1 output by `-o` or `--out1`.
- for PE data, you should also specify read2 output by `-O` or `--out2`.
- if you don't specify the output file names, no output files will be written, but the QC will still be done for both data before and after filtering.
- the output will be gzip-compressed if its file name ends with `.gz`
arguments:
- name: --out1
alternatives: [-o]
type: file
description: The single-end or paired-end R1 reads that pass QC. Will be gzipped if its file name ends with `.gz`.
required: true
example: out.R1.fq.gz
direction: output
- name: --out2
alternatives: [-O]
type: file
description: The paired-end R2 reads that pass QC. Will be gzipped if its file name ends with `.gz`.
required: false
example: out.R2.fq.gz
direction: output
- name: --unpaired1
type: file
description: Store the reads that `read1` passes filters but its paired `read2` doesn't.
required: false
example: unpaired.R1.fq.gz
direction: output
- name: --unpaired2
type: file
description: Store the reads that `read2` passes filters but its paired `read1` doesn't.
required: false
example: unpaired.R2.fq.gz
direction: output
- name: --failed_out
type: file
description: |
Store the reads that fail filters.
If one read failed and is written to --failed_out, its failure reason will be appended to its read name. For example, failed_quality_filter, failed_too_short etc.
For PE data, if unpaired reads are not stored (by giving --unpaired1 or --unpaired2), the failed pair of reads will be put together. If one read passes the filters but its pair doesn't, the failure reason will be paired_read_is_failing.
required: false
example: failed.fq.gz
direction: output
- name: --overlapped_out
type: file
description: |
For each read pair, output the overlapped region if it has no any mismatched base.
direction: output
- name: Report output arguments
arguments:
- name: --json
alternatives: [-j]
type: file
description: |
The json format report file name
example: out.json
direction: output
- name: --html
type: file
description: |
The html format report file name
example: out.html
direction: output
- name: --report_title
type: string
description: |
The title of the html report, default is "fastp report".
example: fastp report
- name: Adapter trimming
description: |
Adapter trimming is enabled by default, but you can disable it by `-A` or `--disable_adapter_trimming`. Adapter sequences can be automatically detected for both PE/SE data.
- For SE data, the adapters are evaluated by analyzing the tails of first ~1M reads. This evaluation may be inacurrate, and you can specify the adapter sequence by `-a` or `--adapter_sequence` option. If adapter sequence is specified, the auto detection for SE data will be disabled.
- For PE data, the adapters can be detected by per-read overlap analysis, which seeks for the overlap of each pair of reads. This method is robust and fast, so normally you don't have to input the adapter sequence even you know it. But you can still specify the adapter sequences for read1 by `--adapter_sequence`, and for read2 by `--adapter_sequence_r2`. If `fastp` fails to find an overlap (i.e. due to low quality bases), it will use these sequences to trim adapters for read1 and read2 respectively.
- For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify `--detect_adapter_for_pe` to enable it.
- For PE data, `fastp` will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
- The most widely used adapter is the Illumina TruSeq adapters. If your data is from the TruSeq library, you can add `--adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT` to your command lines, or enable auto detection for PE data by specifing `detect_adapter_for_pe`.
- `fastp` contains some built-in known adapter sequences for better auto-detection. If you want to make some adapters to be a part of the built-in adapters, please file an issue.
You can also specify --adapter_fasta to give a FASTA file to tell fastp to trim multiple adapters in this FASTA file. Here is a sample of such adapter FASTA file:
```
>Illumina TruSeq Adapter Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>Illumina TruSeq Adapter Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
>polyA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
```
The adapter sequence in this file should be at least 6bp long, otherwise it will be skipped. And you can give whatever you want to trim, rather than regular sequencing adapters (i.e. polyA).
`fastp` first trims the auto-detected adapter or the adapter sequences given by `--adapter_sequence | --adapter_sequence_r2`, then trims the adapters given by `--adapter_fasta` one by one.
The sequence distribution of trimmed adapters can be found at the HTML/JSON reports.
arguments:
- name: --disable_adapter_trimming
alternatives: [-A]
type: boolean_true
description: |
Disable adapter trimming.
- name: --detect_adapter_for_pe
type: boolean_true
description: |
By default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data.
- name: --adapter_sequence
alternatives: [-a]
type: string
description: |
The adapter sequences to be trimmed. For SE data, if not specified, the adapters will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped
- name: --adapter_sequence_r2
type: string
description: |
The adapter sequences to be trimmed for R2. This is used for PE data if R1/R2 are found overlapped.
- name: --adapter_fasta
type: file
description: |
A FASTA file containing all the adapter sequences to be trimmed. For SE data, if not specified, the adapters will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped.
- name: Base trimming
arguments:
- name: --trim_front1
alternatives: [-f]
type: integer
description: |
Trimming how many bases in front for read1, default is 0.
example: 0
- name: --trim_tail1
alternatives: [-t]
type: integer
description: |
Trimming how many bases in tail for read1, default is 0.
example: 0
- name: --max_len1
alternatives: [-b]
type: integer
min: 0
description: |
If read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation.
- name: --trim_front2
alternatives: [-F]
type: integer
description: |
Trimming how many bases in front for read2, default is 0.
example: 0
- name: --trim_tail2
alternatives: [-T]
type: integer
description: |
Trimming how many bases in tail for read2, default is 0.
example: 0
- name: --max_len2
alternatives: [-B]
type: integer
min: 0
description: |
If read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation.
- name: Merging mode
description: Allows merging paired-end reads into a single longer read if they are overlapping.
arguments:
- name: --merge
alternatives: [-m]
type: boolean_true
description: |
For paired-end input, merge each pair of reads into a single read if they are overlapped. The merged reads will be written to the file given by --merged_out, the unmerged reads will be written to the files specified by --out1 and --out2. The merging mode is disabled by default.
- name: --merged_out
type: file
description: |
In the merging mode, specify the file name to store merged output, or specify --stdout to stream the merged output.
direction: output
example: merged.fq.gz
- name: --include_unmerged
type: boolean_true
description: |
In the merging mode, write the unmerged or unpaired reads to the file specified by --merge. Disabled by default.
- name: Additional input arguments
description: Affects how the input is read.
arguments:
- name: --interleaved_in
type: boolean_true
description: |
Indicate that <in1> is an interleaved FASTQ which contains both read1 and read2. Disabled by default.
- name: --fix_mgi_id
type: boolean_true
description: |
The MGI FASTQ ID format is not compatible with many BAM operation tools, enable this option to fix it.
- name: --phred64
alternatives: ["-6"]
type: boolean_true
description: |
Indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
- name: Additional output arguments
description: Affects how the output is written.
arguments:
- name: --compression
alternatives: ["-z"]
type: integer
description: |
Compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4.
example: 4
min: 1
max: 9
- name: --dont_overwrite
type: boolean_true
description: |
Don't overwrite existing files. Overwritting is allowed by default.
- name: Logging arguments
arguments:
- name: --verbose
alternatives: [-V]
type: boolean_true
description: Output verbose log information (i.e. when every 1M reads are processed).
- name: Processing arguments
arguments:
- name: --reads_to_process
type: long
description: |
Specify how many reads/pairs to be processed. Default 0 means process all reads.
example: 1000000
min: 0
- name: Deduplication arguments
arguments:
- name: --dedup
type: boolean_true
description: |
Enable deduplication to drop the duplicated reads/pairs
- name: --dup_calc_accuracy
type: integer
description: |
Accuracy level to calculate duplication (1~6). Higher level uses more memory (1G, 2G, 4G, 8G, 16G, 24G). Default 1 for no-dedup mode, and 3 for dedup mode.
example: 3
min: 1
max: 6
- name: --dont_eval_duplication
type: boolean_true
description: |
Don't evaluate duplication rate to save time and use less memory.
- name: PolyG tail trimming arguments
arguments:
- name: --trim_poly_g
alternatives: [-g]
type: boolean_true
description: |
Force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
- name: --poly_g_min_len
type: integer
description: |
The minimum length to detect polyG in the read tail. 10 by default.
example: 10
min: 1
- name: --disable_trim_poly_g
alternatives: [-G]
type: boolean_true
description: |
Disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
- name: PolyX tail trimming arguments
arguments:
- name: --trim_poly_x
alternatives: [-x]
type: boolean_true
description: |
Enable polyX trimming in 3' ends.
- name: --poly_x_min_len
type: integer
description: |
The minimum length to detect polyX in the read tail. 10 by default.
example: 10
min: 1
- name: Cut arguments
arguments:
- name: --cut_front
alternatives: ["-5"]
type: integer
description: |
Move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise.
- name: --cut_tail
alternatives: ["-3"]
type: integer
description: |
Move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise.
- name: --cut_right
alternatives: ["-r"]
type: integer
description: |
Move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop.
- name: --cut_window_size
alternatives: ["-W"]
type: integer
description: |
The window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4.
example: 4
min: 1
- name: --cut_mean_quality
alternatives: ["-M"]
type: integer
description: |
The mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20)
example: 20
min: 0
- name: --cut_front_window_size
type: integer
description: |
The window size option of cut_front, default to cut_window_size if not specified.
example: 4
min: 1
- name: --cut_front_mean_quality
type: integer
description: |
The mean quality requirement option of cut_front, default to cut_mean_quality if not specified.
example: 20
min: 0
- name: --cut_tail_window_size
type: integer
description: |
The window size option of cut_tail, default to cut_window_size if not specified.
example: 4
min: 1
- name: --cut_tail_mean_quality
type: integer
description: |
The mean quality requirement option of cut_tail, default to cut_mean_quality if not specified.
example: 20
min: 0
- name: --cut_right_window_size
type: integer
description: |
The window size option of cut_right, default to cut_window_size if not specified.
example: 4
min: 1
- name: --cut_right_mean_quality
type: integer
description: |
The mean quality requirement option of cut_right, default to cut_mean_quality if not specified.
example: 20
min: 0
- name: Quality filtering arguments
arguments:
- name: --disable_quality_filtering
alternatives: [-Q]
type: boolean_true
description: |
Quality filtering is enabled by default. If this option is specified, quality filtering is disabled.
- name: --qualified_quality_phred
alternatives: [-q]
type: integer
description: |
The quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified.
example: 15
min: 0
- name: --unqualified_percent_limit
alternatives: [-u]
type: integer
description: |
How many percents of bases are allowed to be unqualified (0~100). Default 40 means 40%.
example: 40
min: 0
max: 100
- name: --n_base_limit
alternatives: [-n]
type: integer
description: |
If one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5.
example: 5
min: 0
- name: --average_qual
alternatives: [-e]
type: integer
description: |
If one read's average quality score <avg_qual, then this read/pair is discarded. Default 0 means no requirement.
example: 0
min: 0
- name: Length filtering arguments
arguments:
- name: --disable_length_filtering
alternatives: [-L]
type: boolean_true
description: |
Length filtering is enabled by default. If this option is specified, length filtering is disabled.
- name: --length_required
alternatives: [-l]
type: integer
description: |
Reads shorter than length_required will be discarded, default is 15.
example: 15
min: 0
- name: --length_limit
type: integer
description: |
Reads longer than length_limit will be discarded, default 0 means no limitation.
example: 0
min: 0
- name: Low complexity filtering arguments
arguments:
- name: --low_complexity_filter
alternatives: [-y]
type: boolean_true
description: |
Enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
- name: --complexity_threshold
alternatives: [-Y]
type: integer
description: |
The threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required.
example: 30
min: 0
- name: Index filtering arguments
arguments:
- name: --filter_by_index1
type: file
description: |
Specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line.
- name: --filter_by_index2
type: file
description: |
Specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line.
- name: --filter_by_index_threshold
type: integer
description: |
The allowed difference of index barcode for index filtering, default 0 means completely identical.
example: 0
min: 0
- name: Overlapped region correction
arguments:
- type: boolean_true
name: --correction
alternatives: [-c]
description: |
Enable base correction in overlapped regions (only for PE data), default is disabled.
- name: --overlap_len_require
type: integer
description: |
The minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default.
example: 30
min: 0
- name: --overlap_diff_limit
type: integer
description: |
The maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default.
example: 5
min: 0
- name: --overlap_diff_percent_limit
type: integer
description: |
The maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%.
example: 20
min: 0
max: 100
- name: UMI arguments
arguments:
- name: --umi
alternatives: [-U]
type: boolean_true
description: |
Enable unique molecular identifier (UMI) preprocessing.
- name: --umi_loc
type: string
description: |
Specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none.
choices: [index1, index2, read1, read2, per_index, per_read]
- name: --umi_len
type: integer
description: |
If the UMI is in read1/read2, its length should be provided.
example: 0
min: 0
- name: --umi_prefix
type: string
description: |
If specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default.
- name: --umi_skip
type: integer
description: |
If the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0.
example: 0
min: 0
- name: --umi_delim
type: string
description: |
If the UMI is in index1/index2, fastp can use a delimiter to separate UMI from the read sequence, default is none.
- name: Overrepresentation analysis arguments
arguments:
- name: --overrepresentation_analysis
alternatives: [-p]
type: boolean_true
description: |
Enable overrepresentation analysis.
- name: --overrepresentation_sampling
type: integer
description: |
One in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20.
example: 20
min: 1
# # would need to set all outputs to multiple: true
# - name: Split arguments
# arguments:
# - name: --split
# alternatives: [-s]
# type: boolean_true
# description: |
# Split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default.
# - name: --split_by_lines
# alternatives: [-S]
# type: long
# description: |
# Split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default.
# - name: --split_prefix_digits
# type: integer
# description: |
# The digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding.
# example: 4
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/fastp:0.23.4--hadf994f_2
setup:
- type: docker
run: |
fastp --version 2>&1 | sed 's# #: "#;s#$#"#' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

93
src/fastp/help.txt Normal file
View File

@@ -0,0 +1,93 @@
```bash
fastp --help
```
usage: fastp [options] ...
options:
-i, --in1 read1 input file name (string [=])
-o, --out1 read1 output file name (string [=])
-I, --in2 read2 input file name (string [=])
-O, --out2 read2 output file name (string [=])
--unpaired1 for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])
--unpaired2 for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --unpaired1 (default mode), both unpaired reads will be written to this same file. (string [=])
--overlapped_out for each read pair, output the overlapped region if it has no any mismatched base. (string [=])
--failed_out specify the file to store reads that cannot pass the filters. (string [=])
-m, --merge for paired-end input, merge each pair of reads into a single read if they are overlapped. The merged reads will be written to the file given by --merged_out, the unmerged reads will be written to the files specified by --out1 and --out2. The merging mode is disabled by default.
--merged_out in the merging mode, specify the file name to store merged output, or specify --stdout to stream the merged output (string [=])
--include_unmerged in the merging mode, write the unmerged or unpaired reads to the file specified by --merge. Disabled by default.
-6, --phred64 indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
-z, --compression compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4. (int [=4])
--stdin input from STDIN. If the STDIN is interleaved paired-end FASTQ, please also add --interleaved_in.
--stdout stream passing-filters reads to STDOUT. This option will result in interleaved FASTQ output for paired-end output. Disabled by default.
--interleaved_in indicate that <in1> is an interleaved FASTQ which contains both read1 and read2. Disabled by default.
--reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])
--dont_overwrite don't overwrite existing files. Overwritting is allowed by default.
--fix_mgi_id the MGI FASTQ ID format is not compatible with many BAM operation tools, enable this option to fix it.
-V, --verbose output verbose log information (i.e. when every 1M reads are processed).
-A, --disable_adapter_trimming adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
-a, --adapter_sequence the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])
--adapter_sequence_r2 the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=auto])
--adapter_fasta specify a FASTA file to trim both read1 and read2 (if PE) by all the sequences in this FASTA file (string [=])
--detect_adapter_for_pe by default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data.
-f, --trim_front1 trimming how many bases in front for read1, default is 0 (int [=0])
-t, --trim_tail1 trimming how many bases in tail for read1, default is 0 (int [=0])
-b, --max_len1 if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation (int [=0])
-F, --trim_front2 trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0])
-T, --trim_tail2 trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0])
-B, --max_len2 if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings (int [=0])
-D, --dedup enable deduplication to drop the duplicated reads/pairs
--dup_calc_accuracy accuracy level to calculate duplication (1~6), higher level uses more memory (1G, 2G, 4G, 8G, 16G, 24G). Default 1 for no-dedup mode, and 3 for dedup mode. (int [=0])
--dont_eval_duplication don't evaluate duplication rate to save time and use less memory.
-g, --trim_poly_g force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
--poly_g_min_len the minimum length to detect polyG in the read tail. 10 by default. (int [=10])
-G, --disable_trim_poly_g disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
-x, --trim_poly_x enable polyX trimming in 3' ends.
--poly_x_min_len the minimum length to detect polyX in the read tail. 10 by default. (int [=10])
-5, --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise.
-3, --cut_tail move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise.
-r, --cut_right move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop.
-W, --cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])
-M, --cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])
--cut_front_window_size the window size option of cut_front, default to cut_window_size if not specified (int [=4])
--cut_front_mean_quality the mean quality requirement option for cut_front, default to cut_mean_quality if not specified (int [=20])
--cut_tail_window_size the window size option of cut_tail, default to cut_window_size if not specified (int [=4])
--cut_tail_mean_quality the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified (int [=20])
--cut_right_window_size the window size option of cut_right, default to cut_window_size if not specified (int [=4])
--cut_right_mean_quality the mean quality requirement option for cut_right, default to cut_mean_quality if not specified (int [=20])
-Q, --disable_quality_filtering quality filtering is enabled by default. If this option is specified, quality filtering is disabled
-q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
-u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])
-n, --n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
-e, --average_qual if one read's average quality score <avg_qual, then this read/pair is discarded. Default 0 means no requirement (int [=0])
-L, --disable_length_filtering length filtering is enabled by default. If this option is specified, length filtering is disabled
-l, --length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
--length_limit reads longer than length_limit will be discarded, default 0 means no limitation. (int [=0])
-y, --low_complexity_filter enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
-Y, --complexity_threshold the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])
--filter_by_index1 specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])
--filter_by_index2 specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])
--filter_by_index_threshold the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])
-c, --correction enable base correction in overlapped regions (only for PE data), default is disabled
--overlap_len_require the minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default. (int [=30])
--overlap_diff_limit the maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default. (int [=5])
--overlap_diff_percent_limit the maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%. (int [=20])
-U, --umi enable unique molecular identifier (UMI) preprocessing
--umi_loc specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])
--umi_len if the UMI is in read1/read2, its length should be provided (int [=0])
--umi_prefix if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])
--umi_skip if the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0 (int [=0])
--umi_delim delimiter to use between the read name and the UMI, default is : (string [=:])
-p, --overrepresentation_analysis enable overrepresented sequence analysis.
-P, --overrepresentation_sampling one in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20. (int [=20])
-j, --json the json format report file name (string [=fastp.json])
-h, --html the html format report file name (string [=fastp.html])
-R, --report_title should be quoted with ' or ", default is "fastp report" (string [=fastp report])
-w, --thread worker thread number, default is 3 (int [=3])
-s, --split split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])
-S, --split_by_lines split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])
-d, --split_prefix_digits the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])
--cut_by_quality5 DEPRECATED, use --cut_front instead.
--cut_by_quality3 DEPRECATED, use --cut_tail instead.
--cut_by_quality_aggressive DEPRECATED, use --cut_right instead.
--discard_unmerged DEPRECATED, no effect now, see the introduction for merging.
-?, --help print this message

105
src/fastp/script.sh Normal file
View File

@@ -0,0 +1,105 @@
#!/bin/bash
## VIASH START
## VIASH END
# disable flags
[[ "$par_disable_adapter_trimming" == "false" ]] && unset par_disable_adapter_trimming
[[ "$par_detect_adapter_for_pe" == "false" ]] && unset par_detect_adapter_for_pe
[[ "$par_merge" == "false" ]] && unset par_merge
[[ "$par_include_unmerged" == "false" ]] && unset par_include_unmerged
[[ "$par_interleaved_in" == "false" ]] && unset par_interleaved_in
[[ "$par_fix_mgi_id" == "false" ]] && unset par_fix_mgi_id
[[ "$par_phred64" == "false" ]] && unset par_phred64
[[ "$par_dont_overwrite" == "false" ]] && unset par_dont_overwrite
[[ "$par_verbose" == "false" ]] && unset par_verbose
[[ "$par_dedup" == "false" ]] && unset par_dedup
[[ "$par_dont_eval_duplication" == "false" ]] && unset par_dont_eval_duplication
[[ "$par_trim_poly_g" == "false" ]] && unset par_trim_poly_g
[[ "$par_disable_trim_poly_g" == "false" ]] && unset par_disable_trim_poly_g
[[ "$par_trim_poly_x" == "false" ]] && unset par_trim_poly_x
[[ "$par_disable_quality_filtering" == "false" ]] && unset par_disable_quality_filtering
[[ "$par_disable_length_filtering" == "false" ]] && unset par_disable_length_filtering
[[ "$par_low_complexity_filter" == "false" ]] && unset par_low_complexity_filter
[[ "$par_umi" == "false" ]] && unset par_umi
[[ "$par_overrepresentation_analysis" == "false" ]] && unset par_overrepresentation_analysis
# run command
fastp \
-i "$par_in1" \
-o "$par_out1" \
${par_in2:+--in2 "${par_in2}"} \
${par_out2:+--out2 "${par_out2}"} \
${par_unpaired1:+--unpaired1 "${par_unpaired1}"} \
${par_unpaired2:+--unpaired2 "${par_unpaired2}"} \
${par_failed_out:+--failed_out "${par_failed_out}"} \
${par_overlapped_out:+--overlapped_out "${par_overlapped_out}"} \
${par_json:+--json "${par_json}"} \
${par_html:+--html "${par_html}"} \
${par_report_title:+--report_title "${par_report_title}"} \
${par_disable_adapter_trimming:+--disable_adapter_trimming} \
${par_detect_adapter_for_pe:+--detect_adapter_for_pe} \
${par_adapter_sequence:+--adapter_sequence "${par_adapter_sequence}"} \
${par_adapter_sequence_r2:+--adapter_sequence_r2 "${par_adapter_sequence_r2}"} \
${par_adapter_fasta:+--adapter_fasta "${par_adapter_fasta}"} \
${par_trim_front1:+--trim_front1 "${par_trim_front1}"} \
${par_trim_tail1:+--trim_tail1 "${par_trim_tail1}"} \
${par_max_len1:+--max_len1 "${par_max_len1}"} \
${par_trim_front2:+--trim_front2 "${par_trim_front2}"} \
${par_trim_tail2:+--trim_tail2 "${par_trim_tail2}"} \
${par_max_len2:+--max_len2 "${par_max_len2}"} \
${par_merge:+--merge} \
${par_merged_out:+--merged_out "${par_merged_out}"} \
${par_include_unmerged:+--include_unmerged} \
${par_interleaved_in:+--interleaved_in} \
${par_fix_mgi_id:+--fix_mgi_id} \
${par_phred64:+--phred64} \
${par_compression:+--compression "${par_compression}"} \
${par_dont_overwrite:+--dont_overwrite} \
${par_verbose:+--verbose} \
${par_reads_to_process:+--reads_to_process "${par_reads_to_process}"} \
${par_dedup:+--dedup} \
${par_dup_calc_accuracy:+--dup_calc_accuracy "${par_dup_calc_accuracy}"} \
${par_dont_eval_duplication:+--dont_eval_duplication} \
${par_trim_poly_g:+--trim_poly_g} \
${par_poly_g_min_len:+--poly_g_min_len "${par_poly_g_min_len}"} \
${par_disable_trim_poly_g:+--disable_trim_poly_g} \
${par_trim_poly_x:+--trim_poly_x} \
${par_poly_x_min_len:+--poly_x_min_len "${par_poly_x_min_len}"} \
${par_cut_front:+--cut_front "${par_cut_front}"} \
${par_cut_tail:+--cut_tail "${par_cut_tail}"} \
${par_cut_right:+--cut_right "${par_cut_right}"} \
${par_cut_window_size:+--cut_window_size "${par_cut_window_size}"} \
${par_cut_mean_quality:+--cut_mean_quality "${par_cut_mean_quality}"} \
${par_cut_front_window_size:+--cut_front_window_size "${par_cut_front_window_size}"} \
${par_cut_front_mean_quality:+--cut_front_mean_quality "${par_cut_front_mean_quality}"} \
${par_cut_tail_window_size:+--cut_tail_window_size "${par_cut_tail_window_size}"} \
${par_cut_tail_mean_quality:+--cut_tail_mean_quality "${par_cut_tail_mean_quality}"} \
${par_cut_right_window_size:+--cut_right_window_size "${par_cut_right_window_size}"} \
${par_cut_right_mean_quality:+--cut_right_mean_quality "${par_cut_right_mean_quality}"} \
${par_disable_quality_filtering:+--disable_quality_filtering} \
${par_qualified_quality_phred:+--qualified_quality_phred "${par_qualified_quality_phred}"} \
${par_unqualified_percent_limit:+--unqualified_percent_limit "${par_unqualified_percent_limit}"} \
${par_n_base_limit:+--n_base_limit "${par_n_base_limit}"} \
${par_average_qual:+--average_qual "${par_average_qual}"} \
${par_disable_length_filtering:+--disable_length_filtering} \
${par_length_required:+--length_required "${par_length_required}"} \
${par_length_limit:+--length_limit "${par_length_limit}"} \
${par_low_complexity_filter:+--low_complexity_filter} \
${par_complexity_threshold:+--complexity_threshold "${par_complexity_threshold}"} \
${par_filter_by_index1:+--filter_by_index1 "${par_filter_by_index1}"} \
${par_filter_by_index2:+--filter_by_index2 "${par_filter_by_index2}"} \
${par_filter_by_index_threshold:+--filter_by_index_threshold "${par_filter_by_index_threshold}"} \
${par_correction:+--correction} \
${par_overlap_len_require:+--overlap_len_require "${par_overlap_len_require}"} \
${par_overlap_diff_limit:+--overlap_diff_limit "${par_overlap_diff_limit}"} \
${par_overlap_diff_percent_limit:+--overlap_diff_percent_limit "${par_overlap_diff_percent_limit}"} \
${par_umi:+--umi} \
${par_umi_loc:+--umi_loc "${par_umi_loc}"} \
${par_umi_len:+--umi_len "${par_umi_len}"} \
${par_umi_prefix:+--umi_prefix "${par_umi_prefix}"} \
${par_umi_skip:+--umi_skip "${par_umi_skip}"} \
${par_umi_delim:+--umi_delim "${par_umi_delim}"} \
${par_overrepresentation_analysis:+--overrepresentation_analysis} \
${par_overrepresentation_sampling:+--overrepresentation_sampling "${par_overrepresentation_sampling}"} \
${meta_cpus:+--thread "${meta_cpus}"}

74
src/fastp/test.sh Normal file
View File

@@ -0,0 +1,74 @@
#!/bin/bash
set -e
## VIASH START
meta_executable="target/docker/fastp/fastp"
meta_resources_dir="src/fastp"
## VIASH END
#########################################################################################
mkdir fastp_se
cd fastp_se
echo "> Run fastp on SE"
"$meta_executable" \
--in1 "$meta_resources_dir/test_data/se/a.fastq" \
--out1 "trimmed.fastq" \
--failed_out "failed.fastq" \
--json "report.json" \
--html "report.html" \
--adapter_sequence ACGGCTAGCTA
echo ">> Check if output exists"
[ ! -f "trimmed.fastq" ] && echo ">> trimmed.fastq does not exist" && exit 1
[ ! -f "failed.fastq" ] && echo ">> failed.fastq does not exist" && exit 1
[ ! -f "report.json" ] && echo ">> report.json does not exist" && exit 1
[ ! -f "report.html" ] && echo ">> report.html does not exist" && exit 1
#########################################################################################
cd ..
mkdir fastp_pe_minimal
cd fastp_pe_minimal
echo ">> Run fastp on PE with minimal parameters"
"$meta_executable" \
--in1 "$meta_resources_dir/test_data/pe/a.1.fastq" \
--in2 "$meta_resources_dir/test_data/pe/a.2.fastq" \
--out1 "trimmed_1.fastq" \
--out2 "trimmed_2.fastq"
echo ">> Check if output exists"
[ ! -f "trimmed_1.fastq" ] && echo ">> trimmed_1.fastq does not exist" && exit 1
[ ! -f "trimmed_2.fastq" ] && echo ">> trimmed_2.fastq does not exist" && exit 1
#########################################################################################
cd ..
mkdir fastp_pe_many
cd fastp_pe_many
echo ">> Run fastp on PE with many parameters"
"$meta_executable" \
--in1 "$meta_resources_dir/test_data/pe/a.1.fastq" \
--in2 "$meta_resources_dir/test_data/pe/a.2.fastq" \
--out1 "trimmed_1.fastq" \
--out2 "trimmed_2.fastq" \
--failed_out "failed.fastq" \
--json "report.json" \
--html "report.html" \
--adapter_sequence ACGGCTAGCTA \
--adapter_sequence_r2 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
--merge \
--merged_out "merged.fastq"
echo ">> Check if output exists"
[ ! -f "trimmed_1.fastq" ] && echo ">> trimmed_1.fastq does not exist" && exit 1
[ ! -f "trimmed_2.fastq" ] && echo ">> trimmed_2.fastq does not exist" && exit 1
[ ! -f "failed.fastq" ] && echo ">> failed.fastq does not exist" && exit 1
[ ! -f "report.json" ] && echo ">> report.json does not exist" && exit 1
[ ! -f "report.html" ] && echo ">> report.html does not exist" && exit 1
[ ! -f "merged.fastq" ] && echo ">> merged.fastq does not exist" && exit 1
#########################################################################################
echo "> Test successful"

View File

@@ -0,0 +1,4 @@
@1
ACGGCAT
+
!!!!!!!

View File

@@ -0,0 +1,4 @@
@1
ACGGCAT
+
!!!!!!!

10
src/fastp/test_data/script.sh Executable file
View File

@@ -0,0 +1,10 @@
# fastp test data
# Test data was obtained from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/fastp/test
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi
cp -r /tmp/snakemake-wrappers/bio/fastp/test/reads/* src/fastp/test_data

View File

@@ -0,0 +1,4 @@
@1
ACGGCAT
+
!!!!!!!

View File

@@ -0,0 +1,336 @@
name: featurecounts
description: |
featureCounts is a read summarization program for counting reads generated from either RNA or genomic DNA sequencing experiments by implementing highly efficient chromosome hashing and feature blocking techniques. It works with either single or paired-end reads and provides a wide range of options appropriate for different sequencing applications.
keywords: ["Read counting", "Genomic features"]
links:
homepage: https://subread.sourceforge.net/
documentation: https://subread.sourceforge.net/SubreadUsersGuide.pdf
repository: https://github.com/ShiLab-Bioinformatics/subread
references:
doi: "10.1093/bioinformatics/btt656"
license: GPL-3.0
requirements:
commands: [ featureCounts ]
argument_groups:
- name: Inputs
arguments:
- name: --annotation
alternatives: ["-a"]
type: file
description: |
Name of an annotation file. GTF/GFF format by default. See '--format' option for more format information.
required: true
example: annotation.gtf
- name: --input
alternatives: ["-i"]
type: file
multiple: true
description: |
A list of SAM or BAM format files separated by semi-colon (;). They can be either name or location sorted. Location-sorted paired-end reads are automatically sorted by read names.
required: true
example: input_file1.bam
- name: Outputs
arguments:
- name: --counts
alternatives: ["-o"]
type: file
direction: output
description: |
Name of output file including read counts in tab delimited format.
required: true
example: features.tsv
- name: --summary
type: file
direction: output
description: |
Summary statistics of counting results in tab delimited format.
required: false
example: summary.tsv
- name: --junctions
type: file
direction: output
description: |
Count number of reads supporting each exon-exon junction. Junctions were identified from those exon-spanning reads in the input (containing 'N' in CIGAR string).
example: junctions.txt
required: false
- name: Annotation
arguments:
- name: --format
alternatives: ["-F"]
type: string
description: |
Specify format of the provided annotation file. Acceptable formats include 'GTF' (or compatible GFF format) and 'SAF'. 'GTF' by default.
choices: [GTF, GFF, SAF]
example: "GTF"
required: false
- name: --feature_type
alternatives: ["-t"]
type: string
description: |
Specify feature type(s) in a GTF annotation. If multiple types are provided, they should be separated by ';' with no space in between. 'exon' by default. Rows in the annotation with a matched feature will be extracted and used for read mapping.
example: "exon"
required: false
multiple: true
- name: --attribute_type
alternatives: ["-g"]
type: string
description: |
Specify attribute type in GTF annotation. 'gene_id' by default. Meta-features used for read counting will be extracted from annotation using the provided value.
example: "gene_id"
required: false
- name: --extra_attributes
type: string
description: |
Extract extra attribute types from the provided GTF annotation and include them in the counting output. These attribute types will not be used to group features. If more than one attribute type is provided they should be separated by semicolon (;).
required: false
multiple: true
- name: --chrom_alias
alternatives: ["-A"]
type: file
description: |
Provide a chromosome name alias file to match chr names in annotation with those in the reads. This should be a two-column comma-delimited text file. Its first column should include chr names in the annotation and its second column should include chr names in the reads. Chr names are case sensitive. No column header should be included in the file.
required: false
example: chrom_alias.csv
- name: Level of summarization
arguments:
- name: --feature_level
alternatives: ["-f"]
type: boolean_true
description: |
Perform read counting at feature level (eg. counting reads for exons rather than genes).
- name: Overlap between reads and features
arguments:
- name: --overlapping
alternatives: ["-O"]
type: boolean_true
description: |
Assign reads to all their overlapping meta-features (or features if '--feature_level' is specified).
- name: --min_overlap
type: integer
description: |
Minimum number of overlapping bases in a read that is required for read assignment. 1 by default. Number of overlapping bases is counted from both reads if paired end. If a negative value is provided, then a gap of up to specified size will be allowed between read and the feature that the read is assigned to.
required: false
example: 1
- name: --frac_overlap
type: double
description: |
Minimum fraction of overlapping bases in a read that is required for read assignment. Value should be within range [0,1]. 0 by default. Number of overlapping bases is counted from both reads if paired end. Both this option and '--min_overlap' option need to be satisfied for read assignment.
required: false
min: 0
max: 1
example: 0
- name: --frac_overlap_feature
type: double
description: |
Minimum fraction of overlapping bases in a feature that is required for read assignment. Value should be within range [0,1]. 0 by default.
required: false
min: 0
max: 1
example: 0
- name: --largest_overlap
type: boolean_true
description: |
Assign reads to a meta-feature/feature that has the largest number of overlapping bases.
- name: --non_overlap
type: integer
description: |
Maximum number of non-overlapping bases in a read (or a read pair) that is allowed when being assigned to a feature. No limit is set by default.
required: false
- name: --non_overlap_feature
type: integer
description: |
Maximum number of non-overlapping bases in a feature that is allowed in read assignment. No limit is set by default.
required: false
- name: --read_extension5
type: integer
description: |
Reads are extended upstream by <int> bases from their 5' end.
required: false
- name: --read_extension3
type: integer
description: |
Reads are extended upstream by <int> bases from their 3' end.
required: false
- name: --read2pos
type: integer
description: |
Reduce reads to their 5' most base or 3' most base. Read counting is then performed based on the single base the read is reduced to.
required: false
choices: [3, 5]
- name: Multi-mapping reads
arguments:
- name: --multi_mapping
alternatives: ["-M"]
type: boolean_true
description: |
Multi-mapping reads will also be counted. For a multi-mapping read, all its reported alignments will be counted. The 'NH' tag in BAM/SAM input is used to detect multi-mapping reads.
- name: Fractional counting
arguments:
- name: --fraction
type: boolean_true
description: |
Assign fractional counts to features. This option must be used together with '--multi_mapping' or '--overlapping' or both. When '--multi_mapping' is specified, each reported alignment from a multi-mapping read (identified via 'NH' tag) will carry a fractional count of 1/x, instead of 1 (one), where x is the total number of alignments reported for the same read. When '--overlapping' is specified, each overlapping feature will receive a fractional count of 1/y, where y is the total number of features overlapping with the read. When both '--multi_mapping' and '--overlapping' are specified, each alignment will carry a fractional count of 1/(x*y).
- name: Read filtering
arguments:
- name: --min_map_quality
alternatives: ["-Q"]
type: integer
description: |
The minimum mapping quality score a read must satisfy in order to be counted. For paired-end reads, at least one end should satisfy this criteria. 0 by default.
required: false
example: 0
- name: --split_only
type: boolean_true
description: |
Count split alignments only (ie. alignments with CIGAR string containing 'N'). An example of split alignments is exon-spanning reads in RNA-seq data.
- name: --non_split_only
type: boolean_true
description: |
If specified, only non-split alignments (CIGAR strings do not contain letter 'N') will be counted. All the other alignments will be ignored.
- name: --primary
type: boolean_true
description: |
Count primary alignments only. Primary alignments are identified using bit 0x100 in SAM/BAM FLAG field.
- name: --ignore_dup
type: boolean_true
description: |
Ignore duplicate reads in read counting. Duplicate reads are identified using bit Ox400 in BAM/SAM FLAG field. The whole read pair is ignored if one of the reads is a duplicate read for paired end data.
- name: Strandedness
arguments:
- name: --strand
alternatives: ["-s"]
type: integer
description: |
Perform strand-specific read counting. A single integer value (applied to all input files) should be provided. Possible values include: 0 (unstranded), 1 (stranded) and 2 (reversely stranded). Default value is 0 (ie. unstranded read counting carried out for all input files).
choices: [0, 1, 2]
example: 0
required: false
- name: Exon-exon junctions
arguments:
- name: --ref_fasta
alternatives: ["-G"]
type: file
description: |
Provide the name of a FASTA-format file that contains the reference sequences used in read mapping that produced the provided SAM/BAM files.
required: false
example: reference.fasta
- name: Parameters specific to paired end reads
arguments:
- name: --paired
alternatives: ["-p"]
type: boolean_true
description: |
Specify that input data contain paired-end reads. To perform fragment counting (ie. counting read pairs), the '--countReadPairs' parameter should also be specified in addition to this parameter.
- name: --count_read_pairs
type: boolean_true
description: |
Count read pairs (fragments) instead of reads. This option is only applicable for paired-end reads.
- name: --both_aligned
alternatives: ["-B"]
type: boolean_true
description: |
Count read pairs (fragments) instead of reads. This option is only applicable for paired-end reads.
- name: --check_pe_dist
alternatives: ["-P"]
type: boolean_true
description: |
Check validity of paired-end distance when counting read pairs. Use '--min_length' and '--max_length' to set thresholds.
- name: --min_length
alternatives: ["-d"]
type: integer
description: |
Minimum fragment/template length, 50 by default.
required: false
example: 50
- name: --max_length
alternatives: ["-D"]
type: integer
description: |
Maximum fragment/template length, 600 by default.
required: false
example: 600
- name: --same_strand
alternatives: ["-C"]
type: boolean_true
description: |
Do not count read pairs that have their two ends mapping to different chromosomes or mapping to same chromosome but on different strands.
- name: --donotsort
type: boolean_true
description: |
Do not sort reads in BAM/SAM input. Note that reads from the same pair are required to be located next to each other in the input.
- name: Read groups
arguments:
- name: --by_read_group
type: boolean_true
description: |
Assign reads by read group. "RG" tag is required to be present in the input BAM/SAM files.
- name: Long reads
arguments:
- name: --long_reads
type: boolean_true
description: |
Count long reads such as Nanopore and PacBio reads. Long read counting can only run in one thread and only reads (not read-pairs) can be counted. There is no limitation on the number of 'M' operations allowed in a CIGAR string in long read counting.
- name: Assignment results for each read
arguments:
- name: --detailed_results
type: file
direction: output
description: |
Directory to save the detailed assignment results. Use `--detailed_results_format` to determine the format of the detailed results.
example: detailed_results/
required: false
- name: --detailed_results_format
alternatives: ["-R"]
type: string
description: |
Output detailed assignment results for each read or read-pair. Results are saved to a file that is in one of the following formats: CORE, SAM and BAM. See documentaiton for more info about these formats.
required: false
choices: [CORE, SAM, BAM]
- name: Miscellaneous
arguments:
- name: --max_M_op
type: integer
description: |
Maximum number of 'M' operations allowed in a CIGAR string. 10 by default. Both 'X' and '=' are treated as 'M' and adjacent 'M' operations are merged in the CIGAR string.
required: false
example: 10
- name: --verbose
type: boolean_true
description: |
Output verbose information for debugging, such as un-matched chromosome/contig names.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/subread:2.0.6--he4a0461_0
setup:
- type: docker
run: |
featureCounts -v 2>&1 | sed 's/featureCounts v\([0-9.]*\)/featureCounts: \1/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

242
src/featurecounts/help.txt Normal file
View File

@@ -0,0 +1,242 @@
```bash
featureCounts
```
Version 2.0.3
Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ...
## Mandatory arguments:
-a <string> Name of an annotation file. GTF/GFF format by default. See
-F option for more format information. Inbuilt annotations
(SAF format) is available in 'annotation' directory of the
package. Gzipped file is also accepted.
-o <string> Name of output file including read counts. A separate file
including summary statistics of counting results is also
included in the output ('<string>.summary'). Both files
are in tab delimited format.
input_file1 [input_file2] ... A list of SAM or BAM format files. They can be
either name or location sorted. If no files provided,
<stdin> input is expected. Location-sorted paired-end reads
are automatically sorted by read names.
## Optional arguments:
# Annotation
-F <string> Specify format of the provided annotation file. Acceptable
formats include 'GTF' (or compatible GFF format) and
'SAF'. 'GTF' by default. For SAF format, please refer to
Users Guide.
-t <string> Specify feature type(s) in a GTF annotation. If multiple
types are provided, they should be separated by ',' with
no space in between. 'exon' by default. Rows in the
annotation with a matched feature will be extracted and
used for read mapping.
-g <string> Specify attribute type in GTF annotation. 'gene_id' by
default. Meta-features used for read counting will be
extracted from annotation using the provided value.
--extraAttributes Extract extra attribute types from the provided GTF
annotation and include them in the counting output. These
attribute types will not be used to group features. If
more than one attribute type is provided they should be
separated by comma.
-A <string> Provide a chromosome name alias file to match chr names in
annotation with those in the reads. This should be a two-
column comma-delimited text file. Its first column should
include chr names in the annotation and its second column
should include chr names in the reads. Chr names are case
sensitive. No column header should be included in the
file.
# Level of summarization
-f Perform read counting at feature level (eg. counting
reads for exons rather than genes).
# Overlap between reads and features
-O Assign reads to all their overlapping meta-features (or
features if -f is specified).
--minOverlap <int> Minimum number of overlapping bases in a read that is
required for read assignment. 1 by default. Number of
overlapping bases is counted from both reads if paired
end. If a negative value is provided, then a gap of up
to specified size will be allowed between read and the
feature that the read is assigned to.
--fracOverlap <float> Minimum fraction of overlapping bases in a read that is
required for read assignment. Value should be within range
[0,1]. 0 by default. Number of overlapping bases is
counted from both reads if paired end. Both this option
and '--minOverlap' option need to be satisfied for read
assignment.
--fracOverlapFeature <float> Minimum fraction of overlapping bases in a
feature that is required for read assignment. Value
should be within range [0,1]. 0 by default.
--largestOverlap Assign reads to a meta-feature/feature that has the
largest number of overlapping bases.
--nonOverlap <int> Maximum number of non-overlapping bases in a read (or a
read pair) that is allowed when being assigned to a
feature. No limit is set by default.
--nonOverlapFeature <int> Maximum number of non-overlapping bases in a feature
that is allowed in read assignment. No limit is set by
default.
--readExtension5 <int> Reads are extended upstream by <int> bases from their
5' end.
--readExtension3 <int> Reads are extended upstream by <int> bases from their
3' end.
--read2pos <5:3> Reduce reads to their 5' most base or 3' most base. Read
counting is then performed based on the single base the
read is reduced to.
# Multi-mapping reads
-M Multi-mapping reads will also be counted. For a multi-
mapping read, all its reported alignments will be
counted. The 'NH' tag in BAM/SAM input is used to detect
multi-mapping reads.
# Fractional counting
--fraction Assign fractional counts to features. This option must
be used together with '-M' or '-O' or both. When '-M' is
specified, each reported alignment from a multi-mapping
read (identified via 'NH' tag) will carry a fractional
count of 1/x, instead of 1 (one), where x is the total
number of alignments reported for the same read. When '-O'
is specified, each overlapping feature will receive a
fractional count of 1/y, where y is the total number of
features overlapping with the read. When both '-M' and
'-O' are specified, each alignment will carry a fractional
count of 1/(x*y).
# Read filtering
-Q <int> The minimum mapping quality score a read must satisfy in
order to be counted. For paired-end reads, at least one
end should satisfy this criteria. 0 by default.
--splitOnly Count split alignments only (ie. alignments with CIGAR
string containing 'N'). An example of split alignments is
exon-spanning reads in RNA-seq data.
--nonSplitOnly If specified, only non-split alignments (CIGAR strings do
not contain letter 'N') will be counted. All the other
alignments will be ignored.
--primary Count primary alignments only. Primary alignments are
identified using bit 0x100 in SAM/BAM FLAG field.
--ignoreDup Ignore duplicate reads in read counting. Duplicate reads
are identified using bit Ox400 in BAM/SAM FLAG field. The
whole read pair is ignored if one of the reads is a
duplicate read for paired end data.
# Strandness
-s <int or string> Perform strand-specific read counting. A single integer
value (applied to all input files) or a string of comma-
separated values (applied to each corresponding input
file) should be provided. Possible values include:
0 (unstranded), 1 (stranded) and 2 (reversely stranded).
Default value is 0 (ie. unstranded read counting carried
out for all input files).
# Exon-exon junctions
-J Count number of reads supporting each exon-exon junction.
Junctions were identified from those exon-spanning reads
in the input (containing 'N' in CIGAR string). Counting
results are saved to a file named '<output_file>.jcounts'
-G <string> Provide the name of a FASTA-format file that contains the
reference sequences used in read mapping that produced the
provided SAM/BAM files. This optional argument can be used
with '-J' option to improve read counting for junctions.
# Parameters specific to paired end reads
-p Specify that input data contain paired-end reads. To
perform fragment counting (ie. counting read pairs), the
'--countReadPairs' parameter should also be specified in
addition to this parameter.
--countReadPairs Count read pairs (fragments) instead of reads. This option
is only applicable for paired-end reads.
-B Only count read pairs that have both ends aligned.
-P Check validity of paired-end distance when counting read
pairs. Use -d and -D to set thresholds.
-d <int> Minimum fragment/template length, 50 by default.
-D <int> Maximum fragment/template length, 600 by default.
-C Do not count read pairs that have their two ends mapping
to different chromosomes or mapping to same chromosome
but on different strands.
--donotsort Do not sort reads in BAM/SAM input. Note that reads from
the same pair are required to be located next to each
other in the input.
# Number of CPU threads
-T <int> Number of the threads. 1 by default.
# Read groups
--byReadGroup Assign reads by read group. "RG" tag is required to be
present in the input BAM/SAM files.
# Long reads
-L Count long reads such as Nanopore and PacBio reads. Long
read counting can only run in one thread and only reads
(not read-pairs) can be counted. There is no limitation on
the number of 'M' operations allowed in a CIGAR string in
long read counting.
# Assignment results for each read
-R <format> Output detailed assignment results for each read or read-
pair. Results are saved to a file that is in one of the
following formats: CORE, SAM and BAM. See Users Guide for
more info about these formats.
--Rpath <string> Specify a directory to save the detailed assignment
results. If unspecified, the directory where counting
results are saved is used.
# Miscellaneous
--tmpDir <string> Directory under which intermediate files are saved (later
removed). By default, intermediate files will be saved to
the directory specified in '-o' argument.
--maxMOp <int> Maximum number of 'M' operations allowed in a CIGAR
string. 10 by default. Both 'X' and '=' are treated as 'M'
and adjacent 'M' operations are merged in the CIGAR
string.
--verbose Output verbose information for debugging, such as un-
matched chromosome/contig names.
-v Output version of the program.

View File

@@ -0,0 +1,94 @@
#!/bin/bash
set -e
## VIASH START
## VIASH END
# create temporary directory
tmp_dir=$(mktemp -d -p "$meta_temp_dir" "${meta_functionality_name}_XXXXXX")
mkdir -p "$tmp_dir/temp"
# create detailed_results directory if variable is set and directory does not exist
if [[ ! -z "$par_detailed_results" ]] && [[ ! -d "$par_detailed_results" ]]; then
mkdir -p "$par_detailed_results"
fi
# replace comma with semicolon
par_feature_type=$(echo $par_feature_type | tr ',' ';')
par_extra_attributes=$(echo $par_extra_attributes | tr ',' ';')
# unset flag variables
[[ "$par_feature_level" == "false" ]] && unset par_feature_level
[[ "$par_overlapping" == "false" ]] && unset par_overlapping
[[ "$par_largest_overlap" == "false" ]] && unset par_largest_overlap
[[ "$par_multi_mapping" == "false" ]] && unset par_multi_mapping
[[ "$par_fraction" == "false" ]] && unset par_fraction
[[ "$par_split_only" == "false" ]] && unset par_split_only
[[ "$par_non_split_only" == "false" ]] && unset par_non_split_only
[[ "$par_primary" == "false" ]] && unset par_primary
[[ "$par_ignore_dup" == "false" ]] && unset par_ignore_dup
[[ "$par_paired" == "false" ]] && unset par_paired
[[ "$par_count_read_pairs" == "false" ]] && unset par_count_read_pairs
[[ "$par_both_aligned" == "false" ]] && unset par_both_aligned
[[ "$par_check_pe_dist" == "false" ]] && unset par_check_pe_dist
[[ "$par_same_strand" == "false" ]] && unset par_same_strand
[[ "$par_donotsort" == "false" ]] && unset par_donotsort
[[ "$par_by_read_group" == "false" ]] && unset par_by_read_group
[[ "$par_long_reads" == "false" ]] && unset par_long_reads
[[ "$par_verbose" == "false" ]] && unset par_verbose
IFS=";" read -ra input <<< $par_input
featureCounts \
${par_format:+-F "${par_format}"} \
${par_feature_type:+-t "${par_feature_type}"} \
${par_attribute_type:+-g "${par_attribute_type}"} \
${par_extra_attributes:+--extraAttributes "${extra_attributes}"} \
${par_chrom_alias:+-A "${par_chrom_alias}"} \
${par_feature_level:+-f} \
${par_overlapping:+-O} \
${par_min_overlap:+--minOverlap "${par_min_overlap}"} \
${par_frac_overlap:+--fracOverlap "${par_frac_overlap}"} \
${par_frac_overlap_feature:+--fracOverlapFeature "${par_frac_overlap_feature}"} \
${par_largest_overlap:+--largestOverlap} \
${par_non_overlap:+--nonOverlap "${par_non_overlap}"} \
${par_non_overlap_feature:+--nonOverlapFeature "${par_non_overlap_feature}"} \
${par_read_extension5:+--readExtension5 "${par_read_extension5}"} \
${par_read_extension3:+--readExtension3 "${par_read_extension3}"} \
${par_read2pos:+--read2pos "${par_read2pos}"} \
${par_multi_mapping:+-M} \
${par_fraction:+--fraction} \
${par_min_map_quality:+-Q "${par_min_map_quality}"} \
${par_split_only:+--splitOnly} \
${par_non_split_only:+--nonSplitOnly} \
${par_primary:+--primary} \
${par_ignore_dup:+--ignoreDup} \
${par_strand:+-s "${par_strand}"} \
${par_junctions:+-J} \
${par_ref_fasta:+-G "${par_ref_fasta}"} \
${par_paired:+-p} \
${par_count_read_pairs:+--countReadPairs} \
${par_both_aligned:+-B} \
${par_check_pe_dist:+-P} \
${par_min_length:+-d "${par_min_length}"} \
${par_max_length:+-D "${par_max_length}"} \
${par_same_strand:+-C} \
${par_donotsort:+--donotsort} \
${par_by_read_group:+--byReadGroup} \
${par_long_reads:+-L} \
${par_detailed_results:+--Rpath "${par_detailed_results}"} \
${par_detailed_results_format:+-R "${par_detailed_results_format}"} \
${par_max_M_op:+--maxMOp "${par_max_M_op}"} \
${par_verbose:+--verbose} \
${meta_cpus:+-T "${meta_cpus}"} \
--tmpDir "$tmp_dir/temp" \
-a "$par_annotation" \
-o "$tmp_dir/output.txt" \
"${input[*]}"
[[ ! -z "$par_counts" ]] && mv "$tmp_dir/output.txt" "$par_counts"
[[ ! -z "$par_summary" ]] && mv "$tmp_dir/output.txt.summary" "$par_summary"
if [[ ! -z "$par_junctions" ]] && [[ -e "$tmp_dir/output.txt.jcounts" ]]; then
mv "$tmp_dir/output.txt.jcounts" "$par_junctions"
fi

59
src/featurecounts/test.sh Normal file
View File

@@ -0,0 +1,59 @@
#!/bin/bash
set -e
dir_in="$meta_resources_dir/test_data"
echo "> Run featureCounts (with junctions)"
"$meta_executable" \
--input "$dir_in/a.bam" \
--annotation "$dir_in/annotation.gtf" \
--counts "features.tsv" \
--summary "summary.tsv" \
--junctions "junction_counts.txt" \
--ref_fasta "$dir_in/genome.fasta" \
--overlapping \
--frac_overlap 0.2 \
--paired \
--strand 0 \
--detailed_results detailed_results \
--detailed_results_format SAM
echo ">> Checking output"
[ ! -f "features.tsv" ] && echo "Output file features.tsv does not exist" && exit 1
[ ! -f "summary.tsv" ] && echo "Output file summary.tsv does not exist" && exit 1
[ ! -f "junction_counts.txt" ] && echo "Output file junction_counts.txt does not exist" && exit 1
[ ! -d "detailed_results" ] && echo "Output directory detailed_results does not exist" && exit 1
[ ! -f "detailed_results/a.bam.featureCounts.sam" ] && echo "Output file detailed_results/a.bam.featureCounts.sam does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "features.tsv" ] && echo "Output file features.tsv is empty" && exit 1
[ ! -s "summary.tsv" ] && echo "Output file summary.tsv is empty" && exit 1
[ ! -s "junction_counts.txt" ] && echo "Output file junction_counts.txt is empty" && exit 1
[ ! -s "detailed_results/a.bam.featureCounts.sam" ] && echo "Output file detailed_results/a.bam.featureCounts.sam is empty" && exit 1
echo "> Run featureCounts (without junctions)"
"$meta_executable" \
--input "$dir_in/a.bam" \
--annotation "$dir_in/annotation.gtf" \
--counts "features.tsv" \
--summary "summary.tsv" \
--overlapping \
--frac_overlap 0.2 \
--paired \
--strand 0 \
--detailed_results detailed_results \
--detailed_results_format SAM
echo ">> Checking output"
[ ! -f "features.tsv" ] && echo "Output file features.tsv does not exist" && exit 1
[ ! -f "summary.tsv" ] && echo "Output file summary.tsv does not exist" && exit 1
[ ! -d "detailed_results" ] && echo "Output directory detailed_results does not exist" && exit 1
[ ! -f "detailed_results/a.bam.featureCounts.sam" ] && echo "Output file detailed_results/a.bam.featureCounts.sam does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "features.tsv" ] && echo "Output file features.tsv is empty" && exit 1
[ ! -s "summary.tsv" ] && echo "Output file summary.tsv is empty" && exit 1
[ ! -s "detailed_results/a.bam.featureCounts.sam" ] && echo "Output file detailed_results/a.bam.featureCounts.sam is empty" && exit 1
echo "> Test successful"

Binary file not shown.

View File

@@ -0,0 +1,6 @@
1 havana gene 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; gene_name "A"; gene_source "havana"; gene_biotype "gene";
1 havana transcript 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; transcript_id "ENST00000000000"; transcript_version "2"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; transcript_name "A-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
1 havana exon 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; transcript_id "ENST00000000000"; transcript_version "2"; exon_number "1"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; transcript_name "A-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00000000000"; exon_version "1"; tag "basic"; transcript_support_level "1";
2 havana gene 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; gene_name "B"; gene_source "havana"; gene_biotype "gene";
2 havana transcript 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; transcript_id "ENST00000000001"; transcript_version "2"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; transcript_name "B-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
2 havana exon 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; transcript_id "ENST00000000001"; transcript_version "2"; exon_number "1"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; transcript_name "B-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00000000001"; exon_version "1"; tag "basic"; transcript_support_level "1";

View File

@@ -0,0 +1,4 @@
>1
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>2
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

View File

@@ -0,0 +1,9 @@
# featureCounts test data
# Test data was obtained from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/subread/featurecounts/test
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi
cp -r /tmp/snakemake-wrappers/bio/subread/featurecounts/test/* src/subread/featurecounts/test_data

397
src/gffread/config.vsh.yaml Normal file
View File

@@ -0,0 +1,397 @@
name: gffread
description: Validate, filter, convert and perform various other operations on GFF files.
keywords: [gff, conversion, validation, filtering]
links:
homepage: https://ccb.jhu.edu/software/stringtie/gff.shtml#gffread
documentation: https://ccb.jhu.edu/software/stringtie/gff.shtml#gffread
repository: https://github.com/gpertea/gffread
references:
doi: 10.12688/f1000research.23297.2
license: MIT
requirements:
commands: [ gffread ]
argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
direction: input
description: |
A reference file in either the GFF3, GFF2 or GTF format.
required: true
example: annotation.gff
- name: --chr_mapping
alternatives: -m
type: file
direction: input
description: |
<chr_replace> is a name mapping table for converting reference sequence names,
having this 2-column format: <original_ref_ID> <new_ref_ID>.
- name: --seq_info
alternatives: -s
type: file
direction: input
description: |
<seq_info.fsize> is a tab-delimited file providing this info for each of the mapped
sequences: <seq-name> <seq-length> <seq-description> (useful for --description option with
mRNA/EST/protein mappings).
- name: --genome
alternatives: -g
type: file
description: |
Full path to a multi-fasta file with the genomic sequences for all input mappings,
OR a directory with single-fasta files (one per genomic sequence, with file names
matching sequence names).
example: genome.fa
- name: Outputs
arguments:
- name: --outfile
alternatives: -o
type: file
direction: output
required: true
description: |
Write the output records into <outfile>.
default: output.gff
- name: --force_exons
type: boolean_true
description: |
Make sure that the lowest level GFF features are considered "exon" features.
- name: --gene2exon
type: boolean_true
description: |
For single-line genes not parenting any transcripts, add an exon feature spanning
the entire gene (treat it as a transcript).
- name: --t_adopt
type: boolean_true
description: |
Try to find a parent gene overlapping/containing a transcript that does not have
any explicit gene Parent.
- name: --decode
alternatives: -D
type: boolean_true
description: |
Decode url encoded characters within attributes.
- name: --merge_exons
alternatives: -Z
type: boolean_true
description: |
Merge very close exons into a single exon (when intron size<4).
- name: --junctions
alternatives: -j
type: boolean_true
description: |
Output the junctions and the corresponding transcripts.
- name: --spliced_exons
alternatives: -w
type: file
direction: output
must_exist: false
description: |
Write a fasta file with spliced exons for each transcript.
example: exons.fa
- name: --w_add
type: integer
description: |
For the --spliced_exons option, extract additional <N> bases both upstream and
downstream of the transcript boundaries.
- name: --w_nocds
type: boolean_true
description: |
For --spliced_exons, disable the output of CDS info in the FASTA file.
- name: --spliced_cds
alternatives: -x
type: file
must_exist: false
example: cds.fa
description: |
Write a fasta file with spliced CDS for each GFF transcript.
- name: --tr_cds
alternatives: -y
type: file
must_exist: false
example: tr_cds.fa
description: |
Write a protein fasta file with the translation of CDS for each record.
- name: --w_coords
alternatives: -W
type: boolean_true
description: |
For --spliced_exons, --spliced_cds and -tr_cds options, write in the FASTA defline
all the exon coordinates projected onto the spliced sequence.
- name: --stop_dot
alternatives: -S
type: boolean_true
description: |
For --tr_cds option, use '*' instead of '.' as stop codon translation.
- name: --id_version
alternatives: -L
type: boolean_true
description: |
Ensembl GTF to GFF3 conversion, adds version to IDs.
- name: --trackname
alternatives: -t
type: string
description: |
Use <trackname> in the 2nd column of each GFF/GTF output line.
- name: --gtf_output
alternatives: -T
type: boolean_true
description: |
Main output will be GTF instead of GFF3.
- name: --bed
type: boolean_true
description: |
Output records in BED format instead of default GFF3.
- name: --tlf
type: boolean_true
description: |
Output "transcript line format" which is like GFF but with exons and CDS related
features stored as GFF attributes in the transcript feature line, like this:
exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords>
<exons> is a comma-delimited list of exon_start-exon_end coordinates;
<CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>.
- name: --table
type: string
multiple: true
multiple_sep: ","
description: |
Output a simple tab delimited format instead of GFF, with columns having the values
of GFF attributes given in <attrlist>; special pseudo-attributes (prefixed by @) are
recognized:
@id, @geneid, @chr, @start, @end, @strand, @numexons, @exons, @cds, @covlen, @cdslen
If any of --spliced_exons/--tr_cds/--spliced_cds FASTA output files are enabled, the
same fields (excluding @id) are appended to the definition line of corresponding FASTA
records.
- name: --expose_dups
type: boolean_true
alternatives: [-E, -v]
description: |
Expose (warn about) duplicate transcript IDs and other potential problems with the
given GFF/GTF records.
- name: Options
arguments:
- name: --ids
type: file
description: |
Discard records/transcripts if their IDs are not listed in <IDs.lst>.
- name: --nids
type: file
description: |
Discard records/transcripts if their IDs are listed in <IDs.lst>.
- name: --maxintron
alternatives: -i
type: integer
description: |
Discard transcripts having an intron larger than <maxintron>.
- name: --minlen
alternatives: -l
type: integer
description: |
Discard transcripts shorter than <minlen> bases.
- name: --range
alternatives: -r
type: string
description: |
Only show transcripts overlapping coordinate range <start>..<end> (on chromosome/contig
<chr>, strand <strand> if provided).
- name: --strict_range
alternatives: -R
type: boolean_true
description: |
For --range option, discard all transcripts that are not fully contained within the given
range.
- name: --jmatch
type: string
description: |
Only output transcripts matching the given junction.
- name: --no_single_exon
alternatives: -U
type: boolean_true
description: |
Discard single-exon transcripts.
- name: --coding
alternatives: -C
type: boolean_true
description: |
Coding only: discard mRNAs that have no CDS features.
- name: --nc
type: boolean_true
description: |
Non-coding only: discard mRNAs that have CDS features.
- name: --ignore_locus
type: boolean_true
description: |
Discard locus features and attributes found in the input.
- name: --description
alternatives: -A
type: boolean_true
description: |
Use the description field from <seq_info.fsize> and add it as the value for a 'descr'
attribute to the GFF record.
- name: Sorting
arguments:
- name: --sort_alpha
type: boolean_true
description: |
Chromosomes (reference sequences) are sorted alphabetically.
- name: --sort_by
type: file
must_exist: true
description: |
Sort the reference sequences by the order in which their names are given in the
<refseq.lst> file.
- name: Misc options
arguments:
- name: --keep_attrs
alternatives: -F
type: boolean_true
description: |
Keep all GFF attributes (for non-exon features).
- name: --keep_exon_attrs
type: boolean_true
description: |
For -F option, do not attempt to reduce redundant exon/CDS attributes.
- name: --no_exon_attrs
alternatives: -G
type: boolean_true
description: |
Do not keep exon attributes, move them to the transcript feature (for GFF3 output).
- name: --attrs
type: string
description: |
Only output the GTF/GFF attributes listed in <attr-list> which is a comma delimited
list of attribute names to.
- name: --keep_genes
type: boolean_true
description: |
In transcript-only mode (default), also preserve gene records.
- name: --keep_comments
type: boolean_true
description: |
For GFF3 input/output, try to preserve comments.
- name: --process_other
alternatives: -O
type: boolean_true
description: |
process other non-transcript GFF records (by default non-transcript records are ignored).
- name: --rm_stop_codons
alternatives: -V
type: boolean_true
description: |
Discard any mRNAs with CDS having in-frame stop codons (requires --genome).
- name: --adj_cds_start
alternatives: -H
type: boolean_true
description: |
For --rm_stop_codons option, check and adjust the starting CDS phase if the original phase
leads to a translation with an in-frame stop codon.
- name: --opposite_strand
alternatives: -B
type: boolean_true
description: |
For -V option, single-exon transcripts are also checked on the opposite strand (requires
--genome).
- name: --coding_status
alternatives: -P
type: boolean_true
description: |
Add transcript level GFF attributes about the coding status of each transcript, including
partialness or in-frame stop codons (requires --genome).
- name: --add_hasCDS
type: boolean_true
description: |
Add a "hasCDS" attribute with value "true" for transcripts that have CDS features.
- name: --adj_stop
type: boolean_true
description: |
Stop codon adjustment: enables --coding_status and performs automatic adjustment of the CDS stop
coordinate if premature or downstream.
- name: --rm_noncanon
alternatives: -N
type: boolean_true
description: |
Discard multi-exon mRNAs that have any intron with a non-canonical splice site consensus
(i.e. not GT-AG, GC-AG or AT-AC).
- name: --complete_cds
alternatives: -J
type: boolean_true
description: |
Discard any mRNAs that either lack initial START codon or the terminal STOP codon, or
have an in-frame stop codon (i.e. only print mRNAs with a complete CDS).
- name: --no_pseudo
type: boolean_true
description: |
Filter out records matching the 'pseudo' keyword.
- name: --in_bed
type: boolean_true
description: |
Input should be parsed as BED format (automatic if the input filename ends with .bed*).
- name: --in_tlf
type: boolean_true
description: |
Input GFF-like one-line-per-transcript format without exon/CDS features (see --tlf option
below); automatic if the input filename ends with .tlf).
- name: --stream
type: boolean_true
description: |
Fast processing of input GFF/BED transcripts as they are received (no sorting, exons must
be grouped by transcript in the input data).
- name: Clustering
arguments:
- name: --merge
alternatives: -M
type: boolean_true
description: |
Cluster the input transcripts into loci, discarding "redundant" transcripts (those with
the same exact introns and fully contained or equal boundaries).
- name: --dupinfo
alternatives: -d
type: file
description: |
For --merge option, write duplication info to file <dupinfo>.
- name: --cluster_only
type: boolean_true
description: |
Same as --merge but without discarding any of the "duplicate" transcripts, only create
"locus" features.
- name: --rm_redundant
alternatives: -K
type: boolean_true
description: |
For --merge option: also discard as redundant the shorter, fully contained transcripts (intron
chains matching a part of the container).
- name: --no_boundary
alternatives: -Q
type: boolean_true
description: |
For --merge option, no longer require boundary containment when assessing redundancy (can be
combined with --rm_redundant); only introns have to match for multi-exon transcripts, and >=80%
overlap for single-exon transcripts.
- name: --no_overlap
alternatives: -Y
type: boolean_true
description: |
For --merge option, enforce --no_boundary but also discard overlapping single-exon transcripts,
even on the opposite strand (can be combined with --rm_redudant).
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/gffread:0.12.7--hdcf5f25_3
setup:
- type: docker
run: |
echo "gffread: \"$(gffread --version 2>&1)\"" > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

140
src/gffread/help.txt Normal file
View File

@@ -0,0 +1,140 @@
```sh
gffread --help
```
gffread v0.12.7. Usage:
gffread [-g <genomic_seqs_fasta> | <dir>] [-s <seq_info.fsize>]
[-o <outfile>] [-t <trackname>] [-r [<strand>]<chr>:<start>-<end> [-R]]
[--jmatch <chr>:<start>-<end>] [--no-pseudo]
[-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]
[-j ][--ids <IDs.lst> | --nids <IDs.lst>] [--attrs <attr-list>] [-i <maxintron>]
[--stream] [--bed | --gtf | --tlf] [--table <attrlist>] [--sort-by <ref.lst>]
[<input_gff>]
Filter, convert or cluster GFF/GTF/BED records, extract the sequence of
transcripts (exon or CDS) and more.
By default (i.e. without -O) only transcripts are processed, discarding any
other non-transcript features. Default output is a simplified GFF3 with only
the basic attributes.
Options:
--ids discard records/transcripts if their IDs are not listed in <IDs.lst>
--nids discard records/transcripts if their IDs are listed in <IDs.lst>
-i discard transcripts having an intron larger than <maxintron>
-l discard transcripts shorter than <minlen> bases
-r only show transcripts overlapping coordinate range <start>..<end>
(on chromosome/contig <chr>, strand <strand> if provided)
-R for -r option, discard all transcripts that are not fully
contained within the given range
--jmatch only output transcripts matching the given junction
-U discard single-exon transcripts
-C coding only: discard mRNAs that have no CDS features
--nc non-coding only: discard mRNAs that have CDS features
--ignore-locus : discard locus features and attributes found in the input
-A use the description field from <seq_info.fsize> and add it
as the value for a 'descr' attribute to the GFF record
-s <seq_info.fsize> is a tab-delimited file providing this info
for each of the mapped sequences:
<seq-name> <seq-length> <seq-description>
(useful for -A option with mRNA/EST/protein mappings)
Sorting: (by default, chromosomes are kept in the order they were found)
--sort-alpha : chromosomes (reference sequences) are sorted alphabetically
--sort-by : sort the reference sequences by the order in which their
names are given in the <refseq.lst> file
Misc options:
-F keep all GFF attributes (for non-exon features)
--keep-exon-attrs : for -F option, do not attempt to reduce redundant
exon/CDS attributes
-G do not keep exon attributes, move them to the transcript feature
(for GFF3 output)
--attrs <attr-list> only output the GTF/GFF attributes listed in <attr-list>
which is a comma delimited list of attribute names to
--keep-genes : in transcript-only mode (default), also preserve gene records
--keep-comments: for GFF3 input/output, try to preserve comments
-O process other non-transcript GFF records (by default non-transcript
records are ignored)
-V discard any mRNAs with CDS having in-frame stop codons (requires -g)
-H for -V option, check and adjust the starting CDS phase
if the original phase leads to a translation with an
in-frame stop codon
-B for -V option, single-exon transcripts are also checked on the
opposite strand (requires -g)
-P add transcript level GFF attributes about the coding status of each
transcript, including partialness or in-frame stop codons (requires -g)
--add-hasCDS : add a "hasCDS" attribute with value "true" for transcripts
that have CDS features
--adj-stop stop codon adjustment: enables -P and performs automatic
adjustment of the CDS stop coordinate if premature or downstream
-N discard multi-exon mRNAs that have any intron with a non-canonical
splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)
-J discard any mRNAs that either lack initial START codon
or the terminal STOP codon, or have an in-frame stop codon
(i.e. only print mRNAs with a complete CDS)
--no-pseudo: filter out records matching the 'pseudo' keyword
--in-bed: input should be parsed as BED format (automatic if the input
filename ends with .bed*)
--in-tlf: input GFF-like one-line-per-transcript format without exon/CDS
features (see --tlf option below); automatic if the input
filename ends with .tlf)
--stream: fast processing of input GFF/BED transcripts as they are received
((no sorting, exons must be grouped by transcript in the input data)
Clustering:
-M/--merge : cluster the input transcripts into loci, discarding
"redundant" transcripts (those with the same exact introns
and fully contained or equal boundaries)
-d <dupinfo> : for -M option, write duplication info to file <dupinfo>
--cluster-only: same as -M/--merge but without discarding any of the
"duplicate" transcripts, only create "locus" features
-K for -M option: also discard as redundant the shorter, fully contained
transcripts (intron chains matching a part of the container)
-Q for -M option, no longer require boundary containment when assessing
redundancy (can be combined with -K); only introns have to match for
multi-exon transcripts, and >=80% overlap for single-exon transcripts
-Y for -M option, enforce -Q but also discard overlapping single-exon
transcripts, even on the opposite strand (can be combined with -K)
Output options:
--force-exons: make sure that the lowest level GFF features are considered
"exon" features
--gene2exon: for single-line genes not parenting any transcripts, add an
exon feature spanning the entire gene (treat it as a transcript)
--t-adopt: try to find a parent gene overlapping/containing a transcript
that does not have any explicit gene Parent
-D decode url encoded characters within attributes
-Z merge very close exons into a single exon (when intron size<4)
-g full path to a multi-fasta file with the genomic sequences
for all input mappings, OR a directory with single-fasta files
(one per genomic sequence, with file names matching sequence names)
-j output the junctions and the corresponding transcripts
-w write a fasta file with spliced exons for each transcript
--w-add <N> for the -w option, extract additional <N> bases
both upstream and downstream of the transcript boundaries
--w-nocds for -w, disable the output of CDS info in the FASTA file
-x write a fasta file with spliced CDS for each GFF transcript
-y write a protein fasta file with the translation of CDS for each record
-W for -w, -x and -y options, write in the FASTA defline all the exon
coordinates projected onto the spliced sequence;
-S for -y option, use '*' instead of '.' as stop codon translation
-L Ensembl GTF to GFF3 conversion, adds version to IDs
-m <chr_replace> is a name mapping table for converting reference
sequence names, having this 2-column format:
<original_ref_ID> <new_ref_ID>
-t use <trackname> in the 2nd column of each GFF/GTF output line
-o write the output records into <outfile> instead of stdout
-T main output will be GTF instead of GFF3
--bed output records in BED format instead of default GFF3
--tlf output "transcript line format" which is like GFF
but with exons and CDS related features stored as GFF
attributes in the transcript feature line, like this:
exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords>
<exons> is a comma-delimited list of exon_start-exon_end coordinates;
<CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>
--table output a simple tab delimited format instead of GFF, with columns
having the values of GFF attributes given in <attrlist>; special
pseudo-attributes (prefixed by @) are recognized:
@id, @geneid, @chr, @start, @end, @strand, @numexons, @exons,
@cds, @covlen, @cdslen
If any of -w/-y/-x FASTA output files are enabled, the same fields
(excluding @id) are appended to the definition line of corresponding
FASTA records
-v,-E expose (warn about) duplicate transcript IDs and other potential
problems with the given GFF/GTF records

119
src/gffread/script.sh Normal file
View File

@@ -0,0 +1,119 @@
#!/bin/bash
## VIASH START
## VIASH END
# unset flags
[[ "$par_coding" == "false" ]] && unset par_coding
[[ "$par_strict_range" == "false" ]] && unset par_strict_range
[[ "$par_no_single_exon" == "false" ]] && unset par_no_single_exon
[[ "$par_no_exon_attrs" == "false" ]] && unset par_no_exon_attrs
[[ "$par_nc" == "false" ]] && unset par_nc
[[ "$par_ignore_locus" == "false" ]] && unset par_ignore_locus
[[ "$par_description" == "false" ]] && unset par_description
[[ "$par_sort_alpha" == "false" ]] && unset par_sort_alpha
[[ "$par_keep_genes" == "false" ]] && unset par_keep_genes
[[ "$par_keep_attrs" == "false" ]] && unset par_keep_attrs
[[ "$par_keep_exon_attrs" == "false" ]] && unset par_keep_exon_attrs
[[ "$par_keep_comments" == "false" ]] && unset par_keep_comments
[[ "$par_process_other" == "false" ]] && unset par_process_other
[[ "$par_rm_stop_codons" == "false" ]] && unset par_rm_stop_codons
[[ "$par_adj_cds_start" == "false" ]] && unset par_adj_cds_start
[[ "$par_opposite_strand" == "false" ]] && unset par_opposite_strand
[[ "$par_coding_status" == "false" ]] && unset par_coding_status
[[ "$par_add_hasCDS" == "false" ]] && unset par_add_hasCDS
[[ "$par_adj_stop" == "false" ]] && unset par_adj_stop
[[ "$par_rm_noncanon" == "false" ]] && unset par_rm_noncanon
[[ "$par_complete_cds" == "false" ]] && unset par_complete_cds
[[ "$par_no_pseudo" == "false" ]] && unset par_no_pseudo
[[ "$par_in_bed" == "false" ]] && unset par_in_bed
[[ "$par_in_tlf" == "false" ]] && unset par_in_tlf
[[ "$par_stream" == "false" ]] && unset par_stream
[[ "$par_merge" == "false" ]] && unset par_merge
[[ "$par_rm_redundant" == "false" ]] && unset par_rm_redundant
[[ "$par_no_boundary" == "false" ]] && unset par_no_boundary
[[ "$par_no_overlap" == "false" ]] && unset par_no_overlap
[[ "$par_force_exons" == "false" ]] && unset par_force_exons
[[ "$par_gene2exon" == "false" ]] && unset par_gene2exon
[[ "$par_t_adopt" == "false" ]] && unset par_t_adopt
[[ "$par_decode" == "false" ]] && unset par_decode
[[ "$par_merge_exons" == "false" ]] && unset par_merge_exons
[[ "$par_junctions" == "false" ]] && unset par_junctions
[[ "$par_w_nocds" == "false" ]] && unset par_w_nocds
[[ "$par_tr_cds" == "false" ]] && unset par_tr_cds
[[ "$par_w_coords" == "false" ]] && unset par_w_coords
[[ "$par_stop_dot" == "false" ]] && unset par_stop_dot
[[ "$par_id_version" == "false" ]] && unset par_id_version
[[ "$par_gtf_output" == "false" ]] && unset par_gtf_output
[[ "$par_bed" == "false" ]] && unset par_bed
[[ "$par_tlf" == "false" ]] && unset par_tlf
[[ "$par_expose_dups" == "false" ]] && unset par_expose_dups
[[ "$par_cluster_only" == "false" ]] && unset par_cluster_only
$(which gffread) \
"$par_input" \
${par_chr_mapping:+-m "$par_chr_mapping"} \
${par_seq_info:+-s "$par_seq_info"} \
-o "$par_outfile" \
${par_force_exons:+--force-exons} \
${par_gene2exon:+--gene2exon} \
${par_t_adopt:+--t-adopt} \
${par_decode:+-D} \
${par_merge_exons:+-Z} \
${par_genome:+-g "$par_genome"} \
${par_junctions:+-j} \
${par_spliced_exons:+-w "$par_spliced_exons"} \
${par_w_add:+--w-add "$par_w_add"} \
${par_w_nocds:+--w-nocds} \
${par_spliced_cds:+-x "$par_spliced_cds"} \
${par_tr_cds:+-y "$par_tr_cds"} \
${par_w_coords:+-W} \
${par_stop_dot:+-S} \
${par_id_version:+-L} \
${par_trackname:+-t "$par_trackname"} \
${par_gtf_output:+-T} \
${par_bed:+--bed} \
${par_tlf:+--tlf} \
${par_table:+--table "$par_table"} \
${par_expose_dups:+-E} \
${par_ids:+--ids "$par_ids"} \
${par_nids:+--nids "$par_nids"} \
${par_maxintron:+-i "$par_maxintron"} \
${par_minlen:+-l "$par_minlen"} \
${par_range:+-r "$par_range"} \
${par_strict_range:+-R} \
${par_jmatch:+--jmatch "$par_jmatch"} \
${par_no_single_exon:+-U} \
${par_coding:+-C} \
${par_nc:+--nc} \
${par_ignore_locus:+--ignore-locus} \
${par_description:+-A} \
${par_sort_alpha:+--sort-alpha} \
${par_sort_by:+--sort-by "$par_sort_by"} \
${par_keep_attrs:+-F} \
${par_keep_exon_attrs:+--keep-exon-attrs} \
${par_no_exon_attrs:+-G} \
${par_attrs:+--attrs "$par_attrs"} \
${par_keep_genes:+--keep-genes} \
${par_keep_comments:+--keep-comments} \
${par_process_other:+-O} \
${par_rm_stop_codons:+-V} \
${par_adj_cds_start:+-H} \
${par_opposite_strand:+-B} \
${par_coding_status:+-P} \
${par_add_hasCDS:+--add-hasCDS} \
${par_adj_stop:+--adj-stop} \
${par_rm_noncanon:+-N} \
${par_complete_cds:+-J} \
${par_no_pseudo:+--no-pseudo} \
${par_in_bed:+--in-bed} \
${par_in_tlf:+--in-tlf} \
${par_stream:+--stream} \
${par_merge:+-M} \
${par_dupinfo:+-d "$par_dupinfo"} \
${par_cluster_only:+--cluster-only} \
${par_rm_redundant:+-K} \
${par_no_boundary:+-Q} \
${par_no_overlap:+-Y}

111
src/gffread/test.sh Executable file
View File

@@ -0,0 +1,111 @@
#!/bin/bash
## VIASH START
## VIASH END
set -e
test_output_dir="${meta_resources_dir}/test_data/test_output"
test_dir="${meta_resources_dir}/test_data"
expected_output_dir="${meta_resources_dir}/test_data/output"
mkdir -p "$test_output_dir"
################################################################################
echo "> Test 1 - Read annotation file, output GFF"
"$meta_executable" \
--expose_dups \
--outfile "$test_output_dir/ann_simple.gff" \
--input "$test_dir/sequence.gff3"
echo ">> Check if output exists"
[ ! -f "$test_output_dir/ann_simple.gff" ] \
&& echo "Output file test_output/ann_simple.gff does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "$test_output_dir/ann_simple.gff" ] \
&& echo "Output file test_output/ann_simple.gff is empty" && exit 1
echo ">> Compare output to expected output"
# compare file expect lines starting with "#"
diff <(grep -v "^#" "$expected_output_dir/ann_simple.gff") \
<(grep -v "^#" "$test_output_dir/ann_simple.gff") || \
(echo "Output file ann_simple.gff does not match expected output" && exit 1)
################################################################################
echo "> Test 2 - Read annotation file, output GTF"
"$meta_executable" \
--gtf_output \
--outfile "$test_output_dir/annotation.gtf" \
--input "$test_dir/sequence.gff3"
echo ">> Check if output exists"
[ ! -f "$test_output_dir/annotation.gtf" ] \
&& echo "Output file test_output/annotation.gtf does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "$test_output_dir/annotation.gtf" ] \
&& echo "Output file test_output/annotation.gtf is empty" && exit 1
echo ">> Compare output to expected output"
diff "$expected_output_dir/annotation.gtf" "$test_output_dir/annotation.gtf" || \
(echo "Output file annotation.gtf does not match expected output" && exit 1)
################################################################################
echo "> Test 3 - Generate fasta file from annotation file"
"$meta_executable" \
--genome "$test_dir/sequence.fasta" \
--spliced_exons "$test_output_dir/transcripts.fa" \
--outfile "$test_output_dir/output.gff" \
--input "$test_dir/sequence.gff3"
echo ">> Check if output exists"
[ ! -f "$test_output_dir/transcripts.fa" ] \
&& echo "Output file transcripts.fa does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "$test_output_dir/transcripts.fa" ] \
&& echo "Output file transcripts.fa is empty" && exit 1
echo ">> Compare output to expected output"
diff "$expected_output_dir/transcripts.fa" "$test_output_dir/transcripts.fa" || \
(echo "Output file transcripts.fa does not match expected output" && exit 1)
################################################################################
echo "> Test 4 - Generate table from GFF annotation file"
"$meta_executable" \
--table @id,@chr,@start,@end,@strand,@exons,Name,gene,product \
--outfile "$test_output_dir/annotation.tbl" \
--input "$test_dir/sequence.gff3"
echo ">> Check if output exists"
[ ! -f "$test_output_dir/annotation.tbl" ] \
&& echo "Output file test_output/annotation.tbl does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "$test_output_dir/annotation.tbl" ] \
&& echo "Output file test_output/annotation.tbl is empty" && exit 1
echo ">> Compare output to expected output"
diff "$expected_output_dir/annotation.tbl" "$test_output_dir/annotation.tbl" || \
(echo "Output file annotation.tbl does not match expected output" && exit 1)
################################################################################
rm -r "$test_output_dir"
echo "> All tests successful"
exit 0

View File

@@ -0,0 +1,38 @@
## GffRead usage examples
GffRead can be used to simply read an annotation file in a GFF format, and print it in either GFF3 (default) or
GTF2 format (with the -T option), while discarding any non-trasncript features and optional attributes.
It can also report some potential issues found in the input GFF records. The command line for such a quick GFF/GTF
file cleanup would be:
```
gffread -E annotation.gff -o ann_simple.gff
```
This will create a minimalist GFF3 re-formatting of the transcript records found in the input file (`annotation.gff` in this example).
The -E option directs GffRead to "expose" (display warnings about) any potential formatting issues
encountered while parsing the input file.
In order to obtain the GTF2 version of the same transcript records, the `-T` option should be added:
```
gffread annotation.gff -T -o annotation.gtf
```
GffRead can be used to generate a FASTA file with the DNA sequences for all transcripts in a GFF file. For this operation
a fasta file with the genomic sequences has to be provided as well. This can be accomplished with a command line like this:
```
gffread -w transcripts.fa -g genome.fa annotation.gff
```
The file `genome.fa` in this example would be a multi-fasta file with the chromosome/contig sequences of the target genome.
This also requires that every contig or chromosome name found in the 1st column of the input GFF file
(`annotation.gff` in this example) must have a corresponding sequence entry in the `genome.fa` file.
```
gffread --table @id,@chr,@start,@end,@strand,@exons,Name,gene,product \
-o annotation.tbl annotation.gff
```
This shows how the `--table` option can make a tab delimited table out of a GFF3 input.
The `output` directory contains all the output files that should be generated by the above examples.

View File

@@ -0,0 +1,5 @@
##gff-version 3
# gffread v0.12.7
# gffread -E -o output/ann_simple.gff sequence.gff3
NM_141699.3 RefSeq gene 22 795 . + . ID=gene-Dmel_CG16905;gene_name=eloF
NM_141699.3 RefSeq CDS 22 795 . + 0 Parent=gene-Dmel_CG16905

View File

@@ -0,0 +1,2 @@
NM_141699.3 RefSeq transcript 22 795 . + . transcript_id "gene-Dmel_CG16905"; gene_id "gene-Dmel_CG16905"; gene_name "eloF"
NM_141699.3 RefSeq CDS 22 795 . + 0 transcript_id "gene-Dmel_CG16905"; gene_name "eloF";

View File

@@ -0,0 +1 @@
gene-Dmel_CG16905 NM_141699.3 22 795 + 22-795 eloF eloF elongase F

View File

@@ -0,0 +1,13 @@
>gene-Dmel_CG16905 CDS=1-774
ATGTTCGCTCCGATAGATCCTGTAAAGATACCCGTTGTAAGCAATCCATGGATAACCATGGGCACATTGA
TTGGCTATCTGCTGTTTGTGCTCAAGCTGGGCCCCAAAATCATGGAGCACCGAAAGCCCTTCCATTTGAA
TGGCGTCATCAGGATCTACAACATATTCCAGATCCTTTACAATGGTCTAATACTCGTTTTAGGAGTTCAC
TTCCTGTTTGTCCTGAAAGCCTACCAAATCAGTTGCATTGTTAGCCTGCCGATGGATCACAAATATAAGG
ATAGAGAGCGTTTGATTTGCACTTTGTACCTGGTGAACAAATTCGTAGACCTTGTGGAAACCATTTTCTT
TGTGCTCCGCAAAAAGGACAGACAGATATCCTTCCTGCACGTCTTCCATCATTTTGCGATGGCATTTTTT
GGATATCTCTACTACTGCTTCCACGGATACGGTGGCGTTGCCTTTCCACAGTGCCTGCTAAACACCGCCG
TCCACGTGATTATGTACGCCTACTACTATCTATCCTCGATCAGCAAGGAGGTGCAGAGAAGTCTCTGGTG
GAAGAAATACATCACAATTGCTCAGCTGGTCCAGTTCGCCATTATTCTGCTCCACTGTACCATCACGCTG
GCACAGCCCAACTGCGCGGTCAACAGACCCTTGACCTACGGATGCGGATCGCTTTCAGCGTTTTTTGCAG
TGATATTTAGCCAATTTTATTACCACAACTACATAAAGCCAGGAAAGAAGTCAGCGAAACAAAACAAAAA
TTAA

View File

@@ -0,0 +1,9 @@
#!/bin/bash
# clone repo
if [ ! -d /tmp/gffread_source ]; then
git clone --depth 2 --single-branch --branch master https://github.com/gpertea/gffread.git /tmp/gffread_source
fi
# copy test data
cp -r /tmp/gffread_source/examples/* src/gffread/test_data

View File

@@ -0,0 +1,16 @@
>NM_141699.3 Drosophila melanogaster elongase F (eloF), mRNA
CACAACTCGATTAGATTCGCCATGTTCGCTCCGATAGATCCTGTAAAGATACCCGTTGTAAGCAATCCAT
GGATAACCATGGGCACATTGATTGGCTATCTGCTGTTTGTGCTCAAGCTGGGCCCCAAAATCATGGAGCA
CCGAAAGCCCTTCCATTTGAATGGCGTCATCAGGATCTACAACATATTCCAGATCCTTTACAATGGTCTA
ATACTCGTTTTAGGAGTTCACTTCCTGTTTGTCCTGAAAGCCTACCAAATCAGTTGCATTGTTAGCCTGC
CGATGGATCACAAATATAAGGATAGAGAGCGTTTGATTTGCACTTTGTACCTGGTGAACAAATTCGTAGA
CCTTGTGGAAACCATTTTCTTTGTGCTCCGCAAAAAGGACAGACAGATATCCTTCCTGCACGTCTTCCAT
CATTTTGCGATGGCATTTTTTGGATATCTCTACTACTGCTTCCACGGATACGGTGGCGTTGCCTTTCCAC
AGTGCCTGCTAAACACCGCCGTCCACGTGATTATGTACGCCTACTACTATCTATCCTCGATCAGCAAGGA
GGTGCAGAGAAGTCTCTGGTGGAAGAAATACATCACAATTGCTCAGCTGGTCCAGTTCGCCATTATTCTG
CTCCACTGTACCATCACGCTGGCACAGCCCAACTGCGCGGTCAACAGACCCTTGACCTACGGATGCGGAT
CGCTTTCAGCGTTTTTTGCAGTGATATTTAGCCAATTTTATTACCACAACTACATAAAGCCAGGAAAGAA
GTCAGCGAAACAAAACAAAAATTAACTAAATTTAAACTAAATCATGAGTACAAAGCCTAAAGATTCGTGA
AGCAACAATAGCCACAGCCTATTTTTGAATATTTCATATATGATTTTATGGGGTAAATGAATTAAAAAAC
ATTTGTTTTCTTGGCGTCAAACT

View File

@@ -0,0 +1,9 @@
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region NM_141699.3 1 933
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7227
NM_141699.3 RefSeq region 1 933 . + . ID=NM_141699.3:1..933;Dbxref=taxon:7227;Name=3R;chromosome=3R;gbkey=Src;genome=chromosome;genotype=y[1]%3B Gr22b[1] Gr22d[1] cn[1] CG33964[R4.2] bw[1] sp[1]%3B LysC[1] MstProx[1] GstD5[1] Rh6[1];mol_type=mRNA
NM_141699.3 RefSeq gene 1 933 . + . ID=gene-Dmel_CG16905;Dbxref=FLYBASE:FBgn0037762,GeneID:41211;Name=eloF;cyt_map=85E10-85E10;description=elongase F;gbkey=Gene;gen_map=3-49 cM;gene=eloF;gene_synonym=CG16905,Dmel\CG16905,EloF;locus_tag=Dmel_CG16905
NM_141699.3 RefSeq CDS 22 795 . + 0 ID=cds-NP_649956.1;Parent=gene-Dmel_CG16905;Dbxref=FLYBASE:FBpp0081622,GeneID:41211,GenBank:NP_649956.1,FLYBASE:FBgn0037762;Name=NP_649956.1;gbkey=CDS;gene=eloF;locus_tag=Dmel_CG16905;orig_transcript_id=gnl|FlyBase|CG16905-RA;product=elongase F;protein_id=NP_649956.1

View File

@@ -0,0 +1,250 @@
name: lofreq_call
namespace: lofreq
description: |
Call variants from a BAM file.
LoFreq* (i.e. LoFreq version 2) is a fast and sensitive variant-caller for inferring SNVs and indels from next-generation sequencing data. It makes full use of base-call qualities and other sources of errors inherent in sequencing (e.g. mapping or base/indel alignment uncertainty), which are usually ignored by other methods or only used for filtering.
LoFreq* can run on almost any type of aligned sequencing data (e.g. Illumina, IonTorrent or Pacbio) since no machine- or sequencing-technology dependent thresholds are used. It automatically adapts to changes in coverage and sequencing quality and can therefore be applied to a variety of data-sets e.g. viral/quasispecies, bacterial, metagenomics or somatic data.
LoFreq* is very sensitive; most notably, it is able to predict variants below the average base-call quality (i.e. sequencing error rate). Each variant call is assigned a p-value which allows for rigorous false positive control. Even though it uses no approximations or heuristics, it is very efficient due to several runtime optimizations and also provides a (pseudo-)parallel implementation. LoFreq* is generic and fast enough to be applied to high-coverage data and large genomes. On a single processor it takes a minute to analyze Dengue genome sequencing data with nearly 4000X coverage, roughly one hour to call SNVs on a 600X coverage E.coli genome and also roughly an hour to run on a 100X coverage human exome dataset.
keywords: [ "variant calling", "low frequancy variant calling", "lofreq", "lofreq/call"]
links:
homepage: https://csb5.github.io/lofreq/
documentation: https://csb5.github.io/lofreq/commands/
references:
doi: 10.1093/nar/gks918
license: "MIT"
requirements:
commands: [ lofreq ]
argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
description: |
Input BAM file.
required: true
example: "normal.bam"
- name: --input_bai
type: file
description: |
Index file for the input BAM file.
required: true
example: "normal.bai"
- name: --ref
alternatives: -f
type: file
description: |
Indexed reference fasta file (gzip supported). Default: none.
required: true
example: "reference.fasta"
- name: Outputs
arguments:
- name: --out
alternatives: -o
type: file
description: |
Vcf output file. Default: stdout.
required: true
direction: output
example: "output.vcf"
- name: Arguments
arguments:
- name: --region
alternatives: -r
type: string
description: |
Limit calls to this region (chrom:start-end). Default: none.
required: false
example: "chr1:1000-2000"
- name: --bed
alternatives: -l
type: file
description: |
List of positions (chr pos) or regions (BED). Default: none.
required: false
example: "regions.bed"
- name: --min_bq
alternatives: -q
type: integer
description: |
Skip any base with baseQ smaller than INT. Default: 6.
required: false
example: 6
- name: --min_alt_bq
alternatives: -Q
type: integer
description: |
Skip alternate bases with baseQ smaller than INT. Default: 6.
required: false
example: 6
- name: --def_alt_bq
alternatives: -R
type: integer
description: |
Overwrite baseQs of alternate bases (that passed bq filter) with this value (-1: use median ref-bq; 0: keep). Default: 0.
required: false
example: 0
- name: --min_jq
alternatives: -j
type: integer
description: |
Skip any base with joinedQ smaller than INT. Default: 0.
example: 0
- name: --min_alt_jq
alternatives: -J
type: integer
description: |
Skip alternate bases with joinedQ smaller than INT. Default: 0.
required: false
example: 0
- name: --def_alt_jq
alternatives: -K
type: integer
description: |
Overwrite joinedQs of alternate bases (that passed jq filter) with this value (-1: use median ref-bq; 0: keep). Default: 0.
required: false
example: 0
- name: --no_baq
alternatives: -B
type: boolean_true
description: |
Disable use of base-alignment quality (BAQ).
- name: --no_idaq
alternatives: -A
type: boolean_true
description: |
Don't use IDAQ values (NOT recommended under ANY circumstances other than debugging).
- name: --del_baq
alternatives: -D
type: boolean_true
description: |
Delete pre-existing BAQ values, i.e. compute even if already present in BAM.
- name: --no_ext_baq
alternatives: -e
type: boolean_true
description: |
Use 'normal' BAQ (samtools default) instead of extended BAQ (both computed on the fly if not already present in lb tag).
- name: --min_mq
alternatives: -m
type: integer
description: |
Skip reads with mapping quality smaller than INT. Default: 0.
required: false
example: 0
- name: --max_mq
alternatives: -M
type: integer
description: |
Cap mapping quality at INT. Default: 255.
required: false
example: 255
- name: --no_mq
alternatives: -N
type: boolean_true
description: |
Don't merge mapping quality in LoFreq's model.
- name: --call_indels
type: boolean_true
description: |
Enable indel calls (note: preprocess your file to include indel alignment qualities!).
- name: --only_indels
type: boolean_true
description: |
Only call indels; no SNVs.
- name: --src_qual
alternatives: -s
type: boolean_true
description: |
Enable computation of source quality.
- name: --ign_vcf
alternatives: -S
type: file
description: |
Ignore variants in this vcf file for source quality computation. Multiple files can be given separated by commas.
required: false
example: "variants.vcf"
- name: --def_nm_q
alternatives: -T
type: integer
description: |
If >= 0, then replace non-match base qualities with this default value. Default: -1.
required: false
example: -1
- name: --sig
alternatives: -a
type: double
description: |
P-Value cutoff / significance level. Default: 0.010000.
required: false
example: 0.01
- name: --bonf
alternatives: -b
type: string
description: |
Bonferroni factor. 'dynamic' (increase per actually performed test) or INT. Default: Dynamic.
required: false
example: "dynamic"
- name: --min_cov
alternatives: -C
type: integer
description: |
Test only positions having at least this coverage. Default: 1.
(note: without --no-default-filter default filters (incl. coverage) kick in after predictions are done).
required: false
example: 1
- name: --max_depth
alternatives: -d
type: integer
description: |
Cap coverage at this depth. Default: 1000000.
required: false
example: 1000000
- name: --illumina_13
type: boolean_true
description: |
Assume the quality is Illumina-1.3-1.7/ASCII+64 encoded.
- name: --use_orphan
type: boolean_true
description: |
Count anomalous read pairs (i.e. where mate is not aligned properly).
- name: --plp_summary_only
type: boolean_true
description: |
No variant calling. Just output pileup summary per column.
- name: --no_default_filter
type: boolean_true
description: |
Don't run default 'lofreq filter' automatically after calling variants.
- name: --force_overwrite
type: boolean_true
description: |
Overwrite any existing output.
- name: --verbose
type: boolean_true
description: |
Be verbose.
- name: --debug
type: boolean_true
description: |
Enable debugging.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/lofreq:2.1.5--py38h794fc9e_10
setup:
- type: docker
run: |
version=$(lofreq version | grep 'version' | sed 's/version: //') && \
echo "lofreq: $version" > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

49
src/lofreq/call/help.txt Normal file
View File

@@ -0,0 +1,49 @@
lofreq call: call variants from BAM file
Usage: lofreq call [options] in.bam
Options:
- Reference:
-f | --ref FILE Indexed reference fasta file (gzip supported) [null]
- Output:
-o | --out FILE Vcf output file [- = stdout]
- Regions:
-r | --region STR Limit calls to this region (chrom:start-end) [null]
-l | --bed FILE List of positions (chr pos) or regions (BED) [null]
- Base-call quality:
-q | --min-bq INT Skip any base with baseQ smaller than INT [6]
-Q | --min-alt-bq INT Skip alternate bases with baseQ smaller than INT [6]
-R | --def-alt-bq INT Overwrite baseQs of alternate bases (that passed bq filter) with this value (-1: use median ref-bq; 0: keep) [0]
-j | --min-jq INT Skip any base with joinedQ smaller than INT [0]
-J | --min-alt-jq INT Skip alternate bases with joinedQ smaller than INT [0]
-K | --def-alt-jq INT Overwrite joinedQs of alternate bases (that passed jq filter) with this value (-1: use median ref-bq; 0: keep) [0]
- Base-alignment (BAQ) and indel-aligment (IDAQ) qualities:
-B | --no-baq Disable use of base-alignment quality (BAQ)
-A | --no-idaq Don't use IDAQ values (NOT recommended under ANY circumstances other than debugging)
-D | --del-baq Delete pre-existing BAQ values, i.e. compute even if already present in BAM
-e | --no-ext-baq Use 'normal' BAQ (samtools default) instead of extended BAQ (both computed on the fly if not already present in lb tag)
- Mapping quality:
-m | --min-mq INT Skip reads with mapping quality smaller than INT [0]
-M | --max-mq INT Cap mapping quality at INT [255]
-N | --no-mq Don't merge mapping quality in LoFreq's model
- Indels:
--call-indels Enable indel calls (note: preprocess your file to include indel alignment qualities!)
--only-indels Only call indels; no SNVs
- Source quality:
-s | --src-qual Enable computation of source quality
-S | --ign-vcf FILE Ignore variants in this vcf file for source quality computation. Multiple files can be given separated by commas
-T | --def-nm-q INT If >= 0, then replace non-match base qualities with this default value [-1]
- P-values:
-a | --sig P-Value cutoff / significance level [0.010000]
-b | --bonf Bonferroni factor. 'dynamic' (increase per actually performed test) or INT ['dynamic']
- Misc.:
-C | --min-cov INT Test only positions having at least this coverage [1]
(note: without --no-default-filter default filters (incl. coverage) kick in after predictions are done)
-d | --max-depth INT Cap coverage at this depth [1000000]
--illumina-1.3 Assume the quality is Illumina-1.3-1.7/ASCII+64 encoded
--use-orphan Count anomalous read pairs (i.e. where mate is not aligned properly)
--plp-summary-only No variant calling. Just output pileup summary per column
--no-default-filter Don't run default 'lofreq filter' automatically after calling variants
--force-overwrite Overwrite any existing output
--verbose Be verbose
--debug Enable debugging

57
src/lofreq/call/script.sh Normal file
View File

@@ -0,0 +1,57 @@
#!/bin/bash
## VIASH START
## VIASH END
# Unset all parameters that are set to "false"
[[ "$par_no_baq" == "false" ]] && unset par_no_baq
[[ "$par_no_idaq" == "false" ]] && unset par_no_idaq
[[ "$par_del_baq" == "false" ]] && unset par_del_baq
[[ "$par_no_ext_baq" == "false" ]] && unset par_no_ext_baq
[[ "$par_no_mq" == "false" ]] && unset par_no_mq
[[ "$par_call_indels" == "false" ]] && unset par_call_indels
[[ "$par_only_indels" == "false" ]] && unset par_only_indels
[[ "$par_src_qual" == "false" ]] && unset par_src_qual
[[ "$par_illumina_13" == "false" ]] && unset par_illumina_13
[[ "$par_use_orphan" == "false" ]] && unset par_use_orphan
[[ "$par_plp_summary_only" == "false" ]] && unset par_plp_summary_only
[[ "$par_no_default_filter" == "false" ]] && unset par_no_default_filter
[[ "$par_force_overwrite" == "false" ]] && unset par_force_overwrite
[[ "$par_verbose" == "false" ]] && unset par_verbose
[[ "$par_debug" == "false" ]] && unset par_debug
# Run lofreq call
lofreq call \
-f "$par_ref" \
-o "$par_out" \
${par_region:+-r "${par_region}"} \
${par_bed:+-l "${par_bed}"} \
${par_min_bq:+-q "${par_min_bq}"} \
${par_min_alt_bq:+-Q "${par_min_alt_bq}"} \
${par_def_alt_bq:+-R "${par_def_alt_bq}"} \
${par_min_jq:+-j "${par_min_jq}"} \
${par_alt_jq:+-K "${par_alt_jq}"} \
${par_no_baq:+-B} \
${par_no_idaq:+-A} \
${par_del_baq:+-D} \
${par_no_ext_baq:+-e} \
${par_min_mq:+-m "${par_min_mq}"} \
${par_max_mq:+-M "${par_max_mq}"} \
${par_no_mq:+-N} \
${par_call_indels:+--call-indels} \
${par_only_indels:+--only-indels} \
${par_src_qual:+-s} \
${par_ign_vcf:+-S "${par_ign_vcf}"} \
${par_def_nm_q:+-T "${par_def_nm_q}"} \
${par_sig:+-a "${par_sig}"} \
${par_bonf:+-b "${par_bonf}"} \
${par_min_cov:+-C "${par_min_cov}"} \
${par_max_depth:+-d "${par_max_depth}"} \
${par_illumina_13:+--illumina-1.3} \
${par_use_orphan:+--use-orphan} \
${par_plp_summary_only:+--plp-summary-only} \
${par_no_default_filter:+--no-default-filter} \
${par_force_overwrite:+--force-overwrite} \
${par_verbose:+--verbose} \
${par_debug:+--debug} \
"$par_input"

20
src/lofreq/call/test.sh Normal file
View File

@@ -0,0 +1,20 @@
#!/bin/bash
set -e
dir_in="${meta_resources_dir%/}/test_data"
echo "> Run lofreq call"
"$meta_executable" \
--input "$dir_in/a.bam" \
--input_bai "$dir_in/a.bai" \
--ref "$dir_in/genome.fasta" \
--out "output.vcf" \
echo ">> Checking output"
[ ! -f "output.vcf" ] && echo "Output file output.vcf does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "output.vcf" ] && echo "Output file output.vcf is empty" && exit 1
echo "> Test successful"

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1,8 @@
>SheilaA
GCTAGCTCAGAAAAAAAAAA
>SheilaB
GCTAGCTCAGAAAAAAAAAA
>SheilaC
GCTAGCTCAGAAAAAAAAAA
>SheilaD
GCTAGCTCAGAAAAAAAAAA

View File

@@ -0,0 +1,4 @@
SheilaA 20 9 20 21
SheilaB 20 39 20 21
SheilaC 20 69 20 21
SheilaD 20 99 20 21

View File

@@ -0,0 +1,10 @@
# pear test data
# Test data was obtained from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/lofreq/call/test/data
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi
cp -r /tmp/snakemake-wrappers/bio/lofreq/call/test/data/* src/lofreq/call/test_data

View File

@@ -0,0 +1,82 @@
name: lofreq_indelqual
namespace: lofreq
description: |
Insert indel qualities into BAM file (required for indel predictions).
The preferred way of inserting indel qualities should be via GATK's BQSR (>=2) If that's not possible, use this subcommand.
The command has two modes: 'uniform' and 'dindel':
- 'uniform' will assign a given value uniformly, whereas
- 'dindel' will insert indel qualities based on Dindel (PMID 20980555).
Both will overwrite any existing values.
Do not realign your BAM file afterwards!
keywords: [ "bam", "indel", "qualities", "indelqual", "lofreq", "lofreq/indelqual"]
links:
homepage: https://csb5.github.io/lofreq/
documentation: https://csb5.github.io/lofreq/commands/
references:
doi: 10.1093/nar/gks918
license: "MIT"
requirements:
commands: [ lofreq ]
argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
description: |
Input BAM file.
required: true
example: "normal.bam"
- name: --ref
alternatives: -f
type: file
description: |
Reference sequence used for mapping (Only required for --dindel).
required: false
example: "reference.fasta"
- name: Outputs
arguments:
- name: --out
alternatives: -o
type: file
description: |
Output BAM file.
required: true
direction: output
example: "output.bam"
- name: Arguments
arguments:
- name: --uniform
alternatives: -u
type: string
description: |
Add this indel quality uniformly to all bases. Use two comma separated values to specify insertion and deletion quality separately. (clashes with --dindel).
required: false
example: "50,50"
- name: --dindel
type: boolean_true
description: |
Add Dindel's indel qualities (Illumina specific) (clashes with -u; needs --ref).
- name: --verbose
type: boolean_true
description: |
Be verbose.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/lofreq:2.1.5--py38h794fc9e_10
setup:
- type: docker
run: |
version=$(lofreq version | grep 'version' | sed 's/version: //') && \
echo "lofreq: $version" > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,21 @@
lofreq indelqual: Insert indel qualities into BAM file (required for indel predictions)
Usage: lofreq indelqual [options] in.bam
Options:
-u | --uniform INT[,INT] Add this indel quality uniformly to all bases.
Use two comma separated values to specify
insertion and deletion quality separately.
(clashes with --dindel)
--dindel Add Dindel's indel qualities (Illumina specific)
(clashes with -u; needs --ref)
-f | --ref Reference sequence used for mapping
(Only required for --dindel)
-o | --out FILE Output BAM file [- = stdout = default]
--verbose Be verbose
The preferred way of inserting indel qualities should be via GATK's BQSR (>=2) If that's not possible, use this subcommand.
The command has two modes: 'uniform' and 'dindel':
- 'uniform' will assign a given value uniformly, whereas
- 'dindel' will insert indel qualities based on Dindel (PMID 20980555).
Both will overwrite any existing values.
Do not realign your BAM file afterwards!

View File

@@ -0,0 +1,17 @@
#!/bin/bash
## VIASH START
## VIASH END
# Unset all parameters that are set to "false"
[[ "$par_dindel" == "false" ]] && unset par_dindel
[[ "$par_verbose" == "false" ]] && unset par_verbose
# run lofreq indelqual
lofreq indelqual \
-o "$par_out" \
${par_uniform:+-u "${par_uniform}"} \
${par_dindel:+--dindel} \
${par_ref:+-f "${par_ref}"} \
${par_verbose:+--verbose} \
"$par_input"

View File

@@ -0,0 +1,46 @@
#!/bin/bash
set -e
dir_in="${meta_resources_dir%/}/test_data"
#############################################
mkdir uniform
cd uniform
echo "> Run lofreq indelqual uniform"
"$meta_executable" \
--input "$dir_in/test.bam" \
-u 15 \
--out "uniform.bam" \
echo ">> Checking output"
[ ! -f "uniform.bam" ] && echo "Output file uniform.bam does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "uniform.bam" ] && echo "Output file uniform.bam is empty" && exit 1
cd ..
#############################################
mkdir dindel
cd dindel
echo "> run lofreq indelqual dindel"
"$meta_executable" \
--input "$dir_in/test.bam" \
--ref "$dir_in/test.fa" \
--dindel \
--out "dindel.bam"
echo ">> Checking output"
[ ! -f "dindel.bam" ] && echo "Output file dindel.bam does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "dindel.bam" ] && echo "Output file dindel.bam is empty" && exit 1
cd ..
#############################################
echo "> Test successful"

View File

@@ -0,0 +1,44 @@
#!/bin/bash
set -e
TMPDIR=$(mktemp -d)
trap "rm -rf $TMPDIR" EXIT
### Step 1: Generate Test Reference FASTA File (`test.fa`)
cat > $TMPDIR/test.fa <<EOF
>chr1
AACTCTCCGTGCTGTCCGGGGTCACTGTGATGCCAGTGCCGTCGACGGACCACAGGAGCGCCGCCAATTACGATTTATA
GGCGGCCCGGCCGATTATATCTTTGGCGGTCCCCTAGGCTCTCTAGGGGCCCGCACTGAAGAGGGCAACTCTGCAAGGA
CACGAATCTGACTCCTTAATAAAGGTGTGAAATCTGTCCGGTCGTCTCCTAATATGGGGCTTCATCATCTCAGGCGAAA
TCAGCGCCCGACGGGCCATAGTAAGCGGTGTTGTGGCATAGGTGCAGGTGGCCACCGATTATAACAGGATGACATACGC
GGAATTCGGGGTATGATGCTCTCCCGACACTTTGAGACAATAAATAGTTTAGTGTCCTGATGGTCTAAACCGAAGTCAT
TCAAAATAGCTAAGTGTAGTCTTCCCGTTCTAGGGATAGTCTAGGACATGCCCTATATTGGTTTTCTCTTACCGCGGAC
TACTCCCGCGCCCTCGGAGGTGTCTCAATTCATCCATGTTGATCCTTCAAATCGGGGCAGCGACGGGGGCACGGAGGGG
GTACGATAACCGCTAAATTGACCACCACCATCGATGATTCTACCATCTCTATCCATCCAACCCTTTTTTTGTTTATTTC
CTCTATGGGTTACAGCTA
EOF
### Step 2: Index the Reference FASTA File
samtools faidx $TMPDIR/test.fa
### Step 3: Generate Test Reads with `wgsim`
wgsim -N 100 -1 70 -2 70 $TMPDIR/test.fa $TMPDIR/reads1.fq $TMPDIR/reads2.fq
### Step 4: Align Reads to Generate BAM File
bwa index $TMPDIR/test.fa
bwa mem $TMPDIR/test.fa $TMPDIR/reads1.fq $TMPDIR/reads2.fq > $TMPDIR/aligned_reads.sam
### Step 5: Convert SAM to BAM, Sort, and Index
samtools view -Sb $TMPDIR/aligned_reads.sam > $TMPDIR/test.bam
### Step 6: Copy output
cp $TMPDIR/test.bam src/lofreq/indelqual/test_data/test.bam
cp $TMPDIR/test.fa src/lofreq/indelqual/test_data/test.fa

Binary file not shown.

View File

@@ -0,0 +1,10 @@
>chr1
AACTCTCCGTGCTGTCCGGGGTCACTGTGATGCCAGTGCCGTCGACGGACCACAGGAGCGCCGCCAATTACGATTTATA
GGCGGCCCGGCCGATTATATCTTTGGCGGTCCCCTAGGCTCTCTAGGGGCCCGCACTGAAGAGGGCAACTCTGCAAGGA
CACGAATCTGACTCCTTAATAAAGGTGTGAAATCTGTCCGGTCGTCTCCTAATATGGGGCTTCATCATCTCAGGCGAAA
TCAGCGCCCGACGGGCCATAGTAAGCGGTGTTGTGGCATAGGTGCAGGTGGCCACCGATTATAACAGGATGACATACGC
GGAATTCGGGGTATGATGCTCTCCCGACACTTTGAGACAATAAATAGTTTAGTGTCCTGATGGTCTAAACCGAAGTCAT
TCAAAATAGCTAAGTGTAGTCTTCCCGTTCTAGGGATAGTCTAGGACATGCCCTATATTGGTTTTCTCTTACCGCGGAC
TACTCCCGCGCCCTCGGAGGTGTCTCAATTCATCCATGTTGATCCTTCAAATCGGGGCAGCGACGGGGGCACGGAGGGG
GTACGATAACCGCTAAATTGACCACCACCATCGATGATTCTACCATCTCTATCCATCCAACCCTTTTTTTGTTTATTTC
CTCTATGGGTTACAGCTA

225
src/multiqc/config.vsh.yaml Normal file
View File

@@ -0,0 +1,225 @@
name: "multiqc"
description: |
MultiQC aggregates results from bioinformatics analyses across many samples into a single report.
It searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.
info:
keywords: [QC, html report, aggregate analysis]
links:
homepage: https://multiqc.info/
documentation: https://multiqc.info/docs/
repository: https://github.com/MultiQC/MultiQC
references:
doi: 10.1093/bioinformatics/btw354
licence: GPL v3 or later
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
multiple: true
required: true
example: data/results/
description: |
File paths to be searched for analysis results to be included in the report.
- name: "Ouput"
arguments:
- name: "--output_report"
type: file
direction: output
must_exist: false
example: multiqc_report.html
description: |
Filepath of the generated report.
- name: "--output_data"
type: file
required: false
direction: output
example: multiqc_data
must_exist: false
description: |
Output directory for parsed data files. If not provided, parsed data will not be published.
- name: "--output_plots"
type: file
required: false
direction: output
must_exist: false
example: multiqc_plots
description: |
Output directory for generated plots. If not provided, plots will not be published.
- name: "Modules and analyses to run"
arguments:
- name: "--include_modules"
type: string
multiple: true
example: [fastqc, cutadapt]
description: Use only these module
- name: "--exclude_modules"
type: string
multiple: true
example: [fastqc, cutadapt]
description: Do not use only these modules
- name: "--ignore_analysis"
type: string
multiple: true
example: [run_one/*, run_two/*]
- name: "--ignore_samples"
type: string
multiple: true
example: [sample_1*, sample_3*]
- name: "--ignore_symlinks"
type: boolean_true
description: Ignore symlinked directories and files
- name: "Sample name handling"
arguments:
- name: "--dirs"
type: boolean_true
description: Prepend directory to sample names to avoid clashing filenames
- name: "--dirs_depth"
type: integer
description: Prepend n directories to sample names. Negative number to take from start of path.
- name: "--full_names"
type: boolean_true
description: Do not clean the sample names (leave as full file name)
- name: "--fn_as_s_name"
type: boolean_true
description: Use the log filename as the sample name
- name: "--replace_names"
type: file
example: replace_names.tsv
description: TSV file to rename sample names during report generation
- name: "Report Customisation"
arguments:
- name: "--title"
type: string
description: |
Report title. Printed as page header, used for filename if not otherwise specified.
- name: "--comment"
type: string
description: |
Custom comment, will be printed at the top of the report.
- name: "--template"
type: string
choices: [default, gathered, geo, highcharts, sections, simple]
description: |
Report template to use.
- name: "--sample_names"
type: file
description: |
TSV file containing alternative sample names for renaming buttons in the report.
example: sample_names.tsv
- name: "--sample_filters"
type: file
description: |
TSV file containing show/hide patterns for the report
example: sample_filters.tsv
- name: "--custom_css_file"
type: file
description: |
Custom CSS file to add to the final report
example: custom_style_sheet.css
- name: "--profile_runtime"
type: boolean_true
description: |
Add analysis of how long MultiQC takes to run to the report
- name: "MultiQC behaviour"
arguments:
- name: "--verbose"
type: boolean_true
description: |
Increase output verbosity.
- name: "--quiet"
type: boolean_true
description: |
Only show log warnings
- name: "--strict"
type: boolean_true
description: |
Don't catch exceptions, run additional code checks to help development.
- name: "--development"
type: boolean_true
description: |
Development mode. Do not compress and minimise JS, export uncompressed plot data.
- name: "--require_logs"
type: boolean_true
description: |
Require all explicitly requested modules to have log files. If not, MultiQC will exit with an error.
- name: "--no_megaqc_upload"
type: boolean_true
description: |
Don't upload generated report to MegaQC, even if MegaQC options are found.
- name: "--no_ansi"
type: boolean_true
description: |
Disable coloured log output.
- name: "--cl_config"
type: string
required: false
description: |
YAML formatted string that allows to customize MultiQC behaviour like input file detection.
example: "qualimap_config: { general_stats_coverage: [20,40,200] }"
- name: "Output format"
arguments:
- name: "--flat"
type: boolean_true
description: |
Use only flat plots (static images).
- name: "--interactive"
type: boolean_true
description: |
Use only interactive plots (in-browser Javascript).
- name: "--data_dir"
type: boolean_true
description: |
Force the parsed data directory to be created.
- name: "--no_data_dir"
type: boolean_true
description: |
Prevent the parsed data directory from being created.
- name: "--zip_data_dir"
type: boolean_true
description: |
Compress the data directory.
- name: "--data_format"
type: string
choices: [tsv, csv, json, yaml]
description: |
Output parsed data in a different format than the default 'txt'.
- name: "--pdf"
type: boolean_true
description: |
Creates PDF report with the 'simple' template. Requires Pandoc to be installed.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/multiqc:1.21--pyhdfd78af_0
setup:
- type: docker
run: |
multiqc --version | sed 's/multiqc, version\s\(.*\)/multiqc: "\1"/' > /var/software_versions.txt
test_setup:
- type: apt
packages:
- jq
runners:
- type: executable
- type: nextflow

67
src/multiqc/help.txt Normal file
View File

@@ -0,0 +1,67 @@
```bash
multiqc --help
```
/// MultiQC 🔍 | v1.20
Usage: multiqc [OPTIONS] [ANALYSIS DIRECTORY]
MultiQC aggregates results from bioinformatics analyses across many samples into a single report.
It searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.
To run, supply with one or more directory to scan for analysis results. For example, to run in the current working directory, use 'multiqc .'
╭─ Main options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --force -f Overwrite any existing reports │
│ --config -c Specific config file to load, after those in MultiQC dir / home dir / working dir. (PATH) │
│ --cl-config Specify MultiQC config YAML on the command line (TEXT) │
│ --filename -n Report filename. Use 'stdout' to print to standard out. (TEXT) │
│ --outdir -o Create report in the specified output directory. (TEXT) │
│ --ignore -x Ignore analysis files (GLOB EXPRESSION) │
│ --ignore-samples Ignore sample names (GLOB EXPRESSION) │
│ --ignore-symlinks Ignore symlinked directories and files │
│ --file-list -l Supply a file containing a list of file paths to be searched, one per row │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Choosing modules to run ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --module -m Use only this module. Can specify multiple times. (MODULE NAME) │
│ --exclude -e Do not use this module. Can specify multiple times. (MODULE NAME) │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Sample handling ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --dirs -d Prepend directory to sample names │
│ --dirs-depth -dd Prepend n directories to sample names. Negative number to take from start of path. (INTEGER) │
│ --fullnames -s Do not clean the sample names (leave as full file name) │
│ --fn_as_s_name Use the log filename as the sample name │
│ --replace-names TSV file to rename sample names during report generation (PATH) │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Report customisation ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --title -i Report title. Printed as page header, used for filename if not otherwise specified. (TEXT) │
│ --comment -b Custom comment, will be printed at the top of the report. (TEXT) │
│ --template -t Report template to use. (default|gathered|geo|highcharts|sections|simple) │
│ --sample-names TSV file containing alternative sample names for renaming buttons in the report (PATH) │
│ --sample-filters TSV file containing show/hide patterns for the report (PATH) │
│ --custom-css-file Custom CSS file to add to the final report (PATH) │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Output files ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --flat -fp Use only flat plots (static images) │
│ --interactive -ip Use only interactive plots (in-browser Javascript) │
│ --export -p Export plots as static images in addition to the report │
│ --data-dir Force the parsed data directory to be created. │
│ --no-data-dir Prevent the parsed data directory from being created. │
│ --data-format -k Output parsed data in a different format. (tsv|csv|json|yaml) │
│ --zip-data-dir -z Compress the data directory. │
│ --no-report Do not generate a report, only export data and plots │
│ --pdf Creates PDF report with the 'simple' template. Requires Pandoc to be installed. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ MultiQC behaviour ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --verbose -v Increase output verbosity. (INTEGER RANGE) │
│ --quiet -q Only show log warnings │
│ --strict Don't catch exceptions, run additional code checks to help development. │
│ --development,--dev Development mode. Do not compress and minimise JS, export uncompressed plot data │
│ --require-logs Require all explicitly requested modules to have log files. If not, MultiQC will exit with an error. │
│ --profile-runtime Add analysis of how long MultiQC takes to run to the report │
│ --no-megaqc-upload Don't upload generated report to MegaQC, even if MegaQC options are found │
│ --no-ansi Disable coloured log output │
│ --version Show the version and exit. │
│ --help -h Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
See http://multiqc.info for more details.

129
src/multiqc/script.sh Executable file
View File

@@ -0,0 +1,129 @@
#!/bin/bash
# disable flags
[[ "$par_ignore_symlinks" == "false" ]] && unset par_ignore_symlinks
[[ "$par_dirs" == "false" ]] && unset par_dirs
[[ "$par_full_names" == "false" ]] && unset par_full_names
[[ "$par_fn_as_s_name" == "false" ]] && unset par_fn_as_s_name
[[ "$par_profile_runtime" == "false" ]] && unset par_profile_runtime
[[ "$par_verbose" == "false" ]] && unset par_verbose
[[ "$par_quiet" == "false" ]] && unset par_quiet
[[ "$par_strict" == "false" ]] && unset par_strict
[[ "$par_development" == "false" ]] && unset par_development
[[ "$par_require_logs" == "false" ]] && unset par_require_logs
[[ "$par_no_megaqc_upload" == "false" ]] && unset par_no_megaqc_upload
[[ "$par_no_ansi" == "false" ]] && unset par_no_ansi
[[ "$par_flat" == "false" ]] && unset par_flat
[[ "$par_interactive" == "false" ]] && unset par_interactive
[[ "$par_static_plot_export" == "false" ]] && unset par_static_plot_export
[[ "$par_data_dir" == "false" ]] && unset par_data_dir
[[ "$par_no_data_dir" == "false" ]] && unset par_no_data_dir
[[ "$par_zip_data_dir" == "false" ]] && unset par_zip_data_dir
[[ "$par_pdf" == "false" ]] && unset par_pdf
# handle inputs
out_dir=$(dirname "$par_output_report")
output_report_file=$(basename "$par_output_report")
report_name="${output_report_file%.*}"
# handle outputs
[[ -z "$par_output_report" ]] && no_report=true
[[ -z "$par_output_data" ]] && no_data_dir=true
[[ ! -z "$par_output_data" ]] && data_dir=true
[[ ! -z "$par_output_plots" ]] && export=true
# handle multiples
IFS=";" read -ra inputs <<< $par_input
if [[ -n "$par_include_modules" ]]; then
include_modules=""
IFS=";" read -ra incl_modules <<< $par_include_modules
for i in "${incl_modules[@]}"; do
include_modules+="--include $i "
done
unset IFS
fi
if [[ -n "$par_exclude_modules" ]]; then
exclude_modules=""
IFS=";" read -ra excl_modules <<< $par_exclude_modules
for i in "${excl_modules[@]}"; do
exclude_modules+="--exclude $i"
done
unset IFS
fi
if [[ -n "$par_ignore_analysis" ]]; then
ignore=""
IFS=";" read -ra ignore_analysis <<< $par_ignore_analysis
for i in "${ignore_analysis[@]}"; do
ignore+="--ignore $i "
done
unset IFS
fi
if [[ -n "$par_ignore_samples" ]]; then
ignore_samples=""
IFS=";" read -ra ign_samples <<< $par_ignore_samples
for i in "${ign_samples[@]}"; do
ignore_samples+="--ignore-samples $i"
done
unset IFS
fi
# run multiqc
multiqc \
${par_output_report:+--filename "$report_name"} \
${out_dir:+--outdir "$out_dir"} \
${no_report:+--no-report} \
${no_data_dir:+--no-data-dir} \
${data_dir:+--data-dir} \
${export:+--export} \
${par_title:+--title "$par_title"} \
${par_comment:+--comment "$par_comment"} \
${par_template:+--template "$par_template"} \
${par_sample_names:+--sample-names "$par_sample_names"} \
${par_sample_filters:+--sample-filters "$par_sample_filters"} \
${par_custom_css_file:+--custom-css-file "$par_custom_css_file"} \
${par_profile_runtime:+--profile-runtime} \
${par_dirs:+--dirs} \
${par_dirs_depth:+--dirs-depth "$par_dirs_depth"} \
${par_full_names:+--full-names} \
${par_fn_as_s_name:+--fn-as-s-name} \
${par_ignore_names:+--ignore-names "$par_ignore_names"} \
${par_ignore_symlinks:+--ignore-symlinks} \
${ignore_samples} \
${ignore} \
${exclude_modules} \
${include_modules} \
${par_include_modules:+--include-modules "$par_include_modules"} \
${par_data_format:+--data-format "$par_data_format"} \
${par_cl_config:+--cl-config "$par_cl_config"} \
${par_zip_data_dir:+--zip-data-dir} \
${par_pdf:+--pdf} \
${par_interactive:+--interactive} \
${par_flat:+--flat} \
${par_verbose:+--verbose} \
${par_quiet:+--quiet} \
${par_strict:+--strict} \
${par_no_megaqc_upload:+--no-megaqc-upload} \
${par_no_ansi:+--no-ansi} \
${par_profile_runtime:+--profile-runtime} \
${par_require_logs:+--require-logs} \
${par_development:+--development} \
--force \
"${inputs[@]}"
# Move outputs
if [[ -n "$par_output_data" ]] && [[ -d "${out_dir}/${report_name}_data" ]]; then
mv "${out_dir}/${report_name}_data" "$par_output_data"
elif [[ -n "$par_output_data" ]] && [[ ! -d "${out_dir}/${report_name}_data" ]]; then
echo "WARNING: Data could not be saved because data folder was not generated by multiqc. This could be due to filtering out of modules or samples."
fi
if [[ -n "$par_output_plots" ]] && [[ -d "${out_dir}/${report_name}_plots" ]]; then
mv "${out_dir}/${report_name}_plots" "$par_output_plots"
elif [[ -n "$par_output_plots" ]] && [[ ! -d "${out_dir}/${report_name}_plots" ]]; then
echo "WARNING: Plots could not be saved because plots folder was not generated by multiqc. This could be due to filtering out of modules or samples."
fi

44
src/multiqc/test.sh Normal file
View File

@@ -0,0 +1,44 @@
#!/bin/bash
echo ">>> Testing input/output handling"
"$meta_executable" \
--input "$meta_resources_dir/test_data/" \
--output_report test1.html \
--output_data data1 \
--output_plots plots1 \
--quiet
[ ! -f test1.html ] && echo "MultiQC report does not exist!" && exit 1
[ ! -d data1 ] && echo "MultiQC data directory does not exist!" && exit 1
[ ! -d plots1 ] && echo "MultiQC plots directory does not exist!" && exit 1
echo ">>> Testing module exclusion"
"$meta_executable" \
--input "$meta_resources_dir/test_data/" \
--output_report test2.html \
--output_data data2 \
--output_plots plots2 \
--exclude_modules samtools \
--quiet
[ -f test2.html ] && echo "MultiQC report should not exist!" && exit 1
[ -d data2 ] && echo "MultiQC data directory should not exist!" && exit 1
[ -d plots2 ] && echo "MultiQC plots directory should not exist!" && exit 1
echo ">>> Testing sample exclusion"
"$meta_executable" \
--input "$meta_resources_dir/test_data/" \
--output_report test3.html \
--output_data data3 \
--ignore_samples a \
--quiet
key_to_check=".report_general_stats_data[0].a"
json_file="data3/multiqc_data.json"
[[ $(jq -r "$key_to_check" "$json_file") != null ]] && echo "$key_to_check should not be present in $json_file" && exit 1
echo "All tests succeeded!"
exit 0

1504
src/multiqc/test_data/a.txt Normal file

File diff suppressed because it is too large Load Diff

1505
src/multiqc/test_data/b.txt Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,9 @@
# multiqc test data
# Test data from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/busco/test
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi
cp -r /tmp/snakemake-wrappers/bio/multiqc/test/samtools_stats/* src/multiqc/test_data

161
src/pear/config.vsh.yaml Normal file
View File

@@ -0,0 +1,161 @@
name: pear
description: |
PEAR is an ultrafast, memory-efficient and highly accurate pair-end read merger. It is fully parallelized and can run with as low as just a few kilobytes of memory.
PEAR evaluates all possible paired-end read overlaps and without requiring the target fragment size as input. In addition, it implements a statistical test for minimizing false-positive results. Together with a highly optimized implementation, it can merge millions of paired end reads within a couple of minutes on a standard desktop computer.
keywords: [ "pair-end", "read", "merge" ]
links:
homepage: https://cme.h-its.org/exelixis/web/software/pear
repository: https://github.com/tseemann/PEAR
documentation: https://cme.h-its.org/exelixis/web/software/pear/doc.html
references:
doi: 10.1093/bioinformatics/btt593
license: "CC-BY-NC-SA-3.0"
requirements:
commands: [ pear , gzip ]
argument_groups:
- name: Inputs
arguments:
- name: --forward_fastq
alternatives: -f
type: file
description: Forward paired-end FASTQ file
required: true
example: "forward.fastq"
- name: --reverse_fastq
alternatives: -r
type: file
description: Reverse paired-end FASTQ file
required: true
example: "reverse.fastq"
- name: Outputs
arguments:
- name: --assembled
type: file
description: The output file containing assembled reads. Can be compressed with gzip.
required: true
direction: output
- name: --unassembled_forward
type: file
description: The output file containing forward reads that could not be assembled. Can be compressed with gzip.
required: true
direction: output
- name: --unassembled_reverse
type: file
description: The output file containing reverse reads that could not be assembled. Can be compressed with gzip.
required: true
direction: output
- name: --discarded
type: file
description: The output file containing reads that were discarded due to too low quality or too many uncalled bases. Can be compressed with gzip.
required: true
direction: output
- name: Arguments
arguments:
- name: --p_value
alternatives: -p
type: double
description: |
Specify a p-value for the statistical test. If the computed p-value of a possible assembly exceeds the specified p-value then paired-end read will not be assembled. Valid options are: 0.0001, 0.001, 0.01, 0.05 and 1.0. Setting 1.0 disables the test.
example: 0.01
required: false
- name: --min_overlap
alternatives: -v
type: integer
description: |
Specify the minimum overlap size. The minimum overlap may be set to 1 when the statistical test is used. However, further restricting the minimum overlap size to a proper value may reduce false-positive assembles.
required: false
example: 10
- name: --max_assembly_length
alternatives: -m
type: integer
description: |
Specify the maximum possible length of the assembled sequences. Setting this value to 0 disables the restriction and assembled sequences may be arbitrary long.
required: false
example: 0
- name: --min_assembly_length
alternatives: -n
type: integer
description: |
Specify the minimum possible length of the assembled sequences. Setting this value to 0 disables the restriction and assembled sequences may be arbitrary short.
required: false
example: 0
- name: --min_trim_length
alternatives: -t
type: integer
description: |
Specify the minimum length of reads after trimming the low quality part (see option -q)
required: false
example: 1
- name: --quality_threshold
alternatives: -q
type: integer
description: |
Specify the quality threshold for trimming the low quality part of a read. If the quality scores of two consecutive bases are strictly less than the specified threshold, the rest of the read will be trimmed.
required: false
example: 0
- name: --max_uncalled_base
alternatives: -u
type: double
description: |
Specify the maximal proportion of uncalled bases in a read. Setting this value to 0 will cause PEAR to discard all reads containing uncalled bases. The other extreme setting is 1 which causes PEAR to process all reads independent on the number of uncalled bases.
example: 1.0
required: false
- name: --test_method
alternatives: -g
type: integer
description: |
Specify the type of statistical test. Two options are available. 1: Given the minimum allowed overlap, test using the highest OES. Note that due to its discrete nature, this test usually yields a lower p-value for the assembled read than the cut- off (specified by -p). For example, setting the cut-off to 0.05 using this test, the assembled reads might have an actual p-value of 0.02.
2. Use the acceptance probability (m.a.p). This test methods computes the same probability as test method 1. However, it assumes that the minimal overlap is the observed overlap with the highest OES, instead of the one specified by -v. Therefore, this is not a valid statistical test and the 'p-value' is in fact the maximal probability for accepting the assembly. Nevertheless, we observed in practice that for the case the actual overlap sizes are relatively small, test 2 can correctly assemble more reads with only slightly higher false-positive rate.
required: false
example: 1
- name: --emperical_freqs
alternatives: -e
type: boolean_true
description: |
Disable empirical base frequencies.
- name: --score_method
alternatives: -s
type: integer
description: |
Specify the scoring method. 1. OES with +1 for match and -1 for mismatch. 2: Assembly score (AS). Use +1 for match and -1 for mismatch multiplied by base quality scores. 3: Ignore quality scores and use +1 for a match and -1 for a mismatch.
required: false
example: 2
- name: --phred_base
alternatives: -b
type: integer
description: |
Base PHRED quality score.
required: false
example: 33
- name: --cap
alternatives: -c
type: integer
description: |
Specify the upper bound for the resulting quality score. If set to zero, capping is disabled.
required: false
example: 40
- name: --nbase
alternatives: -z
type: boolean_true
description: |
When merging a base-pair that consists of two non-equal bases out of which none is degenerate, set the merged base to N and use the highest quality score of the two bases
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/pear:0.9.6--h9d449c0_10
setup:
- type: docker
run: |
version=$(pear -h | grep 'PEAR v' | sed 's/PEAR v//' | sed 's/ .*//') && \
echo "pear: $version" > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

91
src/pear/help.txt Normal file
View File

@@ -0,0 +1,91 @@
```bash
pear -h
```
____ _____ _ ____
| _ \| ____| / \ | _ \
| |_) | _| / _ \ | |_) |
| __/| |___ / ___ \| _ <
|_| |_____/_/ \_\_| \_\
PEAR v0.9.6 [January 15, 2015] - [+bzlib +zlib]
Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR
Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593
License: Creative Commons Licence
Bug-reports and requests to: Tomas.Flouri@h-its.org and Jiajie.Zhang@h-its.org
Usage: pear <options>
Standard (mandatory):
-f, --forward-fastq <str> Forward paired-end FASTQ file.
-r, --reverse-fastq <str> Reverse paired-end FASTQ file.
-o, --output <str> Output filename.
Optional:
-p, --p-value <float> Specify a p-value for the statistical test. If the computed
p-value of a possible assembly exceeds the specified p-value
then paired-end read will not be assembled. Valid options
are: 0.0001, 0.001, 0.01, 0.05 and 1.0. Setting 1.0 disables
the test. (default: 0.01)
-v, --min-overlap <int> Specify the minimum overlap size. The minimum overlap may be
set to 1 when the statistical test is used. However, further
restricting the minimum overlap size to a proper value may
reduce false-positive assembles. (default: 10)
-m, --max-assembly-length <int> Specify the maximum possible length of the assembled
sequences. Setting this value to 0 disables the restriction
and assembled sequences may be arbitrary long. (default: 0)
-n, --min-assembly-length <int> Specify the minimum possible length of the assembled
sequences. Setting this value to 0 disables the restriction
and assembled sequences may be arbitrary short. (default:
50)
-t, --min-trim-length <int> Specify the minimum length of reads after trimming the low
quality part (see option -q). (default: 1)
-q, --quality-threshold <int> Specify the quality score threshold for trimming the low
quality part of a read. If the quality scores of two
consecutive bases are strictly less than the specified
threshold, the rest of the read will be trimmed. (default:
0)
-u, --max-uncalled-base <float> Specify the maximal proportion of uncalled bases in a read.
Setting this value to 0 will cause PEAR to discard all reads
containing uncalled bases. The other extreme setting is 1
which causes PEAR to process all reads independent on the
number of uncalled bases. (default: 1)
-g, --test-method <int> Specify the type of statistical test. Two options are
available. (default: 1)
1: Given the minimum allowed overlap, test using the highest
OES. Note that due to its discrete nature, this test usually
yields a lower p-value for the assembled read than the cut-
off (specified by -p). For example, setting the cut-off to
0.05 using this test, the assembled reads might have an
actual p-value of 0.02.
2. Use the acceptance probability (m.a.p). This test methods
computes the same probability as test method 1. However, it
assumes that the minimal overlap is the observed overlap
with the highest OES, instead of the one specified by -v.
Therefore, this is not a valid statistical test and the
'p-value' is in fact the maximal probability for accepting
the assembly. Nevertheless, we observed in practice that for
the case the actual overlap sizes are relatively small, test
2 can correctly assemble more reads with only slightly
higher false-positive rate.
-e, --empirical-freqs Disable empirical base frequencies. (default: use empirical
base frequencies)
-s, --score-method <int> Specify the scoring method. (default: 2)
1. OES with +1 for match and -1 for mismatch.
2: Assembly score (AS). Use +1 for match and -1 for mismatch
multiplied by base quality scores.
3: Ignore quality scores and use +1 for a match and -1 for a
mismatch.
-b, --phred-base <int> Base PHRED quality score. (default: 33)
-y, --memory <str> Specify the amount of memory to be used. The number may be
followed by one of the letters K, M, or G denoting
Kilobytes, Megabytes and Gigabytes, respectively. Bytes are
assumed in case no letter is specified.
-c, --cap <int> Specify the upper bound for the resulting quality score. If
set to zero, capping is disabled. (default: 40)
-j, --threads <int> Number of threads to use
-z, --nbase When merging a base-pair that consists of two non-equal
bases out of which none is degenerate, set the merged base
to N and use the highest quality score of the two bases
-h, --help This help screen.

65
src/pear/script.sh Normal file
View File

@@ -0,0 +1,65 @@
#!/bin/bash
set -eo pipefail
## VIASH START
## VIASH END
[[ "$par_emperical_freqs" == "false" ]] && unset par_emperical_freqs
[[ "$par_nbase" == "false" ]] && unset par_nbase
if [[ "${par_forward_fastq##*.}" == "gz" ]]; then
gunzip $par_forward_fastq
par_forward_fastq=${par_forward_fastq%.*}
fi
if [[ "${par_reverse_fastq##*.}" == "gz" ]]; then
gunzip $par_reverse_fastq
par_reverse_fastq=${par_reverse_fastq%.*}
fi
output_dir=$(mktemp -d -p "$meta_temp_dir" "pear.XXXXXX")
pear \
-f "$par_forward_fastq" \
-r "$par_reverse_fastq" \
-o "$output_dir" \
${par_p_value:+-p "${par_p_value}"} \
${par_min_overlap:+-v "${par_min_overlap}"} \
${par_max_assembly_length:+-m "${par_max_assembly_length}"} \
${par_min_assembly_length:+-n "${par_min_assembly_length}"} \
${par_min_trim_length:+-t "${par_min_trim_length}"} \
${par_quality_threshold:+-q "${par_quality_threshold}"} \
${par_max_uncalled_base:+-u "${par_max_uncalled_base}"} \
${par_test_method:+-g "${par_test_method}"} \
${par_score_method:+-s "${par_score_method}"} \
${par_phred_base:+-b "${par_phred_base}"} \
${meta_memory_mb:+--memory "${meta_memory_mb}M"} \
${par_cap:+-c "${par_cap}"} \
${meta_cpus:+-j "${meta_cpus}"} \
${par_emperical_freqs:+-e} \
${par_nbase:+-z}
if [[ "${par_assembled##*.}" == "gz" ]]; then
gzip -9 -c ${output_dir}.assembled.fastq > ${par_assembled}
else
mv ${output_dir}.assembled.fastq ${par_assembled}
fi
if [[ "${par_unassembled_forward##*.}" == "gz" ]]; then
gzip -9 -c ${output_dir}.unassembled.forward.fastq > ${par_unassembled_forward}
else
mv ${output_dir}.unassembled.forward.fastq ${par_unassembled_forward}
fi
if [[ "${par_unassembled_reverse##*.}" == "gz" ]]; then
gzip -9 -c ${output_dir}.unassembled.reverse.fastq > ${par_unassembled_reverse}
else
mv ${output_dir}.unassembled.reverse.fastq ${par_unassembled_reverse}
fi
if [[ "${par_discarded##*.}" == "gz" ]]; then
gzip -9 -c ${output_dir}.discarded.fastq > ${par_discarded}
else
mv ${output_dir}.discarded.fastq ${par_discarded}
fi

Some files were not shown because too many files have changed in this diff Show More