Build branch main with version main (da414e7)

Build pipeline: viash-hub.biobox.main-7dwhr

Source commit: da414e72c6

Source message: Add star solo component (#62)

* add star solo component

* change arguments from camelCase to snake_case

* get rid of multiple_sep

* drop star_solo component and just add arguments to star_align_reads

* Update src/star/star_align_reads/script.py

Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>

---------

Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>
This commit is contained in:
CI
2024-07-29 08:14:56 +00:00
parent ae3fef8eec
commit 02c1c9ebea
309 changed files with 58822 additions and 4148 deletions

1
.gitignore vendored
View File

@@ -3,6 +3,7 @@
# IDE ignores
.idea/
.vscode/
# R specific ignores
.Rhistory

View File

@@ -1,25 +1,60 @@
# biobox x.x.x
## BUG FIXES
* `pear`: fix component not exiting with the correct exitcode when PEAR fails.
* `cutadapt`: fix `--par_quality_cutoff_r2` argument.
* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode`.
## MINOR CHANGES
* `busco` components: update BUSCO to `5.7.1`.
# biobox 0.1.0
## BREAKING CHANGES
* Change default `multiple_sep` to `;` (PR #25). This aligns with an upcoming breaking change in
Viash 0.9.0 in order to avoid issues with the current default separator `:` unintentionally
splitting up certain file paths.
* `star/star_align_reads`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).
* `star/star_genome_generate`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).
## NEW FUNCTIONALITY
* `star/star_align_reads`: Add star solo related arguments (PR #62).
* `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline (PR #75).
* `umitools/umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).
* `seqtk`:
- `seqtk/seqtk_sample`: Subsamples sequences from FASTA/Q files (PR #68).
- `seqtk/seqtk_subseq`: Extract the sequences (complete or subsequence) from the FASTA/FASTQ files
based on a provided sequence IDs or region coordinates file (PR #85).
* `agat/agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).
## MINOR CHANGES
* `busco` components: update BUSCO to `5.7.1` (PR #72).
* Update CI to reusable workflow in `viash-io/viash-actions` (PR #86).
## DOCUMENTATION
* Extend the contributing guidelines (PR #82):
- Update format to Viash 0.9.
- Descriptions should be formatted in markdown.
- Add defaults to descriptions, not as a default of the argument.
- Explain parameter expansion.
- Mention that the contents of the output of components in tests should be checked.
* Add authorship to existing components (PR #88).
## BUG FIXES
* `pear`: fix component not exiting with the correct exitcode when PEAR fails (PR #70).
* `cutadapt`: fix `--par_quality_cutoff_r2` argument (PR #69).
* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode` (PR #69).
* `multiqc`: update multiple separator to `;` (PR #81).
# biobox 0.1.0
## NEW FEATURES
@@ -66,6 +101,10 @@
- `samtools/samtools_collate`: Shuffles and groups reads in SAM/BAM/CRAM files together by their names (PR #42).
- `samtools/samtools_view`: Views and converts SAM/BAM/CRAM files (PR #48).
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTQ (PR #52).
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53).
* `umi_tools`:
-`umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71).
* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).
@@ -93,4 +132,4 @@
* Add escaping character before leading hashtag in the description field of the config file (PR #50).
* Format URL in biobase/bcl_convert description (PR #55).
* Format URL in biobase/bcl_convert description (PR #55).

View File

@@ -65,22 +65,21 @@ runners:
Fill in the relevant metadata fields in the config. Here is an example of the metadata of an existing component.
```yaml
functionality:
name: arriba
description: Detect gene fusions from RNA-Seq data
keywords: [Gene fusion, RNA-Seq]
links:
homepage: https://arriba.readthedocs.io/en/latest/
documentation: https://arriba.readthedocs.io/en/latest/
repository: https://github.com/suhrig/arriba
issue_tracker: https://github.com/suhrig/arriba/issues
references:
doi: 10.1101/gr.257246.119
bibtex: |
@article{
... a bibtex entry in case the doi is not available ...
}
license: MIT
name: arriba
description: Detect gene fusions from RNA-Seq data
keywords: [Gene fusion, RNA-Seq]
links:
homepage: https://arriba.readthedocs.io/en/latest/
documentation: https://arriba.readthedocs.io/en/latest/
repository: https://github.com/suhrig/arriba
issue_tracker: https://github.com/suhrig/arriba/issues
references:
doi: 10.1101/gr.257246.119
bibtex: |
@article{
... a bibtex entry in case the doi is not available ...
}
license: MIT
```
### Step 4: Find a suitable container
@@ -162,7 +161,7 @@ argument_groups:
type: file
description: |
File in SAM/BAM/CRAM format with main alignments as generated by STAR
(Aligned.out.sam). Arriba extracts candidate reads from this file.
(`Aligned.out.sam`). Arriba extracts candidate reads from this file.
required: true
example: Aligned.out.bam
```
@@ -175,7 +174,7 @@ Several notes:
* Input arguments can have `multiple: true` to allow the user to specify multiple files.
* The description should be formatted in markdown.
### Step 8: Add arguments for the output files
@@ -220,7 +219,7 @@ argument_groups:
Note:
* Preferably, these outputs should not be directores but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
* Preferably, these outputs should not be directories but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
### Step 9: Add arguments for the other arguments
@@ -230,6 +229,8 @@ Finally, add all other arguments to the config file. There are a few exceptions:
* Arguments related to printing the information such as printing the version (`-v`, `--version`) or printing the help (`-h`, `--help`) should not be added to the config file.
* If the help lists defaults, do not add them as defaults but to the description. Example: `description: <Explanation of parameter>. Default: 10.`
### Step 10: Add a Docker engine
@@ -275,10 +276,13 @@ Next, we need to write a runner script that runs the tool with the input argumen
## VIASH START
## VIASH END
# unset flags
[[ "$par_option" == "false" ]] && unset par_option
xxx \
--input "$par_input" \
--output "$par_output" \
$([ "$par_option" = "true" ] && echo "--option")
${par_option:+--option}
```
When building a Viash component, Viash will automatically replace the `## VIASH START` and `## VIASH END` lines (and anything in between) with environment variables based on the arguments specified in the config.
@@ -291,6 +295,11 @@ As an example, this is what the Bash script for the `arriba` component looks lik
## VIASH START
## VIASH END
# unset flags
[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking
[[ "$par_extra_information" == "false" ]] && unset par_extra_information
[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps
arriba \
-x "$par_bam" \
-a "$par_genome" \
@@ -298,26 +307,30 @@ arriba \
-o "$par_fusions" \
${par_known_fusions:+-k "${par_known_fusions}"} \
${par_blacklist:+-b "${par_blacklist}"} \
${par_structural_variants:+-d "${par_structural_variants}"} \
$([ "$par_skip_duplicate_marking" = "true" ] && echo "-u") \
$([ "$par_extra_information" = "true" ] && echo "-X") \
$([ "$par_fill_gaps" = "true" ] && echo "-I")
# ...
${par_extra_information:+-X} \
${par_fill_gaps:+-I}
```
Notes:
* If your arguments can contain special variables (e.g. `$`), you can use quoting (need to find a documentation page for this) to make sure you can use the string as input. Example: `-x ${par_bam@Q}`.
* Optional arguments can be passed to the command conditionally using Bash [parameter expansion](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html). For example: `${par_known_fusions:+-k ${par_known_fusions@Q}}`
* If your tool allows for multiple inputs using a separator other than `;` (which is the default Viash multiple separator), you can substitute these values with a command like: `par_disable_filters=$(echo $par_disable_filters | tr ';' ',')`.
### Step 12: Create test script
If the unit test requires test resources, these should be provided in the `test_resources` section of the component.
```yaml
functionality:
# ...
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
```
Create a test script at `src/xxx/test.sh` that runs the component with the test data. This script should run the component (available with `$meta_executable`) with the test data and check if the output is as expected. The script should exit with a non-zero exit code if the output is not as expected. For example:
@@ -325,48 +338,64 @@ Create a test script at `src/xxx/test.sh` that runs the component with the test
```bash
#!/bin/bash
set -e
## VIASH START
## VIASH END
echo "> Run xxx with test data"
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_doesnt_exist() {
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
}
assert_file_empty() {
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains() {
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
assert_file_contains_regex() {
grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains_regex() {
grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
#############################################
echo "> Run $meta_name with test data"
"$meta_executable" \
--input "$meta_resources_dir/test_data/input.txt" \
--input "$meta_resources_dir/test_data/reads_R1.fastq" \
--output "output.txt" \
--option
echo ">> Checking output"
[ ! -f "output.txt" ] && echo "Output file output.txt does not exist" && exit 1
```
For example, this is what the test script for the `arriba` component looks like:
```bash
#!/bin/bash
## VIASH START
## VIASH END
echo "> Run arriba with blacklist"
"$meta_executable" \
--bam "$meta_resources_dir/test_data/A.bam" \
--genome "$meta_resources_dir/test_data/genome.fasta" \
--gene_annotation "$meta_resources_dir/test_data/annotation.gtf" \
--blacklist "$meta_resources_dir/test_data/blacklist.tsv" \
--fusions "fusions.tsv" \
--fusions_discarded "fusions_discarded.tsv" \
--interesting_contigs "1,2"
echo ">> Checking output"
[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
echo ">> Check if output exists"
assert_file_exists "output.txt"
echo ">> Check if output is empty"
[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
assert_file_not_empty "output.txt"
echo ">> Check if output is correct"
assert_file_contains "output.txt" "some expected output"
echo "> All tests succeeded!"
```
### Step 12: Create a `/var/software_versions.txt` file
Notes:
* Do always check the contents of the output file. If the output is not deterministic, you can use regular expressions to check the output.
* If possible, generate your own test data instead of copying it from an external resource.
### Step 13: Create a `/var/software_versions.txt` file
For the sake of transparency and reproducibility, we require that the versions of the software used in the component are documented.
@@ -378,6 +407,8 @@ engines:
image: quay.io/biocontainers/xxx:0.1.0--py_0
setup:
- type: docker
# note: /var/software_versions.txt should contain:
# arriba: "2.4.0"
run: |
echo "xxx: \"0.1.0\"" > /var/software_versions.txt
```

View File

@@ -0,0 +1,14 @@
name: Angela Oliveira Pisco
info:
role: Contributor
links:
github: aopisco
orcid: "0000-0003-0142-2355"
linkedin: aopisco
organizations:
- name: Insitro
href: https://insitro.com
role: Director of Computational Biology
- name: Open Problems
href: https://openproblems.bio
role: Core Member

View File

@@ -0,0 +1,10 @@
name: Dorien Roosen
info:
links:
email: dorien@data-intuitive.com
github: dorien-er
linkedin: dorien-roosen
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Scientist

View File

@@ -0,0 +1,11 @@
name: Dries Schaumont
info:
links:
email: dries@data-intuitive.com
github: DriesSchaumont
orcid: "0000-0002-4389-0440"
linkedin: dries-schaumont
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Scientist

View File

@@ -0,0 +1,10 @@
name: Emma Rousseau
info:
links:
email: emma@data-intuitive.com
github: emmarousseau
linkedin: emmarousseau1
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatician

View File

@@ -0,0 +1,10 @@
name: Jakub Majercik
info:
links:
email: jakub@data-intuitive.com
github: jakubmajercik
linkedin: jakubmajercik
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatics Engineer

View File

@@ -0,0 +1,14 @@
name: Kai Waldrant
info:
links:
email: kai@data-intuitive.com
github: KaiWaldrant
orcid: "0009-0003-8555-1361"
linkedin: kaiwaldrant
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatician
- name: Open Problems
href: https://openproblems.bio
role: Contributor

View File

@@ -0,0 +1,10 @@
name: Leïla Paquay
info:
links:
email: leila@data-intuitive.com
github: Leila011
linkedin: leilapaquay
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Software Developer

View File

@@ -0,0 +1,14 @@
name: Robrecht Cannoodt
info:
links:
email: robrecht@data-intuitive.com
github: rcannood
orcid: "0000-0003-3641-729X"
linkedin: robrechtcannoodt
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Science Engineer
- name: Open Problems
href: https://openproblems.bio
role: Core Member

View File

@@ -0,0 +1,10 @@
name: Sai Nirmayi Yasa
info:
links:
email: nirmayi@data-intuitive.com
github: sainirmayi
linkedin: sai-nirmayi-yasa
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Junior Bioinformatics Researcher

View File

@@ -0,0 +1,10 @@
name: Theodoro Gasperin Terra Camargo
info:
links:
email: theodorogtc@gmail.com
github: tgaspe
linkedin: theodoro-gasperin-terra-camargo
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatician

View File

@@ -0,0 +1,9 @@
name: Toni Verbeiren
info:
links:
github: tverbeiren
linkedin: verbeiren
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Scientist and CEO

View File

@@ -0,0 +1,5 @@
name: Weiwei Schultz
info:
organizations:
- name: Janssen R&D US
role: Associate Director Data Sciences

View File

@@ -0,0 +1,94 @@
name: agat_convert_sp_gff2gtf
namespace: agat
description: |
The script aims to convert any GTF/GFF file into a proper GTF file. Full
information about the format can be found here:
https://agat.readthedocs.io/en/latest/gxf.html You can choose among 7
different GTF types (1, 2, 2.1, 2.2, 2.5, 3 or relax). Depending the
version selected the script will filter out the features that are not
accepted. For GTF2.5 and 3, every level1 feature (e.g nc_gene
pseudogene) will be converted into gene feature and every level2 feature
(e.g mRNA ncRNA) will be converted into transcript feature. Using the
"relax" option you will produce a GTF-like output keeping all original
feature types (3rd column). No modification will occur e.g. mRNA to
transcript.
To be fully GTF compliant all feature have a gene_id and a transcript_id
attribute. The gene_id is unique identifier for the genomic source of
the transcript, which is used to group transcripts into genes. The
transcript_id is a unique identifier for the predicted transcript, which
is used to group features into transcripts.
keywords: [gene annotations, GTF conversion]
links:
homepage: https://github.com/NBISweden/AGAT
documentation: https://agat.readthedocs.io/
issue_tracker: https://github.com/NBISweden/AGAT/issues
repository: https://github.com/NBISweden/AGAT
references:
doi: 10.5281/zenodo.3552717
license: GPL-3.0
authors:
- __merge__: /src/_authors/leila_paquay.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: --gff
alternatives: [-i]
description: Input GFF/GTF file that will be read
type: file
required: true
direction: input
example: input.gff
- name: Outputs
arguments:
- name: --output
alternatives: [-o, --out, --outfile, --gtf]
description: Output GTF file. If no output file is specified, the output will be written to STDOUT.
type: file
direction: output
required: true
example: output.gtf
- name: Arguments
arguments:
- name: --gtf_version
description: |
Version of the GTF output (1,2,2.1,2.2,2.5,3 or relax). Default value from AGAT config file (relax for the default config). The script option has the higher priority.
* relax: all feature types are accepted.
* GTF3 (9 feature types accepted): gene, transcript, exon, CDS, Selenocysteine, start_codon, stop_codon, three_prime_utr and five_prime_utr.
* GTF2.5 (8 feature types accepted): gene, transcript, exon, CDS, UTR, start_codon, stop_codon, Selenocysteine.
* GTF2.2 (9 feature types accepted): CDS, start_codon, stop_codon, 5UTR, 3UTR, inter, inter_CNS, intron_CNS and exon.
* GTF2.1 (6 feature types accepted): CDS, start_codon, stop_codon, exon, 5UTR, 3UTR.
* GTF2 (4 feature types accepted): CDS, start_codon, stop_codon, exon.
* GTF1 (5 feature types accepted): CDS, start_codon, stop_codon, exon, intron.
type: string
choices: [relax, "1", "2", "2.1", "2.2", "2.5", "3"]
required: false
example: "3"
- name: --config
alternatives: [-c]
description: |
Input agat config file. By default AGAT takes as input agat_config.yaml file from the working directory if any, otherwise it takes the orignal agat_config.yaml shipped with AGAT. To get the agat_config.yaml locally type: "agat config --expose". The --config option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
type: file
required: false
example: custom_agat_config.yaml
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
setup:
- type: docker
run: |
agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,102 @@
```sh
agat_convert_sp_gff2gtf.pl --help
```
------------------------------------------------------------------------------
| Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0 |
| https://github.com/NBISweden/AGAT |
| National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se |
------------------------------------------------------------------------------
Name:
agat_convert_sp_gff2gtf.pl
Description:
The script aims to convert any GTF/GFF file into a proper GTF file. Full
information about the format can be found here:
https://agat.readthedocs.io/en/latest/gxf.html You can choose among 7
different GTF types (1, 2, 2.1, 2.2, 2.5, 3 or relax). Depending the
version selected the script will filter out the features that are not
accepted. For GTF2.5 and 3, every level1 feature (e.g nc_gene
pseudogene) will be converted into gene feature and every level2 feature
(e.g mRNA ncRNA) will be converted into transcript feature. Using the
"relax" option you will produce a GTF-like output keeping all original
feature types (3rd column). No modification will occur e.g. mRNA to
transcript.
To be fully GTF compliant all feature have a gene_id and a transcript_id
attribute. The gene_id is unique identifier for the genomic source of
the transcript, which is used to group transcripts into genes. The
transcript_id is a unique identifier for the predicted transcript, which
is used to group features into transcripts.
Usage:
agat_convert_sp_gff2gtf.pl --gff infile.gff [ -o outfile ]
agat_convert_sp_gff2gtf -h
Options:
--gff, --gtf or -i
Input GFF/GTF file that will be read
--gtf_version version of the GTF output (1,2,2.1,2.2,2.5,3 or relax).
Default value from AGAT config file (relax for the default config). The
script option has the higher priority.
relax: all feature types are accepted.
GTF3 (9 feature types accepted): gene, transcript, exon, CDS,
Selenocysteine, start_codon, stop_codon, three_prime_utr and
five_prime_utr
GTF2.5 (8 feature types accepted): gene, transcript, exon, CDS,
UTR, start_codon, stop_codon, Selenocysteine
GTF2.2 (9 feature types accepted): CDS, start_codon, stop_codon,
5UTR, 3UTR, inter, inter_CNS, intron_CNS and exon
GTF2.1 (6 feature types accepted): CDS, start_codon, stop_codon,
exon, 5UTR, 3UTR
GTF2 (4 feature types accepted): CDS, start_codon, stop_codon,
exon
GTF1 (5 feature types accepted): CDS, start_codon, stop_codon,
exon, intron
-o , --output , --out , --outfile or --gtf
Output GTF file. If no output file is specified, the output will
be written to STDOUT.
-c or --config
String - Input agat config file. By default AGAT takes as input
agat_config.yaml file from the working directory if any,
otherwise it takes the orignal agat_config.yaml shipped with
AGAT. To get the agat_config.yaml locally type: "agat config
--expose". The --config option gives you the possibility to use
your own AGAT config file (located elsewhere or named
differently).
-h or --help
Display this helpful text.
Feedback:
Did you find a bug?:
Do not hesitate to report bugs to help us keep track of the bugs and
their resolution. Please use the GitHub issue tracking system available
at this address:
https://github.com/NBISweden/AGAT/issues
Ensure that the bug was not already reported by searching under Issues.
If you're unable to find an (open) issue addressing the problem, open a new one.
Try as much as possible to include in the issue when relevant:
- a clear description,
- as much relevant information as possible,
- the command used,
- a data sample,
- an explanation of the expected behaviour that is not occurring.
Do you want to contribute?:
You are very welcome, visit this address for the Contributing
guidelines:
https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md

View File

@@ -0,0 +1,10 @@
#!/bin/bash
## VIASH START
## VIASH END
agat_convert_sp_gff2gtf.pl \
-i "$par_gff" \
-o "$par_output" \
${par_gtf_version:+--gtf_version "${par_gtf_version}"} \
${par_config:+--config "${par_config}"}

View File

@@ -0,0 +1,37 @@
#!/bin/bash
## VIASH START
## VIASH END
test_dir="${meta_resources_dir}/test_data"
echo "> Run $meta_name with test data"
"$meta_executable" \
--gff "$test_dir/0_test.gff" \
--output "output.gtf"
echo ">> Checking output"
[ ! -f "output.gtf" ] && echo "Output file output.gtf does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "output.gtf" ] && echo "Output file output.gtf is empty" && exit 1
echo ">> Check if the conversion resulted in the right GTF format"
idGFF=$(head -n 2 "$test_dir/0_test.gff" | grep -o 'ID=[^;]*' | cut -d '=' -f 2-)
expectedGTF="gene_id \"$idGFF\"; ID \"$idGFF\";"
extractedGTF=$(head -n 3 "output.gtf" | grep -o 'gene_id "[^"]*"; ID "[^"]*";')
[ "$extractedGTF" != "$expectedGTF" ] && echo "Output file output.gtf does not have the right format" && exit 1
rm output.gtf
echo "> Run $meta_name with test data and GTF version 2.5"
"$meta_executable" \
--gff "$test_dir/0_test.gff" \
--output "output.gtf" \
--gtf_version "2.5"
echo ">> Check if the output file header display the right GTF version"
grep -q "##gtf-version 2.5" "output.gtf"
[ $? -ne 0 ] && echo "Output file output.gtf header does not display the right GTF version" && exit 1
echo "> Test successful"

View File

@@ -0,0 +1,36 @@
##gff-version 3
scaffold625 maker gene 337818 343277 . + . ID=CLUHARG00000005458;Name=TUBB3_2
scaffold625 maker mRNA 337818 343277 . + . ID=CLUHART00000008717;Parent=CLUHARG00000005458
scaffold625 maker exon 337818 337971 . + . ID=CLUHART00000008717:exon:1404;Parent=CLUHART00000008717
scaffold625 maker exon 340733 340841 . + . ID=CLUHART00000008717:exon:1405;Parent=CLUHART00000008717
scaffold625 maker exon 341518 341628 . + . ID=CLUHART00000008717:exon:1406;Parent=CLUHART00000008717
scaffold625 maker exon 341964 343277 . + . ID=CLUHART00000008717:exon:1407;Parent=CLUHART00000008717
scaffold625 maker CDS 337915 337971 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 340733 340841 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341518 341628 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341964 343033 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker five_prime_UTR 337818 337914 . + . ID=CLUHART00000008717:five_prime_utr;Parent=CLUHART00000008717
scaffold625 maker three_prime_UTR 343034 343277 . + . ID=CLUHART00000008717:three_prime_utr;Parent=CLUHART00000008717
scaffold789 maker gene 558184 564780 . + . ID=CLUHARG00000003852;Name=PF11_0240
scaffold789 maker mRNA 558184 564780 . + . ID=CLUHART00000006146;Parent=CLUHARG00000003852
scaffold789 maker exon 558184 560123 . + . ID=CLUHART00000006146:exon:995;Parent=CLUHART00000006146
scaffold789 maker exon 561401 561519 . + . ID=CLUHART00000006146:exon:996;Parent=CLUHART00000006146
scaffold789 maker exon 564171 564235 . + . ID=CLUHART00000006146:exon:997;Parent=CLUHART00000006146
scaffold789 maker exon 564372 564780 . + . ID=CLUHART00000006146:exon:998;Parent=CLUHART00000006146
scaffold789 maker CDS 558191 560123 . + 0 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 561401 561519 . + 2 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 564171 564235 . + 0 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 564372 564588 . + 1 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker five_prime_UTR 558184 558190 . + . ID=CLUHART00000006146:five_prime_utr;Parent=CLUHART00000006146
scaffold789 maker three_prime_UTR 564589 564780 . + . ID=CLUHART00000006146:three_prime_utr;Parent=CLUHART00000006146
scaffold789 maker mRNA 558184 564780 . + . ID=CLUHART00000006147;Parent=CLUHARG00000003852
scaffold789 maker exon 558184 560123 . + . ID=CLUHART00000006147:exon:997;Parent=CLUHART00000006147
scaffold789 maker exon 561401 561519 . + . ID=CLUHART00000006147:exon:998;Parent=CLUHART00000006147
scaffold789 maker exon 562057 562121 . + . ID=CLUHART00000006147:exon:999;Parent=CLUHART00000006147
scaffold789 maker exon 564372 564780 . + . ID=CLUHART00000006147:exon:1000;Parent=CLUHART00000006147
scaffold789 maker CDS 558191 560123 . + 0 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 561401 561519 . + 2 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 562057 562121 . + 0 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 564372 564588 . + 1 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker five_prime_UTR 558184 558190 . + . ID=CLUHART00000006147:five_prime_utr;Parent=CLUHART00000006147
scaffold789 maker three_prime_UTR 564589 564780 . + . ID=CLUHART00000006147:three_prime_utr;Parent=CLUHART00000006147

View File

@@ -0,0 +1,9 @@
#!/bin/bash
# clone repo
if [ ! -d /tmp/agat_source ]; then
git clone --depth 1 --single-branch --branch master https://github.com/NBISweden/AGAT /tmp/agat_source
fi
# copy test data
cp -r /tmp/agat_source/t/gff_syntax/in/0_test.gff src/agat/agat_convert_sp_gff2gtf/test_data

View File

@@ -11,6 +11,9 @@ license: MIT
requirements:
cpus: 1
commands: [ arriba ]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -4,6 +4,17 @@ description: |
Information about upgrading from bcl2fastq via
[Upgrading from bcl2fastq to BCL Convert](https://emea.support.illumina.com/bulletins/2020/10/upgrading-from-bcl2fastq-to-bcl-convert.html)
and [BCL Convert Compatible Products](https://support.illumina.com/sequencing/sequencing_software/bcl-convert/compatibility.html)
keywords: [demultiplex, fastq, bcl, illumina]
links:
homepage: https://support.illumina.com/sequencing/sequencing_software/bcl-convert.html
documentation: https://support.illumina.com/downloads/bcl-convert-user-guide.html
license: Proprietary
authors:
- __merge__: /src/_authors/toni_verbeiren.yaml
roles: [ author, maintainer ]
- __merge__: /src/_authors/dorien_roosen.yaml
roles: [ author ]
argument_groups:
- name: Input arguments
arguments:

View File

@@ -0,0 +1,143 @@
name: bd_rhapsody_make_reference
namespace: bd_rhapsody
description: |
The Reference Files Generator creates an archive containing Genome Index
and Transcriptome annotation files needed for the BD Rhapsody Sequencing
Analysis Pipeline. The app takes as input one or more FASTA and GTF files
and produces a compressed archive in the form of a tar.gz file. The
archive contains:
- STAR index
- Filtered GTF file
keywords: [genome, reference, index, align]
links:
repository: https://bitbucket.org/CRSwDev/cwl/src/master/v2.2.1/Extra_Utilities/
documentation: https://bd-rhapsody-bioinfo-docs.genomics.bd.com/resources/extra_utilities.html#make-rhapsody-reference
license: Unknown
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [ author, maintainer ]
- __merge__: /src/_authors/weiwei_schultz.yaml
roles: [ contributor ]
argument_groups:
- name: Inputs
arguments:
- type: file
name: --genome_fasta
required: true
description: Reference genome file in FASTA or FASTA.GZ format. The BD Rhapsody Sequencing Analysis Pipeline uses GRCh38 for Human and GRCm39 for Mouse.
example: genome_sequence.fa.gz
multiple: true
info:
config_key: Genome_fasta
- type: file
name: --gtf
required: true
description: |
File path to the transcript annotation files in GTF or GTF.GZ format. The Sequence Analysis Pipeline requires the 'gene_name' or
'gene_id' attribute to be set on each gene and exon feature. Gene and exon feature lines must have the same attribute, and exons
must have a corresponding gene with the same value. For TCR/BCR assays, the TCR or BCR gene segments must have the 'gene_type' or
'gene_biotype' attribute set, and the value should begin with 'TR' or 'IG', respectively.
example: transcriptome_annotation.gtf.gz
multiple: true
info:
config_key: Gtf
- type: file
name: --extra_sequences
description: |
File path to additional sequences in FASTA format to use when building the STAR index. (e.g. transgenes or CRISPR guide barcodes).
GTF lines for these sequences will be automatically generated and combined with the main GTF.
required: false
multiple: true
info:
config_key: Extra_sequences
- name: Outputs
arguments:
- type: file
name: --reference_archive
direction: output
required: true
description: |
A Compressed archive containing the Reference Genome Index and annotation GTF files. This archive is meant to be used as an
input in the BD Rhapsody Sequencing Analysis Pipeline.
example: star_index.tar.gz
- name: Arguments
arguments:
- type: string
name: --mitochondrial_contigs
description: |
Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are
identified as 'nuclear fragments' in the ATACseq analysis pipeline.
required: false
multiple: true
default: [chrM, chrMT, M, MT]
info:
config_key: Mitochondrial_contigs
- type: boolean_true
name: --filtering_off
description: |
By default the input Transcript Annotation files are filtered based on the gene_type/gene_biotype attribute. Only features
having the following attribute values are kept:
- protein_coding
- lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97)
- IG_LV_gene
- IG_V_gene
- IG_V_pseudogene
- IG_D_gene
- IG_J_gene
- IG_J_pseudogene
- IG_C_gene
- IG_C_pseudogene
- TR_V_gene
- TR_V_pseudogene
- TR_D_gene
- TR_J_gene
- TR_J_pseudogene
- TR_C_gene
If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True.
info:
config_key: Filtering_off
- type: boolean_true
name: --wta_only_index
description: Build a WTA only index, otherwise builds a WTA + ATAC index.
info:
config_key: Wta_Only
- type: string
name: --extra_star_params
description: Additional parameters to pass to STAR when building the genome index. Specify exactly like how you would on the command line.
example: --limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11
required: false
info:
config_key: Extra_STAR_params
resources:
- type: python_script
path: script.py
- path: make_rhap_reference_2.2.1_nodocker.cwl
test_resources:
- type: bash_script
path: test.sh
- path: test_data
requirements:
commands: [ "cwl-runner" ]
engines:
- type: docker
image: bdgenomics/rhapsody:2.2.1
setup:
- type: apt
packages: [procps]
- type: python
packages: [cwlref-runner, cwl-runner]
- type: docker
run: |
echo "bdgenomics/rhapsody: 2.2.1" > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,66 @@
```bash
cwl-runner src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl --help
```
usage: src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl
[-h] [--Archive_prefix ARCHIVE_PREFIX]
[--Extra_STAR_params EXTRA_STAR_PARAMS]
[--Extra_sequences EXTRA_SEQUENCES] [--Filtering_off] --Genome_fasta
GENOME_FASTA --Gtf GTF [--Maximum_threads MAXIMUM_THREADS]
[--Mitochondrial_Contigs MITOCHONDRIAL_CONTIGS] [--WTA_Only]
[job_order]
The Reference Files Generator creates an archive containing Genome Index and
Transcriptome annotation files needed for the BD Rhapsodyâ„¢ Sequencing
Analysis Pipeline. The app takes as input one or more FASTA and GTF files and
produces a compressed archive in the form of a tar.gz file. The archive
contains:\n - STAR index\n - Filtered GTF file
positional arguments:
job_order Job input json file
options:
-h, --help show this help message and exit
--Archive_prefix ARCHIVE_PREFIX
A prefix for naming the compressed archive file
containing the Reference genome index and annotation
files. The default value is constructed based on the
input Reference files.
--Extra_STAR_params EXTRA_STAR_PARAMS
Additional parameters to pass to STAR when building
the genome index. Specify exactly like how you would
on the command line. Example: --limitGenomeGenerateRAM
48000 --genomeSAindexNbases 11
--Extra_sequences EXTRA_SEQUENCES
Additional sequences in FASTA format to use when
building the STAR index. (E.g. phiX genome)
--Filtering_off By default the input Transcript Annotation files are
filtered based on the gene_type/gene_biotype
attribute. Only features having the following
attribute values are are kept: - protein_coding -
lncRNA (lincRNA and antisense for Gencode <
v31/M22/Ensembl97) - IG_LV_gene - IG_V_gene -
IG_V_pseudogene - IG_D_gene - IG_J_gene -
IG_J_pseudogene - IG_C_gene - IG_C_pseudogene -
TR_V_gene - TR_V_pseudogene - TR_D_gene - TR_J_gene -
TR_J_pseudogene - TR_C_gene If you have already pre-
filtered the input Annotation files and/or wish to
turn-off the filtering, please set this option to
True.
--Genome_fasta GENOME_FASTA
Reference genome file in FASTA format. The BD
Rhapsodyâ„¢ Sequencing Analysis Pipeline uses GRCh38
for Human and GRCm39 for Mouse.
--Gtf GTF Transcript annotation files in GTF format. The BD
Rhapsodyâ„¢ Sequencing Analysis Pipeline uses Gencode
v42 for Human and M31 for Mouse.
--Maximum_threads MAXIMUM_THREADS
The maximum number of threads to use in the pipeline.
By default, all available cores are used.
--Mitochondrial_Contigs MITOCHONDRIAL_CONTIGS
Names of the Mitochondrial contigs in the provided
Reference Genome. Fragments originating from contigs
other than these are identified as 'nuclear fragments'
in the ATACseq analysis pipeline.
--WTA_Only Build a WTA only index, otherwise builds a WTA + ATAC
index.

View File

@@ -0,0 +1,115 @@
requirements:
InlineJavascriptRequirement: {}
class: CommandLineTool
label: Reference Files Generator for BD Rhapsodyâ„¢ Sequencing Analysis Pipeline
cwlVersion: v1.2
doc: >-
The Reference Files Generator creates an archive containing Genome Index and Transcriptome annotation files needed for the BD Rhapsodyâ„¢ Sequencing Analysis Pipeline. The app takes as input one or more FASTA and GTF files and produces a compressed archive in the form of a tar.gz file. The archive contains:\n - STAR index\n - Filtered GTF file
baseCommand: run_reference_generator.sh
inputs:
Genome_fasta:
type: File[]
label: Reference Genome
doc: |-
Reference genome file in FASTA format. The BD Rhapsodyâ„¢ Sequencing Analysis Pipeline uses GRCh38 for Human and GRCm39 for Mouse.
inputBinding:
prefix: --reference-genome
shellQuote: false
Gtf:
type: File[]
label: Transcript Annotations
doc: |-
Transcript annotation files in GTF format. The BD Rhapsodyâ„¢ Sequencing Analysis Pipeline uses Gencode v42 for Human and M31 for Mouse.
inputBinding:
prefix: --gtf
shellQuote: false
Extra_sequences:
type: File[]?
label: Extra Sequences
doc: |-
Additional sequences in FASTA format to use when building the STAR index. (E.g. phiX genome)
inputBinding:
prefix: --extra-sequences
shellQuote: false
Mitochondrial_Contigs:
type: string[]?
default: ["chrM", "chrMT", "M", "MT"]
label: Mitochondrial Contig Names
doc: |-
Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are identified as 'nuclear fragments' in the ATACseq analysis pipeline.
inputBinding:
prefix: --mitochondrial-contigs
shellQuote: false
Filtering_off:
type: boolean?
label: Turn off filtering
doc: |-
By default the input Transcript Annotation files are filtered based on the gene_type/gene_biotype attribute. Only features having the following attribute values are are kept:
- protein_coding
- lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97)
- IG_LV_gene
- IG_V_gene
- IG_V_pseudogene
- IG_D_gene
- IG_J_gene
- IG_J_pseudogene
- IG_C_gene
- IG_C_pseudogene
- TR_V_gene
- TR_V_pseudogene
- TR_D_gene
- TR_J_gene
- TR_J_pseudogene
- TR_C_gene
If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True.
inputBinding:
prefix: --filtering-off
shellQuote: false
WTA_Only:
type: boolean?
label: WTA only index
doc: Build a WTA only index, otherwise builds a WTA + ATAC index.
inputBinding:
prefix: --wta-only-index
shellQuote: false
Archive_prefix:
type: string?
label: Archive Prefix
doc: |-
A prefix for naming the compressed archive file containing the Reference genome index and annotation files. The default value is constructed based on the input Reference files.
inputBinding:
prefix: --archive-prefix
shellQuote: false
Extra_STAR_params:
type: string?
label: Extra STAR Params
doc: |-
Additional parameters to pass to STAR when building the genome index. Specify exactly like how you would on the command line.
Example:
--limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11
inputBinding:
prefix: --extra-star-params
shellQuote: true
Maximum_threads:
type: int?
label: Maximum Number of Threads
doc: |-
The maximum number of threads to use in the pipeline. By default, all available cores are used.
inputBinding:
prefix: --maximum-threads
shellQuote: false
outputs:
Archive:
type: File
doc: |-
A Compressed archive containing the Reference Genome Index and annotation GTF files. This archive is meant to be used as an input in the BD Rhapsodyâ„¢ Sequencing Analysis Pipeline.
id: Reference_Archive
label: Reference Files Archive
outputBinding:
glob: '*.tar.gz'

View File

@@ -0,0 +1,161 @@
import os
import re
import subprocess
import tempfile
from typing import Any
import yaml
import shutil
## VIASH START
par = {
"genome_fasta": [],
"gtf": [],
"extra_sequences": [],
"mitochondrial_contigs": ["chrM", "chrMT", "M", "MT"],
"filtering_off": False,
"wta_only_index": False,
"extra_star_params": None,
"reference_archive": "output.tar.gz",
}
meta = {
"config": "target/nextflow/reference/build_bdrhap_2_reference/.config.vsh.yaml",
"resources_dir": os.path.abspath("src/reference/build_bdrhap_2_reference"),
"temp_dir": os.getenv("VIASH_TEMP"),
"memory_mb": None,
"cpus": None
}
## VIASH END
def clean_arg(argument):
argument["clean_name"] = re.sub("^-*", "", argument["name"])
return argument
def read_config(path: str) -> dict[str, Any]:
with open(path, "r") as f:
config = yaml.safe_load(f)
config["all_arguments"] = [
clean_arg(arg)
for grp in config["argument_groups"]
for arg in grp["arguments"]
]
return config
def strip_margin(text: str) -> str:
return re.sub("(\n?)[ \t]*\|", "\\1", text)
def process_params(par: dict[str, Any], config) -> str:
# check input parameters
assert par["genome_fasta"], "Pass at least one set of inputs to --genome_fasta."
assert par["gtf"], "Pass at least one set of inputs to --gtf."
assert par["reference_archive"].endswith(".tar.gz"), "Output reference_archive must end with .tar.gz."
# make paths absolute
for argument in config["all_arguments"]:
if par[argument["clean_name"]] and argument["type"] == "file":
if isinstance(par[argument["clean_name"]], list):
par[argument["clean_name"]] = [ os.path.abspath(f) for f in par[argument["clean_name"]] ]
else:
par[argument["clean_name"]] = os.path.abspath(par[argument["clean_name"]])
return par
def generate_config(par: dict[str, Any], meta, config) -> str:
content_list = [strip_margin(f"""\
|#!/usr/bin/env cwl-runner
|
|""")]
config_key_value_pairs = []
for argument in config["all_arguments"]:
config_key = (argument.get("info") or {}).get("config_key")
arg_type = argument["type"]
par_value = par[argument["clean_name"]]
if par_value and config_key:
config_key_value_pairs.append((config_key, arg_type, par_value))
if meta["cpus"]:
config_key_value_pairs.append(("Maximum_threads", "integer", meta["cpus"]))
# print(config_key_value_pairs)
for config_key, arg_type, par_value in config_key_value_pairs:
if arg_type == "file":
str = strip_margin(f"""\
|{config_key}:
|""")
if isinstance(par_value, list):
for file in par_value:
str += strip_margin(f"""\
| - class: File
| location: "{file}"
|""")
else:
str += strip_margin(f"""\
| class: File
| location: "{par_value}"
|""")
content_list.append(str)
else:
content_list.append(strip_margin(f"""\
|{config_key}: {par_value}
|"""))
## Write config to file
return "".join(content_list)
def get_cwl_file(meta: dict[str, Any]) -> str:
# create cwl file (if need be)
cwl_file=os.path.join(meta["resources_dir"], "make_rhap_reference_2.2.1_nodocker.cwl")
return cwl_file
def main(par: dict[str, Any], meta: dict[str, Any]):
config = read_config(meta["config"])
# Preprocess params
par = process_params(par, config)
# fetch cwl file
cwl_file = get_cwl_file(meta)
# Create output dir if not exists
outdir = os.path.dirname(par["reference_archive"])
if not os.path.exists(outdir):
os.makedirs(outdir)
## Run pipeline
with tempfile.TemporaryDirectory(prefix="cwl-bd_rhapsody_wta-", dir=meta["temp_dir"]) as temp_dir:
# Create params file
config_file = os.path.join(temp_dir, "config.yml")
config_content = generate_config(par, meta, config)
with open(config_file, "w") as f:
f.write(config_content)
cmd = [
"cwl-runner",
"--no-container",
"--preserve-entire-environment",
"--outdir",
temp_dir,
cwl_file,
config_file
]
env = dict(os.environ)
env["TMPDIR"] = temp_dir
print("> " + " ".join(cmd), flush=True)
_ = subprocess.check_call(
cmd,
cwd=os.path.dirname(config_file),
env=env
)
shutil.move(os.path.join(temp_dir, "Rhap_reference.tar.gz"), par["reference_archive"])
if __name__ == "__main__":
main(par, meta)

View File

@@ -0,0 +1,65 @@
#!/bin/bash
set -e
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_doesnt_exist() {
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
}
assert_file_empty() {
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains() {
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
#############################################
in_fa="$meta_resources_dir/test_data/reference_small.fa"
in_gtf="$meta_resources_dir/test_data/reference_small.gtf"
echo "#############################################"
echo "> Simple run"
mkdir simple_run
cd simple_run
out_tar="myreference.tar.gz"
echo "> Running $meta_name."
$meta_executable \
--genome_fasta "$in_fa" \
--gtf "$in_gtf" \
--reference_archive "$out_tar" \
--extra_star_params "--genomeSAindexNbases 6" \
---cpus 2
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
assert_file_exists "$out_tar"
assert_file_not_empty "$out_tar"
echo ">> Checking whether output contains the expected files"
tar -xvf "$out_tar" > /dev/null
assert_file_exists "BD_Rhapsody_Reference_Files/star_index/genomeParameters.txt"
assert_file_exists "BD_Rhapsody_Reference_Files/bwa-mem2_index/reference_small.ann"
assert_file_exists "BD_Rhapsody_Reference_Files/reference_small-processed.gtf"
assert_file_exists "BD_Rhapsody_Reference_Files/mitochondrial_contigs.txt"
assert_file_contains "BD_Rhapsody_Reference_Files/reference_small-processed.gtf" "chr1.*HAVANA.*ENSG00000243485"
assert_file_contains "BD_Rhapsody_Reference_Files/mitochondrial_contigs.txt" 'chrMT'
cd ..
echo "#############################################"
echo "> Tests succeeded!"

View File

@@ -0,0 +1,27 @@
>chr1 1
TGGGGAAGCAAGGCGGAGTTGGGCAGCTCGTGTTCAATGGGTAGAGTTTCAGGCTGGGGT
GATGGAAGGGTGCTGGAAATGAGTGGTAGTGATGGCGGCACAACAGTGTGAATCTACTTA
ATCCCACTGAACTGTATGCTGAAAAATGGTTTAGACGGTGAATTTTAGGTTATGTATGTT
TTACCACAATTTTTAAAAAGCTAGTGAAAAGCTGGTAAAAAGAAAGAAAAGAGGCTTTTT
TAAAAAGTTAAATATATAAAAAGAGCATCATCAGTCCAAAGTCCAGCAGTTGTCCCTCCT
GGAATCCGTTGGCTTGCCTCCGGCATTTTTGGCCCTTGCCTTTTAGGGTTGCCAGATTAA
AAGACAGGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAA
ATATGTCCCAAGCTTAGTTTGGGACATACTTATGCTAAAAAACATTATTGGTTGTTTATC
TGAGATTCAGAATTAAGCATTTTATATTTTATTTGCTGCCTCTGGCCACCCTACTCTCTT
CCTAACACTCTCTCCCTCTCCCAGTTTTGTCCGCCTTCCCTGCCTCCTCTTCTGGGGGAG
TTAGATCGAGTTGTAACAAGAACATGCCACTGTCTCGCTGGCTGCAGCGTGTGGTCCCCT
TACCAGAGGTAAAGAAGAGATGGATCTCCACTCATGTTGTAGACAGAATGTTTATGTCCT
CTCCAAATGCTTATGTTGAAACCCTAACCCCTAATGTGATGGTATGTGGAGATGGGCCTT
TGGTAGGTAATTACGGTTAGATGAGGTCATGGGGTGGGGCCCTCATTATAGATCTGGTAA
GAAAAGAGAGCATTGTCTCTGTGTCTCCCTCTCTCTCTCTCTCTCTCTCTCTCATTTCTC
TCTATCTCATTTCTCTCTCTCTCGCTATCTCATTTTTCTCTCTCTCTCTTTCTCTCCTCT
GTCTTTTCCCACCAAGTGAGGATGCGAAGAGAAGGTGGCTGTCTGCAAACCAGGAAGAGA
GCCCTCACCGGGAACCCGTCCAGCTGCCACCTTGAACTTGGACTTCCAAGCCTCCAGAAC
TGTGAGGGATAAATGTATGATTTTAAAGTCGCCCAGTGTGTGGTATTTTGTTTTGACTAA
TACAACCTGAAAACATTTTCCCCTCACTCCACCTGAGCAATATCTGAGTGGCTTAAGGTA
CTCAGGACACAACAAAGGAGAAATGTCCCATGCACAAGGTGCACCCATGCCTGGGTAAAG
CAGCCTGGCACAGAGGGAAGCACACAGGCTCAGGGATCTGCTATTCATTCTTTGTGTGAC
CCTGGGCAAGCCATGAATGGAGCTTCAGTCACCCCATTTGTAATGGGATTTAATTGTGCT
TGCCCTGCCTCCTTTTGAGGGCTGTAGAGAAAAGATGTCAAAGTATTTTGTAATCTGGCT
GGGCGTGGTGGCTCATGCCTGTAATCCTAGCACTTTGGTAGGCTGACGCGAGAGGACTGC
T

View File

@@ -0,0 +1,8 @@
chr1 HAVANA exon 565 668 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; exon_number 2; exon_id "ENSE00001922571.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1 HAVANA exon 977 1098 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; exon_number 3; exon_id "ENSE00001827679.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1 HAVANA transcript 268 1110 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000469289.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
chr1 HAVANA exon 268 668 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000469289.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; exon_number 1; exon_id "ENSE00001841699.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
chr1 HAVANA exon 977 1110 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000469289.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; exon_number 2; exon_id "ENSE00001890064.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
chr1 ENSEMBL gene 367 504 . + . gene_id "ENSG00000284332.1"; gene_type "miRNA"; gene_name "MIR1302-2"; level 3; hgnc_id "HGNC:35294";
chr1 ENSEMBL transcript 367 504 . + . gene_id "ENSG00000284332.1"; transcript_id "ENST00000607096.1"; gene_type "miRNA"; gene_name "MIR1302-2"; transcript_type "miRNA"; transcript_name "MIR1302-2-201"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:35294"; tag "basic"; tag "Ensembl_canonical";
chr1 ENSEMBL exon 367 504 . + . gene_id "ENSG00000284332.1"; transcript_id "ENST00000607096.1"; gene_type "miRNA"; gene_name "MIR1302-2"; transcript_type "miRNA"; transcript_name "MIR1302-2-201"; exon_number 1; exon_id "ENSE00003695741.1"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:35294"; tag "basic"; tag "Ensembl_canonical";

View File

@@ -0,0 +1,47 @@
#!/bin/bash
TMP_DIR=/tmp/bd_rhapsody_make_reference
OUT_DIR=src/bd_rhapsody/bd_rhapsody_make_reference/test_data
# check if seqkit is installed
if ! command -v seqkit &> /dev/null; then
echo "seqkit could not be found"
exit 1
fi
# create temporary directory and clean up on exit
mkdir -p $TMP_DIR
function clean_up {
rm -rf "$TMP_DIR"
}
trap clean_up EXIT
# fetch reference
ORIG_FA=$TMP_DIR/reference.fa.gz
if [ ! -f $ORIG_FA ]; then
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/GRCh38.primary_assembly.genome.fa.gz \
-O $ORIG_FA
fi
ORIG_GTF=$TMP_DIR/reference.gtf.gz
if [ ! -f $ORIG_GTF ]; then
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.annotation.gtf.gz \
-O $ORIG_GTF
fi
# create small reference
START=30000
END=31500
CHR=chr1
# subset to small region
seqkit grep -r -p "^$CHR\$" "$ORIG_FA" | \
seqkit subseq -r "$START:$END" > $OUT_DIR/reference_small.fa
zcat "$ORIG_GTF" | \
awk -v FS='\t' -v OFS='\t' "
\$1 == \"$CHR\" && \$4 >= $START && \$5 <= $END {
\$4 = \$4 - $START + 1;
\$5 = \$5 - $START + 1;
print;
}" > $OUT_DIR/reference_small.gtf

View File

@@ -10,6 +10,9 @@ references:
license: GPL-2.0
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/dries_schaumont.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Input arguments

View File

@@ -9,6 +9,9 @@ links:
references:
doi: 10.1007/978-1-4939-9173-0_14
license: MIT
authors:
- __merge__: /src/_authors/dorien_roosen.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,6 +9,9 @@ links:
references:
doi: 10.1007/978-1-4939-9173-0_14
license: MIT
authors:
- __merge__: /src/_authors/dorien_roosen.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Outputs
arguments:

View File

@@ -9,6 +9,9 @@ links:
references:
doi: 10.1007/978-1-4939-9173-0_14
license: MIT
authors:
- __merge__: /src/_authors/dorien_roosen.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,6 +9,9 @@ links:
references:
doi: 10.14806/ej.17.1.200
license: MIT
authors:
- __merge__: /src/_authors/toni_verbeiren.yaml
roles: [ author, maintainer ]
argument_groups:
####################################################################
- name: Specify Adapters for R1

View File

@@ -6,25 +6,25 @@ set -eo pipefail
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || (echo "File '$1' does not exist" && exit 1)
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_doesnt_exist() {
[ ! -f "$1" ] || (echo "File '$1' exists but shouldn't" && exit 1)
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
}
assert_file_empty() {
[ ! -s "$1" ] || (echo "File '$1' is not empty but should be" && exit 1)
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || (echo "File '$1' is empty but shouldn't be" && exit 1)
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains() {
grep -q "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
#############################################
mkdir test_multiple_output
cd test_multiple_output

View File

@@ -9,6 +9,9 @@ references:
license: GPL-3.0
requirements:
commands: [falco]
authors:
- __merge__: /src/_authors/toni_verbeiren.yaml
roles: [ author, maintainer ]
# Notes:
# - falco as arguments similar to -subsample and we update those to --subsample

View File

@@ -26,6 +26,9 @@ links:
references:
doi: "10.1093/bioinformatics/bty560"
license: MIT
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
description: |

View File

@@ -11,7 +11,9 @@ references:
license: GPL-3.0
requirements:
commands: [ featureCounts ]
authors:
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -8,8 +8,9 @@ links:
references:
doi: 10.12688/f1000research.23297.2
license: MIT
requirements:
commands: [ gffread ]
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
@@ -52,7 +53,7 @@ argument_groups:
required: true
description: |
Write the output records into <outfile>.
default: output.gff
example: output.gff
- name: --force_exons
type: boolean_true
description: |
@@ -154,7 +155,6 @@ argument_groups:
- name: --table
type: string
multiple: true
multiple_sep: ","
description: |
Output a simple tab delimited format instead of GFF, with columns having the values
of GFF attributes given in <attrlist>; special pseudo-attributes (prefixed by @) are

View File

@@ -50,6 +50,8 @@
[[ "$par_expose_dups" == "false" ]] && unset par_expose_dups
[[ "$par_cluster_only" == "false" ]] && unset par_cluster_only
# if par_table is not empty, replace ";" with ","
par_table=$(echo "$par_table" | tr ';' ',')
$(which gffread) \
"$par_input" \

View File

@@ -86,7 +86,7 @@ diff "$expected_output_dir/transcripts.fa" "$test_output_dir/transcripts.fa" ||
echo "> Test 4 - Generate table from GFF annotation file"
"$meta_executable" \
--table @id,@chr,@start,@end,@strand,@exons,Name,gene,product \
--table "@id;@chr;@start;@end;@strand;@exons;Name;gene;product" \
--outfile "$test_output_dir/annotation.tbl" \
--input "$test_dir/sequence.gff3"

View File

@@ -17,6 +17,9 @@ references:
license: "MIT"
requirements:
commands: [ lofreq ]
authors:
- __merge__: /src/_authors/kai_waldrant.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -18,6 +18,9 @@ references:
license: "MIT"
requirements:
commands: [ lofreq ]
authors:
- __merge__: /src/_authors/kai_waldrant.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -11,7 +11,9 @@ info:
references:
doi: 10.1093/bioinformatics/btw354
licence: GPL v3 or later
authors:
- __merge__: /src/_authors/dorien_roosen.yaml
roles: [ author, maintainer ]
argument_groups:
- name: "Input"
arguments:
@@ -54,25 +56,21 @@ argument_groups:
- name: "--include_modules"
type: string
multiple: true
multiple_sep: ","
example: fastqc,cutadapt
example: [fastqc, cutadapt]
description: Use only these module
- name: "--exclude_modules"
type: string
multiple: true
multiple_sep: ","
example: fastqc,cutadapt
example: [fastqc, cutadapt]
description: Do not use only these modules
- name: "--ignore_analysis"
type: string
multiple: true
multiple_sep: ","
example: run_one/*,run_two/*
example: [run_one/*, run_two/*]
- name: "--ignore_samples"
type: string
multiple: true
multiple_sep: ","
example: sample_1*,sample_3*
example: [sample_1*, sample_3*]
- name: "--ignore_symlinks"
type: boolean_true
description: Ignore symlinked directories and files

View File

@@ -38,7 +38,7 @@ IFS=";" read -ra inputs <<< $par_input
if [[ -n "$par_include_modules" ]]; then
include_modules=""
IFS="," read -ra incl_modules <<< $par_include_modules
IFS=";" read -ra incl_modules <<< $par_include_modules
for i in "${incl_modules[@]}"; do
include_modules+="--include $i "
done
@@ -47,7 +47,7 @@ fi
if [[ -n "$par_exclude_modules" ]]; then
exclude_modules=""
IFS="," read -ra excl_modules <<< $par_exclude_modules
IFS=";" read -ra excl_modules <<< $par_exclude_modules
for i in "${excl_modules[@]}"; do
exclude_modules+="--exclude $i"
done
@@ -56,7 +56,7 @@ fi
if [[ -n "$par_ignore_analysis" ]]; then
ignore=""
IFS="," read -ra ignore_analysis <<< $par_ignore_analysis
IFS=";" read -ra ignore_analysis <<< $par_ignore_analysis
for i in "${ignore_analysis[@]}"; do
ignore+="--ignore $i "
done
@@ -65,7 +65,7 @@ fi
if [[ -n "$par_ignore_samples" ]]; then
ignore_samples=""
IFS="," read -ra ign_samples <<< $par_ignore_samples
IFS=";" read -ra ign_samples <<< $par_ignore_samples
for i in "${ign_samples[@]}"; do
ignore_samples+="--ignore-samples $i"
done

View File

@@ -12,7 +12,10 @@ references:
doi: 10.1093/bioinformatics/btt593
license: "CC-BY-NC-SA-3.0"
requirements:
commands: [ pear , gzip ]
commands: [ pear, gzip ]
authors:
- __merge__: /src/_authors/kai_waldrant.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -12,14 +12,16 @@ references:
license: GPL-3.0
requirements:
commands: [ salmon ]
authors:
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: --genome
type: file
description: |
Genome of the organism to prepare the set of decoy sequences. Required to build decoy-aware transccriptome.
Genome of the organism to prepare the set of decoy sequences. Required to build decoy-aware transcriptome.
required: false
example: genome.fasta
- name: --transcripts
@@ -110,4 +112,4 @@ engines:
salmon index -v 2>&1 | sed 's/salmon \([0-9.]*\)/salmon: \1/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
- type: nextflow

View File

@@ -12,7 +12,9 @@ references:
license: GPL-3.0
requirements:
commands: [ salmon ]
authors:
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Common input options
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -0,0 +1,193 @@
name: samtools_fasta
namespace: samtools
description: Converts a SAM, BAM or CRAM to FASTA format.
keywords: [fasta, bam, sam, cram]
links:
homepage: https://www.htslib.org/
documentation: https://www.htslib.org/doc/samtools-fasta.html
repository: https://github.com/samtools/samtools
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
description: input SAM/BAM/CRAM file
required: true
- name: Outputs
arguments:
- name: --output
type: file
description: output FASTA file
required: true
direction: output
- name: Options
arguments:
- name: --no_suffix
alternatives: -n
type: boolean_true
description: |
By default, either '/1' or '/2' is added to the end of read names where the corresponding
READ1 or READ2 FLAG bit is set. Using -n causes read names to be left as they are.
- name: --suffix
alternatives: -N
type: boolean_true
description: |
Always add either '/1' or '/2' to the end of read names even when put into different files.
- name: --use_oq
alternatives: -O
type: boolean_true
description: |
Use quality values from OQ tags in preference to standard quality string if available.
- name: --singleton
alternatives: -s
type: file
description: write singleton reads to FILE.
- name: --copy_tags
alternatives: -t
type: boolean_true
description: |
Copy RG, BC and QT tags to the FASTA header line, if they exist.
- name: --copy_tags_list
alternatives: -T
type: string
description: |
Specify a comma-separated list of tags to copy to the FASTA header line, if they exist.
TAGLIST can be blank or `*` to indicate all tags should be copied to the output. If using `*`,
be careful to quote it to avoid unwanted shell expansion.
- name: --read1
alternatives: -1
type: file
description: |
Write reads with the READ1 FLAG set (and READ2 not set) to FILE instead of outputting them.
If the -s option is used, only paired reads will be written to this file.
direction: output
- name: --read2
alternatives: -2
type: file
description: |
Write reads with the READ2 FLAG set (and READ1 not set) to FILE instead of outputting them.
If the -s option is used, only paired reads will be written to this file.
direction: output
- name: --output_reads
alternatives: -o
type: file
description: |
Write reads with either READ1 FLAG or READ2 flag set to FILE instead of outputting them to stdout.
This is equivalent to -1 FILE -2 FILE.
direction: output
- name: --output_reads_both
alternatives: -0
type: file
description: |
Write reads where the READ1 and READ2 FLAG bits set are either both set or both unset to FILE
instead of outputting them.
direction: output
- name: --filter_flags
alternatives: -f
type: integer
description: |
Only output alignments with all bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0'
(i.e. /^0[0-7]+/). Default: `0`.
example: 0
- name: --excl_flags
alternatives: -F
type: string
description: |
Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0'
(i.e. /^0[0-7]+/). This defaults to 0x900 representing filtering of secondary and
supplementary alignments. Default: `0x900`.
example: "0x900"
- name: --incl_flags
alternatives: --rf
type: string
description: |
Only output alignments with any bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with '0'
(i.e. /^0[0-7]+/), as a decimal number not beginning with '0' or as a comma-separated list of
flag names. Default: `0`.
example: 0
- name: --excl_flags_all
alternatives: -G
type: integer
description: |
Only EXCLUDE reads with all of the bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0' (i.e. /^0[0-7]+/).
Default: `0`.
example: 0
- name: --aux_tag
alternatives: -d
type: string
description: |
Only output alignments containing an auxiliary tag matching both TAG and VAL. If VAL is omitted
then any value is accepted. The tag types supported are i, f, Z, A and H. "B" arrays are not
supported. This is comparable to the method used in samtools view --tag. The option may be specified
multiple times and is equivalent to using the --aux_tag_file option.
- name: --aux_tag_file
alternatives: -D
type: string
description: |
Only output alignments containing an auxiliary tag matching TAG and having a value listed in FILE.
The format of the file is one line per value. This is equivalent to specifying --aux_tag multiple times.
- name: --casava
alternatives: -i
type: boolean_true
description: add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
- name: --compression
alternatives: -c
type: integer
description: set compression level when writing gz or bgzf fasta files.
example: 0
- name: --index1
alternatives: --i1
type: file
description: write first index reads to FILE.
- name: --index2
alternatives: --i2
type: file
description: write second index reads to FILE.
- name: --barcode_tag
type: string
description: |
Auxiliary tag to find index reads in. Default: `BC`.
example: "BC"
- name: --quality_tag
type: string
description: |
Auxiliary tag to find index quality in. Default: `QT`.
example: "QT"
- name: --index_format
type: string
description: |
string to describe how to parse the barcode and quality tags. For example:
* `i14i8`: the first 14 characters are index 1, the next 8 characters are index 2.
* `n8i14`: ignore the first 8 characters, and use the next 14 characters for index 1.
If the tag contains a separator, then the numeric part can be replaced with`*` to mean
'read until the separator or end of tag', for example: `n*i*`.
resources:
- type: bash_script
path: ../samtools_fastq/script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1
setup:
- type: docker
run: |
samtools --version 2>&1 | grep -E '^(samtools|Using htslib)' | \
sed 's#Using ##;s# \([0-9\.]*\)$#: \1#' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,80 @@
```
samtools fastq
```
Usage: samtools fastq [options...] <in.bam>
Description:
Converts a SAM, BAM or CRAM to FASTQ format.
Options:
-0 FILE write reads designated READ_OTHER to FILE
-1 FILE write reads designated READ1 to FILE
-2 FILE write reads designated READ2 to FILE
-o FILE write reads designated READ1 or READ2 to FILE
note: if a singleton file is specified with -s, only
paired reads will be written to the -1 and -2 files.
-d, --tag TAG[:VAL]
only include reads containing TAG, optionally with value VAL
-f, --require-flags INT
only include reads with all of the FLAGs in INT present [0]
-F, --excl[ude]-flags INT
only include reads with none of the FLAGs in INT present [0x900]
--rf, --incl[ude]-flags INT
only include reads with any of the FLAGs in INT present [0]
-G INT only EXCLUDE reads with all of the FLAGs in INT present [0]
-n don't append /1 and /2 to the read name
-N always append /1 and /2 to the read name
-O output quality in the OQ tag if present
-s FILE write singleton reads designated READ1 or READ2 to FILE
-t copy RG, BC and QT tags to the FASTQ header line
-T TAGLIST copy arbitrary tags to the FASTQ header line, '*' for all
-v INT default quality score if not given in file [1]
-i add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
-c INT compression level [0..9] to use when writing bgzf files [1]
--i1 FILE write first index reads to FILE
--i2 FILE write second index reads to FILE
--barcode-tag TAG
Barcode tag [BC]
--quality-tag TAG
Quality tag [QT]
--index-format STR
How to parse barcode and quality tags
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
--verbosity INT
Set level of verbosity
The files will be automatically compressed if the file names have a .gz
or .bgzf extension. The input to this program must be collated by name.
Run 'samtools collate' or 'samtools sort -n' to achieve this.
Reads are designated READ1 if FLAG READ1 is set and READ2 is not set.
Reads are designated READ2 if FLAG READ1 is not set and READ2 is set.
Otherwise reads are designated READ_OTHER (both flags set or both flags unset).
Run 'samtools flags' for more information on flag codes and meanings.
The index-format string describes how to parse the barcode and quality tags.
It is made up of 'i' or 'n' followed by a length or '*'. For example:
i14i8 The first 14 characters are index 1, the next 8 are index 2
n8i14 Ignore the first 8 characters, and use the next 14 for index 1
If the tag contains a separator, then the numeric part can be replaced with
'*' to mean 'read until the separator or end of tag', for example:
i*i* Break the tag at the separator into index 1 and index 2
n*i* Ignore the left part of the tag until the separator,
then use the second part of the tag as index 1
Examples:
To get just the paired reads in separate files, use:
samtools fastq -1 pair1.fq -2 pair2.fq -0 /dev/null -s /dev/null -n in.bam
To get all non-supplementary/secondary reads in a single file, redirect
the output:
samtools fastq in.bam > all_reads.fq

View File

@@ -0,0 +1,96 @@
#!/bin/bash
test_dir="${meta_resources_dir}/test_data"
out_dir="${meta_resources_dir}/out_data"
############################################################################################
echo ">>> Test 1: Convert all reads from a bam file to fasta format"
"$meta_executable" \
--input "$test_dir/a.bam" \
--output "$out_dir/a.fa"
echo ">>> Check if output file exists"
[ ! -f "$out_dir/a.fa" ] && echo "Output file a.fa does not exist" && exit 1
echo ">>> Check if output is empty"
[ ! -s "$out_dir/a.fa" ] && echo "Output file a.fa is empty" && exit 1
echo ">>> Check if output matches expected output"
diff "$out_dir/a.fa" "$test_dir/a.fa" ||
(echo "Output file a.fa does not match expected output" && exit 1)
rm "$out_dir/a.fa"
############################################################################################
echo ">>> Test 2: Convert all reads from a sam file to fasta format"
"$meta_executable" \
--input "$test_dir/a.sam" \
--output "$out_dir/a.fa"
echo ">>> Check if output file exists"
[ ! -f "$out_dir/a.fa" ] && echo "Output file a.fa does not exist" && exit 1
echo ">>> Check if output is empty"
[ ! -s "$out_dir/a.fa" ] && echo "Output file a.fa is empty" && exit 1
echo ">>> Check if output matches expected output"
diff "$out_dir/a.fa" "$test_dir/a.fa" ||
(echo "Output file a.fa does not match expected output" && exit 1)
rm "$out_dir/a.fa"
############################################################################################
echo ">>> Test 3: Output reads from bam file to separate files"
"$meta_executable" \
--input "$test_dir/a.bam" \
--read1 "$out_dir/a.1.fa" \
--read2 "$out_dir/a.2.fa" \
--output "$out_dir/a.fa"
echo ">>> Check if output files exist"
[ ! -f "$out_dir/a.1.fa" ] && echo "Output file a.1.fa does not exist" && exit 1
[ ! -f "$out_dir/a.2.fa" ] && echo "Output file a.2.fa does not exist" && exit 1
[ ! -f "$out_dir/a.fa" ] && echo "Output file a.fa does not exist" && exit 1
echo ">>> Check if output files are empty"
[ ! -s "$out_dir/a.1.fa" ] && echo "Output file a.1.fa is empty" && exit 1
[ ! -s "$out_dir/a.2.fa" ] && echo "Output file a.2.fa is empty" && exit 1
# output should be empty since input has no singleton reads
echo ">>> Check if output files match expected output"
diff "$out_dir/a.1.fa" "$test_dir/a.1.fa" ||
(echo "Output file a.1.fa does not match expected output" && exit 1)
diff "$out_dir/a.2.fa" "$test_dir/a.2.fa" ||
(echo "Output file a.2.fa does not match expected output" && exit 1)
rm "$out_dir/a.1.fa" "$out_dir/a.2.fa" "$out_dir/a.fa"
############################################################################################
echo ">>> Test 4: Output only forward reads from bam file to fasta format"
"$meta_executable" \
--input "$test_dir/a.sam" \
--excl_flags "0x80" \
--output "$out_dir/half.fa"
echo ">>> Check if output file exists"
[ ! -f "$out_dir/half.fa" ] && echo "Output file half.fa does not exist" && exit 1
echo ">>> Check if output is empty"
[ ! -s "$out_dir/half.fa" ] && echo "Output file half.fa is empty" && exit 1
echo ">>> Check if output matches expected output"
diff "$out_dir/half.fa" "$test_dir/half.fa" ||
(echo "Output file half.fa does not match expected output" && exit 1)
rm "$out_dir/half.fa"
############################################################################################
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,6 @@
>a1
AAAAAAAAAA
>b1
AAAAAAAAAA
>c1
AAAAAAAAAA

View File

@@ -0,0 +1,6 @@
>a1
AAAAAAAAAA
>b1
AAAAAAAAAA
>c1
AAAAAAAAAA

Binary file not shown.

View File

@@ -0,0 +1,12 @@
>a1/1
AAAAAAAAAA
>b1/1
AAAAAAAAAA
>c1/1
AAAAAAAAAA
>a1/2
AAAAAAAAAA
>b1/2
AAAAAAAAAA
>c1/2
AAAAAAAAAA

View File

@@ -0,0 +1,7 @@
@SQ SN:xx LN:20
a1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
b1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
c1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
a1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
b1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
c1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********

View File

@@ -0,0 +1,6 @@
>a1/1
AAAAAAAAAA
>b1/1
AAAAAAAAAA
>c1/1
AAAAAAAAAA

View File

@@ -0,0 +1,11 @@
#!/bin/bash
# dowload test data from snakemake wrapper
if [ ! -d /tmp/fastq_source ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers.git /tmp/fastq_source
fi
cp -r /tmp/fastq_source/bio/samtools/fastx/test/*.sam src/samtools/samtools_fastq/test_data/
cp -r /tmp/fastq_source/bio/samtools/fastq/interleaved/test/mapped/*.bam src/samtools/samtools_fastq/test_data/
cp -r /tmp/fastq_source/bio/samtools/fastq/interleaved/test/reads/*.fq src/samtools/samtools_fastq/test_data/
cp -r /tmp/fastq_source/bio/samtools/fastq/separate/test/reads/*.fq src/samtools/samtools_fastq/test_data/

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
@@ -56,7 +58,7 @@ argument_groups:
type: string
description: |
Specify a comma-separated list of tags to copy to the FASTQ header line, if they exist.
TAGLIST can be blank or * to indicate all tags should be copied to the output. If using *,
TAGLIST can be blank or `*` to indicate all tags should be copied to the output. If using `*`,
be careful to quote it to avoid unwanted shell expansion.
- name: --read1
alternatives: -1
@@ -91,35 +93,35 @@ argument_groups:
type: integer
description: |
Only output alignments with all bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0'
(i.e. /^0[0-7]+/).
default: 0
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0'
(i.e. /^0[0-7]+/). Default: `0`.
example: 0
- name: --excl_flags
alternatives: -F
type: string
description: |
Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0'
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0'
(i.e. /^0[0-7]+/). This defaults to 0x900 representing filtering of secondary and
supplementary alignments.
default: 0x900
supplementary alignments. Default: `0x900`.
example: "0x900"
- name: --incl_flags
alternatives: --rf
type: string
description: |
Only output alignments with any bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with `0'
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with '0'
(i.e. /^0[0-7]+/), as a decimal number not beginning with '0' or as a comma-separated list of
flag names.
default: 0
flag names. Default: `0`.
example: 0
- name: --excl_flags_all
alternatives: -G
type: integer
description: |
Only EXCLUDE reads with all of the bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0'
(i.e. /^0[0-7]+/).
default: 0
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0' (i.e. /^0[0-7]+/).
Default: `0`.
example: 0
- name: --aux_tag
alternatives: -d
type: string
@@ -137,12 +139,13 @@ argument_groups:
- name: --casava
alternatives: -i
type: boolean_true
description: add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
description: |
Add Illumina Casava 1.8 format entry to header, for example: `1:N:0:ATCACG`.
- name: --compression
alternatives: -c
type: integer
description: set compression level when writing gz or bgzf fastq files.
default: 0
example: 0
- name: --index1
alternatives: --i1
type: file
@@ -153,20 +156,22 @@ argument_groups:
description: write second index reads to FILE.
- name: --barcode_tag
type: string
description: Auxiliary tag to find index reads in.
default: BC
description: |
Auxiliary tag to find index reads in. Default: `BC`.
example: "BC"
- name: --quality_tag
type: string
description: Auxiliary tag to find index quality in.
default: QT
description: |
Auxiliary tag to find index quality in. Default: `QT`.
example: QT
- name: --index_format
type: string
description: |
string to describe how to parse the barcode and quality tags. For example:
[i14i8]: the first 14 characters are index 1, the next 8 characters are index 2.
[n8i14]: ignore the first 8 characters, and use the next 14 characters for index 1.
* `i14i8`: the first 14 characters are index 1, the next 8 characters are index 2.
* `n8i14`: ignore the first 8 characters, and use the next 14 characters for index 1.
If the tag contains a separator, then the numeric part can be replaced with '*' to mean
'read until the separator or end of tag', for example: [n*i*].
'read until the separator or end of tag', for example: `n*i*`.
resources:
- type: bash_script

View File

@@ -11,7 +11,14 @@ set -e
[[ "$par_copy_tags" == "false" ]] && unset par_copy_tags
[[ "$par_casava" == "false" ]] && unset par_casava
samtools fastq \
if [[ "$meta_name" == "samtools_fasta" ]]; then
subcommand=fasta
elif [[ "$meta_name" == "samtools_fastq" ]]; then
subcommand=fastq
else
echo "Unrecognized component name" && exit 1
fi
samtools "$subcommand" \
${par_no_suffix:+-n} \
${par_suffix:+-N} \
${par_use_oq:+-O} \

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
@@ -30,10 +32,10 @@ argument_groups:
- name: --coverage
alternatives: -c
type: integer
description: |
Coverage distribution min,max,step [1,1000,1].
multiple: true
multiple_sep: ','
description: |
Coverage distribution min;max;step. Default: [1, 1000, 1].
example: [1, 1000, 1]
- name: --remove_dups
alternatives: -d
type: boolean_true
@@ -48,25 +50,25 @@ argument_groups:
alternatives: -f
type: string
description: |
Required flag, 0 for unset. See also `samtools flags`.
default: "0"
Required flag, 0 for unset. See also `samtools flags`. Default: `"0"`.
example: "0"
- name: --filtering_flag
alternatives: -F
type: string
description: |
Filtering flag, 0 for unset. See also `samtools flags`.
default: "0"
Filtering flag, 0 for unset. See also `samtools flags`. Default: `0`.
example: "0"
- name: --GC_depth
type: double
description: |
The size of GC-depth bins (decreasing bin size increases memory requirement).
default: 20000.0
The size of GC-depth bins (decreasing bin size increases memory requirement). Default: `20000`.
example: 20000.0
- name: --insert_size
alternatives: -i
type: integer
description: |
Maximum insert size.
default: 8000
Maximum insert size. Default: `8000`.
example: 8000
- name: --id
alternatives: -I
type: string
@@ -76,14 +78,14 @@ argument_groups:
alternatives: -l
type: integer
description: |
Include in the statistics only reads with the given read length.
default: -1
Include in the statistics only reads with the given read length. Default: `-1`.
example: -1
- name: --most_inserts
alternatives: -m
type: double
description: |
Report only the main part of inserts.
default: 0.99
Report only the main part of inserts. Default: `0.99`.
example: 0.99
- name: --split_prefix
alternatives: -P
type: string
@@ -93,8 +95,8 @@ argument_groups:
alternatives: -q
type: integer
description: |
The BWA trimming parameter.
default: 0
The BWA trimming parameter. Default: `0`.
example: 0
- name: --ref_seq
alternatives: -r
type: file
@@ -124,8 +126,8 @@ argument_groups:
alternatives: -g
type: integer
description: |
Only bases with coverage above this value will be included in the target percentage computation.
default: 0
Only bases with coverage above this value will be included in the target percentage computation. Default: `0`.
example: 0
- name: --input_fmt_option
type: string
description: |
@@ -141,7 +143,7 @@ argument_groups:
type: file
description: |
Output file.
default: "out.txt"
example: "out.txt"
required: true
direction: output

View File

@@ -10,6 +10,9 @@ set -e
[[ "$par_sparse" == "false" ]] && unset par_sparse
[[ "$par_remove_overlaps" == "false" ]] && unset par_remove_overlaps
# change the coverage input from X;X;X to X,X,X
par_coverage=$(echo "$par_coverage" | tr ';' ',')
samtools stats \
${par_coverage:+-c "$par_coverage"} \
${par_remove_dups:+-d} \

View File

@@ -17,7 +17,7 @@ echo ">>> Checking whether output is non-empty"
[ ! -s "$test_dir/test.paired_end.sorted.txt" ] && echo "File 'test.paired_end.sorted.txt' is empty!" && exit 1
echo ">>> Checking whether output is correct"
# compare using diff, ignoring the line stating the command that was passed.
# compare using diff, ignoring the line stating the command that was passed.
diff <(grep -v "^# The command" "$test_dir/test.paired_end.sorted.txt") \
<(grep -v "^# The command" "$test_dir/ref.paired_end.sorted.txt") || \
(echo "Output file ref.paired_end.sorted.txt does not match expected output" && exit 1)

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -0,0 +1,57 @@
name: seqtk_sample
namespace: seqtk
description: Subsamples sequences from FASTA/Q files.
keywords: [sample, FASTA, FASTQ]
links:
repository: https://github.com/lh3/seqtk/tree/v1.4
license: MIT
authors:
- __merge__: /src/_authors/jakub_majercik.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
description: The input FASTA/Q file.
required: true
- name: Outputs
arguments:
- name: --output
type: file
description: The output FASTA/Q file.
required: true
direction: output
- name: Options
arguments:
- name: --seed
type: integer
description: Seed for random generator.
example: 42
- name: --fraction_number
type: double
description: Fraction or number of sequences to sample.
required: true
example: 0.1
- name: --two_pass_mode
type: boolean_true
description: Twice as slow but with much reduced memory
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: ../test_data
engines:
- type: docker
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,9 @@
```
seqtk_subseq
```
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
Options:
-t TAB delimited output
-s strand aware
-l INT sequence line length [0]
Note: Use 'samtools faidx' if only a few regions are intended.

View File

@@ -0,0 +1,11 @@
#!/bin/bash
## VIASH START
## VIASH END
seqtk sample \
${par_two_pass_mode:+-2} \
${par_seed:+-s "$par_seed"} \
"$par_input" \
"$par_fraction_number" \
> "$par_output"

View File

@@ -0,0 +1,104 @@
#!/bin/bash
set -e
## VIASH START
meta_executable="target/executable/seqtk/seqtk_sample"
meta_resources_dir="src/seqtk"
## VIASH END
#########################################################################################
mkdir seqtk_sample_se
cd seqtk_sample_se
echo "> Run seqtk_sample on fastq SE"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
--seed 42 \
--fraction_number 3 \
--output "sampled.fastq"
echo ">> Check if output exists"
if [ ! -f "sampled.fastq" ]; then
echo ">> sampled.fastq does not exist"
exit 1
fi
echo ">> Count number of samples"
num_samples=$(grep -c '^@' sampled.fastq)
if [ "$num_samples" -ne 3 ]; then
echo ">> sampled.fastq does not contain 3 samples"
exit 1
fi
#########################################################################################
cd ..
mkdir seqtk_sample_pe_number
cd seqtk_sample_pe_number
echo ">> Run seqtk_sample on fastq.gz PE with number of reads"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
--seed 42 \
--fraction_number 3 \
--output "sampled_1.fastq"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.2.fastq.gz" \
--seed 42 \
--fraction_number 3 \
--output "sampled_2.fastq"
echo ">> Check if output exists"
if [ ! -f "sampled_1.fastq" ] || [ ! -f "sampled_2.fastq" ]; then
echo ">> One or both output files do not exist"
exit 1
fi
echo ">> Compare reads"
# Extract headers
headers1=$(grep '^@' sampled_1.fastq | sed -e's/ 1$//' | sort)
headers2=$(grep '^@' sampled_2.fastq | sed -e 's/ 2$//' | sort)
# Compare headers
diff <(echo "$headers1") <(echo "$headers2") || { echo "Mismatch detected"; exit 1; }
echo ">> Count number of samples"
num_headers=$(echo "$headers1" | wc -l)
if [ "$num_headers" -ne 3 ]; then
echo ">> sampled_1.fastq does not contain 3 headers"
exit 1
fi
#########################################################################################
cd ..
mkdir seqtk_sample_pe_fraction
cd seqtk_sample_pe_fraction
echo ">> Run seqtk_sample on fastq.gz PE with fraction of reads"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
--seed 42 \
--fraction_number 0.5 \
--output "sampled_1.fastq"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.2.fastq.gz" \
--seed 42 \
--fraction_number 0.5 \
--output "sampled_2.fastq"
echo ">> Check if output exists"
if [ ! -f "sampled_1.fastq" ] || [ ! -f "sampled_2.fastq" ]; then
echo ">> One or both output files do not exist"
exit 1
fi
echo ">> Compare reads"
# Extract headers
headers1=$(grep '^@' sampled_1.fastq | sed -e's/ 1$//' | sort)
headers2=$(grep '^@' sampled_2.fastq | sed -e 's/ 2$//' | sort)
# Compare headers
diff <(echo "$headers1") <(echo "$headers2") || { echo "Mismatch detected"; exit 1; }

View File

@@ -0,0 +1,78 @@
name: seqtk_subseq
namespace: seqtk
description: |
Extract subsequences from FASTA/Q files. Takes as input a FASTA/Q file and a name.lst (sequence ids file) or a reg.bed (genomic regions file).
keywords: [subseq, FASTA, FASTQ]
links:
repository: https://github.com/lh3/seqtk/tree/v1.4
license: MIT
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: "--input"
type: file
direction: input
description: The input FASTA/Q file.
required: true
example: input.fa
- name: "--name_list"
type: file
direction: input
description: |
List of sequence names (name.lst) or genomic regions (reg.bed) to extract.
required: true
example: list.lst
- name: Outputs
arguments:
- name: "--output"
alternatives: -o
type: file
direction: output
description: The output FASTA/Q file.
required: true
default: output.fa
- name: Options
arguments:
- name: "--tab"
alternatives: -t
type: boolean_true
description: TAB delimited output.
- name: "--strand_aware"
alternatives: -s
type: boolean_true
description: Strand aware.
- name: "--sequence_line_length"
alternatives: -l
type: integer
description: |
Sequence line length of input fasta file. Default: 0.
example: 0
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
setup:
- type: docker
run: |
echo $(echo $(seqtk 2>&1) | sed -n 's/.*\(Version: [^ ]*\).*/\1/p') > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,9 @@
```bash
seqtk subseq
```
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
Options:
-t TAB delimited output
-s strand aware
-l INT sequence line length [0]
Note: Use 'samtools faidx' if only a few regions are intended.

View File

@@ -0,0 +1,15 @@
#!/bin/bash
## VIASH START
## VIASH END
[[ "$par_tab" == "false" ]] && unset par_tab
[[ "$par_strand_aware" == "false" ]] && unset par_strand_aware
seqtk subseq \
${par_tab:+-t} \
${par_strand_aware:+-s} \
${par_sequence_line_length:+-l "$par_sequence_line_length"} \
"$par_input" \
"$par_name_list" \
> "$par_output"

View File

@@ -0,0 +1,182 @@
#!/bin/bash
# exit on error
set -e
## VIASH START
meta_executable="target/executable/seqtk/seqtk_subseq"
meta_resources_dir="src/seqtk"
## VIASH END
# Create directories for tests
echo "Creating Test Data..."
mkdir test_data
# Create and populate input.fasta
cat > "test_data/input.fasta" <<EOL
>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCA
AGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>GU056837.1
CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGAAATAATAATTATCATAATTA
TTAATTACATATTTATTAGGTATAATATTTAAGGAAAAATATATTTTATGTTAATTGTAATAATTAGAAC
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGAC
AGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>JAMFTS010000002.1
CCTAAACCCTAAACCCTAAACCCCCTACAAACCTTACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
ACCCGAAACCCTATACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCAAACCTAATCCCTAAACC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTC
AAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG
EOL
# Update id.list with new entries
cat > "test_data/id.list" <<EOL
KU562861.1
MH150936.1
EOL
# Create and populate reg.bed
cat > "test_data/reg.bed" <<EOL
KU562861.1$(echo -e "\t")10$(echo -e "\t")20$(echo -e "\t")region$(echo -e "\t")0$(echo -e "\t")+$(echo -e "\n")
MH150936.1$(echo -e "\t")10$(echo -e "\t")20$(echo -e "\t")region$(echo -e "\t")0$(echo -e "\t")-
EOL
#########################################################################################
# Run basic test
mkdir test1
cd test1
echo "> Run seqtk_subseq on FASTA/Q file"
"$meta_executable" \
--input "../test_data/input.fasta" \
--name_list "../test_data/id.list" \
--output "sub_sample.fq"
expected_output_basic=">KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTCAAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG"
output_basic=$(cat sub_sample.fq)
if [ "$output_basic" != "$expected_output_basic" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_basic"
echo "Got:"
echo "$output_basic"
exit 1
fi
#########################################################################################
# Run reg.bed as name list input test
cd ..
mkdir test2
cd test2
echo "> Run seqtk_subseq on FASTA/Q file with BED file as name list"
"$meta_executable" \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"
expected_output_basic=">KU562861.1:11-20
AGTGTTCGAG
>MH150936.1:11-20
TGAAAACTTT"
output_basic=$(cat sub_sample.fq)
if [ "$output_basic" != "$expected_output_basic" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_basic"
echo "Got:"
echo "$output_basic"
exit 1
fi
#########################################################################################
# Run tab option output test
cd ..
mkdir test3
cd test3
echo "> Run seqtk_subseq with TAB option"
"$meta_executable" \
--tab \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"
expected_output_tabular=$'KU562861.1\t11\tAGTGTTCGAG\nMH150936.1\t11\tTGAAAACTTT'
output_tabular=$(cat sub_sample.fq)
if [ "$output_tabular" != "$expected_output_tabular" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_tabular"
echo "Got:"
echo "$output_tabular"
exit 1
fi
#########################################################################################
# Run line option output test
cd ..
mkdir test4
cd test4
echo "> Run seqtk_subseq with line length option"
"$meta_executable" \
--sequence_line_length 5 \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"
expected_output_wrapped=">KU562861.1:11-20
AGTGT
TCGAG
>MH150936.1:11-20
TGAAA
ACTTT"
output_wrapped=$(cat sub_sample.fq)
if [ "$output_wrapped" != "$expected_output_wrapped" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_wrapped"
echo "Got:"
echo "$output_wrapped"
exit 1
fi
#########################################################################################
# Run Strand Aware option output test
cd ..
mkdir test5
cd test5
echo "> Run seqtk_subseq with strand aware option"
"$meta_executable" \
--strand_aware \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"
expected_output_wrapped=">KU562861.1:11-20
AGTGTTCGAG
>MH150936.1:11-20
AAAGTTTTCA"
output_wrapped=$(cat sub_sample.fq)
if [ "$output_wrapped" != "$expected_output_wrapped" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_wrapped"
echo "Got:"
echo "$output_wrapped"
exit 1
fi
echo "All tests succeeded!"

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1,4 @@
@1
ACGGCAT
+
!!!!!!!

Binary file not shown.

View File

@@ -0,0 +1 @@
1

9
src/seqtk/test_data/script.sh Executable file
View File

@@ -0,0 +1,9 @@
# clone repo
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi
# copy test data
cp -r /tmp/snakemake-wrappers/bio/seqtk/test/* src/seqtk/test_data
rm src/seqtk/test_data/Snakefile

File diff suppressed because it is too large Load Diff

View File

@@ -11,6 +11,11 @@ references:
license: MIT
requirements:
commands: [ STAR, python, ps, zcat, bzcat ]
authors:
- __merge__: /src/_authors/angela_o_pisco.yaml
roles: [ author ]
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [ author, maintainer ]
# manually taking care of the main input and output arguments
argument_groups:
- name: Inputs
@@ -113,6 +118,8 @@ engines:
rm -rf /tmp/STAR-${STAR_VERSION} /tmp/${STAR_VERSION}.zip && \
apt-get --purge autoremove -y ${PACKAGES} && \
apt-get clean
- type: python
packages: [ pyyaml ]
- type: docker
run: |
STAR --version | sed 's#\(.*\)#star: "\1"#' > /var/software_versions.txt

View File

@@ -2,6 +2,7 @@ import tempfile
import subprocess
import shutil
from pathlib import Path
import yaml
## VIASH START
par = {
@@ -18,10 +19,20 @@ par = {
}
meta = {
"cpus": 8,
"temp_dir": "/tmp"
"temp_dir": "/tmp",
"config": "target/executable/star/star_align_reads/.config.vsh.yaml",
}
## VIASH END
# read config
with open(meta["config"], 'r') as stream:
config = yaml.safe_load(stream)
all_arguments = {
arg["name"].lstrip('-'): arg
for argument_group in config["argument_groups"]
for arg in argument_group["arguments"]
}
##################################################
# check and process SE / PE R1 input files
input_r1 = par["input"]
@@ -87,8 +98,13 @@ with tempfile.TemporaryDirectory(prefix="star-", dir=meta["temp_dir"], ignore_cl
cmd_args = [ "STAR" ]
for name, value in par.items():
if value is not None:
if name in all_arguments:
arg_info = all_arguments[name].get("info", {})
cli_name = arg_info.get("orig_name", f"--{name}")
else:
cli_name = f"--{name}"
val_to_add = value if isinstance(value, list) else [value]
cmd_args.extend([f"--{name}"] + [str(x) for x in val_to_add])
cmd_args.extend([cli_name] + [str(x) for x in val_to_add])
print("", flush=True)
# run command

View File

@@ -7,35 +7,34 @@ meta_executable="target/docker/star/star_align_reads/star_align_reads"
meta_resources_dir="src/star/star_align_reads"
## VIASH END
#########################################################################################
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || (echo "File '$1' does not exist" && exit 1)
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_doesnt_exist() {
[ ! -f "$1" ] || (echo "File '$1' exists but shouldn't" && exit 1)
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
}
assert_file_empty() {
[ ! -s "$1" ] || (echo "File '$1' is not empty but should be" && exit 1)
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || (echo "File '$1' is empty but shouldn't be" && exit 1)
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains() {
grep -q "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
assert_file_contains_regex() {
grep -q -E "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains_regex() {
grep -q -E "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
#############################################
#########################################################################################
echo "> Prepare test data"
cat > reads_R1.fastq <<'EOF'
@@ -89,14 +88,14 @@ cd star_align_reads_se
echo "> Run star_align_reads on SE"
"$meta_executable" \
--input "../reads_R1.fastq" \
--genomeDir "../index/" \
--genome_dir "../index/" \
--aligned_reads "output.sam" \
--log "log.txt" \
--outReadsUnmapped "Fastx" \
--out_reads_unmapped "Fastx" \
--unmapped "unmapped.sam" \
--quantMode "TranscriptomeSAM;GeneCounts" \
--quant_mode "TranscriptomeSAM;GeneCounts" \
--reads_per_gene "reads_per_gene.tsv" \
--outSJtype Standard \
--out_sj_type Standard \
--splice_junctions "splice_junctions.tsv" \
--reads_aligned_to_transcriptome "transcriptome_aligned.bam" \
${meta_cpus:+---cpus $meta_cpus}
@@ -144,10 +143,10 @@ echo ">> Run star_align_reads on PE"
"$meta_executable" \
--input ../reads_R1.fastq \
--input_r2 ../reads_R2.fastq \
--genomeDir ../index/ \
--genome_dir ../index/ \
--aligned_reads output.bam \
--log log.txt \
--outReadsUnmapped Fastx \
--out_reads_unmapped Fastx \
--unmapped unmapped_r1.bam \
--unmapped_r2 unmapped_r2.bam \
${meta_cpus:+---cpus $meta_cpus}

View File

@@ -14,6 +14,14 @@ param_txt <- iconv(param_txt, "UTF-8", "ASCII//TRANSLIT")
dev_begin <- grep("#####UnderDevelopment_begin", param_txt)
dev_end <- grep("#####UnderDevelopment_end", param_txt)
camel_case_to_snake_case <- function(x) {
x %>%
str_replace_all("([A-Z][A-Z][A-Z]*)", "_\\1_") %>%
str_replace_all("([a-z])([A-Z])", "\\1_\\2") %>%
str_to_lower() %>%
str_replace_all("_$", "")
}
# strip development sections
nondev_ix <- unlist(map2(c(1, dev_end + 1), c(dev_begin - 1, length(param_txt)), function(i, j) {
if (i >= 1 && i < j) {
@@ -128,9 +136,8 @@ out2 <- out %>%
# remove arguments that are related to a different runmode
filter(!grepl("--runMode", description) | grepl("--runMode alignReads", description)) %>%
filter(!grepl("--runMode", group_name) | grepl("--runMode alignReads", group_name)) %>%
filter(!grepl("STARsolo", group_name)) %>%
mutate(
viash_arg = paste0("--", name),
viash_arg = paste0("--", camel_case_to_snake_case(name)),
type_step1 = type %>%
str_replace_all(".*(int, string|string|int|real|double)\\(?(s?).*", "\\1\\2"),
viash_type = type_map[gsub("(int, string|string|int|real|double).*", "\\1", type_step1)],
@@ -155,28 +162,41 @@ out2 <- out %>%
group_name = gsub(" - .*", "", group_name),
required = ifelse(name %in% required_args, TRUE, NA)
)
print(out2, n = 200)
out2 %>% mutate(i = row_number()) %>%
# filter(is.na(default_step1) != is.na(viash_default)) %>%
# change references to argument names
out3 <- out2
for (i in seq_len(nrow(out2))) {
orig_name <- paste0("--", out2$name[[i]])
new_name <- out2$viash_arg[[i]]
out3$description <- str_replace_all(out3$description, orig_name, new_name)
}
# sanity checks
out3 %>% select(name, viash_arg) %>% as.data.frame()
print(out3, n = 200)
out3 %>%
mutate(i = row_number()) %>%
select(-group_name, -description)
out3 %>% filter(!grepl("--runMode", description) | grepl("--runMode alignReads", description))
out2 %>% filter(!grepl("--runMode", description) | grepl("--runMode alignReads", description))
argument_groups <- map(unique(out2$group_name), function(group_name) {
args <- out2 %>%
# create argument groups
argument_groups <- map(unique(out3$group_name), function(group_name) {
args <- out3 %>%
filter(group_name == !!group_name) %>%
pmap(function(viash_arg, viash_type, multiple, viash_default, description, required, ...) {
li <- lst(
pmap(function(viash_arg, viash_type, multiple, viash_default, description, required, name, ...) {
li <- list(
name = viash_arg,
type = viash_type,
description = description
description = description,
info = list(
orig_name = paste0("--", name)
)
)
if (all(!is.na(viash_default))) {
li$example <- viash_default
}
if (!is.na(multiple) && multiple) {
li$multiple <- multiple
li$multiple_sep <- ";"
}
if (!is.na(required) && required) {
li$required <- required
@@ -186,4 +206,10 @@ argument_groups <- map(unique(out2$group_name), function(group_name) {
list(name = group_name, arguments = args)
})
yaml::write_yaml(list(argument_groups = argument_groups), yaml_file)
yaml::write_yaml(
list(argument_groups = argument_groups),
yaml_file,
handlers = list(
logical = yaml::verbatim_logical
)
)

View File

@@ -11,75 +11,74 @@ references:
license: MIT
requirements:
commands: [ STAR ]
authors:
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
roles: [ author, maintainer ]
argument_groups:
- name: "Input"
arguments:
- name: "--genomeFastaFiles"
- name: "--genome_fasta_files"
type: file
description: |
Path(s) to the fasta files with the genome sequences, separated by spaces. These files should be plain text FASTA files, they *cannot* be zipped.
required: true
multiple: yes
multiple_sep: ;
- name: "--sjdbGTFfile"
multiple: true
- name: "--sjdb_gtf_file"
type: file
description: Path to the GTF file with annotations
- name: --sjdbOverhang
- name: --sjdb_overhang
type: integer
description: Length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
example: 100
- name: --sjdbGTFchrPrefix
- name: --sjdb_gtf_chr_prefix
type: string
description: Prefix for chromosome names in a GTF file (e.g. 'chr' for using ENSMEBL annotations with UCSC genomes)
- name: --sjdbGTFfeatureExon
- name: --sjdb_gtf_feature_exon
type: string
description: Feature type in GTF file to be used as exons for building transcripts
example: exon
- name: --sjdbGTFtagExonParentTranscript
- name: --sjdb_gtf_tag_exon_parent_transcript
type: string
description: GTF attribute name for parent transcript ID (default "transcript_id" works for GTF files)
example: transcript_id
- name: --sjdbGTFtagExonParentGene
- name: --sjdb_gtf_tag_exon_parent_gene
type: string
description: GTF attribute name for parent gene ID (default "gene_id" works for GTF files)
example: gene_id
- name: --sjdbGTFtagExonParentGeneName
- name: --sjdb_gtf_tag_exon_parent_gene_name
type: string
description: GTF attribute name for parent gene name
example: gene_name
multiple: yes
multiple_sep: ;
- name: --sjdbGTFtagExonParentGeneType
multiple: true
- name: --sjdb_gtf_tag_exon_parent_gene_type
type: string
description: GTF attribute name for parent gene type
example:
- gene_type
- gene_biotype
multiple: yes
multiple_sep: ;
- name: --limitGenomeGenerateRAM
multiple: true
- name: --limit_genome_generate_ram
type: long
description: Maximum available RAM (bytes) for genome generation
example: '31000000000'
- name: --genomeSAindexNbases
example: 31000000000
- name: --genome_sa_index_nbases
type: integer
description: Length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, this parameter must be scaled down to min(14, log2(GenomeLength)/2 - 1).
example: 14
- name: --genomeChrBinNbits
- name: --genome_chr_bin_nbits
type: integer
description: Defined as log2(chrBin), where chrBin is the size of the bins for genome storage. Each chromosome will occupy an integer number of bins. For a genome with large number of contigs, it is recommended to scale this parameter as min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]).
example: 18
- name: --genomeSAsparseD
- name: --genome_sa_sparse_d
type: integer
min: 0
example: 1
description: Suffux array sparsity, i.e. distance between indices. Use bigger numbers to decrease needed RAM at the cost of mapping speed reduction.
- name: --genomeSuffixLengthMax
- name: --genome_suffix_length_max
type: integer
description: Maximum length of the suffixes, has to be longer than read length. Use -1 for infinite length.
example: -1
- name: --genomeTransformType
- name: --genome_transform_type
type: string
description: |
Type of genome transformation
@@ -87,7 +86,7 @@ argument_groups:
Haploid ... replace reference alleles with alternative alleles from VCF file (e.g. consensus allele)
Diploid ... create two haplotypes for each chromosome listed in VCF file, for genotypes 1|2, assumes perfect phasing (e.g. personal genome)
example: None
- name: --genomeTransformVCF
- name: --genome_transform_vcf
type: file
description: path to VCF file for genome transformation

View File

@@ -10,20 +10,20 @@ mkdir -p $par_index
STAR \
--runMode genomeGenerate \
--genomeDir $par_index \
--genomeFastaFiles $par_genomeFastaFiles \
--genomeFastaFiles $par_genome_fasta_files \
${meta_cpus:+--runThreadN "${meta_cpus}"} \
${par_sjdbGTFfile:+--sjdbGTFfile "${par_sjdbGTFfile}"} \
${par_sjdb_gtf_file:+--sjdbGTFfile "${par_sjdb_gtf_file}"} \
${par_sjdbOverhang:+--sjdbOverhang "${par_sjdbOverhang}"} \
${par_genomeSAindexNbases:+--genomeSAindexNbases "${par_genomeSAindexNbases}"} \
${par_sjdbGTFchrPrefix:+--sjdbGTFchrPrefix "${par_sjdbGTFchrPrefix}"} \
${par_sjdbGTFfeatureExon:+--sjdbGTFfeatureExon "${par_sjdbGTFfeatureExon}"} \
${par_sjdbGTFtagExonParentTranscript:+--sjdbGTFtagExonParentTranscript "${par_sjdbGTFtagExonParentTranscript}"} \
${par_sjdbGTFtagExonParentGene:+--sjdbGTFtagExonParentGene "${par_sjdbGTFtagExonParentGene}"} \
${par_sjdbGTFtagExonParentGeneName:+--sjdbGTFtagExonParentGeneName "${par_sjdbGTFtagExonParentGeneName}"} \
${par_sjdbGTFtagExonParentGeneType:+--sjdbGTFtagExonParentGeneType "${sjdbGTFtagExonParentGeneType}"} \
${par_limitGenomeGenerateRAM:+--limitGenomeGenerateRAM "${par_limitGenomeGenerateRAM}"} \
${par_genomeChrBinNbits:+--genomeChrBinNbits "${par_genomeChrBinNbits}"} \
${par_genomeSAsparseD:+--genomeSAsparseD "${par_genomeSAsparseD}"} \
${par_genomeSuffixLengthMax:+--genomeSuffixLengthMax "${par_genomeSuffixLengthMax}"} \
${par_genomeTransformType:+--genomeTransformType "${par_genomeTransformType}"} \
${par_genomeTransformVCF:+--genomeTransformVCF "${par_genomeTransformVCF}"} \
${par_genome_sa_index_nbases:+--genomeSAindexNbases "${par_genome_sa_index_nbases}"} \
${par_sjdb_gtf_chr_prefix:+--sjdbGTFchrPrefix "${par_sjdb_gtf_chr_prefix}"} \
${par_sjdb_gtf_feature_exon:+--sjdbGTFfeatureExon "${par_sjdb_gtf_feature_exon}"} \
${par_sjdb_gtf_tag_exon_parent_transcript:+--sjdbGTFtag_exon_parent_transcript "${par_sjdb_gtf_tag_exon_parent_transcript}"} \
${par_sjdb_gtf_tag_exon_parent_gene:+--sjdbGTFtag_exon_parent_gene "${par_sjdb_gtf_tag_exon_parent_gene}"} \
${par_sjdb_gtf_tag_exon_parent_geneName:+--sjdbGTFtag_exon_parent_geneName "${par_sjdb_gtf_tag_exon_parent_geneName}"} \
${par_sjdb_gtf_tag_exon_parent_geneType:+--sjdbGTFtag_exon_parent_geneType "${sjdbGTFtag_exon_parent_geneType}"} \
${par_limit_genome_generate_ram:+--limitGenomeGenerateRAM "${par_limit_genome_generate_ram}"} \
${par_genome_chr_bin_nbits:+--genomeChrBinNbits "${par_genome_chr_bin_nbits}"} \
${par_genome_sa_sparse_d:+--genomeSAsparseD "${par_genome_sa_sparse_d}"} \
${par_genome_suffix_length_max:+--genomeSuffixLengthMax "${par_genome_suffix_length_max}"} \
${par_genome_transform_type:+--genomeTransformType "${par_genome_transform_type}"} \
${par_genome_transform_vcf:+--genomeTransformVCF "${par_genome_transform_vCF}"} \

View File

@@ -27,9 +27,9 @@ echo "> Generate index"
"$meta_executable" \
${meta_cpus:+---cpus $meta_cpus} \
--index "star_index/" \
--genomeFastaFiles "genome.fasta" \
--sjdbGTFfile "genes.gtf" \
--genomeSAindexNbases 2
--genome_fasta_files "genome.fasta" \
--sjdb_gtf_file "genes.gtf" \
--genome_sa_index_nbases 4
files=("Genome" "Log.out" "SA" "SAindex" "chrLength.txt" "chrName.txt" "chrNameLength.txt" "chrStart.txt" "exonGeTrInfo.tab" "exonInfo.tab" "geneInfo.tab" "genomeParameters.txt" "sjdbInfo.txt" "sjdbList.fromGTF.out.tab" "sjdbList.out.tab" "transcriptInfo.tab")

View File

@@ -0,0 +1,305 @@
name: umi_tools_dedup
namespace: umi_tools
description: |
Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read.
keywords: [umi_tools, deduplication, dedup]
links:
homepage: https://umi-tools.readthedocs.io/en/latest/
documentation: https://umi-tools.readthedocs.io/en/latest/reference/dedup.html
repository: https://github.com/CGATOxford/UMI-tools
references:
doi: 10.1101/gr.209601.116
license: MIT
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: --stdin
type: file
description: Input BAM or SAM file. Use --in_sam to specify SAM format.
required: true
- name: --in_sam
type: boolean_true
description: |
By default, inputs are assumed to be in BAM format. Use this options to specify the use of SAM
format for input.
- name: --bai
type: file
description: BAM index
- name: --random_seed
type: integer
description: Random seed to initialize number generator with.
- name: Outputs
arguments:
- name: --output
alternatives: --stdout
type: file
description: Deduplicated BAM file.
required: true
direction: output
- name: --out_sam
type: boolean_true
description: |
By default, outputa are written in BAM format. Use this options to specify the use of SAM format
for output.
- name: --paired
type: boolean_true
description: |
BAM is paired end - output both read pairs. This will also force the use of the template length
to determine reads with the same mapping coordinates.
- name: --output_stats
type: string
description: |
Generate files containing UMI based deduplication statistics files with this prefix in the file names.
- name: --extract_umi_method
type: string
choices: [read_id, tag, umis]
description: |
Specify the method by which the barcodes were encoded in the read.
The options are:
* read_id (default)
* tag
* umis
example: "read_id"
- name: --umi_tag
type: string
description: |
The tag containing the UMI sequence. This is only required if the extract_umi_method is set to tag.
- name: --umi_separator
type: string
description: |
The separator used to separate the UMI from the read sequence. This is only required if the
extract_umi_method is set to id_read. Default: `_`.
example: '_'
- name: --umi_tag_split
type: string
description: Separate the UMI in tag by <SPLIT> and take the first element.
- name: --umi_tag_delimiter
type: string
description: Separate the UMI in by <DELIMITER> and concatenate the elements.
- name: --cell_tag
type: string
description: |
The tag containing the cell barcode sequence. This is only required if the extract_umi_method
is set to tag.
- name: --cell_tag_split
type: string
description: Separate the cell barcode in tag by <SPLIT> and take the first element.
- name: --cell_tag_delimiter
type: string
description: Separate the cell barcode in by <DELIMITER> and concatenate the elements.
- name: Grouping Options
arguments:
- name: --method
type: string
choices: [unique, percentile, cluster, adjacency, directional]
description: |
The method to use for grouping reads.
The options are:
* unique
* percentile
* cluster
* adjacency
* directional (default)
example: "directional"
- name: --edit_distance_threshold
type: integer
description: |
For the adjacency and cluster methods the threshold for the edit distance to connect two
UMIs in the network can be increased. The default value of 1 works best unless the UMI is
very long (>14bp). Default: `1`.
example: 1
- name: --spliced_is_unique
type: boolean_true
description: |
Causes two reads that start in the same position on the same strand and having the same UMI
to be considered unique if one is spliced and the other is not. (Uses the 'N' cigar operation
to test for splicing).
- name: --soft_clip_threshold
type: integer
description: |
Mappers that soft clip will sometimes do so rather than mapping a spliced read if there is only
a small overhang over the exon junction. By setting this option, you can treat reads with at
least this many bases soft-clipped at the 3' end as spliced. Default: `4`.
example: 4
- name: --multimapping_detection_method
type: string
description: |
If the sam/bam contains tags to identify multimapping reads, you can specify for use when selecting
the best read at a given loci. Supported tags are `NH`, `X0` and `XT`. If not specified, the read
with the highest mapping quality will be selected.
- name: --read_length
type: boolean_true
description: Use the read length as a criteria when deduping, for e.g. sRNA-Seq.
- name: Single-cell RNA-Seq Options
arguments:
- name: --per_gene
type: boolean_true
description: |
Reads will be grouped together if they have the same gene. This is useful if your library prep
generates PCR duplicates with non identical alignment positions such as CEL-Seq. Note this option
is hardcoded to be on with the count command. I.e. counting is always performed per-gene. Must be
combined with either --gene_tag or --per_contig option.
- name: --gene_tag
type: string
description: |
Deduplicate per gene. The gene information is encoded in the bam read tag specified.
- name: --assigned_status_tag
type: string
description: |
BAM tag which describes whether a read is assigned to a gene. Defaults to the same value as given
for --gene_tag.
- name: --skip_tags_regex
type: string
description: |
Use in conjunction with the --assigned_status_tag option to skip any reads where the tag matches
this regex. Default ("^[__|Unassigned]") matches anything which starts with "__" or "Unassigned".
- name: --per_contig
type: boolean_true
description: |
Deduplicate per contig (field 3 in BAM; RNAME). All reads with the sam contig will be considered to
have the same alignment position. This is useful if you have aligned to a reference transcriptome
with one transcript per gene. If you have aligned to a transcriptome with more than one transcript
per gene, you can supply a map between transcripts and gene using the --gene_transcript_map option.
- name: --gene_transcript_map
type: file
description: |
A file containing a mapping between gene names and transcript names. The file should be tab
separated with the gene name in the first column and the transcript name in the second column.
- name: --per_cell
type: boolean_true
description: |
Reads will only be grouped together if they have the same cell barcode. Can be combined with
--per_gene.
- name: SAM/BAM Options
arguments:
- name: --mapping_quality
type: integer
description: |
Minimium mapping quality (MAPQ) for a read to be retained. Default: `0`.
example: 0
- name: --unmapped_reads
type: string
description: |
How unmapped reads should be handled.
The options are:
* "discard": Discard all unmapped reads. (default)
* "use": If read2 is unmapped, deduplicate using read1 only. Requires --paired.
* "output": Output unmapped reads/read pairs without UMI grouping/deduplication. Only available in umi_tools group.
example: "discard"
- name: --chimeric_pairs
type: string
choices: [discard, use, output]
description: |
How chimeric pairs should be handled.
The options are:
* "discard": Discard all chimeric read pairs.
* "use": Deduplicate using read1 only. (default)
* "output": Output chimeric pairs without UMI grouping/deduplication. Only available in
umi_tools group.
example: "use"
- name: --unpaired_reads
type: string
choices: [discard, use, output]
description: |
How unpaired reads should be handled.
The options are:
* "discard": Discard all unmapped reads.
* "use": If read2 is unmapped, deduplicate using read1 only. Requires --paired. (default)
* "output": Output unmapped reads/read pairs without UMI grouping/deduplication. Only available
in umi_tools group.
example: "use"
- name: --ignore_umi
type: boolean_true
description: Ignore the UMI and group reads using mapping coordinates only.
- name: --subset
type: double
description: |
Only consider a fraction of the reads, chosen at random. This is useful for doing saturation
analyses.
- name: --chrom
type: string
description: Only consider a single chromosome. This is useful for debugging/testing purposes.
- name: Group/Dedup Options
arguments:
- name: --no_sort_output
type: boolean_true
description: |
By default, output is sorted. This involves the use of a temporary unsorted file (saved in
--temp_dir). Use this option to turn off sorting.
- name: --buffer_whole_contig
type: boolean_true
description: |
Forces dedup to parse an entire contig before yielding any reads for deduplication. This is the
only way to absolutely guarantee that all reads with the same start position are grouped together
for deduplication since dedup uses the start position of the read, not the alignment coordinate on
which the reads are sorted. However, by default, dedup reads for another 1000bp before outputting
read groups which will avoid any reads being missed with short read sequencing (<1000bp).
- name: Common Options
arguments:
- name: --log
alternatives: -L
type: file
description: File with logging information.
- name: --log2stderr
type: boolean_true
description: Send logging information to stderr.
- name: --verbose
alternatives: -v
type: integer
description: |
Log level. The higher, the more output. Default: `0`.
example: 0
- name: --error
alternatives: -E
type: file
description: File with error information.
- name: --temp_dir
type: string
description: |
Directory for temporary files. If not set, the bash environmental variable TMPDIR is used.
- name: --compresslevel
type: integer
description: |
Level of Gzip compression to use. Default=6 matches GNU gzip rather than python gzip default.
Default: `6`.
example: 6
- name: --timeit
type: file
description: Store timing information in file.
- name: --timeit_name
type: string
description: |
Name in timing file for this class of jobs. Default: `all`.
example: "all"
- name: --timeit_header
type: string
description: Add header for timing information.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/umi_tools:1.1.5--py39hf95cd2a_1
setup:
- type: docker
run: |
umi_tools -v | sed 's/ version//g' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,113 @@
'''
Generated from the following UMI-tools documentation:
https://umi-tools.readthedocs.io/en/latest/common_options.html#common-options
https://umi-tools.readthedocs.io/en/latest/reference/dedup.html
'''
dedup - Deduplicate reads using UMI and mapping coordinates
Usage: umi_tools dedup [OPTIONS] [--stdin=IN_BAM] [--stdout=OUT_BAM]
note: If --stdout is ommited, standard out is output. To
generate a valid BAM file on standard out, please
redirect log with --log=LOGFILE or --log2stderr
Common UMI-tools Options:
-S, --stdout File where output is to go [default = stdout].
-L, --log File with logging information [default = stdout].
--log2stderr Send logging information to stderr [default = False].
-v, --verbose Log level. The higher, the more output [default = 1].
-E, --error File with error information [default = stderr].
--temp-dir Directory for temporary files. If not set, the bash environmental variable TMPDIR is used[default = None].
--compresslevel Level of Gzip compression to use. Default=6 matches GNU gzip rather than python gzip default (which is 9)
profiling and debugging options:
--timeit Store timing information in file [default=none].
--timeit-name Name in timing file for this class of jobs [default=all].
--timeit-header Add header for timing information [default=none].
--random-seed Random seed to initialize number generator with [default=none].
Dedup Options:
--output-stats=<prefix> One can use the edit distance between UMIs at the same position as an quality control for the
deduplication process by comparing with a null expectation of random sampling. For the random
sampling, the observed frequency of UMIs is used to more reasonably model the null expectation.
Use this option to generate a stats outfiles called:
[PREFIX]_stats_edit_distance.tsv
Reports the (binned) average edit distance between the UMIs at each position.
In addition, this option will trigger reporting of further summary statistics for the UMIs which
may be informative for selecting the optimal deduplication method or debugging.
Each unique UMI sequence may be observed [0-many] times at multiple positions in the BAM. The
following files report the distribution for the frequencies of each UMI.
[PREFIX]_stats_per_umi_per_position.tsv
Tabulates the counts for unique combinations of UMI and position.
[PREFIX]_stats_per_umi_per.tsv
The _stats_per_umi_per.tsv table provides UMI-level summary statistics.
--extract-umi-method=<method> How are the barcodes encoded in the read?
Options are: read_id (default), tag, umis
--umi-separator=<separator> Separator between read id and UMI. See --extract-umi-method above. Default=_
--umi-tag=<tag> Tag which contains UMI. See --extract-umi-method above
--umi-tag-split=<split> Separate the UMI in tag by SPLIT and take the first element
--umi-tag-delimiter=<delimiter> Separate the UMI in by DELIMITER and concatenate the elements
--cell-tag=<tag> Tag which contains cell barcode. See --extract-umi-method above
--cell-tag-split=<split> Separate the cell barcode in tag by SPLIT and take the first element
--cell-tag-delimiter=<delimiter> Separate the cell barcode in by DELIMITER and concatenate the elements
--method=<method> What method to use to identify group of reads with the same (or similar) UMI(s)?
All methods start by identifying the reads with the same mapping position.
The simplest methods, unique and percentile, group reads with the exact same UMI.
The network-based methods, cluster, adjacency and directional, build networks where
nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1).
The groups of reads are then defined from the network in a method-specific manner.
For all the network-based methods, each read group is equivalent to one read count for the gene.
--edit-distance-threshold=<threshold> For the adjacency and cluster methods the threshold for the edit distance to connect
two UMIs in the network can be increased. The default value of 1 works best unless
the UMI is very long (>14bp).
--spliced-is-unique Causes two reads that start in the same position on the same strand and having the
same UMI to be considered unique if one is spliced and the other is not.
(Uses the 'N' cigar operation to test for splicing).
--soft-clip-threshold=<threshold> Mappers that soft clip will sometimes do so rather than mapping a spliced read if
there is only a small overhang over the exon junction. By setting this option, you
can treat reads with at least this many bases soft-clipped at the 3' end as spliced.
Default=4.
--multimapping-detection-method=<method> If the sam/bam contains tags to identify multimapping reads, you can specify
for use when selecting the best read at a given loci. Supported tags are "NH",
"X0" and "XT". If not specified, the read with the highest mapping quality will be selected.
--read-length Use the read length as a criteria when deduping, for e.g sRNA-Seq.
--per-gene Reads will be grouped together if they have the same gene. This is useful if your
library prep generates PCR duplicates with non identical alignment positions such as CEL-Seq.
Note this option is hardcoded to be on with the count command. I.e counting is always
performed per-gene. Must be combined with either --gene-tag or --per-contig option.
--gene-tag=<tag> Deduplicate per gene. The gene information is encoded in the bam read tag specified
--assigned-status-tag=<tag> BAM tag which describes whether a read is assigned to a gene. Defaults to the same value
as given for --gene-tag
--skip-tags-regex=<regex> Use in conjunction with the --assigned-status-tag option to skip any reads where the
tag matches this regex. Default ("^[__|Unassigned]") matches anything which starts with "__"
or "Unassigned":
--per-contig Deduplicate per contig (field 3 in BAM; RNAME). All reads with the same contig will be
considered to have the same alignment position. This is useful if you have aligned to a
reference transcriptome with one transcript per gene. If you have aligned to a transcriptome
with more than one transcript per gene, you can supply a map between transcripts and gene
using the --gene-transcript-map option
--gene-transcript-map=<file> File mapping genes to transcripts (tab separated)
--per-cell Reads will only be grouped together if they have the same cell barcode. Can be combined with --per-gene.
--mapping-quality=<quality> Minimium mapping quality (MAPQ) for a read to be retained. Default is 0.
--unmapped-reads=<option> How should unmapped reads be handled.
--chimeric-pairs=<option> How should chimeric read pairs be handled.
--unpaired-reads=<option> How should unpaired reads be handled.
--ignore-umi Ignore the UMI and group reads using mapping coordinates only
--subset=<fraction> Only consider a fraction of the reads, chosen at random. This is useful for doing saturation analyses.
--chrom=<chromosome> Only consider a single chromosome. This is useful for debugging/testing purposes
--in-sam Input is in SAM format
--out-sam Output is in SAM format
--paired BAM is paired end - output both read pairs. This will also force the use of the template
length to determine reads with the same mapping coordinates.
--no-sort-output By default, output is sorted. This involves the use of a temporary unsorted file since
reads are considered in the order of their start position which may not be the same as
their alignment coordinate due to soft-clipping and reverse alignments. The temp file
will be saved (in --temp-dir) and deleted when it has been sorted to the outfile. Use
this option to turn off sorting.
--buffer-whole-contig Forces dedup to parse an entire contig before yielding any reads for deduplication.
This is the only way to absolutely guarantee that all reads with the same start position
are grouped together for deduplication since dedup uses the start position of the read,
not the alignment coordinate on which the reads are

View File

@@ -0,0 +1,72 @@
#!/bin/bash
## VIASH START
## VIASH END
set -e
test_dir="${metal_executable}/test_data"
[[ "$par_paired" == "false" ]] && unset par_paired
[[ "$par_in_sam" == "false" ]] && unset par_in_sam
[[ "$par_out_sam" == "false" ]] && unset par_out_sam
[[ "$par_spliced_is_unique" == "false" ]] && unset par_spliced_is_unique
[[ "$par_per_gene" == "false" ]] && unset par_per_gene
[[ "$par_per_contig" == "false" ]] && unset par_per_contig
[[ "$par_per_cell" == "false" ]] && unset par_per_cell
[[ "$par_no_sort_output" == "false" ]] && unset par_no_sort_output
[[ "$par_buffer_whole_contig" == "false" ]] && unset par_buffer_whole_contig
[[ "$par_ignore_umi" == "false" ]] && unset par_ignore_umi
[[ "$par_subset" == "false" ]] && unset par_subset
[[ "$par_log2stderr" == "false" ]] && unset par_log2stderr
[[ "$par_read_length" == "false" ]] && unset par_read_length
umi_tools dedup \
--stdin "$par_input" \
${par_in_sam:+--in-sam} \
--stdout "$par_output" \
${par_out_sam:+--out-sam} \
${par_paired:+--paired} \
${par_output_stats:+--output-stats "$par_output_stats"} \
${par_extract_umi_method:+--extract-umi-method "$par_extract_umi_method"} \
${par_umi_tag:+--umi-tag "$par_umi_tag"} \
${par_umi_separator:+--umi-separator "$par_umi_separator"} \
${par_umi_tag_split:+--umi-tag-split "$par_umi_tag_split"} \
${par_umi_tag_delimiter:+--umi-tag-delimiter "$par_umi_tag_delimiter"} \
${par_cell_tag:+--cell-tag "$par_cell_tag"} \
${par_cell_tag_split:+--cell-tag-split "$par_cell_tag_split"} \
${par_cell_tag_delimiter:+--cell-tag-delimiter "$par_cell_tag_delimiter"} \
${par_method:+--method "$par_method"} \
${par_edit_distance_threshold:+--edit-distance-threshold "$par_edit_distance_threshold"} \
${par_spliced_is_unique:+--spliced-is-unique} \
${par_soft_clip_threshold:+--soft-clip-threshold "$par_soft_clip_threshold"} \
${par_multimapping_detection_method:+--multimapping-detection-method "$par_multimapping_detection_method"} \
${par_read_length:+--read-length} \
${par_per_gene:+--per-gene} \
${par_gene_tag:+--gene-tag "$par_gene_tag"} \
${par_assigned_status_tag:+--assigned-status-tag "$par_assigned_status_tag"} \
${par_skip_tags_regex:+--skip-tags-regex "$par_skip_tags_regex"} \
${par_per_contig:+--per-contig} \
${par_gene_transcript_map:+--gene-transcript-map "$par_gene_transcript_map"} \
${par_per_cell:+--per-cell} \
${par_mapping_quality:+--mapping-quality "$par_mapping_quality"} \
${par_unmapped_reads:+--unmapped-reads "$par_unmapped_reads"} \
${par_chimeric_pairs:+--chimeric-pairs "$par_chimeric_pairs"} \
${par_unpaired_reads:+--unpaired-reads "$par_unpaired_reads"} \
${par_ignore_umi:+--ignore-umi} \
${par_subset:+--subset "$par_subset"} \
${par_chrom:+--chrom "$par_chrom"} \
${par_no_sort_output:+--no-sort-output} \
${par_buffer_whole_contig:+--buffer-whole-contig} \
${par_log:+-L "$par_log"} \
${par_log2stderr:+--log2stderr} \
${par_verbose:+-v "$par_verbose"} \
${par_error:+-E "$par_error"} \
${par_temp_dir:+--temp-dir "$par_temp_dir"} \
${par_compresslevel:+--compresslevel "$par_compresslevel"} \
${par_timeit:+--timeit "$par_timeit"} \
${par_timeit_name:+--timeit-name "$par_timeit_name"} \
${par_timeit_header:+--timeit-header "$par_timeit_header"} \
${par_random_seed:+--random-seed "$par_random_seed"}
exit 0

View File

@@ -0,0 +1,87 @@
#!/bin/bash
test_dir="${meta_resources_dir}/test_data"
out_dir="${meta_resources_dir}/out"
mkdir -p "$out_dir"
############################################################################################
echo ">>> Test 1: Basic usage of $meta_functionality_name with statistics output"
"$meta_executable" \
--paired \
--input "$test_dir/sample.bam" \
--bai "$test_dir/sample.bam.bai" \
--output "$out_dir/deduped.sam" \
--out_sam \
--output_stats "$out_dir/dedup" \
--random_seed 1
echo ">>> Checking whether output exists"
[ ! -f "$out_dir/deduped.sam" ] && echo "File 'deduped.sam' does not exist!" && exit 1
[ ! -f "$out_dir/dedup_edit_distance.tsv" ] && echo "File 'dedup_edit_distance.tsv' does not exist!" && exit 1
echo ">>> Checking whether output is non-empty"
[ ! -s "$out_dir/deduped.sam" ] && echo "File 'deduped.sam' is empty!" && exit 1
[ ! -s "$out_dir/dedup_edit_distance.tsv" ] && echo "File 'dedup_edit_distance.tsv' is empty!" && exit 1
echo ">>> Checking whether output is correct"
diff "$out_dir/deduped.sam" "$test_dir/deduped.sam" || \
(echo "Output file deduped.sam does not match expected output" && exit 1)
diff "$out_dir/dedup_edit_distance.tsv" "$test_dir/dedup_edit_distance.tsv" || \
(echo "Output file dedup_edit_distance.tsv does not match expected output" && exit 1)
############################################################################################
echo ">>> Test 2: $meta_functionality_name with random subset selection"
"$meta_executable" \
--paired \
--input "$test_dir/sample.bam" \
--bai "$test_dir/sample.bam.bai" \
--output "$out_dir/deduped_fraction.sam" \
--out_sam \
--subset 0.5 \
--random_seed 1
echo ">>> Checking whether output exists"
[ ! -f "$out_dir/deduped_fraction.sam" ] && echo "File 'deduped_fraction.sam' does not exist!" && exit 1
echo ">>> Checking whether output is non-empty"
[ ! -s "$out_dir/deduped_fraction.sam" ] && echo "File 'deduped_fraction.sam' is empty!" && exit 1
echo ">>> Checking whether output is correct"
diff "$out_dir/deduped_fraction.sam" "$test_dir/deduped_fraction.sam" || \
(echo "Output file deduped_fraction.sam does not match expected output" && exit 1)
############################################################################################
echo ">>> Test 3: $meta_functionality_name with --method unique"
"$meta_executable" \
--paired \
--input "$test_dir/sample.bam" \
--bai "$test_dir/sample.bam.bai" \
--output "$out_dir/deduped_unique.sam" \
--out_sam \
--method "unique" \
--random_seed 1
echo ">>> Checking whether output exists"
[ ! -f "$out_dir/deduped_unique.sam" ] && echo "File 'deduped_unique.sam' does not exist!" && exit 1
echo ">>> Checking whether output is non-empty"
[ ! -s "$out_dir/deduped_unique.sam" ] && echo "File 'deduped_unique.sam' is empty!" && exit 1
echo ">>> Checking whether output is correct"
diff "$out_dir/deduped_unique.sam" "$test_dir/deduped_unique.sam" || \
(echo "Output file deduped_unique.sam does not match expected output" && exit 1)
############################################################################################
rm -rf "$out_dir"
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,5 @@
unique unique_null directional directional_null edit_distance
3 3 4 4 Single_UMI
0 1 0 0 0
1 0 0 0 1
0 0 0 0 2
1 unique unique_null directional directional_null edit_distance
2 3 3 4 4 Single_UMI
3 0 1 0 0 0
4 1 0 0 0 1
5 0 0 0 0 2

View File

@@ -0,0 +1,6 @@
UMI median_counts_pre times_observed_pre total_counts_pre median_counts_post times_observed_post total_counts_post
ACCGGTTTA 74 1 74 74 1 74
ACTGGTTTC 48 1 48 49 1 49
AGCGGTTAC 1 1 1 1 1 1
CCAGGTTCT 1 1 1 1 1 1
TCTGGTTTC 1 1 1 0 0 0
1 UMI median_counts_pre times_observed_pre total_counts_pre median_counts_post times_observed_post total_counts_post
2 ACCGGTTTA 74 1 74 74 1 74
3 ACTGGTTTC 48 1 48 49 1 49
4 AGCGGTTAC 1 1 1 1 1 1
5 CCAGGTTCT 1 1 1 1 1 1
6 TCTGGTTTC 1 1 1 0 0 0

Some files were not shown because too many files have changed in this diff Show More