Build branch main with version main (da414e7)
Build pipeline: viash-hub.biobox.main-7dwhr
Source commit: da414e72c6
Source message: Add star solo component (#62)
* add star solo component
* change arguments from camelCase to snake_case
* get rid of multiple_sep
* drop star_solo component and just add arguments to star_align_reads
* Update src/star/star_align_reads/script.py
Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>
---------
Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>
This commit is contained in:
1
.gitignore
vendored
1
.gitignore
vendored
@@ -3,6 +3,7 @@
|
||||
|
||||
# IDE ignores
|
||||
.idea/
|
||||
.vscode/
|
||||
|
||||
# R specific ignores
|
||||
.Rhistory
|
||||
|
||||
75
CHANGELOG.md
75
CHANGELOG.md
@@ -1,25 +1,60 @@
|
||||
# biobox x.x.x
|
||||
|
||||
## BUG FIXES
|
||||
|
||||
* `pear`: fix component not exiting with the correct exitcode when PEAR fails.
|
||||
|
||||
* `cutadapt`: fix `--par_quality_cutoff_r2` argument.
|
||||
|
||||
* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode`.
|
||||
|
||||
## MINOR CHANGES
|
||||
|
||||
* `busco` components: update BUSCO to `5.7.1`.
|
||||
|
||||
# biobox 0.1.0
|
||||
|
||||
## BREAKING CHANGES
|
||||
|
||||
* Change default `multiple_sep` to `;` (PR #25). This aligns with an upcoming breaking change in
|
||||
Viash 0.9.0 in order to avoid issues with the current default separator `:` unintentionally
|
||||
splitting up certain file paths.
|
||||
* `star/star_align_reads`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).
|
||||
|
||||
* `star/star_genome_generate`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).
|
||||
|
||||
## NEW FUNCTIONALITY
|
||||
|
||||
* `star/star_align_reads`: Add star solo related arguments (PR #62).
|
||||
|
||||
* `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline (PR #75).
|
||||
|
||||
* `umitools/umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).
|
||||
|
||||
* `seqtk`:
|
||||
- `seqtk/seqtk_sample`: Subsamples sequences from FASTA/Q files (PR #68).
|
||||
- `seqtk/seqtk_subseq`: Extract the sequences (complete or subsequence) from the FASTA/FASTQ files
|
||||
based on a provided sequence IDs or region coordinates file (PR #85).
|
||||
|
||||
* `agat/agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).
|
||||
|
||||
## MINOR CHANGES
|
||||
|
||||
* `busco` components: update BUSCO to `5.7.1` (PR #72).
|
||||
|
||||
* Update CI to reusable workflow in `viash-io/viash-actions` (PR #86).
|
||||
|
||||
## DOCUMENTATION
|
||||
|
||||
* Extend the contributing guidelines (PR #82):
|
||||
|
||||
- Update format to Viash 0.9.
|
||||
|
||||
- Descriptions should be formatted in markdown.
|
||||
|
||||
- Add defaults to descriptions, not as a default of the argument.
|
||||
|
||||
- Explain parameter expansion.
|
||||
|
||||
- Mention that the contents of the output of components in tests should be checked.
|
||||
|
||||
* Add authorship to existing components (PR #88).
|
||||
|
||||
## BUG FIXES
|
||||
|
||||
* `pear`: fix component not exiting with the correct exitcode when PEAR fails (PR #70).
|
||||
|
||||
* `cutadapt`: fix `--par_quality_cutoff_r2` argument (PR #69).
|
||||
|
||||
* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode` (PR #69).
|
||||
|
||||
* `multiqc`: update multiple separator to `;` (PR #81).
|
||||
|
||||
|
||||
# biobox 0.1.0
|
||||
|
||||
## NEW FEATURES
|
||||
|
||||
@@ -66,6 +101,10 @@
|
||||
- `samtools/samtools_collate`: Shuffles and groups reads in SAM/BAM/CRAM files together by their names (PR #42).
|
||||
- `samtools/samtools_view`: Views and converts SAM/BAM/CRAM files (PR #48).
|
||||
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTQ (PR #52).
|
||||
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53).
|
||||
|
||||
* `umi_tools`:
|
||||
-`umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71).
|
||||
|
||||
* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).
|
||||
|
||||
@@ -93,4 +132,4 @@
|
||||
|
||||
* Add escaping character before leading hashtag in the description field of the config file (PR #50).
|
||||
|
||||
* Format URL in biobase/bcl_convert description (PR #55).
|
||||
* Format URL in biobase/bcl_convert description (PR #55).
|
||||
|
||||
157
CONTRIBUTING.md
157
CONTRIBUTING.md
@@ -65,22 +65,21 @@ runners:
|
||||
Fill in the relevant metadata fields in the config. Here is an example of the metadata of an existing component.
|
||||
|
||||
```yaml
|
||||
functionality:
|
||||
name: arriba
|
||||
description: Detect gene fusions from RNA-Seq data
|
||||
keywords: [Gene fusion, RNA-Seq]
|
||||
links:
|
||||
homepage: https://arriba.readthedocs.io/en/latest/
|
||||
documentation: https://arriba.readthedocs.io/en/latest/
|
||||
repository: https://github.com/suhrig/arriba
|
||||
issue_tracker: https://github.com/suhrig/arriba/issues
|
||||
references:
|
||||
doi: 10.1101/gr.257246.119
|
||||
bibtex: |
|
||||
@article{
|
||||
... a bibtex entry in case the doi is not available ...
|
||||
}
|
||||
license: MIT
|
||||
name: arriba
|
||||
description: Detect gene fusions from RNA-Seq data
|
||||
keywords: [Gene fusion, RNA-Seq]
|
||||
links:
|
||||
homepage: https://arriba.readthedocs.io/en/latest/
|
||||
documentation: https://arriba.readthedocs.io/en/latest/
|
||||
repository: https://github.com/suhrig/arriba
|
||||
issue_tracker: https://github.com/suhrig/arriba/issues
|
||||
references:
|
||||
doi: 10.1101/gr.257246.119
|
||||
bibtex: |
|
||||
@article{
|
||||
... a bibtex entry in case the doi is not available ...
|
||||
}
|
||||
license: MIT
|
||||
```
|
||||
|
||||
### Step 4: Find a suitable container
|
||||
@@ -162,7 +161,7 @@ argument_groups:
|
||||
type: file
|
||||
description: |
|
||||
File in SAM/BAM/CRAM format with main alignments as generated by STAR
|
||||
(Aligned.out.sam). Arriba extracts candidate reads from this file.
|
||||
(`Aligned.out.sam`). Arriba extracts candidate reads from this file.
|
||||
required: true
|
||||
example: Aligned.out.bam
|
||||
```
|
||||
@@ -175,7 +174,7 @@ Several notes:
|
||||
|
||||
* Input arguments can have `multiple: true` to allow the user to specify multiple files.
|
||||
|
||||
|
||||
* The description should be formatted in markdown.
|
||||
|
||||
### Step 8: Add arguments for the output files
|
||||
|
||||
@@ -220,7 +219,7 @@ argument_groups:
|
||||
|
||||
Note:
|
||||
|
||||
* Preferably, these outputs should not be directores but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
|
||||
* Preferably, these outputs should not be directories but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
|
||||
|
||||
### Step 9: Add arguments for the other arguments
|
||||
|
||||
@@ -230,6 +229,8 @@ Finally, add all other arguments to the config file. There are a few exceptions:
|
||||
|
||||
* Arguments related to printing the information such as printing the version (`-v`, `--version`) or printing the help (`-h`, `--help`) should not be added to the config file.
|
||||
|
||||
* If the help lists defaults, do not add them as defaults but to the description. Example: `description: <Explanation of parameter>. Default: 10.`
|
||||
|
||||
|
||||
### Step 10: Add a Docker engine
|
||||
|
||||
@@ -275,10 +276,13 @@ Next, we need to write a runner script that runs the tool with the input argumen
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# unset flags
|
||||
[[ "$par_option" == "false" ]] && unset par_option
|
||||
|
||||
xxx \
|
||||
--input "$par_input" \
|
||||
--output "$par_output" \
|
||||
$([ "$par_option" = "true" ] && echo "--option")
|
||||
${par_option:+--option}
|
||||
```
|
||||
|
||||
When building a Viash component, Viash will automatically replace the `## VIASH START` and `## VIASH END` lines (and anything in between) with environment variables based on the arguments specified in the config.
|
||||
@@ -291,6 +295,11 @@ As an example, this is what the Bash script for the `arriba` component looks lik
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
# unset flags
|
||||
[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking
|
||||
[[ "$par_extra_information" == "false" ]] && unset par_extra_information
|
||||
[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps
|
||||
|
||||
arriba \
|
||||
-x "$par_bam" \
|
||||
-a "$par_genome" \
|
||||
@@ -298,26 +307,30 @@ arriba \
|
||||
-o "$par_fusions" \
|
||||
${par_known_fusions:+-k "${par_known_fusions}"} \
|
||||
${par_blacklist:+-b "${par_blacklist}"} \
|
||||
${par_structural_variants:+-d "${par_structural_variants}"} \
|
||||
$([ "$par_skip_duplicate_marking" = "true" ] && echo "-u") \
|
||||
$([ "$par_extra_information" = "true" ] && echo "-X") \
|
||||
$([ "$par_fill_gaps" = "true" ] && echo "-I")
|
||||
# ...
|
||||
${par_extra_information:+-X} \
|
||||
${par_fill_gaps:+-I}
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
* If your arguments can contain special variables (e.g. `$`), you can use quoting (need to find a documentation page for this) to make sure you can use the string as input. Example: `-x ${par_bam@Q}`.
|
||||
|
||||
* Optional arguments can be passed to the command conditionally using Bash [parameter expansion](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html). For example: `${par_known_fusions:+-k ${par_known_fusions@Q}}`
|
||||
|
||||
* If your tool allows for multiple inputs using a separator other than `;` (which is the default Viash multiple separator), you can substitute these values with a command like: `par_disable_filters=$(echo $par_disable_filters | tr ';' ',')`.
|
||||
|
||||
|
||||
### Step 12: Create test script
|
||||
|
||||
|
||||
If the unit test requires test resources, these should be provided in the `test_resources` section of the component.
|
||||
|
||||
```yaml
|
||||
functionality:
|
||||
# ...
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
```
|
||||
|
||||
Create a test script at `src/xxx/test.sh` that runs the component with the test data. This script should run the component (available with `$meta_executable`) with the test data and check if the output is as expected. The script should exit with a non-zero exit code if the output is not as expected. For example:
|
||||
@@ -325,48 +338,64 @@ Create a test script at `src/xxx/test.sh` that runs the component with the test
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
echo "> Run xxx with test data"
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
assert_file_doesnt_exist() {
|
||||
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
|
||||
}
|
||||
assert_file_empty() {
|
||||
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_file_not_contains() {
|
||||
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
|
||||
}
|
||||
assert_file_contains_regex() {
|
||||
grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_file_not_contains_regex() {
|
||||
grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
|
||||
}
|
||||
#############################################
|
||||
|
||||
echo "> Run $meta_name with test data"
|
||||
"$meta_executable" \
|
||||
--input "$meta_resources_dir/test_data/input.txt" \
|
||||
--input "$meta_resources_dir/test_data/reads_R1.fastq" \
|
||||
--output "output.txt" \
|
||||
--option
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "output.txt" ] && echo "Output file output.txt does not exist" && exit 1
|
||||
```
|
||||
|
||||
|
||||
For example, this is what the test script for the `arriba` component looks like:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
echo "> Run arriba with blacklist"
|
||||
"$meta_executable" \
|
||||
--bam "$meta_resources_dir/test_data/A.bam" \
|
||||
--genome "$meta_resources_dir/test_data/genome.fasta" \
|
||||
--gene_annotation "$meta_resources_dir/test_data/annotation.gtf" \
|
||||
--blacklist "$meta_resources_dir/test_data/blacklist.tsv" \
|
||||
--fusions "fusions.tsv" \
|
||||
--fusions_discarded "fusions_discarded.tsv" \
|
||||
--interesting_contigs "1,2"
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
|
||||
[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
|
||||
echo ">> Check if output exists"
|
||||
assert_file_exists "output.txt"
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
|
||||
[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
|
||||
assert_file_not_empty "output.txt"
|
||||
|
||||
echo ">> Check if output is correct"
|
||||
assert_file_contains "output.txt" "some expected output"
|
||||
|
||||
echo "> All tests succeeded!"
|
||||
```
|
||||
|
||||
### Step 12: Create a `/var/software_versions.txt` file
|
||||
Notes:
|
||||
|
||||
* Do always check the contents of the output file. If the output is not deterministic, you can use regular expressions to check the output.
|
||||
|
||||
* If possible, generate your own test data instead of copying it from an external resource.
|
||||
|
||||
### Step 13: Create a `/var/software_versions.txt` file
|
||||
|
||||
For the sake of transparency and reproducibility, we require that the versions of the software used in the component are documented.
|
||||
|
||||
@@ -378,6 +407,8 @@ engines:
|
||||
image: quay.io/biocontainers/xxx:0.1.0--py_0
|
||||
setup:
|
||||
- type: docker
|
||||
# note: /var/software_versions.txt should contain:
|
||||
# arriba: "2.4.0"
|
||||
run: |
|
||||
echo "xxx: \"0.1.0\"" > /var/software_versions.txt
|
||||
```
|
||||
|
||||
14
src/_authors/angela_o_pisco.yaml
Normal file
14
src/_authors/angela_o_pisco.yaml
Normal file
@@ -0,0 +1,14 @@
|
||||
name: Angela Oliveira Pisco
|
||||
info:
|
||||
role: Contributor
|
||||
links:
|
||||
github: aopisco
|
||||
orcid: "0000-0003-0142-2355"
|
||||
linkedin: aopisco
|
||||
organizations:
|
||||
- name: Insitro
|
||||
href: https://insitro.com
|
||||
role: Director of Computational Biology
|
||||
- name: Open Problems
|
||||
href: https://openproblems.bio
|
||||
role: Core Member
|
||||
10
src/_authors/dorien_roosen.yaml
Normal file
10
src/_authors/dorien_roosen.yaml
Normal file
@@ -0,0 +1,10 @@
|
||||
name: Dorien Roosen
|
||||
info:
|
||||
links:
|
||||
email: dorien@data-intuitive.com
|
||||
github: dorien-er
|
||||
linkedin: dorien-roosen
|
||||
organizations:
|
||||
- name: Data Intuitive
|
||||
href: https://www.data-intuitive.com
|
||||
role: Data Scientist
|
||||
11
src/_authors/dries_schaumont.yaml
Normal file
11
src/_authors/dries_schaumont.yaml
Normal file
@@ -0,0 +1,11 @@
|
||||
name: Dries Schaumont
|
||||
info:
|
||||
links:
|
||||
email: dries@data-intuitive.com
|
||||
github: DriesSchaumont
|
||||
orcid: "0000-0002-4389-0440"
|
||||
linkedin: dries-schaumont
|
||||
organizations:
|
||||
- name: Data Intuitive
|
||||
href: https://www.data-intuitive.com
|
||||
role: Data Scientist
|
||||
10
src/_authors/emma_rousseau.yaml
Normal file
10
src/_authors/emma_rousseau.yaml
Normal file
@@ -0,0 +1,10 @@
|
||||
name: Emma Rousseau
|
||||
info:
|
||||
links:
|
||||
email: emma@data-intuitive.com
|
||||
github: emmarousseau
|
||||
linkedin: emmarousseau1
|
||||
organizations:
|
||||
- name: Data Intuitive
|
||||
href: https://www.data-intuitive.com
|
||||
role: Bioinformatician
|
||||
10
src/_authors/jakub_majercik.yaml
Normal file
10
src/_authors/jakub_majercik.yaml
Normal file
@@ -0,0 +1,10 @@
|
||||
name: Jakub Majercik
|
||||
info:
|
||||
links:
|
||||
email: jakub@data-intuitive.com
|
||||
github: jakubmajercik
|
||||
linkedin: jakubmajercik
|
||||
organizations:
|
||||
- name: Data Intuitive
|
||||
href: https://www.data-intuitive.com
|
||||
role: Bioinformatics Engineer
|
||||
14
src/_authors/kai_waldrant.yaml
Normal file
14
src/_authors/kai_waldrant.yaml
Normal file
@@ -0,0 +1,14 @@
|
||||
name: Kai Waldrant
|
||||
info:
|
||||
links:
|
||||
email: kai@data-intuitive.com
|
||||
github: KaiWaldrant
|
||||
orcid: "0009-0003-8555-1361"
|
||||
linkedin: kaiwaldrant
|
||||
organizations:
|
||||
- name: Data Intuitive
|
||||
href: https://www.data-intuitive.com
|
||||
role: Bioinformatician
|
||||
- name: Open Problems
|
||||
href: https://openproblems.bio
|
||||
role: Contributor
|
||||
10
src/_authors/leila_paquay.yaml
Normal file
10
src/_authors/leila_paquay.yaml
Normal file
@@ -0,0 +1,10 @@
|
||||
name: Leïla Paquay
|
||||
info:
|
||||
links:
|
||||
email: leila@data-intuitive.com
|
||||
github: Leila011
|
||||
linkedin: leilapaquay
|
||||
organizations:
|
||||
- name: Data Intuitive
|
||||
href: https://www.data-intuitive.com
|
||||
role: Software Developer
|
||||
14
src/_authors/robrecht_cannoodt.yaml
Normal file
14
src/_authors/robrecht_cannoodt.yaml
Normal file
@@ -0,0 +1,14 @@
|
||||
name: Robrecht Cannoodt
|
||||
info:
|
||||
links:
|
||||
email: robrecht@data-intuitive.com
|
||||
github: rcannood
|
||||
orcid: "0000-0003-3641-729X"
|
||||
linkedin: robrechtcannoodt
|
||||
organizations:
|
||||
- name: Data Intuitive
|
||||
href: https://www.data-intuitive.com
|
||||
role: Data Science Engineer
|
||||
- name: Open Problems
|
||||
href: https://openproblems.bio
|
||||
role: Core Member
|
||||
10
src/_authors/sai_nirmayi_yasa.yaml
Normal file
10
src/_authors/sai_nirmayi_yasa.yaml
Normal file
@@ -0,0 +1,10 @@
|
||||
name: Sai Nirmayi Yasa
|
||||
info:
|
||||
links:
|
||||
email: nirmayi@data-intuitive.com
|
||||
github: sainirmayi
|
||||
linkedin: sai-nirmayi-yasa
|
||||
organizations:
|
||||
- name: Data Intuitive
|
||||
href: https://www.data-intuitive.com
|
||||
role: Junior Bioinformatics Researcher
|
||||
10
src/_authors/theodoro_gasperin.yaml
Normal file
10
src/_authors/theodoro_gasperin.yaml
Normal file
@@ -0,0 +1,10 @@
|
||||
name: Theodoro Gasperin Terra Camargo
|
||||
info:
|
||||
links:
|
||||
email: theodorogtc@gmail.com
|
||||
github: tgaspe
|
||||
linkedin: theodoro-gasperin-terra-camargo
|
||||
organizations:
|
||||
- name: Data Intuitive
|
||||
href: https://www.data-intuitive.com
|
||||
role: Bioinformatician
|
||||
9
src/_authors/toni_verbeiren.yaml
Normal file
9
src/_authors/toni_verbeiren.yaml
Normal file
@@ -0,0 +1,9 @@
|
||||
name: Toni Verbeiren
|
||||
info:
|
||||
links:
|
||||
github: tverbeiren
|
||||
linkedin: verbeiren
|
||||
organizations:
|
||||
- name: Data Intuitive
|
||||
href: https://www.data-intuitive.com
|
||||
role: Data Scientist and CEO
|
||||
5
src/_authors/weiwei_schultz.yaml
Normal file
5
src/_authors/weiwei_schultz.yaml
Normal file
@@ -0,0 +1,5 @@
|
||||
name: Weiwei Schultz
|
||||
info:
|
||||
organizations:
|
||||
- name: Janssen R&D US
|
||||
role: Associate Director Data Sciences
|
||||
94
src/agat/agat_convert_sp_gff2gtf/config.vsh.yaml
Normal file
94
src/agat/agat_convert_sp_gff2gtf/config.vsh.yaml
Normal file
@@ -0,0 +1,94 @@
|
||||
name: agat_convert_sp_gff2gtf
|
||||
namespace: agat
|
||||
description: |
|
||||
The script aims to convert any GTF/GFF file into a proper GTF file. Full
|
||||
information about the format can be found here:
|
||||
https://agat.readthedocs.io/en/latest/gxf.html You can choose among 7
|
||||
different GTF types (1, 2, 2.1, 2.2, 2.5, 3 or relax). Depending the
|
||||
version selected the script will filter out the features that are not
|
||||
accepted. For GTF2.5 and 3, every level1 feature (e.g nc_gene
|
||||
pseudogene) will be converted into gene feature and every level2 feature
|
||||
(e.g mRNA ncRNA) will be converted into transcript feature. Using the
|
||||
"relax" option you will produce a GTF-like output keeping all original
|
||||
feature types (3rd column). No modification will occur e.g. mRNA to
|
||||
transcript.
|
||||
|
||||
To be fully GTF compliant all feature have a gene_id and a transcript_id
|
||||
attribute. The gene_id is unique identifier for the genomic source of
|
||||
the transcript, which is used to group transcripts into genes. The
|
||||
transcript_id is a unique identifier for the predicted transcript, which
|
||||
is used to group features into transcripts.
|
||||
keywords: [gene annotations, GTF conversion]
|
||||
links:
|
||||
homepage: https://github.com/NBISweden/AGAT
|
||||
documentation: https://agat.readthedocs.io/
|
||||
issue_tracker: https://github.com/NBISweden/AGAT/issues
|
||||
repository: https://github.com/NBISweden/AGAT
|
||||
references:
|
||||
doi: 10.5281/zenodo.3552717
|
||||
license: GPL-3.0
|
||||
authors:
|
||||
- __merge__: /src/_authors/leila_paquay.yaml
|
||||
roles: [ author, maintainer ]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --gff
|
||||
alternatives: [-i]
|
||||
description: Input GFF/GTF file that will be read
|
||||
type: file
|
||||
required: true
|
||||
direction: input
|
||||
example: input.gff
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
alternatives: [-o, --out, --outfile, --gtf]
|
||||
description: Output GTF file. If no output file is specified, the output will be written to STDOUT.
|
||||
type: file
|
||||
direction: output
|
||||
required: true
|
||||
example: output.gtf
|
||||
- name: Arguments
|
||||
arguments:
|
||||
- name: --gtf_version
|
||||
description: |
|
||||
Version of the GTF output (1,2,2.1,2.2,2.5,3 or relax). Default value from AGAT config file (relax for the default config). The script option has the higher priority.
|
||||
|
||||
* relax: all feature types are accepted.
|
||||
* GTF3 (9 feature types accepted): gene, transcript, exon, CDS, Selenocysteine, start_codon, stop_codon, three_prime_utr and five_prime_utr.
|
||||
* GTF2.5 (8 feature types accepted): gene, transcript, exon, CDS, UTR, start_codon, stop_codon, Selenocysteine.
|
||||
* GTF2.2 (9 feature types accepted): CDS, start_codon, stop_codon, 5UTR, 3UTR, inter, inter_CNS, intron_CNS and exon.
|
||||
* GTF2.1 (6 feature types accepted): CDS, start_codon, stop_codon, exon, 5UTR, 3UTR.
|
||||
* GTF2 (4 feature types accepted): CDS, start_codon, stop_codon, exon.
|
||||
* GTF1 (5 feature types accepted): CDS, start_codon, stop_codon, exon, intron.
|
||||
type: string
|
||||
choices: [relax, "1", "2", "2.1", "2.2", "2.5", "3"]
|
||||
required: false
|
||||
example: "3"
|
||||
- name: --config
|
||||
alternatives: [-c]
|
||||
description: |
|
||||
Input agat config file. By default AGAT takes as input agat_config.yaml file from the working directory if any, otherwise it takes the orignal agat_config.yaml shipped with AGAT. To get the agat_config.yaml locally type: "agat config --expose". The --config option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
|
||||
type: file
|
||||
required: false
|
||||
example: custom_agat_config.yaml
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
102
src/agat/agat_convert_sp_gff2gtf/help.txt
Normal file
102
src/agat/agat_convert_sp_gff2gtf/help.txt
Normal file
@@ -0,0 +1,102 @@
|
||||
```sh
|
||||
agat_convert_sp_gff2gtf.pl --help
|
||||
```
|
||||
------------------------------------------------------------------------------
|
||||
| Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0 |
|
||||
| https://github.com/NBISweden/AGAT |
|
||||
| National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se |
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
Name:
|
||||
agat_convert_sp_gff2gtf.pl
|
||||
|
||||
Description:
|
||||
The script aims to convert any GTF/GFF file into a proper GTF file. Full
|
||||
information about the format can be found here:
|
||||
https://agat.readthedocs.io/en/latest/gxf.html You can choose among 7
|
||||
different GTF types (1, 2, 2.1, 2.2, 2.5, 3 or relax). Depending the
|
||||
version selected the script will filter out the features that are not
|
||||
accepted. For GTF2.5 and 3, every level1 feature (e.g nc_gene
|
||||
pseudogene) will be converted into gene feature and every level2 feature
|
||||
(e.g mRNA ncRNA) will be converted into transcript feature. Using the
|
||||
"relax" option you will produce a GTF-like output keeping all original
|
||||
feature types (3rd column). No modification will occur e.g. mRNA to
|
||||
transcript.
|
||||
|
||||
To be fully GTF compliant all feature have a gene_id and a transcript_id
|
||||
attribute. The gene_id is unique identifier for the genomic source of
|
||||
the transcript, which is used to group transcripts into genes. The
|
||||
transcript_id is a unique identifier for the predicted transcript, which
|
||||
is used to group features into transcripts.
|
||||
|
||||
Usage:
|
||||
agat_convert_sp_gff2gtf.pl --gff infile.gff [ -o outfile ]
|
||||
agat_convert_sp_gff2gtf -h
|
||||
|
||||
Options:
|
||||
--gff, --gtf or -i
|
||||
Input GFF/GTF file that will be read
|
||||
|
||||
--gtf_version version of the GTF output (1,2,2.1,2.2,2.5,3 or relax).
|
||||
Default value from AGAT config file (relax for the default config). The
|
||||
script option has the higher priority.
|
||||
relax: all feature types are accepted.
|
||||
|
||||
GTF3 (9 feature types accepted): gene, transcript, exon, CDS,
|
||||
Selenocysteine, start_codon, stop_codon, three_prime_utr and
|
||||
five_prime_utr
|
||||
|
||||
GTF2.5 (8 feature types accepted): gene, transcript, exon, CDS,
|
||||
UTR, start_codon, stop_codon, Selenocysteine
|
||||
|
||||
GTF2.2 (9 feature types accepted): CDS, start_codon, stop_codon,
|
||||
5UTR, 3UTR, inter, inter_CNS, intron_CNS and exon
|
||||
|
||||
GTF2.1 (6 feature types accepted): CDS, start_codon, stop_codon,
|
||||
exon, 5UTR, 3UTR
|
||||
|
||||
GTF2 (4 feature types accepted): CDS, start_codon, stop_codon,
|
||||
exon
|
||||
|
||||
GTF1 (5 feature types accepted): CDS, start_codon, stop_codon,
|
||||
exon, intron
|
||||
|
||||
-o , --output , --out , --outfile or --gtf
|
||||
Output GTF file. If no output file is specified, the output will
|
||||
be written to STDOUT.
|
||||
|
||||
-c or --config
|
||||
String - Input agat config file. By default AGAT takes as input
|
||||
agat_config.yaml file from the working directory if any,
|
||||
otherwise it takes the orignal agat_config.yaml shipped with
|
||||
AGAT. To get the agat_config.yaml locally type: "agat config
|
||||
--expose". The --config option gives you the possibility to use
|
||||
your own AGAT config file (located elsewhere or named
|
||||
differently).
|
||||
|
||||
-h or --help
|
||||
Display this helpful text.
|
||||
|
||||
Feedback:
|
||||
Did you find a bug?:
|
||||
Do not hesitate to report bugs to help us keep track of the bugs and
|
||||
their resolution. Please use the GitHub issue tracking system available
|
||||
at this address:
|
||||
|
||||
https://github.com/NBISweden/AGAT/issues
|
||||
|
||||
Ensure that the bug was not already reported by searching under Issues.
|
||||
If you're unable to find an (open) issue addressing the problem, open a new one.
|
||||
Try as much as possible to include in the issue when relevant:
|
||||
- a clear description,
|
||||
- as much relevant information as possible,
|
||||
- the command used,
|
||||
- a data sample,
|
||||
- an explanation of the expected behaviour that is not occurring.
|
||||
|
||||
Do you want to contribute?:
|
||||
You are very welcome, visit this address for the Contributing
|
||||
guidelines:
|
||||
https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md
|
||||
|
||||
10
src/agat/agat_convert_sp_gff2gtf/script.sh
Normal file
10
src/agat/agat_convert_sp_gff2gtf/script.sh
Normal file
@@ -0,0 +1,10 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
agat_convert_sp_gff2gtf.pl \
|
||||
-i "$par_gff" \
|
||||
-o "$par_output" \
|
||||
${par_gtf_version:+--gtf_version "${par_gtf_version}"} \
|
||||
${par_config:+--config "${par_config}"}
|
||||
37
src/agat/agat_convert_sp_gff2gtf/test.sh
Normal file
37
src/agat/agat_convert_sp_gff2gtf/test.sh
Normal file
@@ -0,0 +1,37 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
test_dir="${meta_resources_dir}/test_data"
|
||||
|
||||
echo "> Run $meta_name with test data"
|
||||
"$meta_executable" \
|
||||
--gff "$test_dir/0_test.gff" \
|
||||
--output "output.gtf"
|
||||
|
||||
echo ">> Checking output"
|
||||
[ ! -f "output.gtf" ] && echo "Output file output.gtf does not exist" && exit 1
|
||||
|
||||
echo ">> Check if output is empty"
|
||||
[ ! -s "output.gtf" ] && echo "Output file output.gtf is empty" && exit 1
|
||||
|
||||
echo ">> Check if the conversion resulted in the right GTF format"
|
||||
idGFF=$(head -n 2 "$test_dir/0_test.gff" | grep -o 'ID=[^;]*' | cut -d '=' -f 2-)
|
||||
expectedGTF="gene_id \"$idGFF\"; ID \"$idGFF\";"
|
||||
extractedGTF=$(head -n 3 "output.gtf" | grep -o 'gene_id "[^"]*"; ID "[^"]*";')
|
||||
[ "$extractedGTF" != "$expectedGTF" ] && echo "Output file output.gtf does not have the right format" && exit 1
|
||||
|
||||
rm output.gtf
|
||||
|
||||
echo "> Run $meta_name with test data and GTF version 2.5"
|
||||
"$meta_executable" \
|
||||
--gff "$test_dir/0_test.gff" \
|
||||
--output "output.gtf" \
|
||||
--gtf_version "2.5"
|
||||
|
||||
echo ">> Check if the output file header display the right GTF version"
|
||||
grep -q "##gtf-version 2.5" "output.gtf"
|
||||
[ $? -ne 0 ] && echo "Output file output.gtf header does not display the right GTF version" && exit 1
|
||||
|
||||
echo "> Test successful"
|
||||
36
src/agat/agat_convert_sp_gff2gtf/test_data/0_test.gff
Normal file
36
src/agat/agat_convert_sp_gff2gtf/test_data/0_test.gff
Normal file
@@ -0,0 +1,36 @@
|
||||
##gff-version 3
|
||||
scaffold625 maker gene 337818 343277 . + . ID=CLUHARG00000005458;Name=TUBB3_2
|
||||
scaffold625 maker mRNA 337818 343277 . + . ID=CLUHART00000008717;Parent=CLUHARG00000005458
|
||||
scaffold625 maker exon 337818 337971 . + . ID=CLUHART00000008717:exon:1404;Parent=CLUHART00000008717
|
||||
scaffold625 maker exon 340733 340841 . + . ID=CLUHART00000008717:exon:1405;Parent=CLUHART00000008717
|
||||
scaffold625 maker exon 341518 341628 . + . ID=CLUHART00000008717:exon:1406;Parent=CLUHART00000008717
|
||||
scaffold625 maker exon 341964 343277 . + . ID=CLUHART00000008717:exon:1407;Parent=CLUHART00000008717
|
||||
scaffold625 maker CDS 337915 337971 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
|
||||
scaffold625 maker CDS 340733 340841 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
|
||||
scaffold625 maker CDS 341518 341628 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
|
||||
scaffold625 maker CDS 341964 343033 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
|
||||
scaffold625 maker five_prime_UTR 337818 337914 . + . ID=CLUHART00000008717:five_prime_utr;Parent=CLUHART00000008717
|
||||
scaffold625 maker three_prime_UTR 343034 343277 . + . ID=CLUHART00000008717:three_prime_utr;Parent=CLUHART00000008717
|
||||
scaffold789 maker gene 558184 564780 . + . ID=CLUHARG00000003852;Name=PF11_0240
|
||||
scaffold789 maker mRNA 558184 564780 . + . ID=CLUHART00000006146;Parent=CLUHARG00000003852
|
||||
scaffold789 maker exon 558184 560123 . + . ID=CLUHART00000006146:exon:995;Parent=CLUHART00000006146
|
||||
scaffold789 maker exon 561401 561519 . + . ID=CLUHART00000006146:exon:996;Parent=CLUHART00000006146
|
||||
scaffold789 maker exon 564171 564235 . + . ID=CLUHART00000006146:exon:997;Parent=CLUHART00000006146
|
||||
scaffold789 maker exon 564372 564780 . + . ID=CLUHART00000006146:exon:998;Parent=CLUHART00000006146
|
||||
scaffold789 maker CDS 558191 560123 . + 0 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
|
||||
scaffold789 maker CDS 561401 561519 . + 2 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
|
||||
scaffold789 maker CDS 564171 564235 . + 0 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
|
||||
scaffold789 maker CDS 564372 564588 . + 1 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
|
||||
scaffold789 maker five_prime_UTR 558184 558190 . + . ID=CLUHART00000006146:five_prime_utr;Parent=CLUHART00000006146
|
||||
scaffold789 maker three_prime_UTR 564589 564780 . + . ID=CLUHART00000006146:three_prime_utr;Parent=CLUHART00000006146
|
||||
scaffold789 maker mRNA 558184 564780 . + . ID=CLUHART00000006147;Parent=CLUHARG00000003852
|
||||
scaffold789 maker exon 558184 560123 . + . ID=CLUHART00000006147:exon:997;Parent=CLUHART00000006147
|
||||
scaffold789 maker exon 561401 561519 . + . ID=CLUHART00000006147:exon:998;Parent=CLUHART00000006147
|
||||
scaffold789 maker exon 562057 562121 . + . ID=CLUHART00000006147:exon:999;Parent=CLUHART00000006147
|
||||
scaffold789 maker exon 564372 564780 . + . ID=CLUHART00000006147:exon:1000;Parent=CLUHART00000006147
|
||||
scaffold789 maker CDS 558191 560123 . + 0 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
|
||||
scaffold789 maker CDS 561401 561519 . + 2 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
|
||||
scaffold789 maker CDS 562057 562121 . + 0 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
|
||||
scaffold789 maker CDS 564372 564588 . + 1 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
|
||||
scaffold789 maker five_prime_UTR 558184 558190 . + . ID=CLUHART00000006147:five_prime_utr;Parent=CLUHART00000006147
|
||||
scaffold789 maker three_prime_UTR 564589 564780 . + . ID=CLUHART00000006147:three_prime_utr;Parent=CLUHART00000006147
|
||||
9
src/agat/agat_convert_sp_gff2gtf/test_data/script.sh
Executable file
9
src/agat/agat_convert_sp_gff2gtf/test_data/script.sh
Executable file
@@ -0,0 +1,9 @@
|
||||
#!/bin/bash
|
||||
|
||||
# clone repo
|
||||
if [ ! -d /tmp/agat_source ]; then
|
||||
git clone --depth 1 --single-branch --branch master https://github.com/NBISweden/AGAT /tmp/agat_source
|
||||
fi
|
||||
|
||||
# copy test data
|
||||
cp -r /tmp/agat_source/t/gff_syntax/in/0_test.gff src/agat/agat_convert_sp_gff2gtf/test_data
|
||||
@@ -11,6 +11,9 @@ license: MIT
|
||||
requirements:
|
||||
cpus: 1
|
||||
commands: [ arriba ]
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
@@ -4,6 +4,17 @@ description: |
|
||||
Information about upgrading from bcl2fastq via
|
||||
[Upgrading from bcl2fastq to BCL Convert](https://emea.support.illumina.com/bulletins/2020/10/upgrading-from-bcl2fastq-to-bcl-convert.html)
|
||||
and [BCL Convert Compatible Products](https://support.illumina.com/sequencing/sequencing_software/bcl-convert/compatibility.html)
|
||||
keywords: [demultiplex, fastq, bcl, illumina]
|
||||
links:
|
||||
homepage: https://support.illumina.com/sequencing/sequencing_software/bcl-convert.html
|
||||
documentation: https://support.illumina.com/downloads/bcl-convert-user-guide.html
|
||||
license: Proprietary
|
||||
authors:
|
||||
- __merge__: /src/_authors/toni_verbeiren.yaml
|
||||
roles: [ author, maintainer ]
|
||||
- __merge__: /src/_authors/dorien_roosen.yaml
|
||||
roles: [ author ]
|
||||
|
||||
argument_groups:
|
||||
- name: Input arguments
|
||||
arguments:
|
||||
|
||||
143
src/bd_rhapsody/bd_rhapsody_make_reference/config.vsh.yaml
Normal file
143
src/bd_rhapsody/bd_rhapsody_make_reference/config.vsh.yaml
Normal file
@@ -0,0 +1,143 @@
|
||||
name: bd_rhapsody_make_reference
|
||||
namespace: bd_rhapsody
|
||||
description: |
|
||||
The Reference Files Generator creates an archive containing Genome Index
|
||||
and Transcriptome annotation files needed for the BD Rhapsody Sequencing
|
||||
Analysis Pipeline. The app takes as input one or more FASTA and GTF files
|
||||
and produces a compressed archive in the form of a tar.gz file. The
|
||||
archive contains:
|
||||
|
||||
- STAR index
|
||||
- Filtered GTF file
|
||||
keywords: [genome, reference, index, align]
|
||||
links:
|
||||
repository: https://bitbucket.org/CRSwDev/cwl/src/master/v2.2.1/Extra_Utilities/
|
||||
documentation: https://bd-rhapsody-bioinfo-docs.genomics.bd.com/resources/extra_utilities.html#make-rhapsody-reference
|
||||
license: Unknown
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [ author, maintainer ]
|
||||
- __merge__: /src/_authors/weiwei_schultz.yaml
|
||||
roles: [ contributor ]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- type: file
|
||||
name: --genome_fasta
|
||||
required: true
|
||||
description: Reference genome file in FASTA or FASTA.GZ format. The BD Rhapsody Sequencing Analysis Pipeline uses GRCh38 for Human and GRCm39 for Mouse.
|
||||
example: genome_sequence.fa.gz
|
||||
multiple: true
|
||||
info:
|
||||
config_key: Genome_fasta
|
||||
- type: file
|
||||
name: --gtf
|
||||
required: true
|
||||
description: |
|
||||
File path to the transcript annotation files in GTF or GTF.GZ format. The Sequence Analysis Pipeline requires the 'gene_name' or
|
||||
'gene_id' attribute to be set on each gene and exon feature. Gene and exon feature lines must have the same attribute, and exons
|
||||
must have a corresponding gene with the same value. For TCR/BCR assays, the TCR or BCR gene segments must have the 'gene_type' or
|
||||
'gene_biotype' attribute set, and the value should begin with 'TR' or 'IG', respectively.
|
||||
example: transcriptome_annotation.gtf.gz
|
||||
multiple: true
|
||||
info:
|
||||
config_key: Gtf
|
||||
- type: file
|
||||
name: --extra_sequences
|
||||
description: |
|
||||
File path to additional sequences in FASTA format to use when building the STAR index. (e.g. transgenes or CRISPR guide barcodes).
|
||||
GTF lines for these sequences will be automatically generated and combined with the main GTF.
|
||||
required: false
|
||||
multiple: true
|
||||
info:
|
||||
config_key: Extra_sequences
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- type: file
|
||||
name: --reference_archive
|
||||
direction: output
|
||||
required: true
|
||||
description: |
|
||||
A Compressed archive containing the Reference Genome Index and annotation GTF files. This archive is meant to be used as an
|
||||
input in the BD Rhapsody Sequencing Analysis Pipeline.
|
||||
example: star_index.tar.gz
|
||||
- name: Arguments
|
||||
arguments:
|
||||
- type: string
|
||||
name: --mitochondrial_contigs
|
||||
description: |
|
||||
Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are
|
||||
identified as 'nuclear fragments' in the ATACseq analysis pipeline.
|
||||
required: false
|
||||
multiple: true
|
||||
default: [chrM, chrMT, M, MT]
|
||||
info:
|
||||
config_key: Mitochondrial_contigs
|
||||
- type: boolean_true
|
||||
name: --filtering_off
|
||||
description: |
|
||||
By default the input Transcript Annotation files are filtered based on the gene_type/gene_biotype attribute. Only features
|
||||
having the following attribute values are kept:
|
||||
|
||||
- protein_coding
|
||||
- lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97)
|
||||
- IG_LV_gene
|
||||
- IG_V_gene
|
||||
- IG_V_pseudogene
|
||||
- IG_D_gene
|
||||
- IG_J_gene
|
||||
- IG_J_pseudogene
|
||||
- IG_C_gene
|
||||
- IG_C_pseudogene
|
||||
- TR_V_gene
|
||||
- TR_V_pseudogene
|
||||
- TR_D_gene
|
||||
- TR_J_gene
|
||||
- TR_J_pseudogene
|
||||
- TR_C_gene
|
||||
|
||||
If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True.
|
||||
info:
|
||||
config_key: Filtering_off
|
||||
- type: boolean_true
|
||||
name: --wta_only_index
|
||||
description: Build a WTA only index, otherwise builds a WTA + ATAC index.
|
||||
info:
|
||||
config_key: Wta_Only
|
||||
- type: string
|
||||
name: --extra_star_params
|
||||
description: Additional parameters to pass to STAR when building the genome index. Specify exactly like how you would on the command line.
|
||||
example: --limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11
|
||||
required: false
|
||||
info:
|
||||
config_key: Extra_STAR_params
|
||||
|
||||
resources:
|
||||
- type: python_script
|
||||
path: script.py
|
||||
- path: make_rhap_reference_2.2.1_nodocker.cwl
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: test_data
|
||||
|
||||
requirements:
|
||||
commands: [ "cwl-runner" ]
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: bdgenomics/rhapsody:2.2.1
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [procps]
|
||||
- type: python
|
||||
packages: [cwlref-runner, cwl-runner]
|
||||
- type: docker
|
||||
run: |
|
||||
echo "bdgenomics/rhapsody: 2.2.1" > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
66
src/bd_rhapsody/bd_rhapsody_make_reference/help.txt
Normal file
66
src/bd_rhapsody/bd_rhapsody_make_reference/help.txt
Normal file
@@ -0,0 +1,66 @@
|
||||
```bash
|
||||
cwl-runner src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl --help
|
||||
```
|
||||
|
||||
usage: src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl
|
||||
[-h] [--Archive_prefix ARCHIVE_PREFIX]
|
||||
[--Extra_STAR_params EXTRA_STAR_PARAMS]
|
||||
[--Extra_sequences EXTRA_SEQUENCES] [--Filtering_off] --Genome_fasta
|
||||
GENOME_FASTA --Gtf GTF [--Maximum_threads MAXIMUM_THREADS]
|
||||
[--Mitochondrial_Contigs MITOCHONDRIAL_CONTIGS] [--WTA_Only]
|
||||
[job_order]
|
||||
|
||||
The Reference Files Generator creates an archive containing Genome Index and
|
||||
Transcriptome annotation files needed for the BD Rhapsodyâ„¢ Sequencing
|
||||
Analysis Pipeline. The app takes as input one or more FASTA and GTF files and
|
||||
produces a compressed archive in the form of a tar.gz file. The archive
|
||||
contains:\n - STAR index\n - Filtered GTF file
|
||||
|
||||
positional arguments:
|
||||
job_order Job input json file
|
||||
|
||||
options:
|
||||
-h, --help show this help message and exit
|
||||
--Archive_prefix ARCHIVE_PREFIX
|
||||
A prefix for naming the compressed archive file
|
||||
containing the Reference genome index and annotation
|
||||
files. The default value is constructed based on the
|
||||
input Reference files.
|
||||
--Extra_STAR_params EXTRA_STAR_PARAMS
|
||||
Additional parameters to pass to STAR when building
|
||||
the genome index. Specify exactly like how you would
|
||||
on the command line. Example: --limitGenomeGenerateRAM
|
||||
48000 --genomeSAindexNbases 11
|
||||
--Extra_sequences EXTRA_SEQUENCES
|
||||
Additional sequences in FASTA format to use when
|
||||
building the STAR index. (E.g. phiX genome)
|
||||
--Filtering_off By default the input Transcript Annotation files are
|
||||
filtered based on the gene_type/gene_biotype
|
||||
attribute. Only features having the following
|
||||
attribute values are are kept: - protein_coding -
|
||||
lncRNA (lincRNA and antisense for Gencode <
|
||||
v31/M22/Ensembl97) - IG_LV_gene - IG_V_gene -
|
||||
IG_V_pseudogene - IG_D_gene - IG_J_gene -
|
||||
IG_J_pseudogene - IG_C_gene - IG_C_pseudogene -
|
||||
TR_V_gene - TR_V_pseudogene - TR_D_gene - TR_J_gene -
|
||||
TR_J_pseudogene - TR_C_gene If you have already pre-
|
||||
filtered the input Annotation files and/or wish to
|
||||
turn-off the filtering, please set this option to
|
||||
True.
|
||||
--Genome_fasta GENOME_FASTA
|
||||
Reference genome file in FASTA format. The BD
|
||||
Rhapsodyâ„¢ Sequencing Analysis Pipeline uses GRCh38
|
||||
for Human and GRCm39 for Mouse.
|
||||
--Gtf GTF Transcript annotation files in GTF format. The BD
|
||||
Rhapsodyâ„¢ Sequencing Analysis Pipeline uses Gencode
|
||||
v42 for Human and M31 for Mouse.
|
||||
--Maximum_threads MAXIMUM_THREADS
|
||||
The maximum number of threads to use in the pipeline.
|
||||
By default, all available cores are used.
|
||||
--Mitochondrial_Contigs MITOCHONDRIAL_CONTIGS
|
||||
Names of the Mitochondrial contigs in the provided
|
||||
Reference Genome. Fragments originating from contigs
|
||||
other than these are identified as 'nuclear fragments'
|
||||
in the ATACseq analysis pipeline.
|
||||
--WTA_Only Build a WTA only index, otherwise builds a WTA + ATAC
|
||||
index.
|
||||
@@ -0,0 +1,115 @@
|
||||
requirements:
|
||||
InlineJavascriptRequirement: {}
|
||||
class: CommandLineTool
|
||||
label: Reference Files Generator for BD Rhapsodyâ„¢ Sequencing Analysis Pipeline
|
||||
cwlVersion: v1.2
|
||||
doc: >-
|
||||
The Reference Files Generator creates an archive containing Genome Index and Transcriptome annotation files needed for the BD Rhapsodyâ„¢ Sequencing Analysis Pipeline. The app takes as input one or more FASTA and GTF files and produces a compressed archive in the form of a tar.gz file. The archive contains:\n - STAR index\n - Filtered GTF file
|
||||
|
||||
|
||||
baseCommand: run_reference_generator.sh
|
||||
inputs:
|
||||
Genome_fasta:
|
||||
type: File[]
|
||||
label: Reference Genome
|
||||
doc: |-
|
||||
Reference genome file in FASTA format. The BD Rhapsodyâ„¢ Sequencing Analysis Pipeline uses GRCh38 for Human and GRCm39 for Mouse.
|
||||
inputBinding:
|
||||
prefix: --reference-genome
|
||||
shellQuote: false
|
||||
Gtf:
|
||||
type: File[]
|
||||
label: Transcript Annotations
|
||||
doc: |-
|
||||
Transcript annotation files in GTF format. The BD Rhapsodyâ„¢ Sequencing Analysis Pipeline uses Gencode v42 for Human and M31 for Mouse.
|
||||
inputBinding:
|
||||
prefix: --gtf
|
||||
shellQuote: false
|
||||
Extra_sequences:
|
||||
type: File[]?
|
||||
label: Extra Sequences
|
||||
doc: |-
|
||||
Additional sequences in FASTA format to use when building the STAR index. (E.g. phiX genome)
|
||||
inputBinding:
|
||||
prefix: --extra-sequences
|
||||
shellQuote: false
|
||||
Mitochondrial_Contigs:
|
||||
type: string[]?
|
||||
default: ["chrM", "chrMT", "M", "MT"]
|
||||
label: Mitochondrial Contig Names
|
||||
doc: |-
|
||||
Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are identified as 'nuclear fragments' in the ATACseq analysis pipeline.
|
||||
inputBinding:
|
||||
prefix: --mitochondrial-contigs
|
||||
shellQuote: false
|
||||
Filtering_off:
|
||||
type: boolean?
|
||||
label: Turn off filtering
|
||||
doc: |-
|
||||
By default the input Transcript Annotation files are filtered based on the gene_type/gene_biotype attribute. Only features having the following attribute values are are kept:
|
||||
- protein_coding
|
||||
- lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97)
|
||||
- IG_LV_gene
|
||||
- IG_V_gene
|
||||
- IG_V_pseudogene
|
||||
- IG_D_gene
|
||||
- IG_J_gene
|
||||
- IG_J_pseudogene
|
||||
- IG_C_gene
|
||||
- IG_C_pseudogene
|
||||
- TR_V_gene
|
||||
- TR_V_pseudogene
|
||||
- TR_D_gene
|
||||
- TR_J_gene
|
||||
- TR_J_pseudogene
|
||||
- TR_C_gene
|
||||
If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True.
|
||||
inputBinding:
|
||||
prefix: --filtering-off
|
||||
shellQuote: false
|
||||
WTA_Only:
|
||||
type: boolean?
|
||||
label: WTA only index
|
||||
doc: Build a WTA only index, otherwise builds a WTA + ATAC index.
|
||||
inputBinding:
|
||||
prefix: --wta-only-index
|
||||
shellQuote: false
|
||||
Archive_prefix:
|
||||
type: string?
|
||||
label: Archive Prefix
|
||||
doc: |-
|
||||
A prefix for naming the compressed archive file containing the Reference genome index and annotation files. The default value is constructed based on the input Reference files.
|
||||
inputBinding:
|
||||
prefix: --archive-prefix
|
||||
shellQuote: false
|
||||
Extra_STAR_params:
|
||||
type: string?
|
||||
label: Extra STAR Params
|
||||
doc: |-
|
||||
Additional parameters to pass to STAR when building the genome index. Specify exactly like how you would on the command line.
|
||||
Example:
|
||||
--limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11
|
||||
inputBinding:
|
||||
prefix: --extra-star-params
|
||||
shellQuote: true
|
||||
|
||||
Maximum_threads:
|
||||
type: int?
|
||||
label: Maximum Number of Threads
|
||||
doc: |-
|
||||
The maximum number of threads to use in the pipeline. By default, all available cores are used.
|
||||
inputBinding:
|
||||
prefix: --maximum-threads
|
||||
shellQuote: false
|
||||
|
||||
outputs:
|
||||
|
||||
Archive:
|
||||
type: File
|
||||
doc: |-
|
||||
A Compressed archive containing the Reference Genome Index and annotation GTF files. This archive is meant to be used as an input in the BD Rhapsodyâ„¢ Sequencing Analysis Pipeline.
|
||||
id: Reference_Archive
|
||||
label: Reference Files Archive
|
||||
outputBinding:
|
||||
glob: '*.tar.gz'
|
||||
|
||||
161
src/bd_rhapsody/bd_rhapsody_make_reference/script.py
Normal file
161
src/bd_rhapsody/bd_rhapsody_make_reference/script.py
Normal file
@@ -0,0 +1,161 @@
|
||||
import os
|
||||
import re
|
||||
import subprocess
|
||||
import tempfile
|
||||
from typing import Any
|
||||
import yaml
|
||||
import shutil
|
||||
|
||||
## VIASH START
|
||||
par = {
|
||||
"genome_fasta": [],
|
||||
"gtf": [],
|
||||
"extra_sequences": [],
|
||||
"mitochondrial_contigs": ["chrM", "chrMT", "M", "MT"],
|
||||
"filtering_off": False,
|
||||
"wta_only_index": False,
|
||||
"extra_star_params": None,
|
||||
"reference_archive": "output.tar.gz",
|
||||
}
|
||||
meta = {
|
||||
"config": "target/nextflow/reference/build_bdrhap_2_reference/.config.vsh.yaml",
|
||||
"resources_dir": os.path.abspath("src/reference/build_bdrhap_2_reference"),
|
||||
"temp_dir": os.getenv("VIASH_TEMP"),
|
||||
"memory_mb": None,
|
||||
"cpus": None
|
||||
}
|
||||
## VIASH END
|
||||
|
||||
def clean_arg(argument):
|
||||
argument["clean_name"] = re.sub("^-*", "", argument["name"])
|
||||
return argument
|
||||
|
||||
def read_config(path: str) -> dict[str, Any]:
|
||||
with open(path, "r") as f:
|
||||
config = yaml.safe_load(f)
|
||||
|
||||
config["all_arguments"] = [
|
||||
clean_arg(arg)
|
||||
for grp in config["argument_groups"]
|
||||
for arg in grp["arguments"]
|
||||
]
|
||||
|
||||
return config
|
||||
|
||||
def strip_margin(text: str) -> str:
|
||||
return re.sub("(\n?)[ \t]*\|", "\\1", text)
|
||||
|
||||
def process_params(par: dict[str, Any], config) -> str:
|
||||
# check input parameters
|
||||
assert par["genome_fasta"], "Pass at least one set of inputs to --genome_fasta."
|
||||
assert par["gtf"], "Pass at least one set of inputs to --gtf."
|
||||
assert par["reference_archive"].endswith(".tar.gz"), "Output reference_archive must end with .tar.gz."
|
||||
|
||||
# make paths absolute
|
||||
for argument in config["all_arguments"]:
|
||||
if par[argument["clean_name"]] and argument["type"] == "file":
|
||||
if isinstance(par[argument["clean_name"]], list):
|
||||
par[argument["clean_name"]] = [ os.path.abspath(f) for f in par[argument["clean_name"]] ]
|
||||
else:
|
||||
par[argument["clean_name"]] = os.path.abspath(par[argument["clean_name"]])
|
||||
|
||||
return par
|
||||
|
||||
def generate_config(par: dict[str, Any], meta, config) -> str:
|
||||
content_list = [strip_margin(f"""\
|
||||
|#!/usr/bin/env cwl-runner
|
||||
|
|
||||
|""")]
|
||||
|
||||
|
||||
config_key_value_pairs = []
|
||||
for argument in config["all_arguments"]:
|
||||
config_key = (argument.get("info") or {}).get("config_key")
|
||||
arg_type = argument["type"]
|
||||
par_value = par[argument["clean_name"]]
|
||||
if par_value and config_key:
|
||||
config_key_value_pairs.append((config_key, arg_type, par_value))
|
||||
|
||||
if meta["cpus"]:
|
||||
config_key_value_pairs.append(("Maximum_threads", "integer", meta["cpus"]))
|
||||
|
||||
# print(config_key_value_pairs)
|
||||
|
||||
for config_key, arg_type, par_value in config_key_value_pairs:
|
||||
if arg_type == "file":
|
||||
str = strip_margin(f"""\
|
||||
|{config_key}:
|
||||
|""")
|
||||
if isinstance(par_value, list):
|
||||
for file in par_value:
|
||||
str += strip_margin(f"""\
|
||||
| - class: File
|
||||
| location: "{file}"
|
||||
|""")
|
||||
else:
|
||||
str += strip_margin(f"""\
|
||||
| class: File
|
||||
| location: "{par_value}"
|
||||
|""")
|
||||
content_list.append(str)
|
||||
else:
|
||||
content_list.append(strip_margin(f"""\
|
||||
|{config_key}: {par_value}
|
||||
|"""))
|
||||
|
||||
## Write config to file
|
||||
return "".join(content_list)
|
||||
|
||||
def get_cwl_file(meta: dict[str, Any]) -> str:
|
||||
# create cwl file (if need be)
|
||||
cwl_file=os.path.join(meta["resources_dir"], "make_rhap_reference_2.2.1_nodocker.cwl")
|
||||
|
||||
return cwl_file
|
||||
|
||||
def main(par: dict[str, Any], meta: dict[str, Any]):
|
||||
config = read_config(meta["config"])
|
||||
|
||||
# Preprocess params
|
||||
par = process_params(par, config)
|
||||
|
||||
# fetch cwl file
|
||||
cwl_file = get_cwl_file(meta)
|
||||
|
||||
# Create output dir if not exists
|
||||
outdir = os.path.dirname(par["reference_archive"])
|
||||
if not os.path.exists(outdir):
|
||||
os.makedirs(outdir)
|
||||
|
||||
## Run pipeline
|
||||
with tempfile.TemporaryDirectory(prefix="cwl-bd_rhapsody_wta-", dir=meta["temp_dir"]) as temp_dir:
|
||||
# Create params file
|
||||
config_file = os.path.join(temp_dir, "config.yml")
|
||||
config_content = generate_config(par, meta, config)
|
||||
with open(config_file, "w") as f:
|
||||
f.write(config_content)
|
||||
|
||||
|
||||
cmd = [
|
||||
"cwl-runner",
|
||||
"--no-container",
|
||||
"--preserve-entire-environment",
|
||||
"--outdir",
|
||||
temp_dir,
|
||||
cwl_file,
|
||||
config_file
|
||||
]
|
||||
|
||||
env = dict(os.environ)
|
||||
env["TMPDIR"] = temp_dir
|
||||
|
||||
print("> " + " ".join(cmd), flush=True)
|
||||
_ = subprocess.check_call(
|
||||
cmd,
|
||||
cwd=os.path.dirname(config_file),
|
||||
env=env
|
||||
)
|
||||
|
||||
shutil.move(os.path.join(temp_dir, "Rhap_reference.tar.gz"), par["reference_archive"])
|
||||
|
||||
if __name__ == "__main__":
|
||||
main(par, meta)
|
||||
65
src/bd_rhapsody/bd_rhapsody_make_reference/test.sh
Normal file
65
src/bd_rhapsody/bd_rhapsody_make_reference/test.sh
Normal file
@@ -0,0 +1,65 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
assert_file_doesnt_exist() {
|
||||
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
|
||||
}
|
||||
assert_file_empty() {
|
||||
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_file_not_contains() {
|
||||
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
|
||||
}
|
||||
#############################################
|
||||
|
||||
in_fa="$meta_resources_dir/test_data/reference_small.fa"
|
||||
in_gtf="$meta_resources_dir/test_data/reference_small.gtf"
|
||||
|
||||
echo "#############################################"
|
||||
echo "> Simple run"
|
||||
|
||||
mkdir simple_run
|
||||
cd simple_run
|
||||
|
||||
out_tar="myreference.tar.gz"
|
||||
|
||||
echo "> Running $meta_name."
|
||||
$meta_executable \
|
||||
--genome_fasta "$in_fa" \
|
||||
--gtf "$in_gtf" \
|
||||
--reference_archive "$out_tar" \
|
||||
--extra_star_params "--genomeSAindexNbases 6" \
|
||||
---cpus 2
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
assert_file_exists "$out_tar"
|
||||
assert_file_not_empty "$out_tar"
|
||||
|
||||
echo ">> Checking whether output contains the expected files"
|
||||
tar -xvf "$out_tar" > /dev/null
|
||||
assert_file_exists "BD_Rhapsody_Reference_Files/star_index/genomeParameters.txt"
|
||||
assert_file_exists "BD_Rhapsody_Reference_Files/bwa-mem2_index/reference_small.ann"
|
||||
assert_file_exists "BD_Rhapsody_Reference_Files/reference_small-processed.gtf"
|
||||
assert_file_exists "BD_Rhapsody_Reference_Files/mitochondrial_contigs.txt"
|
||||
assert_file_contains "BD_Rhapsody_Reference_Files/reference_small-processed.gtf" "chr1.*HAVANA.*ENSG00000243485"
|
||||
assert_file_contains "BD_Rhapsody_Reference_Files/mitochondrial_contigs.txt" 'chrMT'
|
||||
|
||||
cd ..
|
||||
|
||||
echo "#############################################"
|
||||
|
||||
echo "> Tests succeeded!"
|
||||
@@ -0,0 +1,27 @@
|
||||
>chr1 1
|
||||
TGGGGAAGCAAGGCGGAGTTGGGCAGCTCGTGTTCAATGGGTAGAGTTTCAGGCTGGGGT
|
||||
GATGGAAGGGTGCTGGAAATGAGTGGTAGTGATGGCGGCACAACAGTGTGAATCTACTTA
|
||||
ATCCCACTGAACTGTATGCTGAAAAATGGTTTAGACGGTGAATTTTAGGTTATGTATGTT
|
||||
TTACCACAATTTTTAAAAAGCTAGTGAAAAGCTGGTAAAAAGAAAGAAAAGAGGCTTTTT
|
||||
TAAAAAGTTAAATATATAAAAAGAGCATCATCAGTCCAAAGTCCAGCAGTTGTCCCTCCT
|
||||
GGAATCCGTTGGCTTGCCTCCGGCATTTTTGGCCCTTGCCTTTTAGGGTTGCCAGATTAA
|
||||
AAGACAGGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAA
|
||||
ATATGTCCCAAGCTTAGTTTGGGACATACTTATGCTAAAAAACATTATTGGTTGTTTATC
|
||||
TGAGATTCAGAATTAAGCATTTTATATTTTATTTGCTGCCTCTGGCCACCCTACTCTCTT
|
||||
CCTAACACTCTCTCCCTCTCCCAGTTTTGTCCGCCTTCCCTGCCTCCTCTTCTGGGGGAG
|
||||
TTAGATCGAGTTGTAACAAGAACATGCCACTGTCTCGCTGGCTGCAGCGTGTGGTCCCCT
|
||||
TACCAGAGGTAAAGAAGAGATGGATCTCCACTCATGTTGTAGACAGAATGTTTATGTCCT
|
||||
CTCCAAATGCTTATGTTGAAACCCTAACCCCTAATGTGATGGTATGTGGAGATGGGCCTT
|
||||
TGGTAGGTAATTACGGTTAGATGAGGTCATGGGGTGGGGCCCTCATTATAGATCTGGTAA
|
||||
GAAAAGAGAGCATTGTCTCTGTGTCTCCCTCTCTCTCTCTCTCTCTCTCTCTCATTTCTC
|
||||
TCTATCTCATTTCTCTCTCTCTCGCTATCTCATTTTTCTCTCTCTCTCTTTCTCTCCTCT
|
||||
GTCTTTTCCCACCAAGTGAGGATGCGAAGAGAAGGTGGCTGTCTGCAAACCAGGAAGAGA
|
||||
GCCCTCACCGGGAACCCGTCCAGCTGCCACCTTGAACTTGGACTTCCAAGCCTCCAGAAC
|
||||
TGTGAGGGATAAATGTATGATTTTAAAGTCGCCCAGTGTGTGGTATTTTGTTTTGACTAA
|
||||
TACAACCTGAAAACATTTTCCCCTCACTCCACCTGAGCAATATCTGAGTGGCTTAAGGTA
|
||||
CTCAGGACACAACAAAGGAGAAATGTCCCATGCACAAGGTGCACCCATGCCTGGGTAAAG
|
||||
CAGCCTGGCACAGAGGGAAGCACACAGGCTCAGGGATCTGCTATTCATTCTTTGTGTGAC
|
||||
CCTGGGCAAGCCATGAATGGAGCTTCAGTCACCCCATTTGTAATGGGATTTAATTGTGCT
|
||||
TGCCCTGCCTCCTTTTGAGGGCTGTAGAGAAAAGATGTCAAAGTATTTTGTAATCTGGCT
|
||||
GGGCGTGGTGGCTCATGCCTGTAATCCTAGCACTTTGGTAGGCTGACGCGAGAGGACTGC
|
||||
T
|
||||
@@ -0,0 +1,8 @@
|
||||
chr1 HAVANA exon 565 668 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; exon_number 2; exon_id "ENSE00001922571.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
|
||||
chr1 HAVANA exon 977 1098 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; exon_number 3; exon_id "ENSE00001827679.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
|
||||
chr1 HAVANA transcript 268 1110 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000469289.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
|
||||
chr1 HAVANA exon 268 668 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000469289.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; exon_number 1; exon_id "ENSE00001841699.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
|
||||
chr1 HAVANA exon 977 1110 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000469289.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; exon_number 2; exon_id "ENSE00001890064.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
|
||||
chr1 ENSEMBL gene 367 504 . + . gene_id "ENSG00000284332.1"; gene_type "miRNA"; gene_name "MIR1302-2"; level 3; hgnc_id "HGNC:35294";
|
||||
chr1 ENSEMBL transcript 367 504 . + . gene_id "ENSG00000284332.1"; transcript_id "ENST00000607096.1"; gene_type "miRNA"; gene_name "MIR1302-2"; transcript_type "miRNA"; transcript_name "MIR1302-2-201"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:35294"; tag "basic"; tag "Ensembl_canonical";
|
||||
chr1 ENSEMBL exon 367 504 . + . gene_id "ENSG00000284332.1"; transcript_id "ENST00000607096.1"; gene_type "miRNA"; gene_name "MIR1302-2"; transcript_type "miRNA"; transcript_name "MIR1302-2-201"; exon_number 1; exon_id "ENSE00003695741.1"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:35294"; tag "basic"; tag "Ensembl_canonical";
|
||||
@@ -0,0 +1,47 @@
|
||||
#!/bin/bash
|
||||
|
||||
TMP_DIR=/tmp/bd_rhapsody_make_reference
|
||||
OUT_DIR=src/bd_rhapsody/bd_rhapsody_make_reference/test_data
|
||||
|
||||
# check if seqkit is installed
|
||||
if ! command -v seqkit &> /dev/null; then
|
||||
echo "seqkit could not be found"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# create temporary directory and clean up on exit
|
||||
mkdir -p $TMP_DIR
|
||||
function clean_up {
|
||||
rm -rf "$TMP_DIR"
|
||||
}
|
||||
trap clean_up EXIT
|
||||
|
||||
# fetch reference
|
||||
ORIG_FA=$TMP_DIR/reference.fa.gz
|
||||
if [ ! -f $ORIG_FA ]; then
|
||||
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/GRCh38.primary_assembly.genome.fa.gz \
|
||||
-O $ORIG_FA
|
||||
fi
|
||||
|
||||
ORIG_GTF=$TMP_DIR/reference.gtf.gz
|
||||
if [ ! -f $ORIG_GTF ]; then
|
||||
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.annotation.gtf.gz \
|
||||
-O $ORIG_GTF
|
||||
fi
|
||||
|
||||
# create small reference
|
||||
START=30000
|
||||
END=31500
|
||||
CHR=chr1
|
||||
|
||||
# subset to small region
|
||||
seqkit grep -r -p "^$CHR\$" "$ORIG_FA" | \
|
||||
seqkit subseq -r "$START:$END" > $OUT_DIR/reference_small.fa
|
||||
|
||||
zcat "$ORIG_GTF" | \
|
||||
awk -v FS='\t' -v OFS='\t' "
|
||||
\$1 == \"$CHR\" && \$4 >= $START && \$5 <= $END {
|
||||
\$4 = \$4 - $START + 1;
|
||||
\$5 = \$5 - $START + 1;
|
||||
print;
|
||||
}" > $OUT_DIR/reference_small.gtf
|
||||
@@ -10,6 +10,9 @@ references:
|
||||
license: GPL-2.0
|
||||
requirements:
|
||||
commands: [bedtools]
|
||||
authors:
|
||||
- __merge__: /src/_authors/dries_schaumont.yaml
|
||||
roles: [ author, maintainer ]
|
||||
|
||||
argument_groups:
|
||||
- name: Input arguments
|
||||
|
||||
@@ -9,6 +9,9 @@ links:
|
||||
references:
|
||||
doi: 10.1007/978-1-4939-9173-0_14
|
||||
license: MIT
|
||||
authors:
|
||||
- __merge__: /src/_authors/dorien_roosen.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
@@ -9,6 +9,9 @@ links:
|
||||
references:
|
||||
doi: 10.1007/978-1-4939-9173-0_14
|
||||
license: MIT
|
||||
authors:
|
||||
- __merge__: /src/_authors/dorien_roosen.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Outputs
|
||||
arguments:
|
||||
|
||||
@@ -9,6 +9,9 @@ links:
|
||||
references:
|
||||
doi: 10.1007/978-1-4939-9173-0_14
|
||||
license: MIT
|
||||
authors:
|
||||
- __merge__: /src/_authors/dorien_roosen.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
@@ -9,6 +9,9 @@ links:
|
||||
references:
|
||||
doi: 10.14806/ej.17.1.200
|
||||
license: MIT
|
||||
authors:
|
||||
- __merge__: /src/_authors/toni_verbeiren.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
####################################################################
|
||||
- name: Specify Adapters for R1
|
||||
|
||||
@@ -6,25 +6,25 @@ set -eo pipefail
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || (echo "File '$1' does not exist" && exit 1)
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
assert_file_doesnt_exist() {
|
||||
[ ! -f "$1" ] || (echo "File '$1' exists but shouldn't" && exit 1)
|
||||
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
|
||||
}
|
||||
assert_file_empty() {
|
||||
[ ! -s "$1" ] || (echo "File '$1' is not empty but should be" && exit 1)
|
||||
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || (echo "File '$1' is empty but shouldn't be" && exit 1)
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_file_not_contains() {
|
||||
grep -q "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
|
||||
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
|
||||
}
|
||||
|
||||
#############################################
|
||||
|
||||
mkdir test_multiple_output
|
||||
cd test_multiple_output
|
||||
|
||||
|
||||
@@ -9,6 +9,9 @@ references:
|
||||
license: GPL-3.0
|
||||
requirements:
|
||||
commands: [falco]
|
||||
authors:
|
||||
- __merge__: /src/_authors/toni_verbeiren.yaml
|
||||
roles: [ author, maintainer ]
|
||||
|
||||
# Notes:
|
||||
# - falco as arguments similar to -subsample and we update those to --subsample
|
||||
|
||||
@@ -26,6 +26,9 @@ links:
|
||||
references:
|
||||
doi: "10.1093/bioinformatics/bty560"
|
||||
license: MIT
|
||||
authors:
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
description: |
|
||||
|
||||
@@ -11,7 +11,9 @@ references:
|
||||
license: GPL-3.0
|
||||
requirements:
|
||||
commands: [ featureCounts ]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
@@ -8,8 +8,9 @@ links:
|
||||
references:
|
||||
doi: 10.12688/f1000research.23297.2
|
||||
license: MIT
|
||||
requirements:
|
||||
commands: [ gffread ]
|
||||
authors:
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
@@ -52,7 +53,7 @@ argument_groups:
|
||||
required: true
|
||||
description: |
|
||||
Write the output records into <outfile>.
|
||||
default: output.gff
|
||||
example: output.gff
|
||||
- name: --force_exons
|
||||
type: boolean_true
|
||||
description: |
|
||||
@@ -154,7 +155,6 @@ argument_groups:
|
||||
- name: --table
|
||||
type: string
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
description: |
|
||||
Output a simple tab delimited format instead of GFF, with columns having the values
|
||||
of GFF attributes given in <attrlist>; special pseudo-attributes (prefixed by @) are
|
||||
|
||||
@@ -50,6 +50,8 @@
|
||||
[[ "$par_expose_dups" == "false" ]] && unset par_expose_dups
|
||||
[[ "$par_cluster_only" == "false" ]] && unset par_cluster_only
|
||||
|
||||
# if par_table is not empty, replace ";" with ","
|
||||
par_table=$(echo "$par_table" | tr ';' ',')
|
||||
|
||||
$(which gffread) \
|
||||
"$par_input" \
|
||||
|
||||
@@ -86,7 +86,7 @@ diff "$expected_output_dir/transcripts.fa" "$test_output_dir/transcripts.fa" ||
|
||||
echo "> Test 4 - Generate table from GFF annotation file"
|
||||
|
||||
"$meta_executable" \
|
||||
--table @id,@chr,@start,@end,@strand,@exons,Name,gene,product \
|
||||
--table "@id;@chr;@start;@end;@strand;@exons;Name;gene;product" \
|
||||
--outfile "$test_output_dir/annotation.tbl" \
|
||||
--input "$test_dir/sequence.gff3"
|
||||
|
||||
|
||||
@@ -17,6 +17,9 @@ references:
|
||||
license: "MIT"
|
||||
requirements:
|
||||
commands: [ lofreq ]
|
||||
authors:
|
||||
- __merge__: /src/_authors/kai_waldrant.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
@@ -18,6 +18,9 @@ references:
|
||||
license: "MIT"
|
||||
requirements:
|
||||
commands: [ lofreq ]
|
||||
authors:
|
||||
- __merge__: /src/_authors/kai_waldrant.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
@@ -11,7 +11,9 @@ info:
|
||||
references:
|
||||
doi: 10.1093/bioinformatics/btw354
|
||||
licence: GPL v3 or later
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/dorien_roosen.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
@@ -54,25 +56,21 @@ argument_groups:
|
||||
- name: "--include_modules"
|
||||
type: string
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
example: fastqc,cutadapt
|
||||
example: [fastqc, cutadapt]
|
||||
description: Use only these module
|
||||
- name: "--exclude_modules"
|
||||
type: string
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
example: fastqc,cutadapt
|
||||
example: [fastqc, cutadapt]
|
||||
description: Do not use only these modules
|
||||
- name: "--ignore_analysis"
|
||||
type: string
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
example: run_one/*,run_two/*
|
||||
example: [run_one/*, run_two/*]
|
||||
- name: "--ignore_samples"
|
||||
type: string
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
example: sample_1*,sample_3*
|
||||
example: [sample_1*, sample_3*]
|
||||
- name: "--ignore_symlinks"
|
||||
type: boolean_true
|
||||
description: Ignore symlinked directories and files
|
||||
|
||||
@@ -38,7 +38,7 @@ IFS=";" read -ra inputs <<< $par_input
|
||||
|
||||
if [[ -n "$par_include_modules" ]]; then
|
||||
include_modules=""
|
||||
IFS="," read -ra incl_modules <<< $par_include_modules
|
||||
IFS=";" read -ra incl_modules <<< $par_include_modules
|
||||
for i in "${incl_modules[@]}"; do
|
||||
include_modules+="--include $i "
|
||||
done
|
||||
@@ -47,7 +47,7 @@ fi
|
||||
|
||||
if [[ -n "$par_exclude_modules" ]]; then
|
||||
exclude_modules=""
|
||||
IFS="," read -ra excl_modules <<< $par_exclude_modules
|
||||
IFS=";" read -ra excl_modules <<< $par_exclude_modules
|
||||
for i in "${excl_modules[@]}"; do
|
||||
exclude_modules+="--exclude $i"
|
||||
done
|
||||
@@ -56,7 +56,7 @@ fi
|
||||
|
||||
if [[ -n "$par_ignore_analysis" ]]; then
|
||||
ignore=""
|
||||
IFS="," read -ra ignore_analysis <<< $par_ignore_analysis
|
||||
IFS=";" read -ra ignore_analysis <<< $par_ignore_analysis
|
||||
for i in "${ignore_analysis[@]}"; do
|
||||
ignore+="--ignore $i "
|
||||
done
|
||||
@@ -65,7 +65,7 @@ fi
|
||||
|
||||
if [[ -n "$par_ignore_samples" ]]; then
|
||||
ignore_samples=""
|
||||
IFS="," read -ra ign_samples <<< $par_ignore_samples
|
||||
IFS=";" read -ra ign_samples <<< $par_ignore_samples
|
||||
for i in "${ign_samples[@]}"; do
|
||||
ignore_samples+="--ignore-samples $i"
|
||||
done
|
||||
|
||||
@@ -12,7 +12,10 @@ references:
|
||||
doi: 10.1093/bioinformatics/btt593
|
||||
license: "CC-BY-NC-SA-3.0"
|
||||
requirements:
|
||||
commands: [ pear , gzip ]
|
||||
commands: [ pear, gzip ]
|
||||
authors:
|
||||
- __merge__: /src/_authors/kai_waldrant.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
@@ -12,14 +12,16 @@ references:
|
||||
license: GPL-3.0
|
||||
requirements:
|
||||
commands: [ salmon ]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --genome
|
||||
type: file
|
||||
description: |
|
||||
Genome of the organism to prepare the set of decoy sequences. Required to build decoy-aware transccriptome.
|
||||
Genome of the organism to prepare the set of decoy sequences. Required to build decoy-aware transcriptome.
|
||||
required: false
|
||||
example: genome.fasta
|
||||
- name: --transcripts
|
||||
@@ -110,4 +112,4 @@ engines:
|
||||
salmon index -v 2>&1 | sed 's/salmon \([0-9.]*\)/salmon: \1/' > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
- type: nextflow
|
||||
|
||||
@@ -12,7 +12,9 @@ references:
|
||||
license: GPL-3.0
|
||||
requirements:
|
||||
commands: [ salmon ]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Common input options
|
||||
arguments:
|
||||
|
||||
@@ -9,7 +9,9 @@ links:
|
||||
references:
|
||||
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
|
||||
license: MIT/Expat
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
@@ -9,7 +9,9 @@ links:
|
||||
references:
|
||||
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
|
||||
license: MIT/Expat
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
193
src/samtools/samtools_fasta/config.vsh.yaml
Normal file
193
src/samtools/samtools_fasta/config.vsh.yaml
Normal file
@@ -0,0 +1,193 @@
|
||||
name: samtools_fasta
|
||||
namespace: samtools
|
||||
description: Converts a SAM, BAM or CRAM to FASTA format.
|
||||
keywords: [fasta, bam, sam, cram]
|
||||
links:
|
||||
homepage: https://www.htslib.org/
|
||||
documentation: https://www.htslib.org/doc/samtools-fasta.html
|
||||
repository: https://github.com/samtools/samtools
|
||||
references:
|
||||
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
|
||||
license: MIT/Expat
|
||||
authors:
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
type: file
|
||||
description: input SAM/BAM/CRAM file
|
||||
required: true
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
description: output FASTA file
|
||||
required: true
|
||||
direction: output
|
||||
- name: Options
|
||||
arguments:
|
||||
- name: --no_suffix
|
||||
alternatives: -n
|
||||
type: boolean_true
|
||||
description: |
|
||||
By default, either '/1' or '/2' is added to the end of read names where the corresponding
|
||||
READ1 or READ2 FLAG bit is set. Using -n causes read names to be left as they are.
|
||||
- name: --suffix
|
||||
alternatives: -N
|
||||
type: boolean_true
|
||||
description: |
|
||||
Always add either '/1' or '/2' to the end of read names even when put into different files.
|
||||
- name: --use_oq
|
||||
alternatives: -O
|
||||
type: boolean_true
|
||||
description: |
|
||||
Use quality values from OQ tags in preference to standard quality string if available.
|
||||
- name: --singleton
|
||||
alternatives: -s
|
||||
type: file
|
||||
description: write singleton reads to FILE.
|
||||
- name: --copy_tags
|
||||
alternatives: -t
|
||||
type: boolean_true
|
||||
description: |
|
||||
Copy RG, BC and QT tags to the FASTA header line, if they exist.
|
||||
- name: --copy_tags_list
|
||||
alternatives: -T
|
||||
type: string
|
||||
description: |
|
||||
Specify a comma-separated list of tags to copy to the FASTA header line, if they exist.
|
||||
TAGLIST can be blank or `*` to indicate all tags should be copied to the output. If using `*`,
|
||||
be careful to quote it to avoid unwanted shell expansion.
|
||||
- name: --read1
|
||||
alternatives: -1
|
||||
type: file
|
||||
description: |
|
||||
Write reads with the READ1 FLAG set (and READ2 not set) to FILE instead of outputting them.
|
||||
If the -s option is used, only paired reads will be written to this file.
|
||||
direction: output
|
||||
- name: --read2
|
||||
alternatives: -2
|
||||
type: file
|
||||
description: |
|
||||
Write reads with the READ2 FLAG set (and READ1 not set) to FILE instead of outputting them.
|
||||
If the -s option is used, only paired reads will be written to this file.
|
||||
direction: output
|
||||
- name: --output_reads
|
||||
alternatives: -o
|
||||
type: file
|
||||
description: |
|
||||
Write reads with either READ1 FLAG or READ2 flag set to FILE instead of outputting them to stdout.
|
||||
This is equivalent to -1 FILE -2 FILE.
|
||||
direction: output
|
||||
- name: --output_reads_both
|
||||
alternatives: -0
|
||||
type: file
|
||||
description: |
|
||||
Write reads where the READ1 and READ2 FLAG bits set are either both set or both unset to FILE
|
||||
instead of outputting them.
|
||||
direction: output
|
||||
- name: --filter_flags
|
||||
alternatives: -f
|
||||
type: integer
|
||||
description: |
|
||||
Only output alignments with all bits set in INT present in the FLAG field. INT can be specified
|
||||
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0'
|
||||
(i.e. /^0[0-7]+/). Default: `0`.
|
||||
example: 0
|
||||
- name: --excl_flags
|
||||
alternatives: -F
|
||||
type: string
|
||||
description: |
|
||||
Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified
|
||||
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0'
|
||||
(i.e. /^0[0-7]+/). This defaults to 0x900 representing filtering of secondary and
|
||||
supplementary alignments. Default: `0x900`.
|
||||
example: "0x900"
|
||||
- name: --incl_flags
|
||||
alternatives: --rf
|
||||
type: string
|
||||
description: |
|
||||
Only output alignments with any bits set in INT present in the FLAG field. INT can be specified
|
||||
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with '0'
|
||||
(i.e. /^0[0-7]+/), as a decimal number not beginning with '0' or as a comma-separated list of
|
||||
flag names. Default: `0`.
|
||||
example: 0
|
||||
- name: --excl_flags_all
|
||||
alternatives: -G
|
||||
type: integer
|
||||
description: |
|
||||
Only EXCLUDE reads with all of the bits set in INT present in the FLAG field. INT can be specified
|
||||
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0' (i.e. /^0[0-7]+/).
|
||||
Default: `0`.
|
||||
example: 0
|
||||
- name: --aux_tag
|
||||
alternatives: -d
|
||||
type: string
|
||||
description: |
|
||||
Only output alignments containing an auxiliary tag matching both TAG and VAL. If VAL is omitted
|
||||
then any value is accepted. The tag types supported are i, f, Z, A and H. "B" arrays are not
|
||||
supported. This is comparable to the method used in samtools view --tag. The option may be specified
|
||||
multiple times and is equivalent to using the --aux_tag_file option.
|
||||
- name: --aux_tag_file
|
||||
alternatives: -D
|
||||
type: string
|
||||
description: |
|
||||
Only output alignments containing an auxiliary tag matching TAG and having a value listed in FILE.
|
||||
The format of the file is one line per value. This is equivalent to specifying --aux_tag multiple times.
|
||||
- name: --casava
|
||||
alternatives: -i
|
||||
type: boolean_true
|
||||
description: add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
|
||||
- name: --compression
|
||||
alternatives: -c
|
||||
type: integer
|
||||
description: set compression level when writing gz or bgzf fasta files.
|
||||
example: 0
|
||||
- name: --index1
|
||||
alternatives: --i1
|
||||
type: file
|
||||
description: write first index reads to FILE.
|
||||
- name: --index2
|
||||
alternatives: --i2
|
||||
type: file
|
||||
description: write second index reads to FILE.
|
||||
- name: --barcode_tag
|
||||
type: string
|
||||
description: |
|
||||
Auxiliary tag to find index reads in. Default: `BC`.
|
||||
example: "BC"
|
||||
- name: --quality_tag
|
||||
type: string
|
||||
description: |
|
||||
Auxiliary tag to find index quality in. Default: `QT`.
|
||||
example: "QT"
|
||||
- name: --index_format
|
||||
type: string
|
||||
description: |
|
||||
string to describe how to parse the barcode and quality tags. For example:
|
||||
* `i14i8`: the first 14 characters are index 1, the next 8 characters are index 2.
|
||||
* `n8i14`: ignore the first 8 characters, and use the next 14 characters for index 1.
|
||||
If the tag contains a separator, then the numeric part can be replaced with`*` to mean
|
||||
'read until the separator or end of tag', for example: `n*i*`.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: ../samtools_fastq/script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
samtools --version 2>&1 | grep -E '^(samtools|Using htslib)' | \
|
||||
sed 's#Using ##;s# \([0-9\.]*\)$#: \1#' > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
80
src/samtools/samtools_fasta/help.txt
Normal file
80
src/samtools/samtools_fasta/help.txt
Normal file
@@ -0,0 +1,80 @@
|
||||
```
|
||||
samtools fastq
|
||||
```
|
||||
|
||||
Usage: samtools fastq [options...] <in.bam>
|
||||
|
||||
Description:
|
||||
Converts a SAM, BAM or CRAM to FASTQ format.
|
||||
|
||||
Options:
|
||||
-0 FILE write reads designated READ_OTHER to FILE
|
||||
-1 FILE write reads designated READ1 to FILE
|
||||
-2 FILE write reads designated READ2 to FILE
|
||||
-o FILE write reads designated READ1 or READ2 to FILE
|
||||
note: if a singleton file is specified with -s, only
|
||||
paired reads will be written to the -1 and -2 files.
|
||||
-d, --tag TAG[:VAL]
|
||||
only include reads containing TAG, optionally with value VAL
|
||||
-f, --require-flags INT
|
||||
only include reads with all of the FLAGs in INT present [0]
|
||||
-F, --excl[ude]-flags INT
|
||||
only include reads with none of the FLAGs in INT present [0x900]
|
||||
--rf, --incl[ude]-flags INT
|
||||
only include reads with any of the FLAGs in INT present [0]
|
||||
-G INT only EXCLUDE reads with all of the FLAGs in INT present [0]
|
||||
-n don't append /1 and /2 to the read name
|
||||
-N always append /1 and /2 to the read name
|
||||
-O output quality in the OQ tag if present
|
||||
-s FILE write singleton reads designated READ1 or READ2 to FILE
|
||||
-t copy RG, BC and QT tags to the FASTQ header line
|
||||
-T TAGLIST copy arbitrary tags to the FASTQ header line, '*' for all
|
||||
-v INT default quality score if not given in file [1]
|
||||
-i add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
|
||||
-c INT compression level [0..9] to use when writing bgzf files [1]
|
||||
--i1 FILE write first index reads to FILE
|
||||
--i2 FILE write second index reads to FILE
|
||||
--barcode-tag TAG
|
||||
Barcode tag [BC]
|
||||
--quality-tag TAG
|
||||
Quality tag [QT]
|
||||
--index-format STR
|
||||
How to parse barcode and quality tags
|
||||
|
||||
--input-fmt-option OPT[=VAL]
|
||||
Specify a single input file format option in the form
|
||||
of OPTION or OPTION=VALUE
|
||||
--reference FILE
|
||||
Reference sequence FASTA FILE [null]
|
||||
-@, --threads INT
|
||||
Number of additional threads to use [0]
|
||||
--verbosity INT
|
||||
Set level of verbosity
|
||||
|
||||
The files will be automatically compressed if the file names have a .gz
|
||||
or .bgzf extension. The input to this program must be collated by name.
|
||||
Run 'samtools collate' or 'samtools sort -n' to achieve this.
|
||||
|
||||
Reads are designated READ1 if FLAG READ1 is set and READ2 is not set.
|
||||
Reads are designated READ2 if FLAG READ1 is not set and READ2 is set.
|
||||
Otherwise reads are designated READ_OTHER (both flags set or both flags unset).
|
||||
Run 'samtools flags' for more information on flag codes and meanings.
|
||||
|
||||
The index-format string describes how to parse the barcode and quality tags.
|
||||
It is made up of 'i' or 'n' followed by a length or '*'. For example:
|
||||
i14i8 The first 14 characters are index 1, the next 8 are index 2
|
||||
n8i14 Ignore the first 8 characters, and use the next 14 for index 1
|
||||
|
||||
If the tag contains a separator, then the numeric part can be replaced with
|
||||
'*' to mean 'read until the separator or end of tag', for example:
|
||||
i*i* Break the tag at the separator into index 1 and index 2
|
||||
n*i* Ignore the left part of the tag until the separator,
|
||||
then use the second part of the tag as index 1
|
||||
|
||||
Examples:
|
||||
To get just the paired reads in separate files, use:
|
||||
samtools fastq -1 pair1.fq -2 pair2.fq -0 /dev/null -s /dev/null -n in.bam
|
||||
|
||||
To get all non-supplementary/secondary reads in a single file, redirect
|
||||
the output:
|
||||
samtools fastq in.bam > all_reads.fq
|
||||
96
src/samtools/samtools_fasta/test.sh
Normal file
96
src/samtools/samtools_fasta/test.sh
Normal file
@@ -0,0 +1,96 @@
|
||||
#!/bin/bash
|
||||
|
||||
test_dir="${meta_resources_dir}/test_data"
|
||||
out_dir="${meta_resources_dir}/out_data"
|
||||
|
||||
############################################################################################
|
||||
|
||||
echo ">>> Test 1: Convert all reads from a bam file to fasta format"
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/a.bam" \
|
||||
--output "$out_dir/a.fa"
|
||||
|
||||
echo ">>> Check if output file exists"
|
||||
[ ! -f "$out_dir/a.fa" ] && echo "Output file a.fa does not exist" && exit 1
|
||||
|
||||
echo ">>> Check if output is empty"
|
||||
[ ! -s "$out_dir/a.fa" ] && echo "Output file a.fa is empty" && exit 1
|
||||
|
||||
echo ">>> Check if output matches expected output"
|
||||
diff "$out_dir/a.fa" "$test_dir/a.fa" ||
|
||||
(echo "Output file a.fa does not match expected output" && exit 1)
|
||||
|
||||
rm "$out_dir/a.fa"
|
||||
|
||||
############################################################################################
|
||||
|
||||
echo ">>> Test 2: Convert all reads from a sam file to fasta format"
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/a.sam" \
|
||||
--output "$out_dir/a.fa"
|
||||
|
||||
echo ">>> Check if output file exists"
|
||||
[ ! -f "$out_dir/a.fa" ] && echo "Output file a.fa does not exist" && exit 1
|
||||
|
||||
echo ">>> Check if output is empty"
|
||||
[ ! -s "$out_dir/a.fa" ] && echo "Output file a.fa is empty" && exit 1
|
||||
|
||||
echo ">>> Check if output matches expected output"
|
||||
diff "$out_dir/a.fa" "$test_dir/a.fa" ||
|
||||
(echo "Output file a.fa does not match expected output" && exit 1)
|
||||
|
||||
rm "$out_dir/a.fa"
|
||||
|
||||
############################################################################################
|
||||
|
||||
echo ">>> Test 3: Output reads from bam file to separate files"
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/a.bam" \
|
||||
--read1 "$out_dir/a.1.fa" \
|
||||
--read2 "$out_dir/a.2.fa" \
|
||||
--output "$out_dir/a.fa"
|
||||
|
||||
echo ">>> Check if output files exist"
|
||||
[ ! -f "$out_dir/a.1.fa" ] && echo "Output file a.1.fa does not exist" && exit 1
|
||||
[ ! -f "$out_dir/a.2.fa" ] && echo "Output file a.2.fa does not exist" && exit 1
|
||||
[ ! -f "$out_dir/a.fa" ] && echo "Output file a.fa does not exist" && exit 1
|
||||
|
||||
echo ">>> Check if output files are empty"
|
||||
[ ! -s "$out_dir/a.1.fa" ] && echo "Output file a.1.fa is empty" && exit 1
|
||||
[ ! -s "$out_dir/a.2.fa" ] && echo "Output file a.2.fa is empty" && exit 1
|
||||
# output should be empty since input has no singleton reads
|
||||
|
||||
echo ">>> Check if output files match expected output"
|
||||
diff "$out_dir/a.1.fa" "$test_dir/a.1.fa" ||
|
||||
(echo "Output file a.1.fa does not match expected output" && exit 1)
|
||||
diff "$out_dir/a.2.fa" "$test_dir/a.2.fa" ||
|
||||
(echo "Output file a.2.fa does not match expected output" && exit 1)
|
||||
|
||||
rm "$out_dir/a.1.fa" "$out_dir/a.2.fa" "$out_dir/a.fa"
|
||||
|
||||
############################################################################################
|
||||
|
||||
echo ">>> Test 4: Output only forward reads from bam file to fasta format"
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$test_dir/a.sam" \
|
||||
--excl_flags "0x80" \
|
||||
--output "$out_dir/half.fa"
|
||||
|
||||
echo ">>> Check if output file exists"
|
||||
[ ! -f "$out_dir/half.fa" ] && echo "Output file half.fa does not exist" && exit 1
|
||||
|
||||
echo ">>> Check if output is empty"
|
||||
[ ! -s "$out_dir/half.fa" ] && echo "Output file half.fa is empty" && exit 1
|
||||
|
||||
echo ">>> Check if output matches expected output"
|
||||
diff "$out_dir/half.fa" "$test_dir/half.fa" ||
|
||||
(echo "Output file half.fa does not match expected output" && exit 1)
|
||||
|
||||
rm "$out_dir/half.fa"
|
||||
|
||||
############################################################################################
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
6
src/samtools/samtools_fasta/test_data/a.1.fa
Normal file
6
src/samtools/samtools_fasta/test_data/a.1.fa
Normal file
@@ -0,0 +1,6 @@
|
||||
>a1
|
||||
AAAAAAAAAA
|
||||
>b1
|
||||
AAAAAAAAAA
|
||||
>c1
|
||||
AAAAAAAAAA
|
||||
6
src/samtools/samtools_fasta/test_data/a.2.fa
Normal file
6
src/samtools/samtools_fasta/test_data/a.2.fa
Normal file
@@ -0,0 +1,6 @@
|
||||
>a1
|
||||
AAAAAAAAAA
|
||||
>b1
|
||||
AAAAAAAAAA
|
||||
>c1
|
||||
AAAAAAAAAA
|
||||
BIN
src/samtools/samtools_fasta/test_data/a.bam
Normal file
BIN
src/samtools/samtools_fasta/test_data/a.bam
Normal file
Binary file not shown.
12
src/samtools/samtools_fasta/test_data/a.fa
Normal file
12
src/samtools/samtools_fasta/test_data/a.fa
Normal file
@@ -0,0 +1,12 @@
|
||||
>a1/1
|
||||
AAAAAAAAAA
|
||||
>b1/1
|
||||
AAAAAAAAAA
|
||||
>c1/1
|
||||
AAAAAAAAAA
|
||||
>a1/2
|
||||
AAAAAAAAAA
|
||||
>b1/2
|
||||
AAAAAAAAAA
|
||||
>c1/2
|
||||
AAAAAAAAAA
|
||||
7
src/samtools/samtools_fasta/test_data/a.sam
Normal file
7
src/samtools/samtools_fasta/test_data/a.sam
Normal file
@@ -0,0 +1,7 @@
|
||||
@SQ SN:xx LN:20
|
||||
a1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
|
||||
b1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
|
||||
c1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
|
||||
a1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
|
||||
b1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
|
||||
c1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
|
||||
6
src/samtools/samtools_fasta/test_data/half.fa
Normal file
6
src/samtools/samtools_fasta/test_data/half.fa
Normal file
@@ -0,0 +1,6 @@
|
||||
>a1/1
|
||||
AAAAAAAAAA
|
||||
>b1/1
|
||||
AAAAAAAAAA
|
||||
>c1/1
|
||||
AAAAAAAAAA
|
||||
11
src/samtools/samtools_fasta/test_data/script.sh
Executable file
11
src/samtools/samtools_fasta/test_data/script.sh
Executable file
@@ -0,0 +1,11 @@
|
||||
#!/bin/bash
|
||||
|
||||
# dowload test data from snakemake wrapper
|
||||
if [ ! -d /tmp/fastq_source ]; then
|
||||
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers.git /tmp/fastq_source
|
||||
fi
|
||||
|
||||
cp -r /tmp/fastq_source/bio/samtools/fastx/test/*.sam src/samtools/samtools_fastq/test_data/
|
||||
cp -r /tmp/fastq_source/bio/samtools/fastq/interleaved/test/mapped/*.bam src/samtools/samtools_fastq/test_data/
|
||||
cp -r /tmp/fastq_source/bio/samtools/fastq/interleaved/test/reads/*.fq src/samtools/samtools_fastq/test_data/
|
||||
cp -r /tmp/fastq_source/bio/samtools/fastq/separate/test/reads/*.fq src/samtools/samtools_fastq/test_data/
|
||||
@@ -9,7 +9,9 @@ links:
|
||||
references:
|
||||
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
|
||||
license: MIT/Expat
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
@@ -56,7 +58,7 @@ argument_groups:
|
||||
type: string
|
||||
description: |
|
||||
Specify a comma-separated list of tags to copy to the FASTQ header line, if they exist.
|
||||
TAGLIST can be blank or * to indicate all tags should be copied to the output. If using *,
|
||||
TAGLIST can be blank or `*` to indicate all tags should be copied to the output. If using `*`,
|
||||
be careful to quote it to avoid unwanted shell expansion.
|
||||
- name: --read1
|
||||
alternatives: -1
|
||||
@@ -91,35 +93,35 @@ argument_groups:
|
||||
type: integer
|
||||
description: |
|
||||
Only output alignments with all bits set in INT present in the FLAG field. INT can be specified
|
||||
in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0'
|
||||
(i.e. /^0[0-7]+/).
|
||||
default: 0
|
||||
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0'
|
||||
(i.e. /^0[0-7]+/). Default: `0`.
|
||||
example: 0
|
||||
- name: --excl_flags
|
||||
alternatives: -F
|
||||
type: string
|
||||
description: |
|
||||
Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified
|
||||
in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0'
|
||||
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0'
|
||||
(i.e. /^0[0-7]+/). This defaults to 0x900 representing filtering of secondary and
|
||||
supplementary alignments.
|
||||
default: 0x900
|
||||
supplementary alignments. Default: `0x900`.
|
||||
example: "0x900"
|
||||
- name: --incl_flags
|
||||
alternatives: --rf
|
||||
type: string
|
||||
description: |
|
||||
Only output alignments with any bits set in INT present in the FLAG field. INT can be specified
|
||||
in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with `0'
|
||||
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with '0'
|
||||
(i.e. /^0[0-7]+/), as a decimal number not beginning with '0' or as a comma-separated list of
|
||||
flag names.
|
||||
default: 0
|
||||
flag names. Default: `0`.
|
||||
example: 0
|
||||
- name: --excl_flags_all
|
||||
alternatives: -G
|
||||
type: integer
|
||||
description: |
|
||||
Only EXCLUDE reads with all of the bits set in INT present in the FLAG field. INT can be specified
|
||||
in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0'
|
||||
(i.e. /^0[0-7]+/).
|
||||
default: 0
|
||||
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0' (i.e. /^0[0-7]+/).
|
||||
Default: `0`.
|
||||
example: 0
|
||||
- name: --aux_tag
|
||||
alternatives: -d
|
||||
type: string
|
||||
@@ -137,12 +139,13 @@ argument_groups:
|
||||
- name: --casava
|
||||
alternatives: -i
|
||||
type: boolean_true
|
||||
description: add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
|
||||
description: |
|
||||
Add Illumina Casava 1.8 format entry to header, for example: `1:N:0:ATCACG`.
|
||||
- name: --compression
|
||||
alternatives: -c
|
||||
type: integer
|
||||
description: set compression level when writing gz or bgzf fastq files.
|
||||
default: 0
|
||||
example: 0
|
||||
- name: --index1
|
||||
alternatives: --i1
|
||||
type: file
|
||||
@@ -153,20 +156,22 @@ argument_groups:
|
||||
description: write second index reads to FILE.
|
||||
- name: --barcode_tag
|
||||
type: string
|
||||
description: Auxiliary tag to find index reads in.
|
||||
default: BC
|
||||
description: |
|
||||
Auxiliary tag to find index reads in. Default: `BC`.
|
||||
example: "BC"
|
||||
- name: --quality_tag
|
||||
type: string
|
||||
description: Auxiliary tag to find index quality in.
|
||||
default: QT
|
||||
description: |
|
||||
Auxiliary tag to find index quality in. Default: `QT`.
|
||||
example: QT
|
||||
- name: --index_format
|
||||
type: string
|
||||
description: |
|
||||
string to describe how to parse the barcode and quality tags. For example:
|
||||
[i14i8]: the first 14 characters are index 1, the next 8 characters are index 2.
|
||||
[n8i14]: ignore the first 8 characters, and use the next 14 characters for index 1.
|
||||
* `i14i8`: the first 14 characters are index 1, the next 8 characters are index 2.
|
||||
* `n8i14`: ignore the first 8 characters, and use the next 14 characters for index 1.
|
||||
If the tag contains a separator, then the numeric part can be replaced with '*' to mean
|
||||
'read until the separator or end of tag', for example: [n*i*].
|
||||
'read until the separator or end of tag', for example: `n*i*`.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
|
||||
@@ -11,7 +11,14 @@ set -e
|
||||
[[ "$par_copy_tags" == "false" ]] && unset par_copy_tags
|
||||
[[ "$par_casava" == "false" ]] && unset par_casava
|
||||
|
||||
samtools fastq \
|
||||
if [[ "$meta_name" == "samtools_fasta" ]]; then
|
||||
subcommand=fasta
|
||||
elif [[ "$meta_name" == "samtools_fastq" ]]; then
|
||||
subcommand=fastq
|
||||
else
|
||||
echo "Unrecognized component name" && exit 1
|
||||
fi
|
||||
samtools "$subcommand" \
|
||||
${par_no_suffix:+-n} \
|
||||
${par_suffix:+-N} \
|
||||
${par_use_oq:+-O} \
|
||||
|
||||
@@ -9,7 +9,9 @@ links:
|
||||
references:
|
||||
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
|
||||
license: MIT/Expat
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
@@ -9,7 +9,9 @@ links:
|
||||
references:
|
||||
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
|
||||
license: MIT/Expat
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
@@ -9,7 +9,9 @@ links:
|
||||
references:
|
||||
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
|
||||
license: MIT/Expat
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
@@ -9,7 +9,9 @@ links:
|
||||
references:
|
||||
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
|
||||
license: MIT/Expat
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
@@ -9,7 +9,9 @@ links:
|
||||
references:
|
||||
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
|
||||
license: MIT/Expat
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
@@ -30,10 +32,10 @@ argument_groups:
|
||||
- name: --coverage
|
||||
alternatives: -c
|
||||
type: integer
|
||||
description: |
|
||||
Coverage distribution min,max,step [1,1000,1].
|
||||
multiple: true
|
||||
multiple_sep: ','
|
||||
description: |
|
||||
Coverage distribution min;max;step. Default: [1, 1000, 1].
|
||||
example: [1, 1000, 1]
|
||||
- name: --remove_dups
|
||||
alternatives: -d
|
||||
type: boolean_true
|
||||
@@ -48,25 +50,25 @@ argument_groups:
|
||||
alternatives: -f
|
||||
type: string
|
||||
description: |
|
||||
Required flag, 0 for unset. See also `samtools flags`.
|
||||
default: "0"
|
||||
Required flag, 0 for unset. See also `samtools flags`. Default: `"0"`.
|
||||
example: "0"
|
||||
- name: --filtering_flag
|
||||
alternatives: -F
|
||||
type: string
|
||||
description: |
|
||||
Filtering flag, 0 for unset. See also `samtools flags`.
|
||||
default: "0"
|
||||
Filtering flag, 0 for unset. See also `samtools flags`. Default: `0`.
|
||||
example: "0"
|
||||
- name: --GC_depth
|
||||
type: double
|
||||
description: |
|
||||
The size of GC-depth bins (decreasing bin size increases memory requirement).
|
||||
default: 20000.0
|
||||
The size of GC-depth bins (decreasing bin size increases memory requirement). Default: `20000`.
|
||||
example: 20000.0
|
||||
- name: --insert_size
|
||||
alternatives: -i
|
||||
type: integer
|
||||
description: |
|
||||
Maximum insert size.
|
||||
default: 8000
|
||||
Maximum insert size. Default: `8000`.
|
||||
example: 8000
|
||||
- name: --id
|
||||
alternatives: -I
|
||||
type: string
|
||||
@@ -76,14 +78,14 @@ argument_groups:
|
||||
alternatives: -l
|
||||
type: integer
|
||||
description: |
|
||||
Include in the statistics only reads with the given read length.
|
||||
default: -1
|
||||
Include in the statistics only reads with the given read length. Default: `-1`.
|
||||
example: -1
|
||||
- name: --most_inserts
|
||||
alternatives: -m
|
||||
type: double
|
||||
description: |
|
||||
Report only the main part of inserts.
|
||||
default: 0.99
|
||||
Report only the main part of inserts. Default: `0.99`.
|
||||
example: 0.99
|
||||
- name: --split_prefix
|
||||
alternatives: -P
|
||||
type: string
|
||||
@@ -93,8 +95,8 @@ argument_groups:
|
||||
alternatives: -q
|
||||
type: integer
|
||||
description: |
|
||||
The BWA trimming parameter.
|
||||
default: 0
|
||||
The BWA trimming parameter. Default: `0`.
|
||||
example: 0
|
||||
- name: --ref_seq
|
||||
alternatives: -r
|
||||
type: file
|
||||
@@ -124,8 +126,8 @@ argument_groups:
|
||||
alternatives: -g
|
||||
type: integer
|
||||
description: |
|
||||
Only bases with coverage above this value will be included in the target percentage computation.
|
||||
default: 0
|
||||
Only bases with coverage above this value will be included in the target percentage computation. Default: `0`.
|
||||
example: 0
|
||||
- name: --input_fmt_option
|
||||
type: string
|
||||
description: |
|
||||
@@ -141,7 +143,7 @@ argument_groups:
|
||||
type: file
|
||||
description: |
|
||||
Output file.
|
||||
default: "out.txt"
|
||||
example: "out.txt"
|
||||
required: true
|
||||
direction: output
|
||||
|
||||
|
||||
@@ -10,6 +10,9 @@ set -e
|
||||
[[ "$par_sparse" == "false" ]] && unset par_sparse
|
||||
[[ "$par_remove_overlaps" == "false" ]] && unset par_remove_overlaps
|
||||
|
||||
# change the coverage input from X;X;X to X,X,X
|
||||
par_coverage=$(echo "$par_coverage" | tr ';' ',')
|
||||
|
||||
samtools stats \
|
||||
${par_coverage:+-c "$par_coverage"} \
|
||||
${par_remove_dups:+-d} \
|
||||
|
||||
@@ -17,7 +17,7 @@ echo ">>> Checking whether output is non-empty"
|
||||
[ ! -s "$test_dir/test.paired_end.sorted.txt" ] && echo "File 'test.paired_end.sorted.txt' is empty!" && exit 1
|
||||
|
||||
echo ">>> Checking whether output is correct"
|
||||
# compare using diff, ignoring the line stating the command that was passed.
|
||||
# compare using diff, ignoring the line stating the command that was passed.
|
||||
diff <(grep -v "^# The command" "$test_dir/test.paired_end.sorted.txt") \
|
||||
<(grep -v "^# The command" "$test_dir/ref.paired_end.sorted.txt") || \
|
||||
(echo "Output file ref.paired_end.sorted.txt does not match expected output" && exit 1)
|
||||
|
||||
@@ -9,7 +9,9 @@ links:
|
||||
references:
|
||||
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
|
||||
license: MIT/Expat
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
|
||||
57
src/seqtk/seqtk_sample/config.vsh.yaml
Normal file
57
src/seqtk/seqtk_sample/config.vsh.yaml
Normal file
@@ -0,0 +1,57 @@
|
||||
name: seqtk_sample
|
||||
namespace: seqtk
|
||||
description: Subsamples sequences from FASTA/Q files.
|
||||
keywords: [sample, FASTA, FASTQ]
|
||||
links:
|
||||
repository: https://github.com/lh3/seqtk/tree/v1.4
|
||||
license: MIT
|
||||
authors:
|
||||
- __merge__: /src/_authors/jakub_majercik.yaml
|
||||
roles: [ author, maintainer ]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
type: file
|
||||
description: The input FASTA/Q file.
|
||||
required: true
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
type: file
|
||||
description: The output FASTA/Q file.
|
||||
required: true
|
||||
direction: output
|
||||
|
||||
- name: Options
|
||||
arguments:
|
||||
- name: --seed
|
||||
type: integer
|
||||
description: Seed for random generator.
|
||||
example: 42
|
||||
- name: --fraction_number
|
||||
type: double
|
||||
description: Fraction or number of sequences to sample.
|
||||
required: true
|
||||
example: 0.1
|
||||
- name: --two_pass_mode
|
||||
type: boolean_true
|
||||
description: Twice as slow but with much reduced memory
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: ../test_data
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
9
src/seqtk/seqtk_sample/help.txt
Normal file
9
src/seqtk/seqtk_sample/help.txt
Normal file
@@ -0,0 +1,9 @@
|
||||
```
|
||||
seqtk_subseq
|
||||
```
|
||||
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
|
||||
Options:
|
||||
-t TAB delimited output
|
||||
-s strand aware
|
||||
-l INT sequence line length [0]
|
||||
Note: Use 'samtools faidx' if only a few regions are intended.
|
||||
11
src/seqtk/seqtk_sample/script.sh
Normal file
11
src/seqtk/seqtk_sample/script.sh
Normal file
@@ -0,0 +1,11 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
seqtk sample \
|
||||
${par_two_pass_mode:+-2} \
|
||||
${par_seed:+-s "$par_seed"} \
|
||||
"$par_input" \
|
||||
"$par_fraction_number" \
|
||||
> "$par_output"
|
||||
104
src/seqtk/seqtk_sample/test.sh
Normal file
104
src/seqtk/seqtk_sample/test.sh
Normal file
@@ -0,0 +1,104 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
## VIASH START
|
||||
meta_executable="target/executable/seqtk/seqtk_sample"
|
||||
meta_resources_dir="src/seqtk"
|
||||
## VIASH END
|
||||
|
||||
#########################################################################################
|
||||
mkdir seqtk_sample_se
|
||||
cd seqtk_sample_se
|
||||
|
||||
echo "> Run seqtk_sample on fastq SE"
|
||||
"$meta_executable" \
|
||||
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
|
||||
--seed 42 \
|
||||
--fraction_number 3 \
|
||||
--output "sampled.fastq"
|
||||
|
||||
echo ">> Check if output exists"
|
||||
if [ ! -f "sampled.fastq" ]; then
|
||||
echo ">> sampled.fastq does not exist"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo ">> Count number of samples"
|
||||
num_samples=$(grep -c '^@' sampled.fastq)
|
||||
if [ "$num_samples" -ne 3 ]; then
|
||||
echo ">> sampled.fastq does not contain 3 samples"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
#########################################################################################
|
||||
cd ..
|
||||
mkdir seqtk_sample_pe_number
|
||||
cd seqtk_sample_pe_number
|
||||
|
||||
echo ">> Run seqtk_sample on fastq.gz PE with number of reads"
|
||||
"$meta_executable" \
|
||||
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
|
||||
--seed 42 \
|
||||
--fraction_number 3 \
|
||||
--output "sampled_1.fastq"
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$meta_resources_dir/test_data/reads/a.2.fastq.gz" \
|
||||
--seed 42 \
|
||||
--fraction_number 3 \
|
||||
--output "sampled_2.fastq"
|
||||
|
||||
echo ">> Check if output exists"
|
||||
if [ ! -f "sampled_1.fastq" ] || [ ! -f "sampled_2.fastq" ]; then
|
||||
echo ">> One or both output files do not exist"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo ">> Compare reads"
|
||||
# Extract headers
|
||||
headers1=$(grep '^@' sampled_1.fastq | sed -e's/ 1$//' | sort)
|
||||
headers2=$(grep '^@' sampled_2.fastq | sed -e 's/ 2$//' | sort)
|
||||
|
||||
# Compare headers
|
||||
diff <(echo "$headers1") <(echo "$headers2") || { echo "Mismatch detected"; exit 1; }
|
||||
|
||||
echo ">> Count number of samples"
|
||||
num_headers=$(echo "$headers1" | wc -l)
|
||||
if [ "$num_headers" -ne 3 ]; then
|
||||
echo ">> sampled_1.fastq does not contain 3 headers"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
#########################################################################################
|
||||
cd ..
|
||||
mkdir seqtk_sample_pe_fraction
|
||||
cd seqtk_sample_pe_fraction
|
||||
|
||||
echo ">> Run seqtk_sample on fastq.gz PE with fraction of reads"
|
||||
"$meta_executable" \
|
||||
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
|
||||
--seed 42 \
|
||||
--fraction_number 0.5 \
|
||||
--output "sampled_1.fastq"
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$meta_resources_dir/test_data/reads/a.2.fastq.gz" \
|
||||
--seed 42 \
|
||||
--fraction_number 0.5 \
|
||||
--output "sampled_2.fastq"
|
||||
|
||||
echo ">> Check if output exists"
|
||||
if [ ! -f "sampled_1.fastq" ] || [ ! -f "sampled_2.fastq" ]; then
|
||||
echo ">> One or both output files do not exist"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo ">> Compare reads"
|
||||
# Extract headers
|
||||
headers1=$(grep '^@' sampled_1.fastq | sed -e's/ 1$//' | sort)
|
||||
headers2=$(grep '^@' sampled_2.fastq | sed -e 's/ 2$//' | sort)
|
||||
|
||||
# Compare headers
|
||||
diff <(echo "$headers1") <(echo "$headers2") || { echo "Mismatch detected"; exit 1; }
|
||||
|
||||
78
src/seqtk/seqtk_subseq/config.vsh.yaml
Normal file
78
src/seqtk/seqtk_subseq/config.vsh.yaml
Normal file
@@ -0,0 +1,78 @@
|
||||
name: seqtk_subseq
|
||||
namespace: seqtk
|
||||
description: |
|
||||
Extract subsequences from FASTA/Q files. Takes as input a FASTA/Q file and a name.lst (sequence ids file) or a reg.bed (genomic regions file).
|
||||
keywords: [subseq, FASTA, FASTQ]
|
||||
links:
|
||||
repository: https://github.com/lh3/seqtk/tree/v1.4
|
||||
license: MIT
|
||||
authors:
|
||||
- __merge__: /src/_authors/theodoro_gasperin.yaml
|
||||
roles: [ author, maintainer ]
|
||||
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
direction: input
|
||||
description: The input FASTA/Q file.
|
||||
required: true
|
||||
example: input.fa
|
||||
|
||||
- name: "--name_list"
|
||||
type: file
|
||||
direction: input
|
||||
description: |
|
||||
List of sequence names (name.lst) or genomic regions (reg.bed) to extract.
|
||||
required: true
|
||||
example: list.lst
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: "--output"
|
||||
alternatives: -o
|
||||
type: file
|
||||
direction: output
|
||||
description: The output FASTA/Q file.
|
||||
required: true
|
||||
default: output.fa
|
||||
|
||||
- name: Options
|
||||
arguments:
|
||||
- name: "--tab"
|
||||
alternatives: -t
|
||||
type: boolean_true
|
||||
description: TAB delimited output.
|
||||
|
||||
- name: "--strand_aware"
|
||||
alternatives: -s
|
||||
type: boolean_true
|
||||
description: Strand aware.
|
||||
|
||||
- name: "--sequence_line_length"
|
||||
alternatives: -l
|
||||
type: integer
|
||||
description: |
|
||||
Sequence line length of input fasta file. Default: 0.
|
||||
example: 0
|
||||
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
echo $(echo $(seqtk 2>&1) | sed -n 's/.*\(Version: [^ ]*\).*/\1/p') > /var/software_versions.txt
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
9
src/seqtk/seqtk_subseq/help.txt
Normal file
9
src/seqtk/seqtk_subseq/help.txt
Normal file
@@ -0,0 +1,9 @@
|
||||
```bash
|
||||
seqtk subseq
|
||||
```
|
||||
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
|
||||
Options:
|
||||
-t TAB delimited output
|
||||
-s strand aware
|
||||
-l INT sequence line length [0]
|
||||
Note: Use 'samtools faidx' if only a few regions are intended.
|
||||
15
src/seqtk/seqtk_subseq/script.sh
Normal file
15
src/seqtk/seqtk_subseq/script.sh
Normal file
@@ -0,0 +1,15 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
[[ "$par_tab" == "false" ]] && unset par_tab
|
||||
[[ "$par_strand_aware" == "false" ]] && unset par_strand_aware
|
||||
|
||||
seqtk subseq \
|
||||
${par_tab:+-t} \
|
||||
${par_strand_aware:+-s} \
|
||||
${par_sequence_line_length:+-l "$par_sequence_line_length"} \
|
||||
"$par_input" \
|
||||
"$par_name_list" \
|
||||
> "$par_output"
|
||||
182
src/seqtk/seqtk_subseq/test.sh
Normal file
182
src/seqtk/seqtk_subseq/test.sh
Normal file
@@ -0,0 +1,182 @@
|
||||
#!/bin/bash
|
||||
|
||||
# exit on error
|
||||
set -e
|
||||
|
||||
## VIASH START
|
||||
meta_executable="target/executable/seqtk/seqtk_subseq"
|
||||
meta_resources_dir="src/seqtk"
|
||||
## VIASH END
|
||||
|
||||
# Create directories for tests
|
||||
echo "Creating Test Data..."
|
||||
mkdir test_data
|
||||
|
||||
# Create and populate input.fasta
|
||||
cat > "test_data/input.fasta" <<EOL
|
||||
>KU562861.1
|
||||
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCA
|
||||
AGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
|
||||
>GU056837.1
|
||||
CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGAAATAATAATTATCATAATTA
|
||||
TTAATTACATATTTATTAGGTATAATATTTAAGGAAAAATATATTTTATGTTAATTGTAATAATTAGAAC
|
||||
>CP097510.1
|
||||
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGAC
|
||||
AGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
|
||||
>JAMFTS010000002.1
|
||||
CCTAAACCCTAAACCCTAAACCCCCTACAAACCTTACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
|
||||
ACCCGAAACCCTATACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCAAACCTAATCCCTAAACC
|
||||
>MH150936.1
|
||||
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTC
|
||||
AAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG
|
||||
EOL
|
||||
|
||||
# Update id.list with new entries
|
||||
cat > "test_data/id.list" <<EOL
|
||||
KU562861.1
|
||||
MH150936.1
|
||||
EOL
|
||||
|
||||
# Create and populate reg.bed
|
||||
cat > "test_data/reg.bed" <<EOL
|
||||
KU562861.1$(echo -e "\t")10$(echo -e "\t")20$(echo -e "\t")region$(echo -e "\t")0$(echo -e "\t")+$(echo -e "\n")
|
||||
MH150936.1$(echo -e "\t")10$(echo -e "\t")20$(echo -e "\t")region$(echo -e "\t")0$(echo -e "\t")-
|
||||
EOL
|
||||
|
||||
#########################################################################################
|
||||
# Run basic test
|
||||
mkdir test1
|
||||
cd test1
|
||||
|
||||
echo "> Run seqtk_subseq on FASTA/Q file"
|
||||
"$meta_executable" \
|
||||
--input "../test_data/input.fasta" \
|
||||
--name_list "../test_data/id.list" \
|
||||
--output "sub_sample.fq"
|
||||
|
||||
expected_output_basic=">KU562861.1
|
||||
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
|
||||
>MH150936.1
|
||||
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTCAAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG"
|
||||
output_basic=$(cat sub_sample.fq)
|
||||
|
||||
if [ "$output_basic" != "$expected_output_basic" ]; then
|
||||
echo "Test failed"
|
||||
echo "Expected:"
|
||||
echo "$expected_output_basic"
|
||||
echo "Got:"
|
||||
echo "$output_basic"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
#########################################################################################
|
||||
# Run reg.bed as name list input test
|
||||
cd ..
|
||||
mkdir test2
|
||||
cd test2
|
||||
|
||||
echo "> Run seqtk_subseq on FASTA/Q file with BED file as name list"
|
||||
"$meta_executable" \
|
||||
--input "../test_data/input.fasta" \
|
||||
--name_list "../test_data/reg.bed" \
|
||||
--output "sub_sample.fq"
|
||||
|
||||
expected_output_basic=">KU562861.1:11-20
|
||||
AGTGTTCGAG
|
||||
>MH150936.1:11-20
|
||||
TGAAAACTTT"
|
||||
output_basic=$(cat sub_sample.fq)
|
||||
|
||||
if [ "$output_basic" != "$expected_output_basic" ]; then
|
||||
echo "Test failed"
|
||||
echo "Expected:"
|
||||
echo "$expected_output_basic"
|
||||
echo "Got:"
|
||||
echo "$output_basic"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
#########################################################################################
|
||||
# Run tab option output test
|
||||
cd ..
|
||||
mkdir test3
|
||||
cd test3
|
||||
|
||||
echo "> Run seqtk_subseq with TAB option"
|
||||
"$meta_executable" \
|
||||
--tab \
|
||||
--input "../test_data/input.fasta" \
|
||||
--name_list "../test_data/reg.bed" \
|
||||
--output "sub_sample.fq"
|
||||
|
||||
expected_output_tabular=$'KU562861.1\t11\tAGTGTTCGAG\nMH150936.1\t11\tTGAAAACTTT'
|
||||
output_tabular=$(cat sub_sample.fq)
|
||||
|
||||
if [ "$output_tabular" != "$expected_output_tabular" ]; then
|
||||
echo "Test failed"
|
||||
echo "Expected:"
|
||||
echo "$expected_output_tabular"
|
||||
echo "Got:"
|
||||
echo "$output_tabular"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
#########################################################################################
|
||||
# Run line option output test
|
||||
cd ..
|
||||
mkdir test4
|
||||
cd test4
|
||||
|
||||
echo "> Run seqtk_subseq with line length option"
|
||||
"$meta_executable" \
|
||||
--sequence_line_length 5 \
|
||||
--input "../test_data/input.fasta" \
|
||||
--name_list "../test_data/reg.bed" \
|
||||
--output "sub_sample.fq"
|
||||
|
||||
expected_output_wrapped=">KU562861.1:11-20
|
||||
AGTGT
|
||||
TCGAG
|
||||
>MH150936.1:11-20
|
||||
TGAAA
|
||||
ACTTT"
|
||||
output_wrapped=$(cat sub_sample.fq)
|
||||
|
||||
if [ "$output_wrapped" != "$expected_output_wrapped" ]; then
|
||||
echo "Test failed"
|
||||
echo "Expected:"
|
||||
echo "$expected_output_wrapped"
|
||||
echo "Got:"
|
||||
echo "$output_wrapped"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
#########################################################################################
|
||||
# Run Strand Aware option output test
|
||||
cd ..
|
||||
mkdir test5
|
||||
cd test5
|
||||
|
||||
echo "> Run seqtk_subseq with strand aware option"
|
||||
"$meta_executable" \
|
||||
--strand_aware \
|
||||
--input "../test_data/input.fasta" \
|
||||
--name_list "../test_data/reg.bed" \
|
||||
--output "sub_sample.fq"
|
||||
|
||||
expected_output_wrapped=">KU562861.1:11-20
|
||||
AGTGTTCGAG
|
||||
>MH150936.1:11-20
|
||||
AAAGTTTTCA"
|
||||
output_wrapped=$(cat sub_sample.fq)
|
||||
|
||||
if [ "$output_wrapped" != "$expected_output_wrapped" ]; then
|
||||
echo "Test failed"
|
||||
echo "Expected:"
|
||||
echo "$expected_output_wrapped"
|
||||
echo "Got:"
|
||||
echo "$output_wrapped"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "All tests succeeded!"
|
||||
BIN
src/seqtk/test_data/reads/a.1.fastq.gz
Normal file
BIN
src/seqtk/test_data/reads/a.1.fastq.gz
Normal file
Binary file not shown.
BIN
src/seqtk/test_data/reads/a.2.fastq.gz
Normal file
BIN
src/seqtk/test_data/reads/a.2.fastq.gz
Normal file
Binary file not shown.
4
src/seqtk/test_data/reads/a.fastq
Normal file
4
src/seqtk/test_data/reads/a.fastq
Normal file
@@ -0,0 +1,4 @@
|
||||
@1
|
||||
ACGGCAT
|
||||
+
|
||||
!!!!!!!
|
||||
BIN
src/seqtk/test_data/reads/a.fastq.gz
Normal file
BIN
src/seqtk/test_data/reads/a.fastq.gz
Normal file
Binary file not shown.
1
src/seqtk/test_data/reads/id.list
Normal file
1
src/seqtk/test_data/reads/id.list
Normal file
@@ -0,0 +1 @@
|
||||
1
|
||||
9
src/seqtk/test_data/script.sh
Executable file
9
src/seqtk/test_data/script.sh
Executable file
@@ -0,0 +1,9 @@
|
||||
# clone repo
|
||||
if [ ! -d /tmp/snakemake-wrappers ]; then
|
||||
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
|
||||
fi
|
||||
|
||||
# copy test data
|
||||
cp -r /tmp/snakemake-wrappers/bio/seqtk/test/* src/seqtk/test_data
|
||||
|
||||
rm src/seqtk/test_data/Snakefile
|
||||
File diff suppressed because it is too large
Load Diff
@@ -11,6 +11,11 @@ references:
|
||||
license: MIT
|
||||
requirements:
|
||||
commands: [ STAR, python, ps, zcat, bzcat ]
|
||||
authors:
|
||||
- __merge__: /src/_authors/angela_o_pisco.yaml
|
||||
roles: [ author ]
|
||||
- __merge__: /src/_authors/robrecht_cannoodt.yaml
|
||||
roles: [ author, maintainer ]
|
||||
# manually taking care of the main input and output arguments
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
@@ -113,6 +118,8 @@ engines:
|
||||
rm -rf /tmp/STAR-${STAR_VERSION} /tmp/${STAR_VERSION}.zip && \
|
||||
apt-get --purge autoremove -y ${PACKAGES} && \
|
||||
apt-get clean
|
||||
- type: python
|
||||
packages: [ pyyaml ]
|
||||
- type: docker
|
||||
run: |
|
||||
STAR --version | sed 's#\(.*\)#star: "\1"#' > /var/software_versions.txt
|
||||
|
||||
@@ -2,6 +2,7 @@ import tempfile
|
||||
import subprocess
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
import yaml
|
||||
|
||||
## VIASH START
|
||||
par = {
|
||||
@@ -18,10 +19,20 @@ par = {
|
||||
}
|
||||
meta = {
|
||||
"cpus": 8,
|
||||
"temp_dir": "/tmp"
|
||||
"temp_dir": "/tmp",
|
||||
"config": "target/executable/star/star_align_reads/.config.vsh.yaml",
|
||||
}
|
||||
## VIASH END
|
||||
|
||||
# read config
|
||||
with open(meta["config"], 'r') as stream:
|
||||
config = yaml.safe_load(stream)
|
||||
all_arguments = {
|
||||
arg["name"].lstrip('-'): arg
|
||||
for argument_group in config["argument_groups"]
|
||||
for arg in argument_group["arguments"]
|
||||
}
|
||||
|
||||
##################################################
|
||||
# check and process SE / PE R1 input files
|
||||
input_r1 = par["input"]
|
||||
@@ -87,8 +98,13 @@ with tempfile.TemporaryDirectory(prefix="star-", dir=meta["temp_dir"], ignore_cl
|
||||
cmd_args = [ "STAR" ]
|
||||
for name, value in par.items():
|
||||
if value is not None:
|
||||
if name in all_arguments:
|
||||
arg_info = all_arguments[name].get("info", {})
|
||||
cli_name = arg_info.get("orig_name", f"--{name}")
|
||||
else:
|
||||
cli_name = f"--{name}"
|
||||
val_to_add = value if isinstance(value, list) else [value]
|
||||
cmd_args.extend([f"--{name}"] + [str(x) for x in val_to_add])
|
||||
cmd_args.extend([cli_name] + [str(x) for x in val_to_add])
|
||||
print("", flush=True)
|
||||
|
||||
# run command
|
||||
|
||||
@@ -7,35 +7,34 @@ meta_executable="target/docker/star/star_align_reads/star_align_reads"
|
||||
meta_resources_dir="src/star/star_align_reads"
|
||||
## VIASH END
|
||||
|
||||
#########################################################################################
|
||||
|
||||
#############################################
|
||||
# helper functions
|
||||
assert_file_exists() {
|
||||
[ -f "$1" ] || (echo "File '$1' does not exist" && exit 1)
|
||||
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
|
||||
}
|
||||
assert_file_doesnt_exist() {
|
||||
[ ! -f "$1" ] || (echo "File '$1' exists but shouldn't" && exit 1)
|
||||
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
|
||||
}
|
||||
assert_file_empty() {
|
||||
[ ! -s "$1" ] || (echo "File '$1' is not empty but should be" && exit 1)
|
||||
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
|
||||
}
|
||||
assert_file_not_empty() {
|
||||
[ -s "$1" ] || (echo "File '$1' is empty but shouldn't be" && exit 1)
|
||||
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
|
||||
}
|
||||
assert_file_contains() {
|
||||
grep -q "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
|
||||
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_file_not_contains() {
|
||||
grep -q "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
|
||||
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
|
||||
}
|
||||
assert_file_contains_regex() {
|
||||
grep -q -E "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
|
||||
grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
|
||||
}
|
||||
assert_file_not_contains_regex() {
|
||||
grep -q -E "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
|
||||
grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
|
||||
}
|
||||
#############################################
|
||||
|
||||
#########################################################################################
|
||||
echo "> Prepare test data"
|
||||
|
||||
cat > reads_R1.fastq <<'EOF'
|
||||
@@ -89,14 +88,14 @@ cd star_align_reads_se
|
||||
echo "> Run star_align_reads on SE"
|
||||
"$meta_executable" \
|
||||
--input "../reads_R1.fastq" \
|
||||
--genomeDir "../index/" \
|
||||
--genome_dir "../index/" \
|
||||
--aligned_reads "output.sam" \
|
||||
--log "log.txt" \
|
||||
--outReadsUnmapped "Fastx" \
|
||||
--out_reads_unmapped "Fastx" \
|
||||
--unmapped "unmapped.sam" \
|
||||
--quantMode "TranscriptomeSAM;GeneCounts" \
|
||||
--quant_mode "TranscriptomeSAM;GeneCounts" \
|
||||
--reads_per_gene "reads_per_gene.tsv" \
|
||||
--outSJtype Standard \
|
||||
--out_sj_type Standard \
|
||||
--splice_junctions "splice_junctions.tsv" \
|
||||
--reads_aligned_to_transcriptome "transcriptome_aligned.bam" \
|
||||
${meta_cpus:+---cpus $meta_cpus}
|
||||
@@ -144,10 +143,10 @@ echo ">> Run star_align_reads on PE"
|
||||
"$meta_executable" \
|
||||
--input ../reads_R1.fastq \
|
||||
--input_r2 ../reads_R2.fastq \
|
||||
--genomeDir ../index/ \
|
||||
--genome_dir ../index/ \
|
||||
--aligned_reads output.bam \
|
||||
--log log.txt \
|
||||
--outReadsUnmapped Fastx \
|
||||
--out_reads_unmapped Fastx \
|
||||
--unmapped unmapped_r1.bam \
|
||||
--unmapped_r2 unmapped_r2.bam \
|
||||
${meta_cpus:+---cpus $meta_cpus}
|
||||
|
||||
@@ -14,6 +14,14 @@ param_txt <- iconv(param_txt, "UTF-8", "ASCII//TRANSLIT")
|
||||
dev_begin <- grep("#####UnderDevelopment_begin", param_txt)
|
||||
dev_end <- grep("#####UnderDevelopment_end", param_txt)
|
||||
|
||||
camel_case_to_snake_case <- function(x) {
|
||||
x %>%
|
||||
str_replace_all("([A-Z][A-Z][A-Z]*)", "_\\1_") %>%
|
||||
str_replace_all("([a-z])([A-Z])", "\\1_\\2") %>%
|
||||
str_to_lower() %>%
|
||||
str_replace_all("_$", "")
|
||||
}
|
||||
|
||||
# strip development sections
|
||||
nondev_ix <- unlist(map2(c(1, dev_end + 1), c(dev_begin - 1, length(param_txt)), function(i, j) {
|
||||
if (i >= 1 && i < j) {
|
||||
@@ -128,9 +136,8 @@ out2 <- out %>%
|
||||
# remove arguments that are related to a different runmode
|
||||
filter(!grepl("--runMode", description) | grepl("--runMode alignReads", description)) %>%
|
||||
filter(!grepl("--runMode", group_name) | grepl("--runMode alignReads", group_name)) %>%
|
||||
filter(!grepl("STARsolo", group_name)) %>%
|
||||
mutate(
|
||||
viash_arg = paste0("--", name),
|
||||
viash_arg = paste0("--", camel_case_to_snake_case(name)),
|
||||
type_step1 = type %>%
|
||||
str_replace_all(".*(int, string|string|int|real|double)\\(?(s?).*", "\\1\\2"),
|
||||
viash_type = type_map[gsub("(int, string|string|int|real|double).*", "\\1", type_step1)],
|
||||
@@ -155,28 +162,41 @@ out2 <- out %>%
|
||||
group_name = gsub(" - .*", "", group_name),
|
||||
required = ifelse(name %in% required_args, TRUE, NA)
|
||||
)
|
||||
print(out2, n = 200)
|
||||
out2 %>% mutate(i = row_number()) %>%
|
||||
# filter(is.na(default_step1) != is.na(viash_default)) %>%
|
||||
|
||||
# change references to argument names
|
||||
out3 <- out2
|
||||
for (i in seq_len(nrow(out2))) {
|
||||
orig_name <- paste0("--", out2$name[[i]])
|
||||
new_name <- out2$viash_arg[[i]]
|
||||
out3$description <- str_replace_all(out3$description, orig_name, new_name)
|
||||
}
|
||||
|
||||
# sanity checks
|
||||
out3 %>% select(name, viash_arg) %>% as.data.frame()
|
||||
print(out3, n = 200)
|
||||
out3 %>%
|
||||
mutate(i = row_number()) %>%
|
||||
select(-group_name, -description)
|
||||
out3 %>% filter(!grepl("--runMode", description) | grepl("--runMode alignReads", description))
|
||||
|
||||
out2 %>% filter(!grepl("--runMode", description) | grepl("--runMode alignReads", description))
|
||||
|
||||
argument_groups <- map(unique(out2$group_name), function(group_name) {
|
||||
args <- out2 %>%
|
||||
# create argument groups
|
||||
argument_groups <- map(unique(out3$group_name), function(group_name) {
|
||||
args <- out3 %>%
|
||||
filter(group_name == !!group_name) %>%
|
||||
pmap(function(viash_arg, viash_type, multiple, viash_default, description, required, ...) {
|
||||
li <- lst(
|
||||
pmap(function(viash_arg, viash_type, multiple, viash_default, description, required, name, ...) {
|
||||
li <- list(
|
||||
name = viash_arg,
|
||||
type = viash_type,
|
||||
description = description
|
||||
description = description,
|
||||
info = list(
|
||||
orig_name = paste0("--", name)
|
||||
)
|
||||
)
|
||||
if (all(!is.na(viash_default))) {
|
||||
li$example <- viash_default
|
||||
}
|
||||
if (!is.na(multiple) && multiple) {
|
||||
li$multiple <- multiple
|
||||
li$multiple_sep <- ";"
|
||||
}
|
||||
if (!is.na(required) && required) {
|
||||
li$required <- required
|
||||
@@ -186,4 +206,10 @@ argument_groups <- map(unique(out2$group_name), function(group_name) {
|
||||
list(name = group_name, arguments = args)
|
||||
})
|
||||
|
||||
yaml::write_yaml(list(argument_groups = argument_groups), yaml_file)
|
||||
yaml::write_yaml(
|
||||
list(argument_groups = argument_groups),
|
||||
yaml_file,
|
||||
handlers = list(
|
||||
logical = yaml::verbatim_logical
|
||||
)
|
||||
)
|
||||
|
||||
@@ -11,75 +11,74 @@ references:
|
||||
license: MIT
|
||||
requirements:
|
||||
commands: [ STAR ]
|
||||
|
||||
authors:
|
||||
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--genomeFastaFiles"
|
||||
- name: "--genome_fasta_files"
|
||||
type: file
|
||||
description: |
|
||||
Path(s) to the fasta files with the genome sequences, separated by spaces. These files should be plain text FASTA files, they *cannot* be zipped.
|
||||
required: true
|
||||
multiple: yes
|
||||
multiple_sep: ;
|
||||
- name: "--sjdbGTFfile"
|
||||
multiple: true
|
||||
- name: "--sjdb_gtf_file"
|
||||
type: file
|
||||
description: Path to the GTF file with annotations
|
||||
- name: --sjdbOverhang
|
||||
- name: --sjdb_overhang
|
||||
type: integer
|
||||
description: Length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
|
||||
example: 100
|
||||
- name: --sjdbGTFchrPrefix
|
||||
- name: --sjdb_gtf_chr_prefix
|
||||
type: string
|
||||
description: Prefix for chromosome names in a GTF file (e.g. 'chr' for using ENSMEBL annotations with UCSC genomes)
|
||||
- name: --sjdbGTFfeatureExon
|
||||
- name: --sjdb_gtf_feature_exon
|
||||
type: string
|
||||
description: Feature type in GTF file to be used as exons for building transcripts
|
||||
example: exon
|
||||
- name: --sjdbGTFtagExonParentTranscript
|
||||
- name: --sjdb_gtf_tag_exon_parent_transcript
|
||||
type: string
|
||||
description: GTF attribute name for parent transcript ID (default "transcript_id" works for GTF files)
|
||||
example: transcript_id
|
||||
- name: --sjdbGTFtagExonParentGene
|
||||
- name: --sjdb_gtf_tag_exon_parent_gene
|
||||
type: string
|
||||
description: GTF attribute name for parent gene ID (default "gene_id" works for GTF files)
|
||||
example: gene_id
|
||||
- name: --sjdbGTFtagExonParentGeneName
|
||||
- name: --sjdb_gtf_tag_exon_parent_gene_name
|
||||
type: string
|
||||
description: GTF attribute name for parent gene name
|
||||
example: gene_name
|
||||
multiple: yes
|
||||
multiple_sep: ;
|
||||
- name: --sjdbGTFtagExonParentGeneType
|
||||
multiple: true
|
||||
- name: --sjdb_gtf_tag_exon_parent_gene_type
|
||||
type: string
|
||||
description: GTF attribute name for parent gene type
|
||||
example:
|
||||
- gene_type
|
||||
- gene_biotype
|
||||
multiple: yes
|
||||
multiple_sep: ;
|
||||
- name: --limitGenomeGenerateRAM
|
||||
multiple: true
|
||||
- name: --limit_genome_generate_ram
|
||||
type: long
|
||||
description: Maximum available RAM (bytes) for genome generation
|
||||
example: '31000000000'
|
||||
- name: --genomeSAindexNbases
|
||||
example: 31000000000
|
||||
- name: --genome_sa_index_nbases
|
||||
type: integer
|
||||
description: Length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, this parameter must be scaled down to min(14, log2(GenomeLength)/2 - 1).
|
||||
example: 14
|
||||
- name: --genomeChrBinNbits
|
||||
- name: --genome_chr_bin_nbits
|
||||
type: integer
|
||||
description: Defined as log2(chrBin), where chrBin is the size of the bins for genome storage. Each chromosome will occupy an integer number of bins. For a genome with large number of contigs, it is recommended to scale this parameter as min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]).
|
||||
example: 18
|
||||
- name: --genomeSAsparseD
|
||||
- name: --genome_sa_sparse_d
|
||||
type: integer
|
||||
min: 0
|
||||
example: 1
|
||||
description: Suffux array sparsity, i.e. distance between indices. Use bigger numbers to decrease needed RAM at the cost of mapping speed reduction.
|
||||
- name: --genomeSuffixLengthMax
|
||||
- name: --genome_suffix_length_max
|
||||
type: integer
|
||||
description: Maximum length of the suffixes, has to be longer than read length. Use -1 for infinite length.
|
||||
example: -1
|
||||
- name: --genomeTransformType
|
||||
- name: --genome_transform_type
|
||||
type: string
|
||||
description: |
|
||||
Type of genome transformation
|
||||
@@ -87,7 +86,7 @@ argument_groups:
|
||||
Haploid ... replace reference alleles with alternative alleles from VCF file (e.g. consensus allele)
|
||||
Diploid ... create two haplotypes for each chromosome listed in VCF file, for genotypes 1|2, assumes perfect phasing (e.g. personal genome)
|
||||
example: None
|
||||
- name: --genomeTransformVCF
|
||||
- name: --genome_transform_vcf
|
||||
type: file
|
||||
description: path to VCF file for genome transformation
|
||||
|
||||
|
||||
@@ -10,20 +10,20 @@ mkdir -p $par_index
|
||||
STAR \
|
||||
--runMode genomeGenerate \
|
||||
--genomeDir $par_index \
|
||||
--genomeFastaFiles $par_genomeFastaFiles \
|
||||
--genomeFastaFiles $par_genome_fasta_files \
|
||||
${meta_cpus:+--runThreadN "${meta_cpus}"} \
|
||||
${par_sjdbGTFfile:+--sjdbGTFfile "${par_sjdbGTFfile}"} \
|
||||
${par_sjdb_gtf_file:+--sjdbGTFfile "${par_sjdb_gtf_file}"} \
|
||||
${par_sjdbOverhang:+--sjdbOverhang "${par_sjdbOverhang}"} \
|
||||
${par_genomeSAindexNbases:+--genomeSAindexNbases "${par_genomeSAindexNbases}"} \
|
||||
${par_sjdbGTFchrPrefix:+--sjdbGTFchrPrefix "${par_sjdbGTFchrPrefix}"} \
|
||||
${par_sjdbGTFfeatureExon:+--sjdbGTFfeatureExon "${par_sjdbGTFfeatureExon}"} \
|
||||
${par_sjdbGTFtagExonParentTranscript:+--sjdbGTFtagExonParentTranscript "${par_sjdbGTFtagExonParentTranscript}"} \
|
||||
${par_sjdbGTFtagExonParentGene:+--sjdbGTFtagExonParentGene "${par_sjdbGTFtagExonParentGene}"} \
|
||||
${par_sjdbGTFtagExonParentGeneName:+--sjdbGTFtagExonParentGeneName "${par_sjdbGTFtagExonParentGeneName}"} \
|
||||
${par_sjdbGTFtagExonParentGeneType:+--sjdbGTFtagExonParentGeneType "${sjdbGTFtagExonParentGeneType}"} \
|
||||
${par_limitGenomeGenerateRAM:+--limitGenomeGenerateRAM "${par_limitGenomeGenerateRAM}"} \
|
||||
${par_genomeChrBinNbits:+--genomeChrBinNbits "${par_genomeChrBinNbits}"} \
|
||||
${par_genomeSAsparseD:+--genomeSAsparseD "${par_genomeSAsparseD}"} \
|
||||
${par_genomeSuffixLengthMax:+--genomeSuffixLengthMax "${par_genomeSuffixLengthMax}"} \
|
||||
${par_genomeTransformType:+--genomeTransformType "${par_genomeTransformType}"} \
|
||||
${par_genomeTransformVCF:+--genomeTransformVCF "${par_genomeTransformVCF}"} \
|
||||
${par_genome_sa_index_nbases:+--genomeSAindexNbases "${par_genome_sa_index_nbases}"} \
|
||||
${par_sjdb_gtf_chr_prefix:+--sjdbGTFchrPrefix "${par_sjdb_gtf_chr_prefix}"} \
|
||||
${par_sjdb_gtf_feature_exon:+--sjdbGTFfeatureExon "${par_sjdb_gtf_feature_exon}"} \
|
||||
${par_sjdb_gtf_tag_exon_parent_transcript:+--sjdbGTFtag_exon_parent_transcript "${par_sjdb_gtf_tag_exon_parent_transcript}"} \
|
||||
${par_sjdb_gtf_tag_exon_parent_gene:+--sjdbGTFtag_exon_parent_gene "${par_sjdb_gtf_tag_exon_parent_gene}"} \
|
||||
${par_sjdb_gtf_tag_exon_parent_geneName:+--sjdbGTFtag_exon_parent_geneName "${par_sjdb_gtf_tag_exon_parent_geneName}"} \
|
||||
${par_sjdb_gtf_tag_exon_parent_geneType:+--sjdbGTFtag_exon_parent_geneType "${sjdbGTFtag_exon_parent_geneType}"} \
|
||||
${par_limit_genome_generate_ram:+--limitGenomeGenerateRAM "${par_limit_genome_generate_ram}"} \
|
||||
${par_genome_chr_bin_nbits:+--genomeChrBinNbits "${par_genome_chr_bin_nbits}"} \
|
||||
${par_genome_sa_sparse_d:+--genomeSAsparseD "${par_genome_sa_sparse_d}"} \
|
||||
${par_genome_suffix_length_max:+--genomeSuffixLengthMax "${par_genome_suffix_length_max}"} \
|
||||
${par_genome_transform_type:+--genomeTransformType "${par_genome_transform_type}"} \
|
||||
${par_genome_transform_vcf:+--genomeTransformVCF "${par_genome_transform_vCF}"} \
|
||||
|
||||
@@ -27,9 +27,9 @@ echo "> Generate index"
|
||||
"$meta_executable" \
|
||||
${meta_cpus:+---cpus $meta_cpus} \
|
||||
--index "star_index/" \
|
||||
--genomeFastaFiles "genome.fasta" \
|
||||
--sjdbGTFfile "genes.gtf" \
|
||||
--genomeSAindexNbases 2
|
||||
--genome_fasta_files "genome.fasta" \
|
||||
--sjdb_gtf_file "genes.gtf" \
|
||||
--genome_sa_index_nbases 4
|
||||
|
||||
files=("Genome" "Log.out" "SA" "SAindex" "chrLength.txt" "chrName.txt" "chrNameLength.txt" "chrStart.txt" "exonGeTrInfo.tab" "exonInfo.tab" "geneInfo.tab" "genomeParameters.txt" "sjdbInfo.txt" "sjdbList.fromGTF.out.tab" "sjdbList.out.tab" "transcriptInfo.tab")
|
||||
|
||||
|
||||
305
src/umi_tools/umi_tools_dedup/config.vsh.yaml
Normal file
305
src/umi_tools/umi_tools_dedup/config.vsh.yaml
Normal file
@@ -0,0 +1,305 @@
|
||||
name: umi_tools_dedup
|
||||
namespace: umi_tools
|
||||
description: |
|
||||
Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read.
|
||||
keywords: [umi_tools, deduplication, dedup]
|
||||
links:
|
||||
homepage: https://umi-tools.readthedocs.io/en/latest/
|
||||
documentation: https://umi-tools.readthedocs.io/en/latest/reference/dedup.html
|
||||
repository: https://github.com/CGATOxford/UMI-tools
|
||||
references:
|
||||
doi: 10.1101/gr.209601.116
|
||||
license: MIT
|
||||
authors:
|
||||
- __merge__: /src/_authors/emma_rousseau.yaml
|
||||
roles: [ author, maintainer ]
|
||||
argument_groups:
|
||||
- name: Inputs
|
||||
arguments:
|
||||
- name: --input
|
||||
alternatives: --stdin
|
||||
type: file
|
||||
description: Input BAM or SAM file. Use --in_sam to specify SAM format.
|
||||
required: true
|
||||
- name: --in_sam
|
||||
type: boolean_true
|
||||
description: |
|
||||
By default, inputs are assumed to be in BAM format. Use this options to specify the use of SAM
|
||||
format for input.
|
||||
- name: --bai
|
||||
type: file
|
||||
description: BAM index
|
||||
- name: --random_seed
|
||||
type: integer
|
||||
description: Random seed to initialize number generator with.
|
||||
|
||||
- name: Outputs
|
||||
arguments:
|
||||
- name: --output
|
||||
alternatives: --stdout
|
||||
type: file
|
||||
description: Deduplicated BAM file.
|
||||
required: true
|
||||
direction: output
|
||||
- name: --out_sam
|
||||
type: boolean_true
|
||||
description: |
|
||||
By default, outputa are written in BAM format. Use this options to specify the use of SAM format
|
||||
for output.
|
||||
- name: --paired
|
||||
type: boolean_true
|
||||
description: |
|
||||
BAM is paired end - output both read pairs. This will also force the use of the template length
|
||||
to determine reads with the same mapping coordinates.
|
||||
- name: --output_stats
|
||||
type: string
|
||||
description: |
|
||||
Generate files containing UMI based deduplication statistics files with this prefix in the file names.
|
||||
- name: --extract_umi_method
|
||||
type: string
|
||||
choices: [read_id, tag, umis]
|
||||
description: |
|
||||
Specify the method by which the barcodes were encoded in the read.
|
||||
The options are:
|
||||
* read_id (default)
|
||||
* tag
|
||||
* umis
|
||||
example: "read_id"
|
||||
- name: --umi_tag
|
||||
type: string
|
||||
description: |
|
||||
The tag containing the UMI sequence. This is only required if the extract_umi_method is set to tag.
|
||||
- name: --umi_separator
|
||||
type: string
|
||||
description: |
|
||||
The separator used to separate the UMI from the read sequence. This is only required if the
|
||||
extract_umi_method is set to id_read. Default: `_`.
|
||||
example: '_'
|
||||
- name: --umi_tag_split
|
||||
type: string
|
||||
description: Separate the UMI in tag by <SPLIT> and take the first element.
|
||||
- name: --umi_tag_delimiter
|
||||
type: string
|
||||
description: Separate the UMI in by <DELIMITER> and concatenate the elements.
|
||||
- name: --cell_tag
|
||||
type: string
|
||||
description: |
|
||||
The tag containing the cell barcode sequence. This is only required if the extract_umi_method
|
||||
is set to tag.
|
||||
- name: --cell_tag_split
|
||||
type: string
|
||||
description: Separate the cell barcode in tag by <SPLIT> and take the first element.
|
||||
- name: --cell_tag_delimiter
|
||||
type: string
|
||||
description: Separate the cell barcode in by <DELIMITER> and concatenate the elements.
|
||||
|
||||
- name: Grouping Options
|
||||
arguments:
|
||||
- name: --method
|
||||
type: string
|
||||
choices: [unique, percentile, cluster, adjacency, directional]
|
||||
description: |
|
||||
The method to use for grouping reads.
|
||||
The options are:
|
||||
* unique
|
||||
* percentile
|
||||
* cluster
|
||||
* adjacency
|
||||
* directional (default)
|
||||
example: "directional"
|
||||
- name: --edit_distance_threshold
|
||||
type: integer
|
||||
description: |
|
||||
For the adjacency and cluster methods the threshold for the edit distance to connect two
|
||||
UMIs in the network can be increased. The default value of 1 works best unless the UMI is
|
||||
very long (>14bp). Default: `1`.
|
||||
example: 1
|
||||
- name: --spliced_is_unique
|
||||
type: boolean_true
|
||||
description: |
|
||||
Causes two reads that start in the same position on the same strand and having the same UMI
|
||||
to be considered unique if one is spliced and the other is not. (Uses the 'N' cigar operation
|
||||
to test for splicing).
|
||||
- name: --soft_clip_threshold
|
||||
type: integer
|
||||
description: |
|
||||
Mappers that soft clip will sometimes do so rather than mapping a spliced read if there is only
|
||||
a small overhang over the exon junction. By setting this option, you can treat reads with at
|
||||
least this many bases soft-clipped at the 3' end as spliced. Default: `4`.
|
||||
example: 4
|
||||
- name: --multimapping_detection_method
|
||||
type: string
|
||||
description: |
|
||||
If the sam/bam contains tags to identify multimapping reads, you can specify for use when selecting
|
||||
the best read at a given loci. Supported tags are `NH`, `X0` and `XT`. If not specified, the read
|
||||
with the highest mapping quality will be selected.
|
||||
- name: --read_length
|
||||
type: boolean_true
|
||||
description: Use the read length as a criteria when deduping, for e.g. sRNA-Seq.
|
||||
|
||||
- name: Single-cell RNA-Seq Options
|
||||
arguments:
|
||||
- name: --per_gene
|
||||
type: boolean_true
|
||||
description: |
|
||||
Reads will be grouped together if they have the same gene. This is useful if your library prep
|
||||
generates PCR duplicates with non identical alignment positions such as CEL-Seq. Note this option
|
||||
is hardcoded to be on with the count command. I.e. counting is always performed per-gene. Must be
|
||||
combined with either --gene_tag or --per_contig option.
|
||||
- name: --gene_tag
|
||||
type: string
|
||||
description: |
|
||||
Deduplicate per gene. The gene information is encoded in the bam read tag specified.
|
||||
- name: --assigned_status_tag
|
||||
type: string
|
||||
description: |
|
||||
BAM tag which describes whether a read is assigned to a gene. Defaults to the same value as given
|
||||
for --gene_tag.
|
||||
- name: --skip_tags_regex
|
||||
type: string
|
||||
description: |
|
||||
Use in conjunction with the --assigned_status_tag option to skip any reads where the tag matches
|
||||
this regex. Default ("^[__|Unassigned]") matches anything which starts with "__" or "Unassigned".
|
||||
- name: --per_contig
|
||||
type: boolean_true
|
||||
description: |
|
||||
Deduplicate per contig (field 3 in BAM; RNAME). All reads with the sam contig will be considered to
|
||||
have the same alignment position. This is useful if you have aligned to a reference transcriptome
|
||||
with one transcript per gene. If you have aligned to a transcriptome with more than one transcript
|
||||
per gene, you can supply a map between transcripts and gene using the --gene_transcript_map option.
|
||||
- name: --gene_transcript_map
|
||||
type: file
|
||||
description: |
|
||||
A file containing a mapping between gene names and transcript names. The file should be tab
|
||||
separated with the gene name in the first column and the transcript name in the second column.
|
||||
- name: --per_cell
|
||||
type: boolean_true
|
||||
description: |
|
||||
Reads will only be grouped together if they have the same cell barcode. Can be combined with
|
||||
--per_gene.
|
||||
|
||||
- name: SAM/BAM Options
|
||||
arguments:
|
||||
- name: --mapping_quality
|
||||
type: integer
|
||||
description: |
|
||||
Minimium mapping quality (MAPQ) for a read to be retained. Default: `0`.
|
||||
example: 0
|
||||
- name: --unmapped_reads
|
||||
type: string
|
||||
description: |
|
||||
How unmapped reads should be handled.
|
||||
The options are:
|
||||
* "discard": Discard all unmapped reads. (default)
|
||||
* "use": If read2 is unmapped, deduplicate using read1 only. Requires --paired.
|
||||
* "output": Output unmapped reads/read pairs without UMI grouping/deduplication. Only available in umi_tools group.
|
||||
example: "discard"
|
||||
- name: --chimeric_pairs
|
||||
type: string
|
||||
choices: [discard, use, output]
|
||||
description: |
|
||||
How chimeric pairs should be handled.
|
||||
The options are:
|
||||
* "discard": Discard all chimeric read pairs.
|
||||
* "use": Deduplicate using read1 only. (default)
|
||||
* "output": Output chimeric pairs without UMI grouping/deduplication. Only available in
|
||||
umi_tools group.
|
||||
example: "use"
|
||||
- name: --unpaired_reads
|
||||
type: string
|
||||
choices: [discard, use, output]
|
||||
description: |
|
||||
How unpaired reads should be handled.
|
||||
The options are:
|
||||
* "discard": Discard all unmapped reads.
|
||||
* "use": If read2 is unmapped, deduplicate using read1 only. Requires --paired. (default)
|
||||
* "output": Output unmapped reads/read pairs without UMI grouping/deduplication. Only available
|
||||
in umi_tools group.
|
||||
example: "use"
|
||||
- name: --ignore_umi
|
||||
type: boolean_true
|
||||
description: Ignore the UMI and group reads using mapping coordinates only.
|
||||
- name: --subset
|
||||
type: double
|
||||
description: |
|
||||
Only consider a fraction of the reads, chosen at random. This is useful for doing saturation
|
||||
analyses.
|
||||
- name: --chrom
|
||||
type: string
|
||||
description: Only consider a single chromosome. This is useful for debugging/testing purposes.
|
||||
|
||||
- name: Group/Dedup Options
|
||||
arguments:
|
||||
- name: --no_sort_output
|
||||
type: boolean_true
|
||||
description: |
|
||||
By default, output is sorted. This involves the use of a temporary unsorted file (saved in
|
||||
--temp_dir). Use this option to turn off sorting.
|
||||
- name: --buffer_whole_contig
|
||||
type: boolean_true
|
||||
description: |
|
||||
Forces dedup to parse an entire contig before yielding any reads for deduplication. This is the
|
||||
only way to absolutely guarantee that all reads with the same start position are grouped together
|
||||
for deduplication since dedup uses the start position of the read, not the alignment coordinate on
|
||||
which the reads are sorted. However, by default, dedup reads for another 1000bp before outputting
|
||||
read groups which will avoid any reads being missed with short read sequencing (<1000bp).
|
||||
|
||||
- name: Common Options
|
||||
arguments:
|
||||
- name: --log
|
||||
alternatives: -L
|
||||
type: file
|
||||
description: File with logging information.
|
||||
- name: --log2stderr
|
||||
type: boolean_true
|
||||
description: Send logging information to stderr.
|
||||
- name: --verbose
|
||||
alternatives: -v
|
||||
type: integer
|
||||
description: |
|
||||
Log level. The higher, the more output. Default: `0`.
|
||||
example: 0
|
||||
- name: --error
|
||||
alternatives: -E
|
||||
type: file
|
||||
description: File with error information.
|
||||
- name: --temp_dir
|
||||
type: string
|
||||
description: |
|
||||
Directory for temporary files. If not set, the bash environmental variable TMPDIR is used.
|
||||
- name: --compresslevel
|
||||
type: integer
|
||||
description: |
|
||||
Level of Gzip compression to use. Default=6 matches GNU gzip rather than python gzip default.
|
||||
Default: `6`.
|
||||
example: 6
|
||||
- name: --timeit
|
||||
type: file
|
||||
description: Store timing information in file.
|
||||
- name: --timeit_name
|
||||
type: string
|
||||
description: |
|
||||
Name in timing file for this class of jobs. Default: `all`.
|
||||
example: "all"
|
||||
- name: --timeit_header
|
||||
type: string
|
||||
description: Add header for timing information.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- type: file
|
||||
path: test_data
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/umi_tools:1.1.5--py39hf95cd2a_1
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
umi_tools -v | sed 's/ version//g' > /var/software_versions.txt
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
113
src/umi_tools/umi_tools_dedup/help.txt
Normal file
113
src/umi_tools/umi_tools_dedup/help.txt
Normal file
@@ -0,0 +1,113 @@
|
||||
'''
|
||||
Generated from the following UMI-tools documentation:
|
||||
https://umi-tools.readthedocs.io/en/latest/common_options.html#common-options
|
||||
https://umi-tools.readthedocs.io/en/latest/reference/dedup.html
|
||||
'''
|
||||
|
||||
|
||||
dedup - Deduplicate reads using UMI and mapping coordinates
|
||||
|
||||
Usage: umi_tools dedup [OPTIONS] [--stdin=IN_BAM] [--stdout=OUT_BAM]
|
||||
|
||||
note: If --stdout is ommited, standard out is output. To
|
||||
generate a valid BAM file on standard out, please
|
||||
redirect log with --log=LOGFILE or --log2stderr
|
||||
|
||||
Common UMI-tools Options:
|
||||
|
||||
-S, --stdout File where output is to go [default = stdout].
|
||||
-L, --log File with logging information [default = stdout].
|
||||
--log2stderr Send logging information to stderr [default = False].
|
||||
-v, --verbose Log level. The higher, the more output [default = 1].
|
||||
-E, --error File with error information [default = stderr].
|
||||
--temp-dir Directory for temporary files. If not set, the bash environmental variable TMPDIR is used[default = None].
|
||||
--compresslevel Level of Gzip compression to use. Default=6 matches GNU gzip rather than python gzip default (which is 9)
|
||||
|
||||
profiling and debugging options:
|
||||
--timeit Store timing information in file [default=none].
|
||||
--timeit-name Name in timing file for this class of jobs [default=all].
|
||||
--timeit-header Add header for timing information [default=none].
|
||||
--random-seed Random seed to initialize number generator with [default=none].
|
||||
|
||||
Dedup Options:
|
||||
--output-stats=<prefix> One can use the edit distance between UMIs at the same position as an quality control for the
|
||||
deduplication process by comparing with a null expectation of random sampling. For the random
|
||||
sampling, the observed frequency of UMIs is used to more reasonably model the null expectation.
|
||||
Use this option to generate a stats outfiles called:
|
||||
[PREFIX]_stats_edit_distance.tsv
|
||||
Reports the (binned) average edit distance between the UMIs at each position.
|
||||
In addition, this option will trigger reporting of further summary statistics for the UMIs which
|
||||
may be informative for selecting the optimal deduplication method or debugging.
|
||||
Each unique UMI sequence may be observed [0-many] times at multiple positions in the BAM. The
|
||||
following files report the distribution for the frequencies of each UMI.
|
||||
[PREFIX]_stats_per_umi_per_position.tsv
|
||||
Tabulates the counts for unique combinations of UMI and position.
|
||||
[PREFIX]_stats_per_umi_per.tsv
|
||||
The _stats_per_umi_per.tsv table provides UMI-level summary statistics.
|
||||
--extract-umi-method=<method> How are the barcodes encoded in the read?
|
||||
Options are: read_id (default), tag, umis
|
||||
--umi-separator=<separator> Separator between read id and UMI. See --extract-umi-method above. Default=_
|
||||
--umi-tag=<tag> Tag which contains UMI. See --extract-umi-method above
|
||||
--umi-tag-split=<split> Separate the UMI in tag by SPLIT and take the first element
|
||||
--umi-tag-delimiter=<delimiter> Separate the UMI in by DELIMITER and concatenate the elements
|
||||
--cell-tag=<tag> Tag which contains cell barcode. See --extract-umi-method above
|
||||
--cell-tag-split=<split> Separate the cell barcode in tag by SPLIT and take the first element
|
||||
--cell-tag-delimiter=<delimiter> Separate the cell barcode in by DELIMITER and concatenate the elements
|
||||
--method=<method> What method to use to identify group of reads with the same (or similar) UMI(s)?
|
||||
All methods start by identifying the reads with the same mapping position.
|
||||
The simplest methods, unique and percentile, group reads with the exact same UMI.
|
||||
The network-based methods, cluster, adjacency and directional, build networks where
|
||||
nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1).
|
||||
The groups of reads are then defined from the network in a method-specific manner.
|
||||
For all the network-based methods, each read group is equivalent to one read count for the gene.
|
||||
--edit-distance-threshold=<threshold> For the adjacency and cluster methods the threshold for the edit distance to connect
|
||||
two UMIs in the network can be increased. The default value of 1 works best unless
|
||||
the UMI is very long (>14bp).
|
||||
--spliced-is-unique Causes two reads that start in the same position on the same strand and having the
|
||||
same UMI to be considered unique if one is spliced and the other is not.
|
||||
(Uses the 'N' cigar operation to test for splicing).
|
||||
--soft-clip-threshold=<threshold> Mappers that soft clip will sometimes do so rather than mapping a spliced read if
|
||||
there is only a small overhang over the exon junction. By setting this option, you
|
||||
can treat reads with at least this many bases soft-clipped at the 3' end as spliced.
|
||||
Default=4.
|
||||
--multimapping-detection-method=<method> If the sam/bam contains tags to identify multimapping reads, you can specify
|
||||
for use when selecting the best read at a given loci. Supported tags are "NH",
|
||||
"X0" and "XT". If not specified, the read with the highest mapping quality will be selected.
|
||||
--read-length Use the read length as a criteria when deduping, for e.g sRNA-Seq.
|
||||
--per-gene Reads will be grouped together if they have the same gene. This is useful if your
|
||||
library prep generates PCR duplicates with non identical alignment positions such as CEL-Seq.
|
||||
Note this option is hardcoded to be on with the count command. I.e counting is always
|
||||
performed per-gene. Must be combined with either --gene-tag or --per-contig option.
|
||||
--gene-tag=<tag> Deduplicate per gene. The gene information is encoded in the bam read tag specified
|
||||
--assigned-status-tag=<tag> BAM tag which describes whether a read is assigned to a gene. Defaults to the same value
|
||||
as given for --gene-tag
|
||||
--skip-tags-regex=<regex> Use in conjunction with the --assigned-status-tag option to skip any reads where the
|
||||
tag matches this regex. Default ("^[__|Unassigned]") matches anything which starts with "__"
|
||||
or "Unassigned":
|
||||
--per-contig Deduplicate per contig (field 3 in BAM; RNAME). All reads with the same contig will be
|
||||
considered to have the same alignment position. This is useful if you have aligned to a
|
||||
reference transcriptome with one transcript per gene. If you have aligned to a transcriptome
|
||||
with more than one transcript per gene, you can supply a map between transcripts and gene
|
||||
using the --gene-transcript-map option
|
||||
--gene-transcript-map=<file> File mapping genes to transcripts (tab separated)
|
||||
--per-cell Reads will only be grouped together if they have the same cell barcode. Can be combined with --per-gene.
|
||||
--mapping-quality=<quality> Minimium mapping quality (MAPQ) for a read to be retained. Default is 0.
|
||||
--unmapped-reads=<option> How should unmapped reads be handled.
|
||||
--chimeric-pairs=<option> How should chimeric read pairs be handled.
|
||||
--unpaired-reads=<option> How should unpaired reads be handled.
|
||||
--ignore-umi Ignore the UMI and group reads using mapping coordinates only
|
||||
--subset=<fraction> Only consider a fraction of the reads, chosen at random. This is useful for doing saturation analyses.
|
||||
--chrom=<chromosome> Only consider a single chromosome. This is useful for debugging/testing purposes
|
||||
--in-sam Input is in SAM format
|
||||
--out-sam Output is in SAM format
|
||||
--paired BAM is paired end - output both read pairs. This will also force the use of the template
|
||||
length to determine reads with the same mapping coordinates.
|
||||
--no-sort-output By default, output is sorted. This involves the use of a temporary unsorted file since
|
||||
reads are considered in the order of their start position which may not be the same as
|
||||
their alignment coordinate due to soft-clipping and reverse alignments. The temp file
|
||||
will be saved (in --temp-dir) and deleted when it has been sorted to the outfile. Use
|
||||
this option to turn off sorting.
|
||||
--buffer-whole-contig Forces dedup to parse an entire contig before yielding any reads for deduplication.
|
||||
This is the only way to absolutely guarantee that all reads with the same start position
|
||||
are grouped together for deduplication since dedup uses the start position of the read,
|
||||
not the alignment coordinate on which the reads are
|
||||
72
src/umi_tools/umi_tools_dedup/script.sh
Normal file
72
src/umi_tools/umi_tools_dedup/script.sh
Normal file
@@ -0,0 +1,72 @@
|
||||
#!/bin/bash
|
||||
|
||||
## VIASH START
|
||||
## VIASH END
|
||||
|
||||
set -e
|
||||
|
||||
test_dir="${metal_executable}/test_data"
|
||||
|
||||
[[ "$par_paired" == "false" ]] && unset par_paired
|
||||
[[ "$par_in_sam" == "false" ]] && unset par_in_sam
|
||||
[[ "$par_out_sam" == "false" ]] && unset par_out_sam
|
||||
[[ "$par_spliced_is_unique" == "false" ]] && unset par_spliced_is_unique
|
||||
[[ "$par_per_gene" == "false" ]] && unset par_per_gene
|
||||
[[ "$par_per_contig" == "false" ]] && unset par_per_contig
|
||||
[[ "$par_per_cell" == "false" ]] && unset par_per_cell
|
||||
[[ "$par_no_sort_output" == "false" ]] && unset par_no_sort_output
|
||||
[[ "$par_buffer_whole_contig" == "false" ]] && unset par_buffer_whole_contig
|
||||
[[ "$par_ignore_umi" == "false" ]] && unset par_ignore_umi
|
||||
[[ "$par_subset" == "false" ]] && unset par_subset
|
||||
[[ "$par_log2stderr" == "false" ]] && unset par_log2stderr
|
||||
[[ "$par_read_length" == "false" ]] && unset par_read_length
|
||||
|
||||
umi_tools dedup \
|
||||
--stdin "$par_input" \
|
||||
${par_in_sam:+--in-sam} \
|
||||
--stdout "$par_output" \
|
||||
${par_out_sam:+--out-sam} \
|
||||
${par_paired:+--paired} \
|
||||
${par_output_stats:+--output-stats "$par_output_stats"} \
|
||||
${par_extract_umi_method:+--extract-umi-method "$par_extract_umi_method"} \
|
||||
${par_umi_tag:+--umi-tag "$par_umi_tag"} \
|
||||
${par_umi_separator:+--umi-separator "$par_umi_separator"} \
|
||||
${par_umi_tag_split:+--umi-tag-split "$par_umi_tag_split"} \
|
||||
${par_umi_tag_delimiter:+--umi-tag-delimiter "$par_umi_tag_delimiter"} \
|
||||
${par_cell_tag:+--cell-tag "$par_cell_tag"} \
|
||||
${par_cell_tag_split:+--cell-tag-split "$par_cell_tag_split"} \
|
||||
${par_cell_tag_delimiter:+--cell-tag-delimiter "$par_cell_tag_delimiter"} \
|
||||
${par_method:+--method "$par_method"} \
|
||||
${par_edit_distance_threshold:+--edit-distance-threshold "$par_edit_distance_threshold"} \
|
||||
${par_spliced_is_unique:+--spliced-is-unique} \
|
||||
${par_soft_clip_threshold:+--soft-clip-threshold "$par_soft_clip_threshold"} \
|
||||
${par_multimapping_detection_method:+--multimapping-detection-method "$par_multimapping_detection_method"} \
|
||||
${par_read_length:+--read-length} \
|
||||
${par_per_gene:+--per-gene} \
|
||||
${par_gene_tag:+--gene-tag "$par_gene_tag"} \
|
||||
${par_assigned_status_tag:+--assigned-status-tag "$par_assigned_status_tag"} \
|
||||
${par_skip_tags_regex:+--skip-tags-regex "$par_skip_tags_regex"} \
|
||||
${par_per_contig:+--per-contig} \
|
||||
${par_gene_transcript_map:+--gene-transcript-map "$par_gene_transcript_map"} \
|
||||
${par_per_cell:+--per-cell} \
|
||||
${par_mapping_quality:+--mapping-quality "$par_mapping_quality"} \
|
||||
${par_unmapped_reads:+--unmapped-reads "$par_unmapped_reads"} \
|
||||
${par_chimeric_pairs:+--chimeric-pairs "$par_chimeric_pairs"} \
|
||||
${par_unpaired_reads:+--unpaired-reads "$par_unpaired_reads"} \
|
||||
${par_ignore_umi:+--ignore-umi} \
|
||||
${par_subset:+--subset "$par_subset"} \
|
||||
${par_chrom:+--chrom "$par_chrom"} \
|
||||
${par_no_sort_output:+--no-sort-output} \
|
||||
${par_buffer_whole_contig:+--buffer-whole-contig} \
|
||||
${par_log:+-L "$par_log"} \
|
||||
${par_log2stderr:+--log2stderr} \
|
||||
${par_verbose:+-v "$par_verbose"} \
|
||||
${par_error:+-E "$par_error"} \
|
||||
${par_temp_dir:+--temp-dir "$par_temp_dir"} \
|
||||
${par_compresslevel:+--compresslevel "$par_compresslevel"} \
|
||||
${par_timeit:+--timeit "$par_timeit"} \
|
||||
${par_timeit_name:+--timeit-name "$par_timeit_name"} \
|
||||
${par_timeit_header:+--timeit-header "$par_timeit_header"} \
|
||||
${par_random_seed:+--random-seed "$par_random_seed"}
|
||||
|
||||
exit 0
|
||||
87
src/umi_tools/umi_tools_dedup/test.sh
Normal file
87
src/umi_tools/umi_tools_dedup/test.sh
Normal file
@@ -0,0 +1,87 @@
|
||||
#!/bin/bash
|
||||
|
||||
test_dir="${meta_resources_dir}/test_data"
|
||||
out_dir="${meta_resources_dir}/out"
|
||||
|
||||
mkdir -p "$out_dir"
|
||||
|
||||
############################################################################################
|
||||
|
||||
echo ">>> Test 1: Basic usage of $meta_functionality_name with statistics output"
|
||||
|
||||
"$meta_executable" \
|
||||
--paired \
|
||||
--input "$test_dir/sample.bam" \
|
||||
--bai "$test_dir/sample.bam.bai" \
|
||||
--output "$out_dir/deduped.sam" \
|
||||
--out_sam \
|
||||
--output_stats "$out_dir/dedup" \
|
||||
--random_seed 1
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -f "$out_dir/deduped.sam" ] && echo "File 'deduped.sam' does not exist!" && exit 1
|
||||
[ ! -f "$out_dir/dedup_edit_distance.tsv" ] && echo "File 'dedup_edit_distance.tsv' does not exist!" && exit 1
|
||||
|
||||
echo ">>> Checking whether output is non-empty"
|
||||
[ ! -s "$out_dir/deduped.sam" ] && echo "File 'deduped.sam' is empty!" && exit 1
|
||||
[ ! -s "$out_dir/dedup_edit_distance.tsv" ] && echo "File 'dedup_edit_distance.tsv' is empty!" && exit 1
|
||||
|
||||
echo ">>> Checking whether output is correct"
|
||||
diff "$out_dir/deduped.sam" "$test_dir/deduped.sam" || \
|
||||
(echo "Output file deduped.sam does not match expected output" && exit 1)
|
||||
diff "$out_dir/dedup_edit_distance.tsv" "$test_dir/dedup_edit_distance.tsv" || \
|
||||
(echo "Output file dedup_edit_distance.tsv does not match expected output" && exit 1)
|
||||
|
||||
############################################################################################
|
||||
|
||||
echo ">>> Test 2: $meta_functionality_name with random subset selection"
|
||||
|
||||
"$meta_executable" \
|
||||
--paired \
|
||||
--input "$test_dir/sample.bam" \
|
||||
--bai "$test_dir/sample.bam.bai" \
|
||||
--output "$out_dir/deduped_fraction.sam" \
|
||||
--out_sam \
|
||||
--subset 0.5 \
|
||||
--random_seed 1
|
||||
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -f "$out_dir/deduped_fraction.sam" ] && echo "File 'deduped_fraction.sam' does not exist!" && exit 1
|
||||
|
||||
echo ">>> Checking whether output is non-empty"
|
||||
[ ! -s "$out_dir/deduped_fraction.sam" ] && echo "File 'deduped_fraction.sam' is empty!" && exit 1
|
||||
|
||||
echo ">>> Checking whether output is correct"
|
||||
diff "$out_dir/deduped_fraction.sam" "$test_dir/deduped_fraction.sam" || \
|
||||
(echo "Output file deduped_fraction.sam does not match expected output" && exit 1)
|
||||
|
||||
############################################################################################
|
||||
|
||||
echo ">>> Test 3: $meta_functionality_name with --method unique"
|
||||
|
||||
"$meta_executable" \
|
||||
--paired \
|
||||
--input "$test_dir/sample.bam" \
|
||||
--bai "$test_dir/sample.bam.bai" \
|
||||
--output "$out_dir/deduped_unique.sam" \
|
||||
--out_sam \
|
||||
--method "unique" \
|
||||
--random_seed 1
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -f "$out_dir/deduped_unique.sam" ] && echo "File 'deduped_unique.sam' does not exist!" && exit 1
|
||||
|
||||
echo ">>> Checking whether output is non-empty"
|
||||
[ ! -s "$out_dir/deduped_unique.sam" ] && echo "File 'deduped_unique.sam' is empty!" && exit 1
|
||||
|
||||
echo ">>> Checking whether output is correct"
|
||||
diff "$out_dir/deduped_unique.sam" "$test_dir/deduped_unique.sam" || \
|
||||
(echo "Output file deduped_unique.sam does not match expected output" && exit 1)
|
||||
|
||||
############################################################################################
|
||||
|
||||
rm -rf "$out_dir"
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
@@ -0,0 +1,5 @@
|
||||
unique unique_null directional directional_null edit_distance
|
||||
3 3 4 4 Single_UMI
|
||||
0 1 0 0 0
|
||||
1 0 0 0 1
|
||||
0 0 0 0 2
|
||||
|
@@ -0,0 +1,6 @@
|
||||
UMI median_counts_pre times_observed_pre total_counts_pre median_counts_post times_observed_post total_counts_post
|
||||
ACCGGTTTA 74 1 74 74 1 74
|
||||
ACTGGTTTC 48 1 48 49 1 49
|
||||
AGCGGTTAC 1 1 1 1 1 1
|
||||
CCAGGTTCT 1 1 1 1 1 1
|
||||
TCTGGTTTC 1 1 1 0 0 0
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user