Build branch qualimap with version qualimap (28cd122)

Build pipeline: viash-hub.biobox.qualimap-6tqq7

Source commit: 28cd122935

Source message: Merge branch 'main' into qualimap
This commit is contained in:
CI
2024-07-29 15:00:07 +00:00
parent fbfdc19532
commit aa043fdc19
283 changed files with 42125 additions and 3915 deletions

View File

@@ -1,18 +1,29 @@
# biobox x.x.x
## BUG FIXES
## BREAKING CHANGES
* `pear`: fix component not exiting with the correct exitcode when PEAR fails.
* `star/star_align_reads`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).
* `cutadapt`: fix `--par_quality_cutoff_r2` argument.
* `star/star_genome_generate`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).
* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode`.
## NEW FUNCTIONALITY
* `multiqc`: update multiple separator to `;` (PR #81).
* `star/star_align_reads`: Add star solo related arguments (PR #62).
* `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline (PR #75).
* `umitools/umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).
* `seqtk`:
- `seqtk/seqtk_sample`: Subsamples sequences from FASTA/Q files (PR #68).
- `seqtk/seqtk_subseq`: Extract the sequences (complete or subsequence) from the FASTA/FASTQ files
based on a provided sequence IDs or region coordinates file (PR #85).
* `agat/agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).
## MINOR CHANGES
* `busco` components: update BUSCO to `5.7.1`.
* `busco` components: update BUSCO to `5.7.1` (PR #72).
## NEW FEATURES
@@ -20,12 +31,36 @@
# biobox 0.1.0
## BREAKING CHANGES
* Update CI to reusable workflow in `viash-io/viash-actions` (PR #86).
* Change default `multiple_sep` to `;` (PR #25). This aligns with an upcoming breaking change in
Viash 0.9.0 in order to avoid issues with the current default separator `:` unintentionally
splitting up certain file paths.
## DOCUMENTATION
* Extend the contributing guidelines (PR #82):
- Update format to Viash 0.9.
- Descriptions should be formatted in markdown.
- Add defaults to descriptions, not as a default of the argument.
- Explain parameter expansion.
- Mention that the contents of the output of components in tests should be checked.
* Add authorship to existing components (PR #88).
## BUG FIXES
* `pear`: fix component not exiting with the correct exitcode when PEAR fails (PR #70).
* `cutadapt`: fix `--par_quality_cutoff_r2` argument (PR #69).
* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode` (PR #69).
* `multiqc`: update multiple separator to `;` (PR #81).
# biobox 0.1.0
## NEW FEATURES
@@ -74,12 +109,11 @@
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTQ (PR #52).
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53).
* `umi_tools`:
-`umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71).
* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).
* `umitools`:
- `umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).
* `bedtools`:
- `bedtools_getfasta`: extract sequences from a FASTA file for each of the
intervals defined in a BED/GFF/VCF file (PR #59).
@@ -104,4 +138,4 @@
* Add escaping character before leading hashtag in the description field of the config file (PR #50).
* Format URL in biobase/bcl_convert description (PR #55).
* Format URL in biobase/bcl_convert description (PR #55).

View File

@@ -65,22 +65,21 @@ runners:
Fill in the relevant metadata fields in the config. Here is an example of the metadata of an existing component.
```yaml
functionality:
name: arriba
description: Detect gene fusions from RNA-Seq data
keywords: [Gene fusion, RNA-Seq]
links:
homepage: https://arriba.readthedocs.io/en/latest/
documentation: https://arriba.readthedocs.io/en/latest/
repository: https://github.com/suhrig/arriba
issue_tracker: https://github.com/suhrig/arriba/issues
references:
doi: 10.1101/gr.257246.119
bibtex: |
@article{
... a bibtex entry in case the doi is not available ...
}
license: MIT
name: arriba
description: Detect gene fusions from RNA-Seq data
keywords: [Gene fusion, RNA-Seq]
links:
homepage: https://arriba.readthedocs.io/en/latest/
documentation: https://arriba.readthedocs.io/en/latest/
repository: https://github.com/suhrig/arriba
issue_tracker: https://github.com/suhrig/arriba/issues
references:
doi: 10.1101/gr.257246.119
bibtex: |
@article{
... a bibtex entry in case the doi is not available ...
}
license: MIT
```
### Step 4: Find a suitable container
@@ -162,7 +161,7 @@ argument_groups:
type: file
description: |
File in SAM/BAM/CRAM format with main alignments as generated by STAR
(Aligned.out.sam). Arriba extracts candidate reads from this file.
(`Aligned.out.sam`). Arriba extracts candidate reads from this file.
required: true
example: Aligned.out.bam
```
@@ -175,7 +174,7 @@ Several notes:
* Input arguments can have `multiple: true` to allow the user to specify multiple files.
* The description should be formatted in markdown.
### Step 8: Add arguments for the output files
@@ -220,7 +219,7 @@ argument_groups:
Note:
* Preferably, these outputs should not be directores but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
* Preferably, these outputs should not be directories but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
### Step 9: Add arguments for the other arguments
@@ -230,6 +229,8 @@ Finally, add all other arguments to the config file. There are a few exceptions:
* Arguments related to printing the information such as printing the version (`-v`, `--version`) or printing the help (`-h`, `--help`) should not be added to the config file.
* If the help lists defaults, do not add them as defaults but to the description. Example: `description: <Explanation of parameter>. Default: 10.`
### Step 10: Add a Docker engine
@@ -275,10 +276,13 @@ Next, we need to write a runner script that runs the tool with the input argumen
## VIASH START
## VIASH END
# unset flags
[[ "$par_option" == "false" ]] && unset par_option
xxx \
--input "$par_input" \
--output "$par_output" \
$([ "$par_option" = "true" ] && echo "--option")
${par_option:+--option}
```
When building a Viash component, Viash will automatically replace the `## VIASH START` and `## VIASH END` lines (and anything in between) with environment variables based on the arguments specified in the config.
@@ -291,6 +295,11 @@ As an example, this is what the Bash script for the `arriba` component looks lik
## VIASH START
## VIASH END
# unset flags
[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking
[[ "$par_extra_information" == "false" ]] && unset par_extra_information
[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps
arriba \
-x "$par_bam" \
-a "$par_genome" \
@@ -298,26 +307,30 @@ arriba \
-o "$par_fusions" \
${par_known_fusions:+-k "${par_known_fusions}"} \
${par_blacklist:+-b "${par_blacklist}"} \
${par_structural_variants:+-d "${par_structural_variants}"} \
$([ "$par_skip_duplicate_marking" = "true" ] && echo "-u") \
$([ "$par_extra_information" = "true" ] && echo "-X") \
$([ "$par_fill_gaps" = "true" ] && echo "-I")
# ...
${par_extra_information:+-X} \
${par_fill_gaps:+-I}
```
Notes:
* If your arguments can contain special variables (e.g. `$`), you can use quoting (need to find a documentation page for this) to make sure you can use the string as input. Example: `-x ${par_bam@Q}`.
* Optional arguments can be passed to the command conditionally using Bash [parameter expansion](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html). For example: `${par_known_fusions:+-k ${par_known_fusions@Q}}`
* If your tool allows for multiple inputs using a separator other than `;` (which is the default Viash multiple separator), you can substitute these values with a command like: `par_disable_filters=$(echo $par_disable_filters | tr ';' ',')`.
### Step 12: Create test script
If the unit test requires test resources, these should be provided in the `test_resources` section of the component.
```yaml
functionality:
# ...
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
```
Create a test script at `src/xxx/test.sh` that runs the component with the test data. This script should run the component (available with `$meta_executable`) with the test data and check if the output is as expected. The script should exit with a non-zero exit code if the output is not as expected. For example:
@@ -325,48 +338,64 @@ Create a test script at `src/xxx/test.sh` that runs the component with the test
```bash
#!/bin/bash
set -e
## VIASH START
## VIASH END
echo "> Run xxx with test data"
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_doesnt_exist() {
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
}
assert_file_empty() {
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains() {
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
assert_file_contains_regex() {
grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains_regex() {
grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
#############################################
echo "> Run $meta_name with test data"
"$meta_executable" \
--input "$meta_resources_dir/test_data/input.txt" \
--input "$meta_resources_dir/test_data/reads_R1.fastq" \
--output "output.txt" \
--option
echo ">> Checking output"
[ ! -f "output.txt" ] && echo "Output file output.txt does not exist" && exit 1
```
For example, this is what the test script for the `arriba` component looks like:
```bash
#!/bin/bash
## VIASH START
## VIASH END
echo "> Run arriba with blacklist"
"$meta_executable" \
--bam "$meta_resources_dir/test_data/A.bam" \
--genome "$meta_resources_dir/test_data/genome.fasta" \
--gene_annotation "$meta_resources_dir/test_data/annotation.gtf" \
--blacklist "$meta_resources_dir/test_data/blacklist.tsv" \
--fusions "fusions.tsv" \
--fusions_discarded "fusions_discarded.tsv" \
--interesting_contigs "1,2"
echo ">> Checking output"
[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
echo ">> Check if output exists"
assert_file_exists "output.txt"
echo ">> Check if output is empty"
[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
assert_file_not_empty "output.txt"
echo ">> Check if output is correct"
assert_file_contains "output.txt" "some expected output"
echo "> All tests succeeded!"
```
### Step 12: Create a `/var/software_versions.txt` file
Notes:
* Do always check the contents of the output file. If the output is not deterministic, you can use regular expressions to check the output.
* If possible, generate your own test data instead of copying it from an external resource.
### Step 13: Create a `/var/software_versions.txt` file
For the sake of transparency and reproducibility, we require that the versions of the software used in the component are documented.
@@ -378,6 +407,8 @@ engines:
image: quay.io/biocontainers/xxx:0.1.0--py_0
setup:
- type: docker
# note: /var/software_versions.txt should contain:
# arriba: "2.4.0"
run: |
echo "xxx: \"0.1.0\"" > /var/software_versions.txt
```

View File

@@ -0,0 +1,14 @@
name: Angela Oliveira Pisco
info:
role: Contributor
links:
github: aopisco
orcid: "0000-0003-0142-2355"
linkedin: aopisco
organizations:
- name: Insitro
href: https://insitro.com
role: Director of Computational Biology
- name: Open Problems
href: https://openproblems.bio
role: Core Member

View File

@@ -0,0 +1,10 @@
name: Dorien Roosen
info:
links:
email: dorien@data-intuitive.com
github: dorien-er
linkedin: dorien-roosen
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Scientist

View File

@@ -0,0 +1,11 @@
name: Dries Schaumont
info:
links:
email: dries@data-intuitive.com
github: DriesSchaumont
orcid: "0000-0002-4389-0440"
linkedin: dries-schaumont
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Scientist

View File

@@ -0,0 +1,10 @@
name: Emma Rousseau
info:
links:
email: emma@data-intuitive.com
github: emmarousseau
linkedin: emmarousseau1
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatician

View File

@@ -0,0 +1,10 @@
name: Jakub Majercik
info:
links:
email: jakub@data-intuitive.com
github: jakubmajercik
linkedin: jakubmajercik
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatics Engineer

View File

@@ -0,0 +1,14 @@
name: Kai Waldrant
info:
links:
email: kai@data-intuitive.com
github: KaiWaldrant
orcid: "0009-0003-8555-1361"
linkedin: kaiwaldrant
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatician
- name: Open Problems
href: https://openproblems.bio
role: Contributor

View File

@@ -0,0 +1,10 @@
name: Leïla Paquay
info:
links:
email: leila@data-intuitive.com
github: Leila011
linkedin: leilapaquay
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Software Developer

View File

@@ -0,0 +1,14 @@
name: Robrecht Cannoodt
info:
links:
email: robrecht@data-intuitive.com
github: rcannood
orcid: "0000-0003-3641-729X"
linkedin: robrechtcannoodt
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Science Engineer
- name: Open Problems
href: https://openproblems.bio
role: Core Member

View File

@@ -0,0 +1,10 @@
name: Sai Nirmayi Yasa
info:
links:
email: nirmayi@data-intuitive.com
github: sainirmayi
linkedin: sai-nirmayi-yasa
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Junior Bioinformatics Researcher

View File

@@ -0,0 +1,10 @@
name: Theodoro Gasperin Terra Camargo
info:
links:
email: theodorogtc@gmail.com
github: tgaspe
linkedin: theodoro-gasperin-terra-camargo
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatician

View File

@@ -0,0 +1,9 @@
name: Toni Verbeiren
info:
links:
github: tverbeiren
linkedin: verbeiren
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Scientist and CEO

View File

@@ -0,0 +1,5 @@
name: Weiwei Schultz
info:
organizations:
- name: Janssen R&D US
role: Associate Director Data Sciences

View File

@@ -0,0 +1,94 @@
name: agat_convert_sp_gff2gtf
namespace: agat
description: |
The script aims to convert any GTF/GFF file into a proper GTF file. Full
information about the format can be found here:
https://agat.readthedocs.io/en/latest/gxf.html You can choose among 7
different GTF types (1, 2, 2.1, 2.2, 2.5, 3 or relax). Depending the
version selected the script will filter out the features that are not
accepted. For GTF2.5 and 3, every level1 feature (e.g nc_gene
pseudogene) will be converted into gene feature and every level2 feature
(e.g mRNA ncRNA) will be converted into transcript feature. Using the
"relax" option you will produce a GTF-like output keeping all original
feature types (3rd column). No modification will occur e.g. mRNA to
transcript.
To be fully GTF compliant all feature have a gene_id and a transcript_id
attribute. The gene_id is unique identifier for the genomic source of
the transcript, which is used to group transcripts into genes. The
transcript_id is a unique identifier for the predicted transcript, which
is used to group features into transcripts.
keywords: [gene annotations, GTF conversion]
links:
homepage: https://github.com/NBISweden/AGAT
documentation: https://agat.readthedocs.io/
issue_tracker: https://github.com/NBISweden/AGAT/issues
repository: https://github.com/NBISweden/AGAT
references:
doi: 10.5281/zenodo.3552717
license: GPL-3.0
authors:
- __merge__: /src/_authors/leila_paquay.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: --gff
alternatives: [-i]
description: Input GFF/GTF file that will be read
type: file
required: true
direction: input
example: input.gff
- name: Outputs
arguments:
- name: --output
alternatives: [-o, --out, --outfile, --gtf]
description: Output GTF file. If no output file is specified, the output will be written to STDOUT.
type: file
direction: output
required: true
example: output.gtf
- name: Arguments
arguments:
- name: --gtf_version
description: |
Version of the GTF output (1,2,2.1,2.2,2.5,3 or relax). Default value from AGAT config file (relax for the default config). The script option has the higher priority.
* relax: all feature types are accepted.
* GTF3 (9 feature types accepted): gene, transcript, exon, CDS, Selenocysteine, start_codon, stop_codon, three_prime_utr and five_prime_utr.
* GTF2.5 (8 feature types accepted): gene, transcript, exon, CDS, UTR, start_codon, stop_codon, Selenocysteine.
* GTF2.2 (9 feature types accepted): CDS, start_codon, stop_codon, 5UTR, 3UTR, inter, inter_CNS, intron_CNS and exon.
* GTF2.1 (6 feature types accepted): CDS, start_codon, stop_codon, exon, 5UTR, 3UTR.
* GTF2 (4 feature types accepted): CDS, start_codon, stop_codon, exon.
* GTF1 (5 feature types accepted): CDS, start_codon, stop_codon, exon, intron.
type: string
choices: [relax, "1", "2", "2.1", "2.2", "2.5", "3"]
required: false
example: "3"
- name: --config
alternatives: [-c]
description: |
Input agat config file. By default AGAT takes as input agat_config.yaml file from the working directory if any, otherwise it takes the orignal agat_config.yaml shipped with AGAT. To get the agat_config.yaml locally type: "agat config --expose". The --config option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
type: file
required: false
example: custom_agat_config.yaml
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
setup:
- type: docker
run: |
agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,102 @@
```sh
agat_convert_sp_gff2gtf.pl --help
```
------------------------------------------------------------------------------
| Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0 |
| https://github.com/NBISweden/AGAT |
| National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se |
------------------------------------------------------------------------------
Name:
agat_convert_sp_gff2gtf.pl
Description:
The script aims to convert any GTF/GFF file into a proper GTF file. Full
information about the format can be found here:
https://agat.readthedocs.io/en/latest/gxf.html You can choose among 7
different GTF types (1, 2, 2.1, 2.2, 2.5, 3 or relax). Depending the
version selected the script will filter out the features that are not
accepted. For GTF2.5 and 3, every level1 feature (e.g nc_gene
pseudogene) will be converted into gene feature and every level2 feature
(e.g mRNA ncRNA) will be converted into transcript feature. Using the
"relax" option you will produce a GTF-like output keeping all original
feature types (3rd column). No modification will occur e.g. mRNA to
transcript.
To be fully GTF compliant all feature have a gene_id and a transcript_id
attribute. The gene_id is unique identifier for the genomic source of
the transcript, which is used to group transcripts into genes. The
transcript_id is a unique identifier for the predicted transcript, which
is used to group features into transcripts.
Usage:
agat_convert_sp_gff2gtf.pl --gff infile.gff [ -o outfile ]
agat_convert_sp_gff2gtf -h
Options:
--gff, --gtf or -i
Input GFF/GTF file that will be read
--gtf_version version of the GTF output (1,2,2.1,2.2,2.5,3 or relax).
Default value from AGAT config file (relax for the default config). The
script option has the higher priority.
relax: all feature types are accepted.
GTF3 (9 feature types accepted): gene, transcript, exon, CDS,
Selenocysteine, start_codon, stop_codon, three_prime_utr and
five_prime_utr
GTF2.5 (8 feature types accepted): gene, transcript, exon, CDS,
UTR, start_codon, stop_codon, Selenocysteine
GTF2.2 (9 feature types accepted): CDS, start_codon, stop_codon,
5UTR, 3UTR, inter, inter_CNS, intron_CNS and exon
GTF2.1 (6 feature types accepted): CDS, start_codon, stop_codon,
exon, 5UTR, 3UTR
GTF2 (4 feature types accepted): CDS, start_codon, stop_codon,
exon
GTF1 (5 feature types accepted): CDS, start_codon, stop_codon,
exon, intron
-o , --output , --out , --outfile or --gtf
Output GTF file. If no output file is specified, the output will
be written to STDOUT.
-c or --config
String - Input agat config file. By default AGAT takes as input
agat_config.yaml file from the working directory if any,
otherwise it takes the orignal agat_config.yaml shipped with
AGAT. To get the agat_config.yaml locally type: "agat config
--expose". The --config option gives you the possibility to use
your own AGAT config file (located elsewhere or named
differently).
-h or --help
Display this helpful text.
Feedback:
Did you find a bug?:
Do not hesitate to report bugs to help us keep track of the bugs and
their resolution. Please use the GitHub issue tracking system available
at this address:
https://github.com/NBISweden/AGAT/issues
Ensure that the bug was not already reported by searching under Issues.
If you're unable to find an (open) issue addressing the problem, open a new one.
Try as much as possible to include in the issue when relevant:
- a clear description,
- as much relevant information as possible,
- the command used,
- a data sample,
- an explanation of the expected behaviour that is not occurring.
Do you want to contribute?:
You are very welcome, visit this address for the Contributing
guidelines:
https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md

View File

@@ -0,0 +1,10 @@
#!/bin/bash
## VIASH START
## VIASH END
agat_convert_sp_gff2gtf.pl \
-i "$par_gff" \
-o "$par_output" \
${par_gtf_version:+--gtf_version "${par_gtf_version}"} \
${par_config:+--config "${par_config}"}

View File

@@ -0,0 +1,37 @@
#!/bin/bash
## VIASH START
## VIASH END
test_dir="${meta_resources_dir}/test_data"
echo "> Run $meta_name with test data"
"$meta_executable" \
--gff "$test_dir/0_test.gff" \
--output "output.gtf"
echo ">> Checking output"
[ ! -f "output.gtf" ] && echo "Output file output.gtf does not exist" && exit 1
echo ">> Check if output is empty"
[ ! -s "output.gtf" ] && echo "Output file output.gtf is empty" && exit 1
echo ">> Check if the conversion resulted in the right GTF format"
idGFF=$(head -n 2 "$test_dir/0_test.gff" | grep -o 'ID=[^;]*' | cut -d '=' -f 2-)
expectedGTF="gene_id \"$idGFF\"; ID \"$idGFF\";"
extractedGTF=$(head -n 3 "output.gtf" | grep -o 'gene_id "[^"]*"; ID "[^"]*";')
[ "$extractedGTF" != "$expectedGTF" ] && echo "Output file output.gtf does not have the right format" && exit 1
rm output.gtf
echo "> Run $meta_name with test data and GTF version 2.5"
"$meta_executable" \
--gff "$test_dir/0_test.gff" \
--output "output.gtf" \
--gtf_version "2.5"
echo ">> Check if the output file header display the right GTF version"
grep -q "##gtf-version 2.5" "output.gtf"
[ $? -ne 0 ] && echo "Output file output.gtf header does not display the right GTF version" && exit 1
echo "> Test successful"

View File

@@ -0,0 +1,36 @@
##gff-version 3
scaffold625 maker gene 337818 343277 . + . ID=CLUHARG00000005458;Name=TUBB3_2
scaffold625 maker mRNA 337818 343277 . + . ID=CLUHART00000008717;Parent=CLUHARG00000005458
scaffold625 maker exon 337818 337971 . + . ID=CLUHART00000008717:exon:1404;Parent=CLUHART00000008717
scaffold625 maker exon 340733 340841 . + . ID=CLUHART00000008717:exon:1405;Parent=CLUHART00000008717
scaffold625 maker exon 341518 341628 . + . ID=CLUHART00000008717:exon:1406;Parent=CLUHART00000008717
scaffold625 maker exon 341964 343277 . + . ID=CLUHART00000008717:exon:1407;Parent=CLUHART00000008717
scaffold625 maker CDS 337915 337971 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 340733 340841 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341518 341628 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341964 343033 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker five_prime_UTR 337818 337914 . + . ID=CLUHART00000008717:five_prime_utr;Parent=CLUHART00000008717
scaffold625 maker three_prime_UTR 343034 343277 . + . ID=CLUHART00000008717:three_prime_utr;Parent=CLUHART00000008717
scaffold789 maker gene 558184 564780 . + . ID=CLUHARG00000003852;Name=PF11_0240
scaffold789 maker mRNA 558184 564780 . + . ID=CLUHART00000006146;Parent=CLUHARG00000003852
scaffold789 maker exon 558184 560123 . + . ID=CLUHART00000006146:exon:995;Parent=CLUHART00000006146
scaffold789 maker exon 561401 561519 . + . ID=CLUHART00000006146:exon:996;Parent=CLUHART00000006146
scaffold789 maker exon 564171 564235 . + . ID=CLUHART00000006146:exon:997;Parent=CLUHART00000006146
scaffold789 maker exon 564372 564780 . + . ID=CLUHART00000006146:exon:998;Parent=CLUHART00000006146
scaffold789 maker CDS 558191 560123 . + 0 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 561401 561519 . + 2 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 564171 564235 . + 0 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 564372 564588 . + 1 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker five_prime_UTR 558184 558190 . + . ID=CLUHART00000006146:five_prime_utr;Parent=CLUHART00000006146
scaffold789 maker three_prime_UTR 564589 564780 . + . ID=CLUHART00000006146:three_prime_utr;Parent=CLUHART00000006146
scaffold789 maker mRNA 558184 564780 . + . ID=CLUHART00000006147;Parent=CLUHARG00000003852
scaffold789 maker exon 558184 560123 . + . ID=CLUHART00000006147:exon:997;Parent=CLUHART00000006147
scaffold789 maker exon 561401 561519 . + . ID=CLUHART00000006147:exon:998;Parent=CLUHART00000006147
scaffold789 maker exon 562057 562121 . + . ID=CLUHART00000006147:exon:999;Parent=CLUHART00000006147
scaffold789 maker exon 564372 564780 . + . ID=CLUHART00000006147:exon:1000;Parent=CLUHART00000006147
scaffold789 maker CDS 558191 560123 . + 0 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 561401 561519 . + 2 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 562057 562121 . + 0 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 564372 564588 . + 1 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker five_prime_UTR 558184 558190 . + . ID=CLUHART00000006147:five_prime_utr;Parent=CLUHART00000006147
scaffold789 maker three_prime_UTR 564589 564780 . + . ID=CLUHART00000006147:three_prime_utr;Parent=CLUHART00000006147

View File

@@ -0,0 +1,9 @@
#!/bin/bash
# clone repo
if [ ! -d /tmp/agat_source ]; then
git clone --depth 1 --single-branch --branch master https://github.com/NBISweden/AGAT /tmp/agat_source
fi
# copy test data
cp -r /tmp/agat_source/t/gff_syntax/in/0_test.gff src/agat/agat_convert_sp_gff2gtf/test_data

View File

@@ -11,6 +11,9 @@ license: MIT
requirements:
cpus: 1
commands: [ arriba ]
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -4,6 +4,17 @@ description: |
Information about upgrading from bcl2fastq via
[Upgrading from bcl2fastq to BCL Convert](https://emea.support.illumina.com/bulletins/2020/10/upgrading-from-bcl2fastq-to-bcl-convert.html)
and [BCL Convert Compatible Products](https://support.illumina.com/sequencing/sequencing_software/bcl-convert/compatibility.html)
keywords: [demultiplex, fastq, bcl, illumina]
links:
homepage: https://support.illumina.com/sequencing/sequencing_software/bcl-convert.html
documentation: https://support.illumina.com/downloads/bcl-convert-user-guide.html
license: Proprietary
authors:
- __merge__: /src/_authors/toni_verbeiren.yaml
roles: [ author, maintainer ]
- __merge__: /src/_authors/dorien_roosen.yaml
roles: [ author ]
argument_groups:
- name: Input arguments
arguments:

View File

@@ -0,0 +1,143 @@
name: bd_rhapsody_make_reference
namespace: bd_rhapsody
description: |
The Reference Files Generator creates an archive containing Genome Index
and Transcriptome annotation files needed for the BD Rhapsody Sequencing
Analysis Pipeline. The app takes as input one or more FASTA and GTF files
and produces a compressed archive in the form of a tar.gz file. The
archive contains:
- STAR index
- Filtered GTF file
keywords: [genome, reference, index, align]
links:
repository: https://bitbucket.org/CRSwDev/cwl/src/master/v2.2.1/Extra_Utilities/
documentation: https://bd-rhapsody-bioinfo-docs.genomics.bd.com/resources/extra_utilities.html#make-rhapsody-reference
license: Unknown
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [ author, maintainer ]
- __merge__: /src/_authors/weiwei_schultz.yaml
roles: [ contributor ]
argument_groups:
- name: Inputs
arguments:
- type: file
name: --genome_fasta
required: true
description: Reference genome file in FASTA or FASTA.GZ format. The BD Rhapsody Sequencing Analysis Pipeline uses GRCh38 for Human and GRCm39 for Mouse.
example: genome_sequence.fa.gz
multiple: true
info:
config_key: Genome_fasta
- type: file
name: --gtf
required: true
description: |
File path to the transcript annotation files in GTF or GTF.GZ format. The Sequence Analysis Pipeline requires the 'gene_name' or
'gene_id' attribute to be set on each gene and exon feature. Gene and exon feature lines must have the same attribute, and exons
must have a corresponding gene with the same value. For TCR/BCR assays, the TCR or BCR gene segments must have the 'gene_type' or
'gene_biotype' attribute set, and the value should begin with 'TR' or 'IG', respectively.
example: transcriptome_annotation.gtf.gz
multiple: true
info:
config_key: Gtf
- type: file
name: --extra_sequences
description: |
File path to additional sequences in FASTA format to use when building the STAR index. (e.g. transgenes or CRISPR guide barcodes).
GTF lines for these sequences will be automatically generated and combined with the main GTF.
required: false
multiple: true
info:
config_key: Extra_sequences
- name: Outputs
arguments:
- type: file
name: --reference_archive
direction: output
required: true
description: |
A Compressed archive containing the Reference Genome Index and annotation GTF files. This archive is meant to be used as an
input in the BD Rhapsody Sequencing Analysis Pipeline.
example: star_index.tar.gz
- name: Arguments
arguments:
- type: string
name: --mitochondrial_contigs
description: |
Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are
identified as 'nuclear fragments' in the ATACseq analysis pipeline.
required: false
multiple: true
default: [chrM, chrMT, M, MT]
info:
config_key: Mitochondrial_contigs
- type: boolean_true
name: --filtering_off
description: |
By default the input Transcript Annotation files are filtered based on the gene_type/gene_biotype attribute. Only features
having the following attribute values are kept:
- protein_coding
- lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97)
- IG_LV_gene
- IG_V_gene
- IG_V_pseudogene
- IG_D_gene
- IG_J_gene
- IG_J_pseudogene
- IG_C_gene
- IG_C_pseudogene
- TR_V_gene
- TR_V_pseudogene
- TR_D_gene
- TR_J_gene
- TR_J_pseudogene
- TR_C_gene
If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True.
info:
config_key: Filtering_off
- type: boolean_true
name: --wta_only_index
description: Build a WTA only index, otherwise builds a WTA + ATAC index.
info:
config_key: Wta_Only
- type: string
name: --extra_star_params
description: Additional parameters to pass to STAR when building the genome index. Specify exactly like how you would on the command line.
example: --limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11
required: false
info:
config_key: Extra_STAR_params
resources:
- type: python_script
path: script.py
- path: make_rhap_reference_2.2.1_nodocker.cwl
test_resources:
- type: bash_script
path: test.sh
- path: test_data
requirements:
commands: [ "cwl-runner" ]
engines:
- type: docker
image: bdgenomics/rhapsody:2.2.1
setup:
- type: apt
packages: [procps]
- type: python
packages: [cwlref-runner, cwl-runner]
- type: docker
run: |
echo "bdgenomics/rhapsody: 2.2.1" > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,66 @@
```bash
cwl-runner src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl --help
```
usage: src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl
[-h] [--Archive_prefix ARCHIVE_PREFIX]
[--Extra_STAR_params EXTRA_STAR_PARAMS]
[--Extra_sequences EXTRA_SEQUENCES] [--Filtering_off] --Genome_fasta
GENOME_FASTA --Gtf GTF [--Maximum_threads MAXIMUM_THREADS]
[--Mitochondrial_Contigs MITOCHONDRIAL_CONTIGS] [--WTA_Only]
[job_order]
The Reference Files Generator creates an archive containing Genome Index and
Transcriptome annotation files needed for the BD Rhapsodyâ„¢ Sequencing
Analysis Pipeline. The app takes as input one or more FASTA and GTF files and
produces a compressed archive in the form of a tar.gz file. The archive
contains:\n - STAR index\n - Filtered GTF file
positional arguments:
job_order Job input json file
options:
-h, --help show this help message and exit
--Archive_prefix ARCHIVE_PREFIX
A prefix for naming the compressed archive file
containing the Reference genome index and annotation
files. The default value is constructed based on the
input Reference files.
--Extra_STAR_params EXTRA_STAR_PARAMS
Additional parameters to pass to STAR when building
the genome index. Specify exactly like how you would
on the command line. Example: --limitGenomeGenerateRAM
48000 --genomeSAindexNbases 11
--Extra_sequences EXTRA_SEQUENCES
Additional sequences in FASTA format to use when
building the STAR index. (E.g. phiX genome)
--Filtering_off By default the input Transcript Annotation files are
filtered based on the gene_type/gene_biotype
attribute. Only features having the following
attribute values are are kept: - protein_coding -
lncRNA (lincRNA and antisense for Gencode <
v31/M22/Ensembl97) - IG_LV_gene - IG_V_gene -
IG_V_pseudogene - IG_D_gene - IG_J_gene -
IG_J_pseudogene - IG_C_gene - IG_C_pseudogene -
TR_V_gene - TR_V_pseudogene - TR_D_gene - TR_J_gene -
TR_J_pseudogene - TR_C_gene If you have already pre-
filtered the input Annotation files and/or wish to
turn-off the filtering, please set this option to
True.
--Genome_fasta GENOME_FASTA
Reference genome file in FASTA format. The BD
Rhapsodyâ„¢ Sequencing Analysis Pipeline uses GRCh38
for Human and GRCm39 for Mouse.
--Gtf GTF Transcript annotation files in GTF format. The BD
Rhapsodyâ„¢ Sequencing Analysis Pipeline uses Gencode
v42 for Human and M31 for Mouse.
--Maximum_threads MAXIMUM_THREADS
The maximum number of threads to use in the pipeline.
By default, all available cores are used.
--Mitochondrial_Contigs MITOCHONDRIAL_CONTIGS
Names of the Mitochondrial contigs in the provided
Reference Genome. Fragments originating from contigs
other than these are identified as 'nuclear fragments'
in the ATACseq analysis pipeline.
--WTA_Only Build a WTA only index, otherwise builds a WTA + ATAC
index.

View File

@@ -0,0 +1,115 @@
requirements:
InlineJavascriptRequirement: {}
class: CommandLineTool
label: Reference Files Generator for BD Rhapsodyâ„¢ Sequencing Analysis Pipeline
cwlVersion: v1.2
doc: >-
The Reference Files Generator creates an archive containing Genome Index and Transcriptome annotation files needed for the BD Rhapsodyâ„¢ Sequencing Analysis Pipeline. The app takes as input one or more FASTA and GTF files and produces a compressed archive in the form of a tar.gz file. The archive contains:\n - STAR index\n - Filtered GTF file
baseCommand: run_reference_generator.sh
inputs:
Genome_fasta:
type: File[]
label: Reference Genome
doc: |-
Reference genome file in FASTA format. The BD Rhapsodyâ„¢ Sequencing Analysis Pipeline uses GRCh38 for Human and GRCm39 for Mouse.
inputBinding:
prefix: --reference-genome
shellQuote: false
Gtf:
type: File[]
label: Transcript Annotations
doc: |-
Transcript annotation files in GTF format. The BD Rhapsodyâ„¢ Sequencing Analysis Pipeline uses Gencode v42 for Human and M31 for Mouse.
inputBinding:
prefix: --gtf
shellQuote: false
Extra_sequences:
type: File[]?
label: Extra Sequences
doc: |-
Additional sequences in FASTA format to use when building the STAR index. (E.g. phiX genome)
inputBinding:
prefix: --extra-sequences
shellQuote: false
Mitochondrial_Contigs:
type: string[]?
default: ["chrM", "chrMT", "M", "MT"]
label: Mitochondrial Contig Names
doc: |-
Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are identified as 'nuclear fragments' in the ATACseq analysis pipeline.
inputBinding:
prefix: --mitochondrial-contigs
shellQuote: false
Filtering_off:
type: boolean?
label: Turn off filtering
doc: |-
By default the input Transcript Annotation files are filtered based on the gene_type/gene_biotype attribute. Only features having the following attribute values are are kept:
- protein_coding
- lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97)
- IG_LV_gene
- IG_V_gene
- IG_V_pseudogene
- IG_D_gene
- IG_J_gene
- IG_J_pseudogene
- IG_C_gene
- IG_C_pseudogene
- TR_V_gene
- TR_V_pseudogene
- TR_D_gene
- TR_J_gene
- TR_J_pseudogene
- TR_C_gene
If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True.
inputBinding:
prefix: --filtering-off
shellQuote: false
WTA_Only:
type: boolean?
label: WTA only index
doc: Build a WTA only index, otherwise builds a WTA + ATAC index.
inputBinding:
prefix: --wta-only-index
shellQuote: false
Archive_prefix:
type: string?
label: Archive Prefix
doc: |-
A prefix for naming the compressed archive file containing the Reference genome index and annotation files. The default value is constructed based on the input Reference files.
inputBinding:
prefix: --archive-prefix
shellQuote: false
Extra_STAR_params:
type: string?
label: Extra STAR Params
doc: |-
Additional parameters to pass to STAR when building the genome index. Specify exactly like how you would on the command line.
Example:
--limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11
inputBinding:
prefix: --extra-star-params
shellQuote: true
Maximum_threads:
type: int?
label: Maximum Number of Threads
doc: |-
The maximum number of threads to use in the pipeline. By default, all available cores are used.
inputBinding:
prefix: --maximum-threads
shellQuote: false
outputs:
Archive:
type: File
doc: |-
A Compressed archive containing the Reference Genome Index and annotation GTF files. This archive is meant to be used as an input in the BD Rhapsodyâ„¢ Sequencing Analysis Pipeline.
id: Reference_Archive
label: Reference Files Archive
outputBinding:
glob: '*.tar.gz'

View File

@@ -0,0 +1,161 @@
import os
import re
import subprocess
import tempfile
from typing import Any
import yaml
import shutil
## VIASH START
par = {
"genome_fasta": [],
"gtf": [],
"extra_sequences": [],
"mitochondrial_contigs": ["chrM", "chrMT", "M", "MT"],
"filtering_off": False,
"wta_only_index": False,
"extra_star_params": None,
"reference_archive": "output.tar.gz",
}
meta = {
"config": "target/nextflow/reference/build_bdrhap_2_reference/.config.vsh.yaml",
"resources_dir": os.path.abspath("src/reference/build_bdrhap_2_reference"),
"temp_dir": os.getenv("VIASH_TEMP"),
"memory_mb": None,
"cpus": None
}
## VIASH END
def clean_arg(argument):
argument["clean_name"] = re.sub("^-*", "", argument["name"])
return argument
def read_config(path: str) -> dict[str, Any]:
with open(path, "r") as f:
config = yaml.safe_load(f)
config["all_arguments"] = [
clean_arg(arg)
for grp in config["argument_groups"]
for arg in grp["arguments"]
]
return config
def strip_margin(text: str) -> str:
return re.sub("(\n?)[ \t]*\|", "\\1", text)
def process_params(par: dict[str, Any], config) -> str:
# check input parameters
assert par["genome_fasta"], "Pass at least one set of inputs to --genome_fasta."
assert par["gtf"], "Pass at least one set of inputs to --gtf."
assert par["reference_archive"].endswith(".tar.gz"), "Output reference_archive must end with .tar.gz."
# make paths absolute
for argument in config["all_arguments"]:
if par[argument["clean_name"]] and argument["type"] == "file":
if isinstance(par[argument["clean_name"]], list):
par[argument["clean_name"]] = [ os.path.abspath(f) for f in par[argument["clean_name"]] ]
else:
par[argument["clean_name"]] = os.path.abspath(par[argument["clean_name"]])
return par
def generate_config(par: dict[str, Any], meta, config) -> str:
content_list = [strip_margin(f"""\
|#!/usr/bin/env cwl-runner
|
|""")]
config_key_value_pairs = []
for argument in config["all_arguments"]:
config_key = (argument.get("info") or {}).get("config_key")
arg_type = argument["type"]
par_value = par[argument["clean_name"]]
if par_value and config_key:
config_key_value_pairs.append((config_key, arg_type, par_value))
if meta["cpus"]:
config_key_value_pairs.append(("Maximum_threads", "integer", meta["cpus"]))
# print(config_key_value_pairs)
for config_key, arg_type, par_value in config_key_value_pairs:
if arg_type == "file":
str = strip_margin(f"""\
|{config_key}:
|""")
if isinstance(par_value, list):
for file in par_value:
str += strip_margin(f"""\
| - class: File
| location: "{file}"
|""")
else:
str += strip_margin(f"""\
| class: File
| location: "{par_value}"
|""")
content_list.append(str)
else:
content_list.append(strip_margin(f"""\
|{config_key}: {par_value}
|"""))
## Write config to file
return "".join(content_list)
def get_cwl_file(meta: dict[str, Any]) -> str:
# create cwl file (if need be)
cwl_file=os.path.join(meta["resources_dir"], "make_rhap_reference_2.2.1_nodocker.cwl")
return cwl_file
def main(par: dict[str, Any], meta: dict[str, Any]):
config = read_config(meta["config"])
# Preprocess params
par = process_params(par, config)
# fetch cwl file
cwl_file = get_cwl_file(meta)
# Create output dir if not exists
outdir = os.path.dirname(par["reference_archive"])
if not os.path.exists(outdir):
os.makedirs(outdir)
## Run pipeline
with tempfile.TemporaryDirectory(prefix="cwl-bd_rhapsody_wta-", dir=meta["temp_dir"]) as temp_dir:
# Create params file
config_file = os.path.join(temp_dir, "config.yml")
config_content = generate_config(par, meta, config)
with open(config_file, "w") as f:
f.write(config_content)
cmd = [
"cwl-runner",
"--no-container",
"--preserve-entire-environment",
"--outdir",
temp_dir,
cwl_file,
config_file
]
env = dict(os.environ)
env["TMPDIR"] = temp_dir
print("> " + " ".join(cmd), flush=True)
_ = subprocess.check_call(
cmd,
cwd=os.path.dirname(config_file),
env=env
)
shutil.move(os.path.join(temp_dir, "Rhap_reference.tar.gz"), par["reference_archive"])
if __name__ == "__main__":
main(par, meta)

View File

@@ -0,0 +1,65 @@
#!/bin/bash
set -e
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_doesnt_exist() {
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
}
assert_file_empty() {
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains() {
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
#############################################
in_fa="$meta_resources_dir/test_data/reference_small.fa"
in_gtf="$meta_resources_dir/test_data/reference_small.gtf"
echo "#############################################"
echo "> Simple run"
mkdir simple_run
cd simple_run
out_tar="myreference.tar.gz"
echo "> Running $meta_name."
$meta_executable \
--genome_fasta "$in_fa" \
--gtf "$in_gtf" \
--reference_archive "$out_tar" \
--extra_star_params "--genomeSAindexNbases 6" \
---cpus 2
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
assert_file_exists "$out_tar"
assert_file_not_empty "$out_tar"
echo ">> Checking whether output contains the expected files"
tar -xvf "$out_tar" > /dev/null
assert_file_exists "BD_Rhapsody_Reference_Files/star_index/genomeParameters.txt"
assert_file_exists "BD_Rhapsody_Reference_Files/bwa-mem2_index/reference_small.ann"
assert_file_exists "BD_Rhapsody_Reference_Files/reference_small-processed.gtf"
assert_file_exists "BD_Rhapsody_Reference_Files/mitochondrial_contigs.txt"
assert_file_contains "BD_Rhapsody_Reference_Files/reference_small-processed.gtf" "chr1.*HAVANA.*ENSG00000243485"
assert_file_contains "BD_Rhapsody_Reference_Files/mitochondrial_contigs.txt" 'chrMT'
cd ..
echo "#############################################"
echo "> Tests succeeded!"

View File

@@ -0,0 +1,27 @@
>chr1 1
TGGGGAAGCAAGGCGGAGTTGGGCAGCTCGTGTTCAATGGGTAGAGTTTCAGGCTGGGGT
GATGGAAGGGTGCTGGAAATGAGTGGTAGTGATGGCGGCACAACAGTGTGAATCTACTTA
ATCCCACTGAACTGTATGCTGAAAAATGGTTTAGACGGTGAATTTTAGGTTATGTATGTT
TTACCACAATTTTTAAAAAGCTAGTGAAAAGCTGGTAAAAAGAAAGAAAAGAGGCTTTTT
TAAAAAGTTAAATATATAAAAAGAGCATCATCAGTCCAAAGTCCAGCAGTTGTCCCTCCT
GGAATCCGTTGGCTTGCCTCCGGCATTTTTGGCCCTTGCCTTTTAGGGTTGCCAGATTAA
AAGACAGGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAA
ATATGTCCCAAGCTTAGTTTGGGACATACTTATGCTAAAAAACATTATTGGTTGTTTATC
TGAGATTCAGAATTAAGCATTTTATATTTTATTTGCTGCCTCTGGCCACCCTACTCTCTT
CCTAACACTCTCTCCCTCTCCCAGTTTTGTCCGCCTTCCCTGCCTCCTCTTCTGGGGGAG
TTAGATCGAGTTGTAACAAGAACATGCCACTGTCTCGCTGGCTGCAGCGTGTGGTCCCCT
TACCAGAGGTAAAGAAGAGATGGATCTCCACTCATGTTGTAGACAGAATGTTTATGTCCT
CTCCAAATGCTTATGTTGAAACCCTAACCCCTAATGTGATGGTATGTGGAGATGGGCCTT
TGGTAGGTAATTACGGTTAGATGAGGTCATGGGGTGGGGCCCTCATTATAGATCTGGTAA
GAAAAGAGAGCATTGTCTCTGTGTCTCCCTCTCTCTCTCTCTCTCTCTCTCTCATTTCTC
TCTATCTCATTTCTCTCTCTCTCGCTATCTCATTTTTCTCTCTCTCTCTTTCTCTCCTCT
GTCTTTTCCCACCAAGTGAGGATGCGAAGAGAAGGTGGCTGTCTGCAAACCAGGAAGAGA
GCCCTCACCGGGAACCCGTCCAGCTGCCACCTTGAACTTGGACTTCCAAGCCTCCAGAAC
TGTGAGGGATAAATGTATGATTTTAAAGTCGCCCAGTGTGTGGTATTTTGTTTTGACTAA
TACAACCTGAAAACATTTTCCCCTCACTCCACCTGAGCAATATCTGAGTGGCTTAAGGTA
CTCAGGACACAACAAAGGAGAAATGTCCCATGCACAAGGTGCACCCATGCCTGGGTAAAG
CAGCCTGGCACAGAGGGAAGCACACAGGCTCAGGGATCTGCTATTCATTCTTTGTGTGAC
CCTGGGCAAGCCATGAATGGAGCTTCAGTCACCCCATTTGTAATGGGATTTAATTGTGCT
TGCCCTGCCTCCTTTTGAGGGCTGTAGAGAAAAGATGTCAAAGTATTTTGTAATCTGGCT
GGGCGTGGTGGCTCATGCCTGTAATCCTAGCACTTTGGTAGGCTGACGCGAGAGGACTGC
T

View File

@@ -0,0 +1,8 @@
chr1 HAVANA exon 565 668 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; exon_number 2; exon_id "ENSE00001922571.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1 HAVANA exon 977 1098 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; exon_number 3; exon_id "ENSE00001827679.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1 HAVANA transcript 268 1110 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000469289.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
chr1 HAVANA exon 268 668 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000469289.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; exon_number 1; exon_id "ENSE00001841699.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
chr1 HAVANA exon 977 1110 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000469289.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; exon_number 2; exon_id "ENSE00001890064.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
chr1 ENSEMBL gene 367 504 . + . gene_id "ENSG00000284332.1"; gene_type "miRNA"; gene_name "MIR1302-2"; level 3; hgnc_id "HGNC:35294";
chr1 ENSEMBL transcript 367 504 . + . gene_id "ENSG00000284332.1"; transcript_id "ENST00000607096.1"; gene_type "miRNA"; gene_name "MIR1302-2"; transcript_type "miRNA"; transcript_name "MIR1302-2-201"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:35294"; tag "basic"; tag "Ensembl_canonical";
chr1 ENSEMBL exon 367 504 . + . gene_id "ENSG00000284332.1"; transcript_id "ENST00000607096.1"; gene_type "miRNA"; gene_name "MIR1302-2"; transcript_type "miRNA"; transcript_name "MIR1302-2-201"; exon_number 1; exon_id "ENSE00003695741.1"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:35294"; tag "basic"; tag "Ensembl_canonical";

View File

@@ -0,0 +1,47 @@
#!/bin/bash
TMP_DIR=/tmp/bd_rhapsody_make_reference
OUT_DIR=src/bd_rhapsody/bd_rhapsody_make_reference/test_data
# check if seqkit is installed
if ! command -v seqkit &> /dev/null; then
echo "seqkit could not be found"
exit 1
fi
# create temporary directory and clean up on exit
mkdir -p $TMP_DIR
function clean_up {
rm -rf "$TMP_DIR"
}
trap clean_up EXIT
# fetch reference
ORIG_FA=$TMP_DIR/reference.fa.gz
if [ ! -f $ORIG_FA ]; then
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/GRCh38.primary_assembly.genome.fa.gz \
-O $ORIG_FA
fi
ORIG_GTF=$TMP_DIR/reference.gtf.gz
if [ ! -f $ORIG_GTF ]; then
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.annotation.gtf.gz \
-O $ORIG_GTF
fi
# create small reference
START=30000
END=31500
CHR=chr1
# subset to small region
seqkit grep -r -p "^$CHR\$" "$ORIG_FA" | \
seqkit subseq -r "$START:$END" > $OUT_DIR/reference_small.fa
zcat "$ORIG_GTF" | \
awk -v FS='\t' -v OFS='\t' "
\$1 == \"$CHR\" && \$4 >= $START && \$5 <= $END {
\$4 = \$4 - $START + 1;
\$5 = \$5 - $START + 1;
print;
}" > $OUT_DIR/reference_small.gtf

View File

@@ -10,6 +10,9 @@ references:
license: GPL-2.0
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/dries_schaumont.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Input arguments

View File

@@ -9,6 +9,9 @@ links:
references:
doi: 10.1007/978-1-4939-9173-0_14
license: MIT
authors:
- __merge__: /src/_authors/dorien_roosen.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,6 +9,9 @@ links:
references:
doi: 10.1007/978-1-4939-9173-0_14
license: MIT
authors:
- __merge__: /src/_authors/dorien_roosen.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Outputs
arguments:

View File

@@ -9,6 +9,9 @@ links:
references:
doi: 10.1007/978-1-4939-9173-0_14
license: MIT
authors:
- __merge__: /src/_authors/dorien_roosen.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,6 +9,9 @@ links:
references:
doi: 10.14806/ej.17.1.200
license: MIT
authors:
- __merge__: /src/_authors/toni_verbeiren.yaml
roles: [ author, maintainer ]
argument_groups:
####################################################################
- name: Specify Adapters for R1

View File

@@ -6,25 +6,25 @@ set -eo pipefail
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || (echo "File '$1' does not exist" && exit 1)
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_doesnt_exist() {
[ ! -f "$1" ] || (echo "File '$1' exists but shouldn't" && exit 1)
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
}
assert_file_empty() {
[ ! -s "$1" ] || (echo "File '$1' is not empty but should be" && exit 1)
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || (echo "File '$1' is empty but shouldn't be" && exit 1)
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains() {
grep -q "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
#############################################
mkdir test_multiple_output
cd test_multiple_output

View File

@@ -9,6 +9,9 @@ references:
license: GPL-3.0
requirements:
commands: [falco]
authors:
- __merge__: /src/_authors/toni_verbeiren.yaml
roles: [ author, maintainer ]
# Notes:
# - falco as arguments similar to -subsample and we update those to --subsample

View File

@@ -26,6 +26,9 @@ links:
references:
doi: "10.1093/bioinformatics/bty560"
license: MIT
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
description: |

View File

@@ -11,7 +11,9 @@ references:
license: GPL-3.0
requirements:
commands: [ featureCounts ]
authors:
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -8,8 +8,9 @@ links:
references:
doi: 10.12688/f1000research.23297.2
license: MIT
requirements:
commands: [ gffread ]
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
@@ -52,7 +53,7 @@ argument_groups:
required: true
description: |
Write the output records into <outfile>.
default: output.gff
example: output.gff
- name: --force_exons
type: boolean_true
description: |
@@ -154,7 +155,6 @@ argument_groups:
- name: --table
type: string
multiple: true
multiple_sep: ","
description: |
Output a simple tab delimited format instead of GFF, with columns having the values
of GFF attributes given in <attrlist>; special pseudo-attributes (prefixed by @) are

View File

@@ -50,6 +50,8 @@
[[ "$par_expose_dups" == "false" ]] && unset par_expose_dups
[[ "$par_cluster_only" == "false" ]] && unset par_cluster_only
# if par_table is not empty, replace ";" with ","
par_table=$(echo "$par_table" | tr ';' ',')
$(which gffread) \
"$par_input" \

View File

@@ -86,7 +86,7 @@ diff "$expected_output_dir/transcripts.fa" "$test_output_dir/transcripts.fa" ||
echo "> Test 4 - Generate table from GFF annotation file"
"$meta_executable" \
--table @id,@chr,@start,@end,@strand,@exons,Name,gene,product \
--table "@id;@chr;@start;@end;@strand;@exons;Name;gene;product" \
--outfile "$test_output_dir/annotation.tbl" \
--input "$test_dir/sequence.gff3"

View File

@@ -17,6 +17,9 @@ references:
license: "MIT"
requirements:
commands: [ lofreq ]
authors:
- __merge__: /src/_authors/kai_waldrant.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -18,6 +18,9 @@ references:
license: "MIT"
requirements:
commands: [ lofreq ]
authors:
- __merge__: /src/_authors/kai_waldrant.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -11,7 +11,9 @@ info:
references:
doi: 10.1093/bioinformatics/btw354
licence: GPL v3 or later
authors:
- __merge__: /src/_authors/dorien_roosen.yaml
roles: [ author, maintainer ]
argument_groups:
- name: "Input"
arguments:

View File

@@ -12,7 +12,10 @@ references:
doi: 10.1093/bioinformatics/btt593
license: "CC-BY-NC-SA-3.0"
requirements:
commands: [ pear , gzip ]
commands: [ pear, gzip ]
authors:
- __merge__: /src/_authors/kai_waldrant.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -12,7 +12,9 @@ references:
license: GPL-3.0
requirements:
commands: [ salmon ]
authors:
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -12,7 +12,9 @@ references:
license: GPL-3.0
requirements:
commands: [ salmon ]
authors:
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Common input options
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
@@ -30,10 +32,10 @@ argument_groups:
- name: --coverage
alternatives: -c
type: integer
description: |
Coverage distribution min,max,step [1,1000,1].
multiple: true
multiple_sep: ','
description: |
Coverage distribution min;max;step. Default: [1, 1000, 1].
example: [1, 1000, 1]
- name: --remove_dups
alternatives: -d
type: boolean_true
@@ -48,25 +50,25 @@ argument_groups:
alternatives: -f
type: string
description: |
Required flag, 0 for unset. See also `samtools flags`.
default: "0"
Required flag, 0 for unset. See also `samtools flags`. Default: `"0"`.
example: "0"
- name: --filtering_flag
alternatives: -F
type: string
description: |
Filtering flag, 0 for unset. See also `samtools flags`.
default: "0"
Filtering flag, 0 for unset. See also `samtools flags`. Default: `0`.
example: "0"
- name: --GC_depth
type: double
description: |
The size of GC-depth bins (decreasing bin size increases memory requirement).
default: 20000.0
The size of GC-depth bins (decreasing bin size increases memory requirement). Default: `20000`.
example: 20000.0
- name: --insert_size
alternatives: -i
type: integer
description: |
Maximum insert size.
default: 8000
Maximum insert size. Default: `8000`.
example: 8000
- name: --id
alternatives: -I
type: string
@@ -76,14 +78,14 @@ argument_groups:
alternatives: -l
type: integer
description: |
Include in the statistics only reads with the given read length.
default: -1
Include in the statistics only reads with the given read length. Default: `-1`.
example: -1
- name: --most_inserts
alternatives: -m
type: double
description: |
Report only the main part of inserts.
default: 0.99
Report only the main part of inserts. Default: `0.99`.
example: 0.99
- name: --split_prefix
alternatives: -P
type: string
@@ -93,8 +95,8 @@ argument_groups:
alternatives: -q
type: integer
description: |
The BWA trimming parameter.
default: 0
The BWA trimming parameter. Default: `0`.
example: 0
- name: --ref_seq
alternatives: -r
type: file
@@ -124,8 +126,8 @@ argument_groups:
alternatives: -g
type: integer
description: |
Only bases with coverage above this value will be included in the target percentage computation.
default: 0
Only bases with coverage above this value will be included in the target percentage computation. Default: `0`.
example: 0
- name: --input_fmt_option
type: string
description: |
@@ -141,7 +143,7 @@ argument_groups:
type: file
description: |
Output file.
default: "out.txt"
example: "out.txt"
required: true
direction: output

View File

@@ -10,6 +10,9 @@ set -e
[[ "$par_sparse" == "false" ]] && unset par_sparse
[[ "$par_remove_overlaps" == "false" ]] && unset par_remove_overlaps
# change the coverage input from X;X;X to X,X,X
par_coverage=$(echo "$par_coverage" | tr ';' ',')
samtools stats \
${par_coverage:+-c "$par_coverage"} \
${par_remove_dups:+-d} \

View File

@@ -17,7 +17,7 @@ echo ">>> Checking whether output is non-empty"
[ ! -s "$test_dir/test.paired_end.sorted.txt" ] && echo "File 'test.paired_end.sorted.txt' is empty!" && exit 1
echo ">>> Checking whether output is correct"
# compare using diff, ignoring the line stating the command that was passed.
# compare using diff, ignoring the line stating the command that was passed.
diff <(grep -v "^# The command" "$test_dir/test.paired_end.sorted.txt") \
<(grep -v "^# The command" "$test_dir/ref.paired_end.sorted.txt") || \
(echo "Output file ref.paired_end.sorted.txt does not match expected output" && exit 1)

View File

@@ -9,7 +9,9 @@ links:
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -0,0 +1,57 @@
name: seqtk_sample
namespace: seqtk
description: Subsamples sequences from FASTA/Q files.
keywords: [sample, FASTA, FASTQ]
links:
repository: https://github.com/lh3/seqtk/tree/v1.4
license: MIT
authors:
- __merge__: /src/_authors/jakub_majercik.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
description: The input FASTA/Q file.
required: true
- name: Outputs
arguments:
- name: --output
type: file
description: The output FASTA/Q file.
required: true
direction: output
- name: Options
arguments:
- name: --seed
type: integer
description: Seed for random generator.
example: 42
- name: --fraction_number
type: double
description: Fraction or number of sequences to sample.
required: true
example: 0.1
- name: --two_pass_mode
type: boolean_true
description: Twice as slow but with much reduced memory
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: ../test_data
engines:
- type: docker
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,9 @@
```
seqtk_subseq
```
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
Options:
-t TAB delimited output
-s strand aware
-l INT sequence line length [0]
Note: Use 'samtools faidx' if only a few regions are intended.

View File

@@ -0,0 +1,11 @@
#!/bin/bash
## VIASH START
## VIASH END
seqtk sample \
${par_two_pass_mode:+-2} \
${par_seed:+-s "$par_seed"} \
"$par_input" \
"$par_fraction_number" \
> "$par_output"

View File

@@ -0,0 +1,104 @@
#!/bin/bash
set -e
## VIASH START
meta_executable="target/executable/seqtk/seqtk_sample"
meta_resources_dir="src/seqtk"
## VIASH END
#########################################################################################
mkdir seqtk_sample_se
cd seqtk_sample_se
echo "> Run seqtk_sample on fastq SE"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
--seed 42 \
--fraction_number 3 \
--output "sampled.fastq"
echo ">> Check if output exists"
if [ ! -f "sampled.fastq" ]; then
echo ">> sampled.fastq does not exist"
exit 1
fi
echo ">> Count number of samples"
num_samples=$(grep -c '^@' sampled.fastq)
if [ "$num_samples" -ne 3 ]; then
echo ">> sampled.fastq does not contain 3 samples"
exit 1
fi
#########################################################################################
cd ..
mkdir seqtk_sample_pe_number
cd seqtk_sample_pe_number
echo ">> Run seqtk_sample on fastq.gz PE with number of reads"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
--seed 42 \
--fraction_number 3 \
--output "sampled_1.fastq"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.2.fastq.gz" \
--seed 42 \
--fraction_number 3 \
--output "sampled_2.fastq"
echo ">> Check if output exists"
if [ ! -f "sampled_1.fastq" ] || [ ! -f "sampled_2.fastq" ]; then
echo ">> One or both output files do not exist"
exit 1
fi
echo ">> Compare reads"
# Extract headers
headers1=$(grep '^@' sampled_1.fastq | sed -e's/ 1$//' | sort)
headers2=$(grep '^@' sampled_2.fastq | sed -e 's/ 2$//' | sort)
# Compare headers
diff <(echo "$headers1") <(echo "$headers2") || { echo "Mismatch detected"; exit 1; }
echo ">> Count number of samples"
num_headers=$(echo "$headers1" | wc -l)
if [ "$num_headers" -ne 3 ]; then
echo ">> sampled_1.fastq does not contain 3 headers"
exit 1
fi
#########################################################################################
cd ..
mkdir seqtk_sample_pe_fraction
cd seqtk_sample_pe_fraction
echo ">> Run seqtk_sample on fastq.gz PE with fraction of reads"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
--seed 42 \
--fraction_number 0.5 \
--output "sampled_1.fastq"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.2.fastq.gz" \
--seed 42 \
--fraction_number 0.5 \
--output "sampled_2.fastq"
echo ">> Check if output exists"
if [ ! -f "sampled_1.fastq" ] || [ ! -f "sampled_2.fastq" ]; then
echo ">> One or both output files do not exist"
exit 1
fi
echo ">> Compare reads"
# Extract headers
headers1=$(grep '^@' sampled_1.fastq | sed -e's/ 1$//' | sort)
headers2=$(grep '^@' sampled_2.fastq | sed -e 's/ 2$//' | sort)
# Compare headers
diff <(echo "$headers1") <(echo "$headers2") || { echo "Mismatch detected"; exit 1; }

View File

@@ -0,0 +1,78 @@
name: seqtk_subseq
namespace: seqtk
description: |
Extract subsequences from FASTA/Q files. Takes as input a FASTA/Q file and a name.lst (sequence ids file) or a reg.bed (genomic regions file).
keywords: [subseq, FASTA, FASTQ]
links:
repository: https://github.com/lh3/seqtk/tree/v1.4
license: MIT
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: "--input"
type: file
direction: input
description: The input FASTA/Q file.
required: true
example: input.fa
- name: "--name_list"
type: file
direction: input
description: |
List of sequence names (name.lst) or genomic regions (reg.bed) to extract.
required: true
example: list.lst
- name: Outputs
arguments:
- name: "--output"
alternatives: -o
type: file
direction: output
description: The output FASTA/Q file.
required: true
default: output.fa
- name: Options
arguments:
- name: "--tab"
alternatives: -t
type: boolean_true
description: TAB delimited output.
- name: "--strand_aware"
alternatives: -s
type: boolean_true
description: Strand aware.
- name: "--sequence_line_length"
alternatives: -l
type: integer
description: |
Sequence line length of input fasta file. Default: 0.
example: 0
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
setup:
- type: docker
run: |
echo $(echo $(seqtk 2>&1) | sed -n 's/.*\(Version: [^ ]*\).*/\1/p') > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,9 @@
```bash
seqtk subseq
```
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
Options:
-t TAB delimited output
-s strand aware
-l INT sequence line length [0]
Note: Use 'samtools faidx' if only a few regions are intended.

View File

@@ -0,0 +1,15 @@
#!/bin/bash
## VIASH START
## VIASH END
[[ "$par_tab" == "false" ]] && unset par_tab
[[ "$par_strand_aware" == "false" ]] && unset par_strand_aware
seqtk subseq \
${par_tab:+-t} \
${par_strand_aware:+-s} \
${par_sequence_line_length:+-l "$par_sequence_line_length"} \
"$par_input" \
"$par_name_list" \
> "$par_output"

View File

@@ -0,0 +1,182 @@
#!/bin/bash
# exit on error
set -e
## VIASH START
meta_executable="target/executable/seqtk/seqtk_subseq"
meta_resources_dir="src/seqtk"
## VIASH END
# Create directories for tests
echo "Creating Test Data..."
mkdir test_data
# Create and populate input.fasta
cat > "test_data/input.fasta" <<EOL
>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCA
AGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>GU056837.1
CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGAAATAATAATTATCATAATTA
TTAATTACATATTTATTAGGTATAATATTTAAGGAAAAATATATTTTATGTTAATTGTAATAATTAGAAC
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGAC
AGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>JAMFTS010000002.1
CCTAAACCCTAAACCCTAAACCCCCTACAAACCTTACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
ACCCGAAACCCTATACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCAAACCTAATCCCTAAACC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTC
AAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG
EOL
# Update id.list with new entries
cat > "test_data/id.list" <<EOL
KU562861.1
MH150936.1
EOL
# Create and populate reg.bed
cat > "test_data/reg.bed" <<EOL
KU562861.1$(echo -e "\t")10$(echo -e "\t")20$(echo -e "\t")region$(echo -e "\t")0$(echo -e "\t")+$(echo -e "\n")
MH150936.1$(echo -e "\t")10$(echo -e "\t")20$(echo -e "\t")region$(echo -e "\t")0$(echo -e "\t")-
EOL
#########################################################################################
# Run basic test
mkdir test1
cd test1
echo "> Run seqtk_subseq on FASTA/Q file"
"$meta_executable" \
--input "../test_data/input.fasta" \
--name_list "../test_data/id.list" \
--output "sub_sample.fq"
expected_output_basic=">KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTCAAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG"
output_basic=$(cat sub_sample.fq)
if [ "$output_basic" != "$expected_output_basic" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_basic"
echo "Got:"
echo "$output_basic"
exit 1
fi
#########################################################################################
# Run reg.bed as name list input test
cd ..
mkdir test2
cd test2
echo "> Run seqtk_subseq on FASTA/Q file with BED file as name list"
"$meta_executable" \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"
expected_output_basic=">KU562861.1:11-20
AGTGTTCGAG
>MH150936.1:11-20
TGAAAACTTT"
output_basic=$(cat sub_sample.fq)
if [ "$output_basic" != "$expected_output_basic" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_basic"
echo "Got:"
echo "$output_basic"
exit 1
fi
#########################################################################################
# Run tab option output test
cd ..
mkdir test3
cd test3
echo "> Run seqtk_subseq with TAB option"
"$meta_executable" \
--tab \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"
expected_output_tabular=$'KU562861.1\t11\tAGTGTTCGAG\nMH150936.1\t11\tTGAAAACTTT'
output_tabular=$(cat sub_sample.fq)
if [ "$output_tabular" != "$expected_output_tabular" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_tabular"
echo "Got:"
echo "$output_tabular"
exit 1
fi
#########################################################################################
# Run line option output test
cd ..
mkdir test4
cd test4
echo "> Run seqtk_subseq with line length option"
"$meta_executable" \
--sequence_line_length 5 \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"
expected_output_wrapped=">KU562861.1:11-20
AGTGT
TCGAG
>MH150936.1:11-20
TGAAA
ACTTT"
output_wrapped=$(cat sub_sample.fq)
if [ "$output_wrapped" != "$expected_output_wrapped" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_wrapped"
echo "Got:"
echo "$output_wrapped"
exit 1
fi
#########################################################################################
# Run Strand Aware option output test
cd ..
mkdir test5
cd test5
echo "> Run seqtk_subseq with strand aware option"
"$meta_executable" \
--strand_aware \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"
expected_output_wrapped=">KU562861.1:11-20
AGTGTTCGAG
>MH150936.1:11-20
AAAGTTTTCA"
output_wrapped=$(cat sub_sample.fq)
if [ "$output_wrapped" != "$expected_output_wrapped" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_wrapped"
echo "Got:"
echo "$output_wrapped"
exit 1
fi
echo "All tests succeeded!"

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1,4 @@
@1
ACGGCAT
+
!!!!!!!

Binary file not shown.

View File

@@ -0,0 +1 @@
1

9
src/seqtk/test_data/script.sh Executable file
View File

@@ -0,0 +1,9 @@
# clone repo
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi
# copy test data
cp -r /tmp/snakemake-wrappers/bio/seqtk/test/* src/seqtk/test_data
rm src/seqtk/test_data/Snakefile

File diff suppressed because it is too large Load Diff

View File

@@ -11,6 +11,11 @@ references:
license: MIT
requirements:
commands: [ STAR, python, ps, zcat, bzcat ]
authors:
- __merge__: /src/_authors/angela_o_pisco.yaml
roles: [ author ]
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [ author, maintainer ]
# manually taking care of the main input and output arguments
argument_groups:
- name: Inputs
@@ -113,6 +118,8 @@ engines:
rm -rf /tmp/STAR-${STAR_VERSION} /tmp/${STAR_VERSION}.zip && \
apt-get --purge autoremove -y ${PACKAGES} && \
apt-get clean
- type: python
packages: [ pyyaml ]
- type: docker
run: |
STAR --version | sed 's#\(.*\)#star: "\1"#' > /var/software_versions.txt

View File

@@ -2,6 +2,7 @@ import tempfile
import subprocess
import shutil
from pathlib import Path
import yaml
## VIASH START
par = {
@@ -18,10 +19,20 @@ par = {
}
meta = {
"cpus": 8,
"temp_dir": "/tmp"
"temp_dir": "/tmp",
"config": "target/executable/star/star_align_reads/.config.vsh.yaml",
}
## VIASH END
# read config
with open(meta["config"], 'r') as stream:
config = yaml.safe_load(stream)
all_arguments = {
arg["name"].lstrip('-'): arg
for argument_group in config["argument_groups"]
for arg in argument_group["arguments"]
}
##################################################
# check and process SE / PE R1 input files
input_r1 = par["input"]
@@ -87,8 +98,13 @@ with tempfile.TemporaryDirectory(prefix="star-", dir=meta["temp_dir"], ignore_cl
cmd_args = [ "STAR" ]
for name, value in par.items():
if value is not None:
if name in all_arguments:
arg_info = all_arguments[name].get("info", {})
cli_name = arg_info.get("orig_name", f"--{name}")
else:
cli_name = f"--{name}"
val_to_add = value if isinstance(value, list) else [value]
cmd_args.extend([f"--{name}"] + [str(x) for x in val_to_add])
cmd_args.extend([cli_name] + [str(x) for x in val_to_add])
print("", flush=True)
# run command

View File

@@ -7,35 +7,34 @@ meta_executable="target/docker/star/star_align_reads/star_align_reads"
meta_resources_dir="src/star/star_align_reads"
## VIASH END
#########################################################################################
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || (echo "File '$1' does not exist" && exit 1)
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_doesnt_exist() {
[ ! -f "$1" ] || (echo "File '$1' exists but shouldn't" && exit 1)
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
}
assert_file_empty() {
[ ! -s "$1" ] || (echo "File '$1' is not empty but should be" && exit 1)
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || (echo "File '$1' is empty but shouldn't be" && exit 1)
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains() {
grep -q "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
assert_file_contains_regex() {
grep -q -E "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains_regex() {
grep -q -E "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
#############################################
#########################################################################################
echo "> Prepare test data"
cat > reads_R1.fastq <<'EOF'
@@ -89,14 +88,14 @@ cd star_align_reads_se
echo "> Run star_align_reads on SE"
"$meta_executable" \
--input "../reads_R1.fastq" \
--genomeDir "../index/" \
--genome_dir "../index/" \
--aligned_reads "output.sam" \
--log "log.txt" \
--outReadsUnmapped "Fastx" \
--out_reads_unmapped "Fastx" \
--unmapped "unmapped.sam" \
--quantMode "TranscriptomeSAM;GeneCounts" \
--quant_mode "TranscriptomeSAM;GeneCounts" \
--reads_per_gene "reads_per_gene.tsv" \
--outSJtype Standard \
--out_sj_type Standard \
--splice_junctions "splice_junctions.tsv" \
--reads_aligned_to_transcriptome "transcriptome_aligned.bam" \
${meta_cpus:+---cpus $meta_cpus}
@@ -144,10 +143,10 @@ echo ">> Run star_align_reads on PE"
"$meta_executable" \
--input ../reads_R1.fastq \
--input_r2 ../reads_R2.fastq \
--genomeDir ../index/ \
--genome_dir ../index/ \
--aligned_reads output.bam \
--log log.txt \
--outReadsUnmapped Fastx \
--out_reads_unmapped Fastx \
--unmapped unmapped_r1.bam \
--unmapped_r2 unmapped_r2.bam \
${meta_cpus:+---cpus $meta_cpus}

View File

@@ -14,6 +14,14 @@ param_txt <- iconv(param_txt, "UTF-8", "ASCII//TRANSLIT")
dev_begin <- grep("#####UnderDevelopment_begin", param_txt)
dev_end <- grep("#####UnderDevelopment_end", param_txt)
camel_case_to_snake_case <- function(x) {
x %>%
str_replace_all("([A-Z][A-Z][A-Z]*)", "_\\1_") %>%
str_replace_all("([a-z])([A-Z])", "\\1_\\2") %>%
str_to_lower() %>%
str_replace_all("_$", "")
}
# strip development sections
nondev_ix <- unlist(map2(c(1, dev_end + 1), c(dev_begin - 1, length(param_txt)), function(i, j) {
if (i >= 1 && i < j) {
@@ -128,9 +136,8 @@ out2 <- out %>%
# remove arguments that are related to a different runmode
filter(!grepl("--runMode", description) | grepl("--runMode alignReads", description)) %>%
filter(!grepl("--runMode", group_name) | grepl("--runMode alignReads", group_name)) %>%
filter(!grepl("STARsolo", group_name)) %>%
mutate(
viash_arg = paste0("--", name),
viash_arg = paste0("--", camel_case_to_snake_case(name)),
type_step1 = type %>%
str_replace_all(".*(int, string|string|int|real|double)\\(?(s?).*", "\\1\\2"),
viash_type = type_map[gsub("(int, string|string|int|real|double).*", "\\1", type_step1)],
@@ -155,28 +162,41 @@ out2 <- out %>%
group_name = gsub(" - .*", "", group_name),
required = ifelse(name %in% required_args, TRUE, NA)
)
print(out2, n = 200)
out2 %>% mutate(i = row_number()) %>%
# filter(is.na(default_step1) != is.na(viash_default)) %>%
# change references to argument names
out3 <- out2
for (i in seq_len(nrow(out2))) {
orig_name <- paste0("--", out2$name[[i]])
new_name <- out2$viash_arg[[i]]
out3$description <- str_replace_all(out3$description, orig_name, new_name)
}
# sanity checks
out3 %>% select(name, viash_arg) %>% as.data.frame()
print(out3, n = 200)
out3 %>%
mutate(i = row_number()) %>%
select(-group_name, -description)
out3 %>% filter(!grepl("--runMode", description) | grepl("--runMode alignReads", description))
out2 %>% filter(!grepl("--runMode", description) | grepl("--runMode alignReads", description))
argument_groups <- map(unique(out2$group_name), function(group_name) {
args <- out2 %>%
# create argument groups
argument_groups <- map(unique(out3$group_name), function(group_name) {
args <- out3 %>%
filter(group_name == !!group_name) %>%
pmap(function(viash_arg, viash_type, multiple, viash_default, description, required, ...) {
li <- lst(
pmap(function(viash_arg, viash_type, multiple, viash_default, description, required, name, ...) {
li <- list(
name = viash_arg,
type = viash_type,
description = description
description = description,
info = list(
orig_name = paste0("--", name)
)
)
if (all(!is.na(viash_default))) {
li$example <- viash_default
}
if (!is.na(multiple) && multiple) {
li$multiple <- multiple
li$multiple_sep <- ";"
}
if (!is.na(required) && required) {
li$required <- required
@@ -186,4 +206,10 @@ argument_groups <- map(unique(out2$group_name), function(group_name) {
list(name = group_name, arguments = args)
})
yaml::write_yaml(list(argument_groups = argument_groups), yaml_file)
yaml::write_yaml(
list(argument_groups = argument_groups),
yaml_file,
handlers = list(
logical = yaml::verbatim_logical
)
)

View File

@@ -11,75 +11,74 @@ references:
license: MIT
requirements:
commands: [ STAR ]
authors:
- __merge__: /src/_authors/sai_nirmayi_yasa.yaml
roles: [ author, maintainer ]
argument_groups:
- name: "Input"
arguments:
- name: "--genomeFastaFiles"
- name: "--genome_fasta_files"
type: file
description: |
Path(s) to the fasta files with the genome sequences, separated by spaces. These files should be plain text FASTA files, they *cannot* be zipped.
required: true
multiple: yes
multiple_sep: ;
- name: "--sjdbGTFfile"
multiple: true
- name: "--sjdb_gtf_file"
type: file
description: Path to the GTF file with annotations
- name: --sjdbOverhang
- name: --sjdb_overhang
type: integer
description: Length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
example: 100
- name: --sjdbGTFchrPrefix
- name: --sjdb_gtf_chr_prefix
type: string
description: Prefix for chromosome names in a GTF file (e.g. 'chr' for using ENSMEBL annotations with UCSC genomes)
- name: --sjdbGTFfeatureExon
- name: --sjdb_gtf_feature_exon
type: string
description: Feature type in GTF file to be used as exons for building transcripts
example: exon
- name: --sjdbGTFtagExonParentTranscript
- name: --sjdb_gtf_tag_exon_parent_transcript
type: string
description: GTF attribute name for parent transcript ID (default "transcript_id" works for GTF files)
example: transcript_id
- name: --sjdbGTFtagExonParentGene
- name: --sjdb_gtf_tag_exon_parent_gene
type: string
description: GTF attribute name for parent gene ID (default "gene_id" works for GTF files)
example: gene_id
- name: --sjdbGTFtagExonParentGeneName
- name: --sjdb_gtf_tag_exon_parent_gene_name
type: string
description: GTF attribute name for parent gene name
example: gene_name
multiple: yes
multiple_sep: ;
- name: --sjdbGTFtagExonParentGeneType
multiple: true
- name: --sjdb_gtf_tag_exon_parent_gene_type
type: string
description: GTF attribute name for parent gene type
example:
- gene_type
- gene_biotype
multiple: yes
multiple_sep: ;
- name: --limitGenomeGenerateRAM
multiple: true
- name: --limit_genome_generate_ram
type: long
description: Maximum available RAM (bytes) for genome generation
example: '31000000000'
- name: --genomeSAindexNbases
example: 31000000000
- name: --genome_sa_index_nbases
type: integer
description: Length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, this parameter must be scaled down to min(14, log2(GenomeLength)/2 - 1).
example: 14
- name: --genomeChrBinNbits
- name: --genome_chr_bin_nbits
type: integer
description: Defined as log2(chrBin), where chrBin is the size of the bins for genome storage. Each chromosome will occupy an integer number of bins. For a genome with large number of contigs, it is recommended to scale this parameter as min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]).
example: 18
- name: --genomeSAsparseD
- name: --genome_sa_sparse_d
type: integer
min: 0
example: 1
description: Suffux array sparsity, i.e. distance between indices. Use bigger numbers to decrease needed RAM at the cost of mapping speed reduction.
- name: --genomeSuffixLengthMax
- name: --genome_suffix_length_max
type: integer
description: Maximum length of the suffixes, has to be longer than read length. Use -1 for infinite length.
example: -1
- name: --genomeTransformType
- name: --genome_transform_type
type: string
description: |
Type of genome transformation
@@ -87,7 +86,7 @@ argument_groups:
Haploid ... replace reference alleles with alternative alleles from VCF file (e.g. consensus allele)
Diploid ... create two haplotypes for each chromosome listed in VCF file, for genotypes 1|2, assumes perfect phasing (e.g. personal genome)
example: None
- name: --genomeTransformVCF
- name: --genome_transform_vcf
type: file
description: path to VCF file for genome transformation

View File

@@ -10,20 +10,20 @@ mkdir -p $par_index
STAR \
--runMode genomeGenerate \
--genomeDir $par_index \
--genomeFastaFiles $par_genomeFastaFiles \
--genomeFastaFiles $par_genome_fasta_files \
${meta_cpus:+--runThreadN "${meta_cpus}"} \
${par_sjdbGTFfile:+--sjdbGTFfile "${par_sjdbGTFfile}"} \
${par_sjdb_gtf_file:+--sjdbGTFfile "${par_sjdb_gtf_file}"} \
${par_sjdbOverhang:+--sjdbOverhang "${par_sjdbOverhang}"} \
${par_genomeSAindexNbases:+--genomeSAindexNbases "${par_genomeSAindexNbases}"} \
${par_sjdbGTFchrPrefix:+--sjdbGTFchrPrefix "${par_sjdbGTFchrPrefix}"} \
${par_sjdbGTFfeatureExon:+--sjdbGTFfeatureExon "${par_sjdbGTFfeatureExon}"} \
${par_sjdbGTFtagExonParentTranscript:+--sjdbGTFtagExonParentTranscript "${par_sjdbGTFtagExonParentTranscript}"} \
${par_sjdbGTFtagExonParentGene:+--sjdbGTFtagExonParentGene "${par_sjdbGTFtagExonParentGene}"} \
${par_sjdbGTFtagExonParentGeneName:+--sjdbGTFtagExonParentGeneName "${par_sjdbGTFtagExonParentGeneName}"} \
${par_sjdbGTFtagExonParentGeneType:+--sjdbGTFtagExonParentGeneType "${sjdbGTFtagExonParentGeneType}"} \
${par_limitGenomeGenerateRAM:+--limitGenomeGenerateRAM "${par_limitGenomeGenerateRAM}"} \
${par_genomeChrBinNbits:+--genomeChrBinNbits "${par_genomeChrBinNbits}"} \
${par_genomeSAsparseD:+--genomeSAsparseD "${par_genomeSAsparseD}"} \
${par_genomeSuffixLengthMax:+--genomeSuffixLengthMax "${par_genomeSuffixLengthMax}"} \
${par_genomeTransformType:+--genomeTransformType "${par_genomeTransformType}"} \
${par_genomeTransformVCF:+--genomeTransformVCF "${par_genomeTransformVCF}"} \
${par_genome_sa_index_nbases:+--genomeSAindexNbases "${par_genome_sa_index_nbases}"} \
${par_sjdb_gtf_chr_prefix:+--sjdbGTFchrPrefix "${par_sjdb_gtf_chr_prefix}"} \
${par_sjdb_gtf_feature_exon:+--sjdbGTFfeatureExon "${par_sjdb_gtf_feature_exon}"} \
${par_sjdb_gtf_tag_exon_parent_transcript:+--sjdbGTFtag_exon_parent_transcript "${par_sjdb_gtf_tag_exon_parent_transcript}"} \
${par_sjdb_gtf_tag_exon_parent_gene:+--sjdbGTFtag_exon_parent_gene "${par_sjdb_gtf_tag_exon_parent_gene}"} \
${par_sjdb_gtf_tag_exon_parent_geneName:+--sjdbGTFtag_exon_parent_geneName "${par_sjdb_gtf_tag_exon_parent_geneName}"} \
${par_sjdb_gtf_tag_exon_parent_geneType:+--sjdbGTFtag_exon_parent_geneType "${sjdbGTFtag_exon_parent_geneType}"} \
${par_limit_genome_generate_ram:+--limitGenomeGenerateRAM "${par_limit_genome_generate_ram}"} \
${par_genome_chr_bin_nbits:+--genomeChrBinNbits "${par_genome_chr_bin_nbits}"} \
${par_genome_sa_sparse_d:+--genomeSAsparseD "${par_genome_sa_sparse_d}"} \
${par_genome_suffix_length_max:+--genomeSuffixLengthMax "${par_genome_suffix_length_max}"} \
${par_genome_transform_type:+--genomeTransformType "${par_genome_transform_type}"} \
${par_genome_transform_vcf:+--genomeTransformVCF "${par_genome_transform_vCF}"} \

View File

@@ -27,9 +27,9 @@ echo "> Generate index"
"$meta_executable" \
${meta_cpus:+---cpus $meta_cpus} \
--index "star_index/" \
--genomeFastaFiles "genome.fasta" \
--sjdbGTFfile "genes.gtf" \
--genomeSAindexNbases 2
--genome_fasta_files "genome.fasta" \
--sjdb_gtf_file "genes.gtf" \
--genome_sa_index_nbases 4
files=("Genome" "Log.out" "SA" "SAindex" "chrLength.txt" "chrName.txt" "chrNameLength.txt" "chrStart.txt" "exonGeTrInfo.tab" "exonInfo.tab" "geneInfo.tab" "genomeParameters.txt" "sjdbInfo.txt" "sjdbList.fromGTF.out.tab" "sjdbList.out.tab" "transcriptInfo.tab")

View File

@@ -10,7 +10,9 @@ links:
references:
doi: 10.1101/gr.209601.116
license: MIT
authors:
- __merge__: /src/_authors/emma_rousseau.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:

View File

@@ -0,0 +1,197 @@
name: umi_tools_extract
namespace: umi_tools
description: |
Flexible removal of UMI sequences from fastq reads.
UMIs are removed and appended to the read name. Any other barcode, for example a library barcode,
is left on the read. Can also filter reads by quality or against a whitelist.
keywords: [ extract, umi-tools, umi, fastq ]
links:
homepage: https://umi-tools.readthedocs.io/en/latest/
documentation: https://umi-tools.readthedocs.io/en/latest/reference/extract.html
repository: https://github.com/CGATOxford/UMI-tools
references:
doi: 10.1101/gr.209601.116
license: MIT
argument_groups:
- name: Input
arguments:
- name: --input
type: file
required: true
description: File containing the input data.
example: sample.fastq
- name: --read2_in
type: file
required: false
description: File containing the input data for the R2 reads (if paired). If provided, a <list of other required arguments> need to be provided.
example: sample_R2.fastq
- name: --bc_pattern
alternatives: -p
type: string
description: |
The UMI barcode pattern to use e.g. 'NNNNNN' indicates that the first 6 nucleotides
of the read are from the UMI.
- name: --bc_pattern2
type: string
description: The UMI barcode pattern to use for read 2.
- name: "Output"
arguments:
- name: --output
type: file
required: true
description: Output file for read 1.
direction: output
- name: --read2_out
type: file
description: Output file for read 2.
direction: output
- name: --filtered_out
type: file
description: |
Write out reads not matching regex pattern or cell barcode whitelist to this file.
- name: --filtered_out2
type: file
description: |
Write out read pairs not matching regex pattern or cell barcode whitelist to this file.
- name: Extract Options
arguments:
- name: --extract_method
type: string
choices: [string, regex]
description: |
UMI pattern to use. Default: `string`.
example: "string"
- name: --error_correct_cell
type: boolean_true
description: Error correct cell barcodes to the whitelist.
- name: --whitelist
type: file
description: |
Whitelist of accepted cell barcodes tab-separated format, where column 1 is the whitelisted
cell barcodes and column 2 is the list (comma-separated) of other cell barcodes which should
be corrected to the barcode in column 1. If the --error_correct_cell option is not used, this
column will be ignored.
- name: --blacklist
type: file
description: BlackWhitelist of cell barcodes to discard.
- name: --subset_reads
type: integer
description: Only parse the first N reads.
- name: --quality_filter_threshold
type: integer
description: Remove reads where any UMI base quality score falls below this threshold.
- name: --quality_filter_mask
type: string
description: |
If a UMI base has a quality below this threshold, replace the base with 'N'.
- name: --quality_encoding
type: string
choices: [phred33, phred64, solexa]
description: |
Quality score encoding. Choose from:
* phred33 [33-77]
* phred64 [64-106]
* solexa [59-106]
- name: --reconcile_pairs
type: boolean_true
description: |
Allow read 2 infile to contain reads not in read 1 infile. This enables support for upstream protocols
where read one contains cell barcodes, and the read pairs have been filtered and corrected without regard
to the read2.
- name: --three_prime
alternatives: --3prime
type: boolean_true
description: |
By default the barcode is assumed to be on the 5' end of the read, but use this option to sepecify that it is
on the 3' end instead. This option only works with --extract_method=string since 3' encoding can be specified
explicitly with a regex, e.g `.*(?P<umi_1>.{5})$`.
- name: --ignore_read_pair_suffixes
type: boolean_true
description: |
Ignore "/1" and "/2" read name suffixes. Note that this options is required if the suffixes are not whitespace
separated from the rest of the read name.
arguments:
- name: --umi_separator
type: string
description: |
The character that separates the UMI in the read name. Most likely a colon if you skipped the extraction with
UMI-tools and used other software. Default: `_`
example: "_"
- name: --grouping_method
type: string
choices: [unique, percentile, cluster, adjacency, directional]
description: |
Method to use to determine read groups by subsuming those with similar UMIs. All methods start by identifying
the reads with the same mapping position, but treat similar yet nonidentical UMIs differently. Default: `directional`
example: "directional"
- name: --umi_discard_read
type: integer
choices: [0, 1, 2]
description: |
After UMI barcode extraction discard either R1 or R2 by setting this parameter to 1 or 2, respectively. Default: `0`
example: 0
- name: Common Options
arguments:
- name: --log
type: file
description: File with logging information.
direction: output
- name: --log2stderr
type: boolean_true
description: Send logging information to stderr.
direction: output
- name: --verbose
type: integer
description: Log level. The higher, the more output.
- name: --error
type: file
description: File with error information.
direction: output
- name: --temp_dir
type: string
description: |
Directory for temporary files. If not set, the bash environmental variable TMPDIR is used.
- name: --compresslevel
type: integer
description: |
Level of Gzip compression to use. Default=6 matches GNU gzip rather than python gzip default (which is 9).
Default `6`.
example: 6
- name: --timeit
type: file
description: Store timing information in file.
direction: output
- name: --timeit_name
type: string
description: Name in timing file for this class of jobs.
default: all
- name: --timeit_header
type: boolean_true
description: Add header for timing information.
- name: --random_seed
type: integer
description: Random seed to initialize number generator with.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/umi_tools:1.1.4--py310h4b81fae_2
setup:
- type: docker
run: |
umi_tools -v | sed 's/ version//g' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,106 @@
'''
Generated from the following UMI-tools documentation:
https://umi-tools.readthedocs.io/en/latest/common_options.html#common-options
https://umi-tools.readthedocs.io/en/latest/reference/extract.html
'''
extract - Extract UMI from fastq
Usage:
Single-end:
umi_tools extract [OPTIONS] -p PATTERN [-I IN_FASTQ[.gz]] [-S OUT_FASTQ[.gz]]
Paired end:
umi_tools extract [OPTIONS] -p PATTERN [-I IN_FASTQ[.gz]] [-S OUT_FASTQ[.gz]] --read2-in=IN2_FASTQ[.gz] --read2-out=OUT2_FASTQ[.gz]
note: If -I/-S are ommited standard in and standard out are used
for input and output. To generate a valid BAM file on
standard out, please redirect log with --log=LOGFILE or
--log2stderr. Input/Output will be (de)compressed if a
filename provided to -S/-I/--read2-in/read2-out ends in .gz
Common UMI-tools Options:
-S, --stdout File where output is to go [default = stdout].
-L, --log File with logging information [default = stdout].
--log2stderr Send logging information to stderr [default = False].
-v, --verbose Log level. The higher, the more output [default = 1].
-E, --error File with error information [default = stderr].
--temp-dir Directory for temporary files. If not set, the bash environmental variable TMPDIR is used[default = None].
--compresslevel Level of Gzip compression to use. Default=6 matches GNU gzip rather than python gzip default (which is 9)
profiling and debugging options:
--timeit Store timing information in file [default=none].
--timeit-name Name in timing file for this class of jobs [default=all].
--timeit-header Add header for timing information [default=none].
--random-seed Random seed to initialize number generator with [default=none].
Extract Options:
-I, --stdin File containing the input data [default = stdin].
--error-correct-cell Error correct cell barcodes to the whitelist (see --whitelist)
--whitelist Whitelist of accepted cell barcodes. The whitelist should be in the following format (tab-separated):
AAAAAA AGAAAA
AAAATC
AAACAT
AAACTA AAACTN,GAACTA
AAATAC
AAATCA GAATCA
AAATGT AAAGGT,CAATGT
Where column 1 is the whitelisted cell barcodes and column 2 is the list (comma-separated) of other cell
barcodes which should be corrected to the barcode in column 1. If the --error-correct-cell option is not
used, this column will be ignored. Any additional columns in the whitelist input, such as the counts columns
from the output of umi_tools whitelist, will be ignored.
--blacklist BlackWhitelist of cell barcodes to discard
--subset-reads=[N] Only parse the first N reads
--quality-filter-threshold Remove reads where any UMI base quality score falls below this threshold
--quality-filter-mask If a UMI base has a quality below this threshold, replace the base with 'N'
--quality-encoding Quality score encoding. Choose from:
'phred33' [33-77]
'phred64' [64-106]
'solexa' [59-106]
--reconcile-pairs Allow read 2 infile to contain reads not in read 1 infile. This enables support for upstream protocols
where read one contains cell barcodes, and the read pairs have been filtered and corrected without regard
to the read2s.
Experimental options:
Note: These options have not been extensively testing to ensure behaviour is as expected. If you have some suitable input files which
we can use for testing, please contact us.
If you have a library preparation method where the UMI may be in either read, you can use the following options to search for the
UMI in either read:
--either-read --extract-method --bc-pattern=[PATTERN1] --bc-pattern2=[PATTERN2]
Where both patterns match, the default behaviour is to discard both reads. If you want to select the read with the UMI with highest
sequence quality, provide --either-read-resolve=quality.
--bc-pattern Pattern for barcode(s) on read 1. See --extract-method
--bc-pattern2 Pattern for barcode(s) on read 2. See --extract-method
--extract-method There are two methods enabled to extract the umi barcode (+/- cell barcode). For both methods, the patterns
should be provided using the --bc-pattern and --bc-pattern2 options.x
string:
This should be used where the barcodes are always in the same place in the read.
N = UMI position (required)
C = cell barcode position (optional)
X = sample position (optional)
Bases with Ns and Cs will be extracted and added to the read name. The corresponding sequence qualities will
be removed from the read. Bases with an X will be reattached to the read.
regex:
This method allows for more flexible barcode extraction and should be used where the cell barcodes are variable
in length. Alternatively, the regex option can also be used to filter out reads which do not contain an expected
adapter sequence. The regex must contain groups to define how the barcodes are encoded in the read.
The expected groups in the regex are:
umi_n = UMI positions, where n can be any value (required)
cell_n = cell barcode positions, where n can be any value (optional)
discard_n = positions to discard, where n can be any value (optional)
--3prime By default the barcode is assumed to be on the 5' end of the read, but use this option to sepecify that it is
on the 3' end instead. This option only works with --extract-method=string since 3' encoding can be specified
explicitly with a regex, e.g .*(?P<umi_1>.{5})$
--read2-in Filename for read pairs
--filtered-out Write out reads not matching regex pattern or cell barcode whitelist to this file
--filtered-out2 Write out read pairs not matching regex pattern or cell barcode whitelist to this file
--ignore-read-pair-suffixes Ignore SOH and STX read name suffixes. Note that this options is required if the suffixes are not whitespace
separated from the rest of the read name
For full UMI-tools documentation, see https://umi-tools.readthedocs.io/en/latest/

View File

@@ -0,0 +1,88 @@
#!/bin/bash
## VIASH START
## VIASH END
set -exo pipefail
test_dir="${metal_executable}/test_data"
[[ "$par_error_correct_cell" == "false" ]] && unset par_error_correct_cell
[[ "$par_reconcile_pairs" == "false" ]] && unset par_reconcile_pairs
[[ "$par_three_prime" == "false" ]] && unset par_three_prime
[[ "$par_ignore_read_pair_suffixes" == "false" ]] && unset par_ignore_read_pair_suffixes
[[ "$par_timeit_header" == "false" ]] && unset par_timeit_header
[[ "$par_log2stderr" == "false" ]] && unset par_log2stderr
# Check if we have the correct number of input files and patterns for paired-end or single-end reads
# For paired-end rends, check that we have two read files, two patterns
# Check for paired-end inputs
if [ -n "$par_input" ] && [ -n "$par_read2_in" ]; then
# Paired-end checks: Ensure both UMI patterns are provided
if [ -z "$par_bc_pattern" ] || [ -z "$par_bc_pattern2" ]; then
echo "Paired end input requires two UMI patterns."
exit 1
fi
elif [ -n "$par_input" ]; then
# Single-end checks: Ensure no second read or UMI pattern for the second read is provided
if [ -n "$par_bc_pattern2" ]; then
echo "Single end input requires only one read file and one UMI pattern."
exit 1
fi
# Check that discard_read is not set or set to 0 for single-end reads
if [ -n "$par_umi_discard_read" ] && [ "$par_umi_discard_read" != 0 ]; then
echo "umi_discard_read is only valid when processing paired end reads."
exit 1
fi
else
# No inputs provided
echo "No input files provided."
exit 1
fi
umi_tools extract \
-I "$par_input" \
${par_read2_in:+ --read2-in "$par_read2_in"} \
-S "$par_output" \
${par_read2_out:+--read2-out "$par_read2_out"} \
${par_extract_method:+--extract-method "$par_extract_method"} \
--bc-pattern "$par_bc_pattern" \
${par_bc_pattern2:+ --bc-pattern2 "$par_bc_pattern2"} \
${par_umi_separator:+--umi-separator "$par_umi_separator"} \
${par_output_stats:+--output-stats "$par_output_stats"} \
${par_error_correct_cell:+--error-correct-cell} \
${par_whitelist:+--whitelist "$par_whitelist"} \
${par_blacklist:+--blacklist "$par_blacklist"} \
${par_subset_reads:+--subset-reads "$par_subset_reads"} \
${par_quality_filter_threshold:+--quality-filter-threshold "$par_quality_filter_threshold"} \
${par_quality_filter_mask:+--quality-filter-mask "$par_quality_filter_mask"} \
${par_quality_encoding:+--quality-encoding "$par_quality_encoding"} \
${par_reconcile_pairs:+--reconcile-pairs} \
${par_three_prime:+--3prime} \
${par_filtered_out:+--filtered-out "$par_filtered_out"} \
${par_filtered_out2:+--filtered-out2 "$par_filtered_out2"} \
${par_ignore_read_pair_suffixes:+--ignore-read-pair-suffixes} \
${par_random_seed:+--random-seed "$par_random_seed"} \
${par_temp_dir:+--temp-dir "$par_temp_dir"} \
${par_compresslevel:+--compresslevel "$par_compresslevel"} \
${par_timeit:+--timeit "$par_timeit"} \
${par_timeit_name:+--timeit-name "$par_timeit_name"} \
${par_timeit_header:+--timeit-header} \
${par_log:+--log "$par_log"} \
${par_log2stderr:+--log2stderr} \
${par_verbose:+--verbose "$par_verbose"} \
${par_error:+--error "$par_error"}
if [ "$par_umi_discard_read" == 1 ]; then
# discard read 1
rm "$par_read1_out"
elif [ "$par_umi_discard_read" == 2 ]; then
# discard read 2 (-f to bypass file existence check)
rm -f "$par_read2_out"
fi

View File

@@ -0,0 +1,86 @@
#!/bin/bash
test_dir="${meta_resources_dir}/test_data"
echo ">>> Testing $meta_functionality_name"
############################################################################################################
echo ">>> Test 1: Testing for paired-end reads"
"$meta_executable" \
--input "$test_dir/scrb_seq_fastq.1_30"\
--read2_in "$test_dir/scrb_seq_fastq.2_30" \
--bc_pattern "CCCCCCNNNNNNNNNN"\
--bc_pattern2 "CCCCCCNNNNNNNNNN" \
--extract_method string \
--umi_separator '_' \
--grouping_method directional \
--umi_discard_read 0 \
--output scrb_seq_fastq.1_30.extract \
--read2_out scrb_seq_fastq.2_30.extract \
--random_seed 1
echo ">> Checking if the correct files are present"
[[ ! -f "scrb_seq_fastq.1_30.extract" ]] || [[ ! -f "scrb_seq_fastq.2_30.extract" ]] && echo "Reads file missing" && exit 1
[ ! -s "scrb_seq_fastq.1_30.extract" ] && echo "Read 1 file is empty" && exit 1
[ ! -s "scrb_seq_fastq.2_30.extract" ] && echo "Read 2 file is empty" && exit 1
echo ">> Checking if the files are correct"
diff -q "${meta_resources_dir}/scrb_seq_fastq.1_30.extract" "$test_dir/scrb_seq_fastq.1_30.extract" || \
(echo "Read 1 file is not correct" && exit 1)
diff -q "${meta_resources_dir}/scrb_seq_fastq.2_30.extract" "$test_dir/scrb_seq_fastq.2_30.extract" || \
(echo "Read 2 file is not correct" && exit 1)
rm scrb_seq_fastq.1_30.extract scrb_seq_fastq.2_30.extract
############################################################################################################
echo ">>> Test 2: Testing for paired-end reads with umi_discard_reads option"
"$meta_executable" \
--input "$test_dir/scrb_seq_fastq.1_30" \
--read2_in "$test_dir/scrb_seq_fastq.2_30" \
--bc_pattern CCCCCCNNNNNNNNNN \
--bc_pattern2 CCCCCCNNNNNNNNNN \
--extract_method string \
--umi_separator '_' \
--grouping_method directional \
--umi_discard_read 2 \
--output scrb_seq_fastq.1_30.extract \
--random_seed 1
echo ">> Checking if the correct files are present"
[ ! -f "scrb_seq_fastq.1_30.extract" ] && echo "Read 1 file is missing" && exit 1
[ ! -s "scrb_seq_fastq.1_30.extract" ] && echo "Read 1 file is empty" && exit 1
[ -f "scrb_seq_fastq.2_30.extract" ] && echo "Read 2 is not discarded" && exit 1
echo ">> Checking if the files are correct"
diff -q "${meta_resources_dir}/scrb_seq_fastq.1_30.extract" "$test_dir/scrb_seq_fastq.1_30.extract" || \
(echo "Read 1 file is not correct" && exit 1)
rm scrb_seq_fastq.1_30.extract
############################################################################################################
echo ">>> Test 3: Testing for single-end reads"
"$meta_executable" \
--input "$test_dir/slim_30.fastq" \
--bc_pattern "^(?P<umi_1>.{3}).{4}(?P<umi_2>.{2})" \
--extract_method regex \
--umi_separator '_' \
--grouping_method directional \
--output slim_30.extract \
--random_seed 1
echo ">> Checking if the correct files are present"
[ ! -f "slim_30.extract" ] && echo "Trimmed reads file missing" && exit 1
[ ! -s "slim_30.extract" ] && echo "Trimmed reads file is empty" && exit 1
echo ">> Checking if the files are correct"
diff -q "${meta_resources_dir}/slim_30.extract" "$test_dir/slim_30.extract" || \
(echo "Trimmed reads file is not correct" && exit 1)
rm slim_30.extract
echo ">>> Test finished successfully"
exit 0

View File

@@ -0,0 +1,120 @@
@SRR1058032.1 HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17
AATAACTTCCCGCGTCG
+SRR1058032.1 HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17
@@@DDDBDDF>FFHGIB
@SRR1058032.2 HISEQ:653:H12WDADXX:1:1101:1191:2236 length=17
AGCGGGGTGCTCGTCGT
+SRR1058032.2 HISEQ:653:H12WDADXX:1:1101:1191:2236 length=17
CCCFFFFFHHHHHJJJJ
@SRR1058032.3 HISEQ:653:H12WDADXX:1:1101:1715:2245 length=17
CTTTAGTACCAGTCCTT
+SRR1058032.3 HISEQ:653:H12WDADXX:1:1101:1715:2245 length=17
BBCFFDADHHHHHHIJJ
@SRR1058032.4 HISEQ:653:H12WDADXX:1:1101:1905:2212 length=17
AGGCGTTGTTTTTTTTT
+SRR1058032.4 HISEQ:653:H12WDADXX:1:1101:1905:2212 length=17
CCCFFFFFHHHHHJJJJ
@SRR1058032.5 HISEQ:653:H12WDADXX:1:1101:1927:2237 length=17
ATCGAGACATAATTGAT
+SRR1058032.5 HISEQ:653:H12WDADXX:1:1101:1927:2237 length=17
@B@FFFFFHHHHHJJJJ
@SRR1058032.6 HISEQ:653:H12WDADXX:1:1101:1876:2243 length=17
TGGGGGCGGTACATGAT
+SRR1058032.6 HISEQ:653:H12WDADXX:1:1101:1876:2243 length=17
BBBFFFFFHHHHHJJJJ
@SRR1058032.7 HISEQ:653:H12WDADXX:1:1101:2491:2207 length=17
CTATATGTTTGCGCTGT
+SRR1058032.7 HISEQ:653:H12WDADXX:1:1101:2491:2207 length=17
1=BDFFFFHHHHHJJJJ
@SRR1058032.8 HISEQ:653:H12WDADXX:1:1101:2513:2219 length=17
CTCCCGCATGCTGCTGT
+SRR1058032.8 HISEQ:653:H12WDADXX:1:1101:2513:2219 length=17
?BBFFFFFHHHHHJJJJ
@SRR1058032.9 HISEQ:653:H12WDADXX:1:1101:2604:2231 length=17
GAGCCCTGAGGGGATCT
+SRR1058032.9 HISEQ:653:H12WDADXX:1:1101:2604:2231 length=17
1??DDDFD>DFDGFGHG
@SRR1058032.10 HISEQ:653:H12WDADXX:1:1101:2936:2218 length=17
AGCGGGGTTCGCGGTTT
+SRR1058032.10 HISEQ:653:H12WDADXX:1:1101:2936:2218 length=17
CCCFFFFFHHHHHJIJI
@SRR1058032.11 HISEQ:653:H12WDADXX:1:1101:3447:2241 length=17
AGAATTGCCTGGATTTT
+SRR1058032.11 HISEQ:653:H12WDADXX:1:1101:3447:2241 length=17
@CCFFFFAFHHHGJJJJ
@SRR1058032.12 HISEQ:653:H12WDADXX:1:1101:3620:2196 length=17
AGGCGGGGCAACGGGTT
+SRR1058032.12 HISEQ:653:H12WDADXX:1:1101:3620:2196 length=17
CCCFFFFFHHGHHJJHH
@SRR1058032.13 HISEQ:653:H12WDADXX:1:1101:3875:2206 length=17
GTCCCCGCGTCGTGTAG
+SRR1058032.13 HISEQ:653:H12WDADXX:1:1101:3875:2206 length=17
@C@FFFFFHFFGHJJJJ
@SRR1058032.14 HISEQ:653:H12WDADXX:1:1101:4131:2215 length=17
CCACGCATTCACTCGGT
+SRR1058032.14 HISEQ:653:H12WDADXX:1:1101:4131:2215 length=17
BBBDFFFFHHHHHJJJJ
@SRR1058032.15 HISEQ:653:H12WDADXX:1:1101:4284:2241 length=17
TGCGCAATAAGCGCTAT
+SRR1058032.15 HISEQ:653:H12WDADXX:1:1101:4284:2241 length=17
+:=DDDDDBHHGDIBEH
@SRR1058032.16 HISEQ:653:H12WDADXX:1:1101:4599:2232 length=17
CGCTGGCAGAGCCCGGT
+SRR1058032.16 HISEQ:653:H12WDADXX:1:1101:4599:2232 length=17
@BCFFFFFHHHHHJJJJ
@SRR1058032.17 HISEQ:653:H12WDADXX:1:1101:5428:2200 length=17
AGGCGGTGCATAGTCTT
+SRR1058032.17 HISEQ:653:H12WDADXX:1:1101:5428:2200 length=17
CCCFFFFFHHHHHIJIH
@SRR1058032.18 HISEQ:653:H12WDADXX:1:1101:5336:2218 length=17
GTCCCCCGCGTGTGACT
+SRR1058032.18 HISEQ:653:H12WDADXX:1:1101:5336:2218 length=17
<BBFFFFFHHHHHIIJJ
@SRR1058032.19 HISEQ:653:H12WDADXX:1:1101:5397:2220 length=17
TATAGAAAAAACTTTTT
+SRR1058032.19 HISEQ:653:H12WDADXX:1:1101:5397:2220 length=17
B@BFDDFFGHHFHIJIJ
@SRR1058032.20 HISEQ:653:H12WDADXX:1:1101:5605:2194 length=17
CATTATGGGCTTATTTT
+SRR1058032.20 HISEQ:653:H12WDADXX:1:1101:5605:2194 length=17
BBBFFFFFHHHHHJJJJ
@SRR1058032.21 HISEQ:653:H12WDADXX:1:1101:5519:2196 length=17
AAATGTGCAGTTCAGAT
+SRR1058032.21 HISEQ:653:H12WDADXX:1:1101:5519:2196 length=17
BCCFFFFFHHHHHJJJJ
@SRR1058032.22 HISEQ:653:H12WDADXX:1:1101:5705:2220 length=17
TGGGGGCTAAAGGGACT
+SRR1058032.22 HISEQ:653:H12WDADXX:1:1101:5705:2220 length=17
BBBDFFFFHHHHHJIJI
@SRR1058032.23 HISEQ:653:H12WDADXX:1:1101:5558:2236 length=17
GATAATACTTACGGTGT
+SRR1058032.23 HISEQ:653:H12WDADXX:1:1101:5558:2236 length=17
CCCFFFFFHHHHHJFHI
@SRR1058032.24 HISEQ:653:H12WDADXX:1:1101:5649:2244 length=17
CGTTAATAATTGTGGTT
+SRR1058032.24 HISEQ:653:H12WDADXX:1:1101:5649:2244 length=17
BBBFFFFFHHHHHIIHG
@SRR1058032.25 HISEQ:653:H12WDADXX:1:1101:5910:2207 length=17
AAAAAAAAAAAAAAAAA
+SRR1058032.25 HISEQ:653:H12WDADXX:1:1101:5910:2207 length=17
@CCFFFFFGHAA<:46'
@SRR1058032.26 HISEQ:653:H12WDADXX:1:1101:5757:2217 length=17
GCCGACCAACGATTTTT
+SRR1058032.26 HISEQ:653:H12WDADXX:1:1101:5757:2217 length=17
:=?DD@?DH;AFBFDFF
@SRR1058032.27 HISEQ:653:H12WDADXX:1:1101:5790:2248 length=17
AATCAAGACCACTGAAT
+SRR1058032.27 HISEQ:653:H12WDADXX:1:1101:5790:2248 length=17
@CCFFFFFHHHHHJJJI
@SRR1058032.28 HISEQ:653:H12WDADXX:1:1101:6079:2195 length=17
CGCGCTTTTGTTTTTTT
+SRR1058032.28 HISEQ:653:H12WDADXX:1:1101:6079:2195 length=17
BB@FFFFFHHHHHJJJJ
@SRR1058032.29 HISEQ:653:H12WDADXX:1:1101:6133:2213 length=17
AAATACTTTGAGGGAAT
+SRR1058032.29 HISEQ:653:H12WDADXX:1:1101:6133:2213 length=17
@CCFFEFFHHFHGJJII
@SRR1058032.30 HISEQ:653:H12WDADXX:1:1101:6651:2198 length=17
AGCGGGGTTTTATCGGT
+SRR1058032.30 HISEQ:653:H12WDADXX:1:1101:6651:2198 length=17
CCCFFFFDHHHHHHJJJ

View File

@@ -0,0 +1,120 @@
@SRR1058032.1_AATAACCCTACA_TTCCCGCGTCCTCTTTCCCT HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17
G
+
B
@SRR1058032.2_AGCGGGACGCTA_GTGCTCGTCGTACTCTTTCC HISEQ:653:H12WDADXX:1:1101:1191:2236 length=17
T
+
J
@SRR1058032.3_CTTTAGACGCTA_TACCAGTCCTCACTCTTTCC HISEQ:653:H12WDADXX:1:1101:1715:2245 length=17
T
+
J
@SRR1058032.4_AGGCGTACTTTA_TGTTTTTTTTCACTCTCTCC HISEQ:653:H12WDADXX:1:1101:1905:2212 length=17
T
+
J
@SRR1058032.5_ATCGAGGTGTAG_ACATAATTGAGGAAAGAGTG HISEQ:653:H12WDADXX:1:1101:1927:2237 length=17
T
+
J
@SRR1058032.6_TGGGGGCCTATA_CGGTACATGATAGTATAGCT HISEQ:653:H12WDADXX:1:1101:1876:2243 length=17
T
+
J
@SRR1058032.7_CTATATATTAAA_GTTTGCGCTGGACAAACTAC HISEQ:653:H12WDADXX:1:1101:2491:2207 length=17
T
+
J
@SRR1058032.8_CTCCCGGCCTAG_CATGCTGCTGTTGTGAACCA HISEQ:653:H12WDADXX:1:1101:2513:2219 length=17
T
+
J
@SRR1058032.9_GAGCCCCCCTTC_TGAGGGGATCACGACGCTAC HISEQ:653:H12WDADXX:1:1101:2604:2231 length=17
T
+
G
@SRR1058032.10_AGCGGGGGGAAA_GTTCGCGGTTGAGTGTGTCG HISEQ:653:H12WDADXX:1:1101:2936:2218 length=17
T
+
I
@SRR1058032.11_AGAATTCCCACA_GCCTGGATTTCTCTTTCCCT HISEQ:653:H12WDADXX:1:1101:3447:2241 length=17
T
+
J
@SRR1058032.12_AGGCGGGTGTAT_GGCAACGGGTGGAAAGAGTG HISEQ:653:H12WDADXX:1:1101:3620:2196 length=17
T
+
H
@SRR1058032.13_GTCCCCCTCTTT_GCGTCGTGTACCCTACACTC HISEQ:653:H12WDADXX:1:1101:3875:2206 length=17
G
+
J
@SRR1058032.14_CCACGCGTGTAG_ATTCACTCGGCGTCGTGTAG HISEQ:653:H12WDADXX:1:1101:4131:2215 length=17
T
+
J
@SRR1058032.15_TGCGCAGTGTAT_ATAAGCGCTAGGAAAGAGTG HISEQ:653:H12WDADXX:1:1101:4284:2241 length=17
T
+
H
@SRR1058032.16_CGCTGGACTCTT_CAGAGCCCGGTCCCTACACT HISEQ:653:H12WDADXX:1:1101:4599:2232 length=17
T
+
J
@SRR1058032.17_AGGCGGGATTCT_TGCATAGTCTTCAAATGAGG HISEQ:653:H12WDADXX:1:1101:5428:2200 length=17
T
+
H
@SRR1058032.18_GTCCCCGCGTCG_CGCGTGTGACTGTAGGGAAA HISEQ:653:H12WDADXX:1:1101:5336:2218 length=17
T
+
J
@SRR1058032.19_TATAGACCATCA_AAAAACTTTTCGCCTGCCCT HISEQ:653:H12WDADXX:1:1101:5397:2220 length=17
T
+
J
@SRR1058032.20_CATTATTTAATG_GGGCTTATTTGACTGTTTCA HISEQ:653:H12WDADXX:1:1101:5605:2194 length=17
T
+
J
@SRR1058032.21_AAATGTTATCTA_GCAGTTCAGAGACTGCTCGT HISEQ:653:H12WDADXX:1:1101:5519:2196 length=17
T
+
J
@SRR1058032.22_TGGGGGACTGTT_CTAAAGGGACCTTTAACCAA HISEQ:653:H12WDADXX:1:1101:5705:2220 length=17
T
+
I
@SRR1058032.23_GATAATTTCCAT_ACTTACGGTGACACTCTTTC HISEQ:653:H12WDADXX:1:1101:5558:2236 length=17
T
+
I
@SRR1058032.24_CGTTAAAGACGG_TAATTGTGGTACCAGAGCGA HISEQ:653:H12WDADXX:1:1101:5649:2244 length=17
T
+
G
@SRR1058032.25_AAAAAAGAGTAT_AAAAAAAAAAAGGGAAAGAG HISEQ:653:H12WDADXX:1:1101:5910:2207 length=17
A
+
'
@SRR1058032.26_GCCGACCCTTTT_CAACGATTTTATACAATACA HISEQ:653:H12WDADXX:1:1101:5757:2217 length=17
T
+
F
@SRR1058032.27_AATCAAATCACA_GACCACTGAAGCTGGAGAGA HISEQ:653:H12WDADXX:1:1101:5790:2248 length=17
T
+
I
@SRR1058032.28_CGCGCTGTACTA_TTTGTTTTTTGGCATCGTCA HISEQ:653:H12WDADXX:1:1101:6079:2195 length=17
T
+
J
@SRR1058032.29_AAATACCCAATA_TTTGAGGGAAACTTGACCAA HISEQ:653:H12WDADXX:1:1101:6133:2213 length=17
T
+
I
@SRR1058032.30_AGCGGGGAGTGT_GTTTTATCGGACACTCTTTC HISEQ:653:H12WDADXX:1:1101:6651:2198 length=17
T
+
J

View File

@@ -0,0 +1,120 @@
@SRR1058032.1 HISEQ:653:H12WDADXX:1:1101:1210:2217 length=34
CCTACACTCTTTCCCTACACGACGCTACACTCTN
+SRR1058032.1 HISEQ:653:H12WDADXX:1:1101:1210:2217 length=34
@@@DFEDABD?A?ABGHGGGIGGEGIIIJJJFI#
@SRR1058032.2 HISEQ:653:H12WDADXX:1:1101:1191:2236 length=34
ACGCTATACTCTTTCCCTACACGACGCTACACTN
+SRR1058032.2 HISEQ:653:H12WDADXX:1:1101:1191:2236 length=34
CCCDFFFFHHHHHJJJJGGICGE6FDH<?F<F<#
@SRR1058032.3 HISEQ:653:H12WDADXX:1:1101:1715:2245 length=34
ACGCTACACTCTTTCCCTACACGACGCTACACTN
+SRR1058032.3 HISEQ:653:H12WDADXX:1:1101:1715:2245 length=34
C@CFFFFFGHHGAEHIIEIGIIAGFHIFG@FBE#
@SRR1058032.4 HISEQ:653:H12WDADXX:1:1101:1905:2212 length=34
ACTTTACACTCTCTCCCTACACGACGCTACACTN
+SRR1058032.4 HISEQ:653:H12WDADXX:1:1101:1905:2212 length=34
??;==DBDD?F:D<EGH<HGHIF>GEGCDG9FD#
@SRR1058032.5 HISEQ:653:H12WDADXX:1:1101:1927:2237 length=34
GTGTAGGGAAAGAGTGTAAGGAAAGAGTGTAGCN
+SRR1058032.5 HISEQ:653:H12WDADXX:1:1101:1927:2237 length=34
?=??B?DB2ACCAEAEFHHIHHHIHFHCEHHIG#
@SRR1058032.6 HISEQ:653:H12WDADXX:1:1101:1876:2243 length=34
CCTATATAGTATAGCTTCCCATCTTCTTTGAGAN
+SRR1058032.6 HISEQ:653:H12WDADXX:1:1101:1876:2243 length=34
CCCFFFFFHDHBHEIIJJJJIIIJJJGGGIGIE#
@SRR1058032.7 HISEQ:653:H12WDADXX:1:1101:2491:2207 length=34
ATTAAAGACAAACTACAACTCATATGAGGCATTN
+SRR1058032.7 HISEQ:653:H12WDADXX:1:1101:2491:2207 length=34
@@@DDDADDHHHFBFAHIGBHH<H<BHDFGIIG#
@SRR1058032.8 HISEQ:653:H12WDADXX:1:1101:2513:2219 length=34
GCCTAGTTGTGAACCAAATGTGAAAAAACCTCCN
+SRR1058032.8 HISEQ:653:H12WDADXX:1:1101:2513:2219 length=34
@@@FFDDDFFFFFIIGHIFI<HHEHCEBEFEED#
@SRR1058032.9 HISEQ:653:H12WDADXX:1:1101:2604:2231 length=34
CCCTTCACGACGCTACACTCTTTCCCTACACGAN
+SRR1058032.9 HISEQ:653:H12WDADXX:1:1101:2604:2231 length=34
C@CFFFFFHHHHHJJJIJJJJIJJJIGBHBFG:#
@SRR1058032.10 HISEQ:653:H12WDADXX:1:1101:2936:2218 length=34
GGGAAAGAGTGTGTCGTGTATGGAAAGAGTGTAN
+SRR1058032.10 HISEQ:653:H12WDADXX:1:1101:2936:2218 length=34
CCCFFFFDD>FAH;E@@?AB>F@BF3;3?1C?<#
@SRR1058032.11 HISEQ:653:H12WDADXX:1:1101:3447:2241 length=34
CCCACACTCTTTCCCTACACGACGCTACACTCTN
+SRR1058032.11 HISEQ:653:H12WDADXX:1:1101:3447:2241 length=34
@@@DDFDDBHBFHGI<F@GFBFEE>)C:D@@@B#
@SRR1058032.12 HISEQ:653:H12WDADXX:1:1101:3620:2196 length=34
GTGTATGGAAAGAGTGTAGGGAAAGAGTGTAGGN
+SRR1058032.12 HISEQ:653:H12WDADXX:1:1101:3620:2196 length=34
@@@DDDDAHHHFHIABEEEAB??CFBF?C@BFF#
@SRR1058032.13 HISEQ:653:H12WDADXX:1:1101:3875:2206 length=34
CTCTTTCCCTACACTCTTTCCCTACACGACGCTN
+SRR1058032.13 HISEQ:653:H12WDADXX:1:1101:3875:2206 length=34
@@@DDDAAADHDHDGDGIIIIIJJJJJJIJIIJ#
@SRR1058032.14 HISEQ:653:H12WDADXX:1:1101:4131:2215 length=34
GTGTAGCGTCGTGTAGGGAAAGAGTGTGTGGAAN
+SRR1058032.14 HISEQ:653:H12WDADXX:1:1101:4131:2215 length=34
@@@DDDDD?DFDCAEFHIGGFHEH:D1C:CG@F#
@SRR1058032.15 HISEQ:653:H12WDADXX:1:1101:4284:2241 length=34
GTGTATGGAAAGAGTGTGCGTCGTACGTGTAGAN
+SRR1058032.15 HISEQ:653:H12WDADXX:1:1101:4284:2241 length=34
@?@DDFFFHHHHGDAC:CHGGIIGIIIFHFGHB#
@SRR1058032.16 HISEQ:653:H12WDADXX:1:1101:4599:2232 length=34
ACTCTTTCCCTACACTCTTTCCCTACACGACGCN
+SRR1058032.16 HISEQ:653:H12WDADXX:1:1101:4599:2232 length=34
@@<DAAAA?>BCBE@9;EGGGGGIHJJIJHIGG#
@SRR1058032.17 HISEQ:653:H12WDADXX:1:1101:5428:2200 length=34
GATTCTTCAAATGAGGACTATGCGGGACATGAAN
+SRR1058032.17 HISEQ:653:H12WDADXX:1:1101:5428:2200 length=34
@@@DDDDDFHHFAHB;FHIIIIIIIIFHEHIHI#
@SRR1058032.18 HISEQ:653:H12WDADXX:1:1101:5336:2218 length=34
GCGTCGTGTAGGGAAAGAGTGTAGCGTCGTGTAN
+SRR1058032.18 HISEQ:653:H12WDADXX:1:1101:5336:2218 length=34
@@@DDDDD<FFD?GIIDGF+<<CBAFCGE@FB@#
@SRR1058032.19 HISEQ:653:H12WDADXX:1:1101:5397:2220 length=34
CCATCACGCCTGCCCTTCCTTGAAATTACACCTN
+SRR1058032.19 HISEQ:653:H12WDADXX:1:1101:5397:2220 length=34
;===AAA<@A72??A+22<+,+<+@+++*:***#
@SRR1058032.20 HISEQ:653:H12WDADXX:1:1101:5605:2194 length=34
TTAATGGACTGTTTCAGGTAAAAGAGAATGAATN
+SRR1058032.20 HISEQ:653:H12WDADXX:1:1101:5605:2194 length=34
CCCFFDAEHHHHDEHIGCEIIJJIGIJGIGGHE#
@SRR1058032.21 HISEQ:653:H12WDADXX:1:1101:5519:2196 length=34
TATCTAGACTGCTCGTCATTTAGAAGACACGTCN
+SRR1058032.21 HISEQ:653:H12WDADXX:1:1101:5519:2196 length=34
@B@FDDFFHFHBHEIIGIIJJGHGHIIIGIGII#
@SRR1058032.22 HISEQ:653:H12WDADXX:1:1101:5705:2220 length=34
ACTGTTCTTTAACCAAACATCCGTGCGATTCGTN
+SRR1058032.22 HISEQ:653:H12WDADXX:1:1101:5705:2220 length=34
CCCFFFFFHHHHHJJJJGHIJJIGIIIBEFG?G#
@SRR1058032.23 HISEQ:653:H12WDADXX:1:1101:5558:2236 length=34
TTCCATACACTCTTTCCCTACACGACGCACACTN
+SRR1058032.23 HISEQ:653:H12WDADXX:1:1101:5558:2236 length=34
@@@DFBEFHFFD<A<CD>BHEGGFGHGGIEGII#
@SRR1058032.24 HISEQ:653:H12WDADXX:1:1101:5649:2244 length=34
AGACGGACCAGAGCGAAAGCATTTGCCAAGAATN
+SRR1058032.24 HISEQ:653:H12WDADXX:1:1101:5649:2244 length=34
CCCFFFDFGHHHGJIIJJIJHEDD919CGGHJ@#
@SRR1058032.25 HISEQ:653:H12WDADXX:1:1101:5910:2207 length=34
GAGTATAGGGAAAGAGTTTTTTTTTTTTTTTTTN
+SRR1058032.25 HISEQ:653:H12WDADXX:1:1101:5910:2207 length=34
?=?DDDD>AB:ACEEGHIJJIJJJJIIJJHFDD#
@SRR1058032.26 HISEQ:653:H12WDADXX:1:1101:5757:2217 length=34
CCTTTTATACAATACAAAGCTTTGCTTTTTTTTN
+SRR1058032.26 HISEQ:653:H12WDADXX:1:1101:5757:2217 length=34
???DDDDDDDDD4EEEII@A<:33<33,22110#
@SRR1058032.27 HISEQ:653:H12WDADXX:1:1101:5790:2248 length=34
ATCACAGCTGGAGAGATCTTGATCTTCATGGTGN
+SRR1058032.27 HISEQ:653:H12WDADXX:1:1101:5790:2248 length=34
CCCFFFFFHHFHGGIIIIJIEAHCEHHEFECGD#
@SRR1058032.28 HISEQ:653:H12WDADXX:1:1101:6079:2195 length=34
GTACTAGGCATCGTCATCCAATGCGACGAGTCCN
+SRR1058032.28 HISEQ:653:H12WDADXX:1:1101:6079:2195 length=34
@@CFFDDFHHGHHIJJJIJJJIGGHIDG<GFHG#
@SRR1058032.29 HISEQ:653:H12WDADXX:1:1101:6133:2213 length=34
CCAATAACTTGACCAACGGAACAAGTTACCCTAN
+SRR1058032.29 HISEQ:653:H12WDADXX:1:1101:6133:2213 length=34
@CCFFFFFHHGHHIJJJJIJIIIIIIIIIJIJI#
@SRR1058032.30 HISEQ:653:H12WDADXX:1:1101:6651:2198 length=34
GAGTGTACACTCTTTCCCTACACGACGTTACACN
+SRR1058032.30 HISEQ:653:H12WDADXX:1:1101:6651:2198 length=34
???A:2ABDBDDDBEEIIA:F:CC8F<))1:??#

View File

@@ -0,0 +1,120 @@
@SRR1058032.1_AATAACCCTACA_TTCCCGCGTCCTCTTTCCCT HISEQ:653:H12WDADXX:1:1101:1210:2217 length=34
ACACGACGCTACACTCTN
+
HGGGIGGEGIIIJJJFI#
@SRR1058032.2_AGCGGGACGCTA_GTGCTCGTCGTACTCTTTCC HISEQ:653:H12WDADXX:1:1101:1191:2236 length=34
CTACACGACGCTACACTN
+
JGGICGE6FDH<?F<F<#
@SRR1058032.3_CTTTAGACGCTA_TACCAGTCCTCACTCTTTCC HISEQ:653:H12WDADXX:1:1101:1715:2245 length=34
CTACACGACGCTACACTN
+
IEIGIIAGFHIFG@FBE#
@SRR1058032.4_AGGCGTACTTTA_TGTTTTTTTTCACTCTCTCC HISEQ:653:H12WDADXX:1:1101:1905:2212 length=34
CTACACGACGCTACACTN
+
H<HGHIF>GEGCDG9FD#
@SRR1058032.5_ATCGAGGTGTAG_ACATAATTGAGGAAAGAGTG HISEQ:653:H12WDADXX:1:1101:1927:2237 length=34
TAAGGAAAGAGTGTAGCN
+
FHHIHHHIHFHCEHHIG#
@SRR1058032.6_TGGGGGCCTATA_CGGTACATGATAGTATAGCT HISEQ:653:H12WDADXX:1:1101:1876:2243 length=34
TCCCATCTTCTTTGAGAN
+
JJJJIIIJJJGGGIGIE#
@SRR1058032.7_CTATATATTAAA_GTTTGCGCTGGACAAACTAC HISEQ:653:H12WDADXX:1:1101:2491:2207 length=34
AACTCATATGAGGCATTN
+
HIGBHH<H<BHDFGIIG#
@SRR1058032.8_CTCCCGGCCTAG_CATGCTGCTGTTGTGAACCA HISEQ:653:H12WDADXX:1:1101:2513:2219 length=34
AATGTGAAAAAACCTCCN
+
HIFI<HHEHCEBEFEED#
@SRR1058032.9_GAGCCCCCCTTC_TGAGGGGATCACGACGCTAC HISEQ:653:H12WDADXX:1:1101:2604:2231 length=34
ACTCTTTCCCTACACGAN
+
IJJJJIJJJIGBHBFG:#
@SRR1058032.10_AGCGGGGGGAAA_GTTCGCGGTTGAGTGTGTCG HISEQ:653:H12WDADXX:1:1101:2936:2218 length=34
TGTATGGAAAGAGTGTAN
+
@?AB>F@BF3;3?1C?<#
@SRR1058032.11_AGAATTCCCACA_GCCTGGATTTCTCTTTCCCT HISEQ:653:H12WDADXX:1:1101:3447:2241 length=34
ACACGACGCTACACTCTN
+
F@GFBFEE>)C:D@@@B#
@SRR1058032.12_AGGCGGGTGTAT_GGCAACGGGTGGAAAGAGTG HISEQ:653:H12WDADXX:1:1101:3620:2196 length=34
TAGGGAAAGAGTGTAGGN
+
EEEAB??CFBF?C@BFF#
@SRR1058032.13_GTCCCCCTCTTT_GCGTCGTGTACCCTACACTC HISEQ:653:H12WDADXX:1:1101:3875:2206 length=34
TTTCCCTACACGACGCTN
+
GIIIIIJJJJJJIJIIJ#
@SRR1058032.14_CCACGCGTGTAG_ATTCACTCGGCGTCGTGTAG HISEQ:653:H12WDADXX:1:1101:4131:2215 length=34
GGAAAGAGTGTGTGGAAN
+
HIGGFHEH:D1C:CG@F#
@SRR1058032.15_TGCGCAGTGTAT_ATAAGCGCTAGGAAAGAGTG HISEQ:653:H12WDADXX:1:1101:4284:2241 length=34
TGCGTCGTACGTGTAGAN
+
:CHGGIIGIIIFHFGHB#
@SRR1058032.16_CGCTGGACTCTT_CAGAGCCCGGTCCCTACACT HISEQ:653:H12WDADXX:1:1101:4599:2232 length=34
CTTTCCCTACACGACGCN
+
;EGGGGGIHJJIJHIGG#
@SRR1058032.17_AGGCGGGATTCT_TGCATAGTCTTCAAATGAGG HISEQ:653:H12WDADXX:1:1101:5428:2200 length=34
ACTATGCGGGACATGAAN
+
FHIIIIIIIIFHEHIHI#
@SRR1058032.18_GTCCCCGCGTCG_CGCGTGTGACTGTAGGGAAA HISEQ:653:H12WDADXX:1:1101:5336:2218 length=34
GAGTGTAGCGTCGTGTAN
+
DGF+<<CBAFCGE@FB@#
@SRR1058032.19_TATAGACCATCA_AAAAACTTTTCGCCTGCCCT HISEQ:653:H12WDADXX:1:1101:5397:2220 length=34
TCCTTGAAATTACACCTN
+
22<+,+<+@+++*:***#
@SRR1058032.20_CATTATTTAATG_GGGCTTATTTGACTGTTTCA HISEQ:653:H12WDADXX:1:1101:5605:2194 length=34
GGTAAAAGAGAATGAATN
+
GCEIIJJIGIJGIGGHE#
@SRR1058032.21_AAATGTTATCTA_GCAGTTCAGAGACTGCTCGT HISEQ:653:H12WDADXX:1:1101:5519:2196 length=34
CATTTAGAAGACACGTCN
+
GIIJJGHGHIIIGIGII#
@SRR1058032.22_TGGGGGACTGTT_CTAAAGGGACCTTTAACCAA HISEQ:653:H12WDADXX:1:1101:5705:2220 length=34
ACATCCGTGCGATTCGTN
+
JGHIJJIGIIIBEFG?G#
@SRR1058032.23_GATAATTTCCAT_ACTTACGGTGACACTCTTTC HISEQ:653:H12WDADXX:1:1101:5558:2236 length=34
CCTACACGACGCACACTN
+
D>BHEGGFGHGGIEGII#
@SRR1058032.24_CGTTAAAGACGG_TAATTGTGGTACCAGAGCGA HISEQ:653:H12WDADXX:1:1101:5649:2244 length=34
AAGCATTTGCCAAGAATN
+
JJIJHEDD919CGGHJ@#
@SRR1058032.25_AAAAAAGAGTAT_AAAAAAAAAAAGGGAAAGAG HISEQ:653:H12WDADXX:1:1101:5910:2207 length=34
TTTTTTTTTTTTTTTTTN
+
HIJJIJJJJIIJJHFDD#
@SRR1058032.26_GCCGACCCTTTT_CAACGATTTTATACAATACA HISEQ:653:H12WDADXX:1:1101:5757:2217 length=34
AAGCTTTGCTTTTTTTTN
+
II@A<:33<33,22110#
@SRR1058032.27_AATCAAATCACA_GACCACTGAAGCTGGAGAGA HISEQ:653:H12WDADXX:1:1101:5790:2248 length=34
TCTTGATCTTCATGGTGN
+
IIJIEAHCEHHEFECGD#
@SRR1058032.28_CGCGCTGTACTA_TTTGTTTTTTGGCATCGTCA HISEQ:653:H12WDADXX:1:1101:6079:2195 length=34
TCCAATGCGACGAGTCCN
+
JIJJJIGGHIDG<GFHG#
@SRR1058032.29_AAATACCCAATA_TTTGAGGGAAACTTGACCAA HISEQ:653:H12WDADXX:1:1101:6133:2213 length=34
CGGAACAAGTTACCCTAN
+
JJIJIIIIIIIIIJIJI#
@SRR1058032.30_AGCGGGGAGTGT_GTTTTATCGGACACTCTTTC HISEQ:653:H12WDADXX:1:1101:6651:2198 length=34
CCTACACGACGTTACACN
+
IIA:F:CC8F<))1:??#

View File

@@ -0,0 +1,34 @@
#!/bin/bash
# Download test data
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/slim.fastq.gz
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.1.gz
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.2.gz
gunzip -f slim.fastq.gz scrb_seq_fastq.1.gz scrb_seq_fastq.2.gz
# smaller datasets
head -n 120 slim.fastq > slim_30.fastq
head -n 120 scrb_seq_fastq.1 > scrb_seq_fastq.1_30
head -n 120 scrb_seq_fastq.2 > scrb_seq_fastq.2_30
rm slim.fastq scrb_seq_fastq.1 scrb_seq_fastq.2
# Generate expected output
# Test 1 and 2
umi_tools extract \
--stdin "scrb_seq_fastq.1_30" \
--read2-in "scrb_seq_fastq.2_30" \
--bc-pattern "CCCCCCNNNNNNNNNN" \
--bc-pattern2 "CCCCCCNNNNNNNNNN" \
--extract-method string \
--stdout scrb_seq_fastq.1_30.extract \
--read2-out scrb_seq_fastq.2_30.extract \
--random-seed 1
# Test 3
umi_tools extract \
--stdin "slim_30.fastq" \
--bc-pattern "^(?P<umi_1>.{3}).{4}(?P<umi_2>.{2})" \
--extract-method regex \
--stdout slim_30.extract \
--random-seed 1

View File

@@ -0,0 +1,120 @@
@SRR2057595.7_CAGAA
GTTCTCTCGGTGGGACCTC
+
FFFFHHHJJJFGIJIJJIJ
@SRR2057595.9_TTGAA
GTTCTCTGATGCCCTCTTCTGGTGCATCTGAAGACAGCTACAGTGTACTTAGATATAATAAATAAATCTT
+
FDBDFHHIGGEHJGGIHGHGGCAFCHGIGEHIJJJJIJJJIHIIIIIIJIIIIIGHIIGGIJGIIJIIJ@
@SRR2057595.14_TGGAT
GTTAGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+
FFFFHHHJJIJJJJIGHJJIIJJJJJIJHFHHFFEDEEEEDDDDBDDDD
@SRR2057595.22_ACGAT
GTTAGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGC
+
FFFFHHHJJJJJJJJIJJJJJJJJJJJJHHHFFFEDEEEEDDDDBDDD
@SRR2057595.23_GCGTT
GTTACCTAAGGCGAGCTCAGGGAGGACAGAAACCTCCCGTGGAGCAGAAGGGCAAAAGCTCGCTTGATCT
+
FFFFHHHJJJJJJJJJJJJJJJIJJIIJJJJJJJJJJJJIJJHHHHHFFFFDDDDDDDDDDDDDDDDDDA
@SRR2057595.29_ACGTT
GTTCGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCTT
+
FFFFHHHJJJJJJJJHIJJJJJJIJJJJHHHFFDEDEEDDCDDDBDDDDD
@SRR2057595.30_GAGAA
GTTGAATCCGTGCTAAGAAGAA
+
DFFFHHHJJJJIJJJJJJJJJJ
@SRR2057595.33_TCGAT
GTTTCTCGTCTGATCTCGGAAGCTAAGCAGGGTCGGGCCTGGTTAGTACTTGGATGGGAGACCGCCTGGG
+
FFFFHHHJJJJJJJJJJJJJJJJJJJJJJJJJDHIJJJJIJJJHGGEEHFFFFFFEDDEDDDDDDDDDDB
@SRR2057595.35_ACGCT
GTTACCCGGGGCTACGCCTGTCTGAGCGTCGCT
+
DFFFHHHJJJJJJIJJJJJJIJJJJJJJHIIJJ
@SRR2057595.38_GGGCC
GTTATGCATGTTTATAGTTTCTAGTTTTGGCATTTTGTGTGGTCTCTTTTTTGTT
+
DFFFHHHJJJJJJJJJJHJJIJJJIJJJJJJJJJJJJGIGHJHIJJIJJJJJJJJ
@SRR2057595.42_TAGGA
GTTGTAAGTTATACACTGACTAAGTCATCTGTTACTGCCTTCACTGAGTTTTTATTTCCTTT
+
DFFFHHHJJJJJJJJJJJJJJJJJIIJJJJGJJJJJJJJJJJJJJJIIHIJJJJJJJIJJJI
@SRR2057595.45_CTGGC
GTTTTGCGGAAGGATCATTA
+
DDDDFFDFFAGFE<EB8?BF
@SRR2057595.46_CAGTT
GTTTTGGCTTTTTTTTAAAACCATTTTGTGAAAGGTTTCTGAAACTTGATAATAAAAAGCAGTTGGTGTA
+
DDDDHHFIGIJJJJJJJIIIJIIIJJJICHHIGIJFHHGHIEIHGHFHEDFFEFEFEEDEDD@CDD<@B:
@SRR2057595.56_GGGCG
GTTTATGAAGAACGCAGCTAGCTGCGAGAATTAATGTGAATTGCAGGACACATTGATCATCGACACTTCG
+
FFFFHHHJJJJJJJJJJJJJJJJJJJIJJJJJIJJJJJJJJJJJJJJJJJJJHHHHHHFFFFFDDDDDDD
@SRR2057595.59_GCGCC
GTTATCCTGTCTTATCATTGTCTTTTGAGCCTGGGCCTTGCCAGGTAGCTCTAGACTGGCCTAGAACTCA
+
FFFFHHHJJJJJJJJJJJJ4CHHJJJJJJJJJJJJJJJJJJJJIJDHIJJJIIJJJJJIJJJJHHHHHHB
@SRR2057595.60_ATGCA
GTTTTCTCGTCTGATCTCGGAAGCTAAGCAGGGCCGGGCCTGGTTAGTACTTGGATGGGAGACCGCC
+
DDFDBBBFECFE@HHIBCBG<2CGEC49?1CBD)86:;AB=7C.=;=)77;A3;?C@;96=?@B8;?
@SRR2057595.61_GAGAG
GTTTCAGGACACATTGATCATCGACACTTCGAACGCACTTGCGGCCCCGGGTTCCTCCCGGGGCTACGCC
+
DFFFHHHGGHGHIIJGEFEGGFH9GGIIFGGGGIFGDHBG@FGGGHEFCCB?@@CDCCD?B7>B@ACB9<
@SRR2057595.65_GCGCG
GTTTGAGCTTGCTCCGTCCACTCAACGCATCGACCTGGTATTGCAGTACCTCCAGGAACGGTGCACCAAG
+
FFFFHHHJJJJJJHJIHHIIIIIIIJHJBHIHBFHHJI@EHJJHHHHHHHFFFBDE?AEBD=AB@CDBD?
@SRR2057595.67_AAGGT
GTTGTTTTGAGGTCCTGCTCGTGCAGGGT
+
DDDFHHHHGFHGFGGIIDGHHIGIJJJJ9
@SRR2057595.69_ATTAT
GGTTTTTGTTTTTCCTCCTTCTCTTTCTAAA
+
FFFFHHHHJJJJJJJJJJJJJJJJJJIJIJJ
@SRR2057595.70_TTAAA
GGTTTTGTAATTTTATGAGGTCCCATTTGTCAATTCTT
+
DDDD2CDFA@FBGHCCHFHGBFHGHIGGDHGHIIFCFF
@SRR2057595.71_TGCCA
GGTTTATTAGCATGGCCCCTGCGCAAGGATGACACGCAAATTCGTGAAGCGTTCCATATTT
+
FFFFHGHHJJJJJJJJJJIIJJIJIJJIFHJIIIJJJIJJJJJJHIIHHHHFFFDEECEEE
@SRR2057595.73_TGACA
GGTTGCGAGTGCCTAGTGGGCCACTTTTGGTAAGCAGAACTGGCGCTGCGGGA
+
FFFFGFFHC@EBHGHGAEGIIHIIIIJJJJGHIIIJIJIIGHIJIJJIGGEFD
@SRR2057595.74_AATTC
GGTTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
FFFFDFFHFIJJJGGGGJJGDDDDDDDDDDDDDBDDDDDDBBDDDDDDDDDDDDDDDDDDDBBBDDDDBD>
@SRR2057595.77_GCGGA
GTTCTCCCACTTCTGAC
+
FFFFDHHHIJJJIJJJJ
@SRR2057595.82_GAGAC
GGTTTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+
FFFFHHHHJJJJJJJJJJJJJJJJIJJJJJIJIIJJJH
@SRR2057595.83_TGGAT
GTTGCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+
DFFFHHHJJIJJJJJJIJJJIJJIGGHIFHGEH
@SRR2057595.86_ACCAC
GGTTTTTTTTTAAATGTAAAGCATAAATAAAAAGCCTTTGTGGACTGTGAAAAAAAAAAAAAAAAAAAAAA
+
FFFFHHHHJJJJJJJJIIJJJJJJJJJIJJJJJJJJJJJJGIJIIJJIJJJJJJHFDDDDDDDDDDDDDB>
@SRR2057595.88_TCAGC
GGTTCTAAGCATAGATAACCATATATCAGGGGGAGCTCCATGTTCTAGTCCTGCAAGCGCCTGGGCAATAA
+
FFFFHHHHJJJJJJIJJJJJIJJJJJJIJJIJJIJJJJJJJJJHIJJJJJJIIIHJIHHHFFDDDDEDDD@
@SRR2057595.99_TGACA
GGTTTCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAGGGTTTGT
+
FFFFDHHHIHIIIJJIJJJJJIGEHGFHIJJGHIHADHIIJIJJJIJG

View File

@@ -0,0 +1,120 @@
@SRR2057595.7
CAGGTTCAATCTCGGTGGGACCTC
+SRR2057595.7
1=DFFFFHHHHHJJJFGIJIJJIJ
@SRR2057595.9
TTGGTTCAATCTGATGCCCTCTTCTGGTGCATCTGAAGACAGCTACAGTGTACTTAGATATAATAAATAAATCTT
+SRR2057595.9
4=DFDBDHHFHHIGGEHJGGIHGHGGCAFCHGIGEHIJJJJIJJJIHIIIIIIJIIIIIGHIIGGIJGIIJIIJ@
@SRR2057595.14
TGGGTTAATGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+SRR2057595.14
1=DFFFFHHHHHJJIJJJJIGHJJIIJJJJJIJHFHHFFEDEEEEDDDDBDDDD
@SRR2057595.22
ACGGTTAATGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGC
+SRR2057595.22
1=DFFFFHHHHHJJJJJJJJIJJJJJJJJJJJJHHHFFFEDEEEEDDDDBDDD
@SRR2057595.23
GCGGTTATTCCTAAGGCGAGCTCAGGGAGGACAGAAACCTCCCGTGGAGCAGAAGGGCAAAAGCTCGCTTGATCT
+SRR2057595.23
1=DFFFFHHHHHJJJJJJJJJJJJJJJIJJIIJJJJJJJJJJJJIJJHHHHHFFFFDDDDDDDDDDDDDDDDDDA
@SRR2057595.29
ACGGTTCTTGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCTT
+SRR2057595.29
1=DFFFFHHHHHJJJJJJJJHIJJJJJJIJJJJHHHFFDEDEEDDCDDDBDDDDD
@SRR2057595.30
GAGGTTGAAAATCCGTGCTAAGAAGAA
+SRR2057595.30
4=DDFFFHHHHHJJJJIJJJJJJJJJJ
@SRR2057595.33
TCGGTTTATCTCGTCTGATCTCGGAAGCTAAGCAGGGTCGGGCCTGGTTAGTACTTGGATGGGAGACCGCCTGGG
+SRR2057595.33
1=DFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJDHIJJJJIJJJHGGEEHFFFFFFEDDEDDDDDDDDDDB
@SRR2057595.35
ACGGTTACTCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+SRR2057595.35
1=DDFFFHHHHHJJJJJJIJJJJJJIJJJJJJJHIIJJ
@SRR2057595.38
GGGGTTACCTGCATGTTTATAGTTTCTAGTTTTGGCATTTTGTGTGGTCTCTTTTTTGTT
+SRR2057595.38
1=DDFFFHHHHHJJJJJJJJJJHJJIJJJIJJJJJJJJJJJJGIGHJHIJJIJJJJJJJJ
@SRR2057595.42
TAGGTTGGATAAGTTATACACTGACTAAGTCATCTGTTACTGCCTTCACTGAGTTTTTATTTCCTTT
+SRR2057595.42
1=DDFFFHHHHHJJJJJJJJJJJJJJJJJIIJJJJGJJJJJJJJJJJJJJJIIHIJJJJJJJIJJJI
@SRR2057595.45
CTGGTTTGCTGCGGAAGGATCATTA
+SRR2057595.45
1:DDDDDDDFFDFFAGFE<EB8?BF
@SRR2057595.46
CAGGTTTTTTGGCTTTTTTTTAAAACCATTTTGTGAAAGGTTTCTGAAACTTGATAATAAAAAGCAGTTGGTGTA
+SRR2057595.46
4=DDDDDHHHHFIGIJJJJJJJIIIJIIIJJJICHHIGIJFHHGHIEIHGHFHEDFFEFEFEEDEDD@CDD<@B:
@SRR2057595.56
GGGGTTTCGATGAAGAACGCAGCTAGCTGCGAGAATTAATGTGAATTGCAGGACACATTGATCATCGACACTTCG
+SRR2057595.56
4=DFFFFHHHHHJJJJJJJJJJJJJJJJJJJIJJJJJIJJJJJJJJJJJJJJJJJJJHHHHHHFFFFFDDDDDDD
@SRR2057595.59
GCGGTTACCTCCTGTCTTATCATTGTCTTTTGAGCCTGGGCCTTGCCAGGTAGCTCTAGACTGGCCTAGAACTCA
+SRR2057595.59
1=DFFFFHHHHHJJJJJJJJJJJJ4CHHJJJJJJJJJJJJJJJJJJJJIJDHIJJJIIJJJJJIJJJJHHHHHHB
@SRR2057595.60
ATGGTTTCATCTCGTCTGATCTCGGAAGCTAAGCAGGGCCGGGCCTGGTTAGTACTTGGATGGGAGACCGCC
+SRR2057595.60
11BDDFDFFBBBFECFE@HHIBCBG<2CGEC49?1CBD)86:;AB=7C.=;=)77;A3;?C@;96=?@B8;?
@SRR2057595.61
GAGGTTTAGCAGGACACATTGATCATCGACACTTCGAACGCACTTGCGGCCCCGGGTTCCTCCCGGGGCTACGCC
+SRR2057595.61
1=DDFFFGHHHHGGHGHIIJGEFEGGFH9GGIIFGGGGIFGDHBG@FGGGHEFCCB?@@CDCCD?B7>B@ACB9<
@SRR2057595.65
GCGGTTTCGGAGCTTGCTCCGTCCACTCAACGCATCGACCTGGTATTGCAGTACCTCCAGGAACGGTGCACCAAG
+SRR2057595.65
1=DFFFFHHHHHJJJJJJHJIHHIIIIIIIJHJBHIHBFHHJI@EHJJHHHHHHHFFFBDE?AEBD=AB@CDBD?
@SRR2057595.67
AAGGTTGGTTTTTGAGGTCCTGCTCGTGCAGGGT
+SRR2057595.67
1:BDDDFHFHHHHGFHGFGGIIDGHHIGIJJJJ9
@SRR2057595.69
ATTGGTTATTTTGTTTTTCCTCCTTCTCTTTCTAAA
+SRR2057595.69
CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJIJJ
@SRR2057595.70
TTAGGTTAATTGTAATTTTATGAGGTCCCATTTGTCAATTCTT
+SRR2057595.70
@@@DDDDD+2CDFA@FBGHCCHFHGBFHGHIGGDHGHIIFCFF
@SRR2057595.71
TGCGGTTCATATTAGCATGGCCCCTGCGCAAGGATGACACGCAAATTCGTGAAGCGTTCCATATTT
+SRR2057595.71
CCCFFFFFHHGHHJJJJJJJJJJIIJJIJIJJIFHJIIIJJJIJJJJJJHIIHHHHFFFDEECEEE
@SRR2057595.73
TGAGGTTCAGCGAGTGCCTAGTGGGCCACTTTTGGTAAGCAGAACTGGCGCTGCGGGA
+SRR2057595.73
@@@FFFFFHGFFHC@EBHGHGAEGIIHIIIIJJJJGHIIIJIJIIGHIJIJJIGGEFD
@SRR2057595.74
AATGGTTTCTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+SRR2057595.74
@CCFFFFFGDFFHFIJJJGGGGJJGDDDDDDDDDDDDDBDDDDDDBBDDDDDDDDDDDDDDDDDDDBBBDDDDBD>
@SRR2057595.77
GCGGTTCGATCCCACTTCTGAC
+SRR2057595.77
1=DFFFFHGDHHHIJJJIJJJJ
@SRR2057595.82
GAGGGTTACTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+SRR2057595.82
CBCFFFFFHHHHHJJJJJJJJJJJJJJJJIJJJJJIJIIJJJH
@SRR2057595.83
TGGGTTGATCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+SRR2057595.83
1=DDFFFHHHHHJJIJJJJJJIJJJIJJIGGHIFHGEH
@SRR2057595.86
ACCGGTTACTTTTTTTAAATGTAAAGCATAAATAAAAAGCCTTTGTGGACTGTGAAAAAAAAAAAAAAAAAAAAAA
+SRR2057595.86
BCCFFFFFHHHHHJJJJJJJJIIJJJJJJJJJIJJJJJJJJJJJJGIJIIJJIJJJJJJHFDDDDDDDDDDDDDB>
@SRR2057595.88
TCAGGTTGCCTAAGCATAGATAACCATATATCAGGGGGAGCTCCATGTTCTAGTCCTGCAAGCGCCTGGGCAATAA
+SRR2057595.88
CCCFFFFFHHHHHJJJJJJIJJJJJIJJJJJJIJJIJJIJJJJJJJJJHIJJJJJJIIIHJIHHHFFDDDDEDDD@
@SRR2057595.99
TGAGGTTCATCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAGGGTTTGT
+SRR2057595.99
B@CFFFFFFDHHHIHIIIJJIJJJJJIGEHGFHIJJGHIHADHIIJIJJJIJG

View File

@@ -0,0 +1,254 @@
name: "agat_convert_sp_gff2gtf"
namespace: "agat"
version: "qualimap"
authors:
- name: "Leïla Paquay"
roles:
- "author"
- "maintainer"
info:
links:
email: "leila@data-intuitive.com"
github: "Leila011"
linkedin: "leilapaquay"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Software Developer"
argument_groups:
- name: "Inputs"
arguments:
- type: "file"
name: "--gff"
alternatives:
- "-i"
description: "Input GFF/GTF file that will be read"
info: null
example:
- "input.gff"
must_exist: true
create_parent: true
required: true
direction: "input"
multiple: false
multiple_sep: ";"
- name: "Outputs"
arguments:
- type: "file"
name: "--output"
alternatives:
- "-o"
- "--out"
- "--outfile"
- "--gtf"
description: "Output GTF file. If no output file is specified, the output will\
\ be written to STDOUT."
info: null
example:
- "output.gtf"
must_exist: true
create_parent: true
required: true
direction: "output"
multiple: false
multiple_sep: ";"
- name: "Arguments"
arguments:
- type: "string"
name: "--gtf_version"
description: "Version of the GTF output (1,2,2.1,2.2,2.5,3 or relax). Default\
\ value from AGAT config file (relax for the default config). The script option\
\ has the higher priority. \n\n * relax: all feature types are accepted. \
\ \n * GTF3 (9 feature types accepted): gene, transcript, exon, CDS, Selenocysteine,\
\ start_codon, stop_codon, three_prime_utr and five_prime_utr. \n * GTF2.5\
\ (8 feature types accepted): gene, transcript, exon, CDS, UTR, start_codon,\
\ stop_codon, Selenocysteine. \n * GTF2.2 (9 feature types accepted): CDS,\
\ start_codon, stop_codon, 5UTR, 3UTR, inter, inter_CNS, intron_CNS and exon.\
\ \n * GTF2.1 (6 feature types accepted): CDS, start_codon, stop_codon, exon,\
\ 5UTR, 3UTR. \n * GTF2 (4 feature types accepted): CDS, start_codon, stop_codon,\
\ exon. \n * GTF1 (5 feature types accepted): CDS, start_codon, stop_codon,\
\ exon, intron. \n"
info: null
example:
- "3"
required: false
choices:
- "relax"
- "1"
- "2"
- "2.1"
- "2.2"
- "2.5"
- "3"
direction: "input"
multiple: false
multiple_sep: ";"
- type: "file"
name: "--config"
alternatives:
- "-c"
description: "Input agat config file. By default AGAT takes as input agat_config.yaml\
\ file from the working directory if any, otherwise it takes the orignal agat_config.yaml\
\ shipped with AGAT. To get the agat_config.yaml locally type: \"agat config\
\ --expose\". The --config option gives you the possibility to use your own\
\ AGAT config file (located elsewhere or named differently).\n"
info: null
example:
- "custom_agat_config.yaml"
must_exist: true
create_parent: true
required: false
direction: "input"
multiple: false
multiple_sep: ";"
resources:
- type: "bash_script"
path: "script.sh"
is_executable: true
description: "The script aims to convert any GTF/GFF file into a proper GTF file.\
\ Full\ninformation about the format can be found here:\nhttps://agat.readthedocs.io/en/latest/gxf.html\
\ You can choose among 7\ndifferent GTF types (1, 2, 2.1, 2.2, 2.5, 3 or relax).\
\ Depending the\nversion selected the script will filter out the features that are\
\ not\naccepted. For GTF2.5 and 3, every level1 feature (e.g nc_gene\npseudogene)\
\ will be converted into gene feature and every level2 feature\n(e.g mRNA ncRNA)\
\ will be converted into transcript feature. Using the\n\"relax\" option you will\
\ produce a GTF-like output keeping all original\nfeature types (3rd column). No\
\ modification will occur e.g. mRNA to\ntranscript.\n\nTo be fully GTF compliant\
\ all feature have a gene_id and a transcript_id\nattribute. The gene_id is unique\
\ identifier for the genomic source of\nthe transcript, which is used to group transcripts\
\ into genes. The\ntranscript_id is a unique identifier for the predicted transcript,\
\ which\nis used to group features into transcripts.\n"
test_resources:
- type: "bash_script"
path: "test.sh"
is_executable: true
- type: "file"
path: "test_data"
info: null
status: "enabled"
requirements:
commands:
- "ps"
keywords:
- "gene annotations"
- "GTF conversion"
license: "GPL-3.0"
references:
doi:
- "10.5281/zenodo.3552717"
links:
repository: "https://github.com/NBISweden/AGAT"
homepage: "https://github.com/NBISweden/AGAT"
documentation: "https://agat.readthedocs.io/"
issue_tracker: "https://github.com/NBISweden/AGAT/issues"
runners:
- type: "executable"
id: "executable"
docker_setup_strategy: "ifneedbepullelsecachedbuild"
- type: "nextflow"
id: "nextflow"
directives:
tag: "$id"
auto:
simplifyInput: true
simplifyOutput: false
transcript: false
publish: false
config:
labels:
mem1gb: "memory = 1000000000.B"
mem2gb: "memory = 2000000000.B"
mem5gb: "memory = 5000000000.B"
mem10gb: "memory = 10000000000.B"
mem20gb: "memory = 20000000000.B"
mem50gb: "memory = 50000000000.B"
mem100gb: "memory = 100000000000.B"
mem200gb: "memory = 200000000000.B"
mem500gb: "memory = 500000000000.B"
mem1tb: "memory = 1000000000000.B"
mem2tb: "memory = 2000000000000.B"
mem5tb: "memory = 5000000000000.B"
mem10tb: "memory = 10000000000000.B"
mem20tb: "memory = 20000000000000.B"
mem50tb: "memory = 50000000000000.B"
mem100tb: "memory = 100000000000000.B"
mem200tb: "memory = 200000000000000.B"
mem500tb: "memory = 500000000000000.B"
mem1gib: "memory = 1073741824.B"
mem2gib: "memory = 2147483648.B"
mem4gib: "memory = 4294967296.B"
mem8gib: "memory = 8589934592.B"
mem16gib: "memory = 17179869184.B"
mem32gib: "memory = 34359738368.B"
mem64gib: "memory = 68719476736.B"
mem128gib: "memory = 137438953472.B"
mem256gib: "memory = 274877906944.B"
mem512gib: "memory = 549755813888.B"
mem1tib: "memory = 1099511627776.B"
mem2tib: "memory = 2199023255552.B"
mem4tib: "memory = 4398046511104.B"
mem8tib: "memory = 8796093022208.B"
mem16tib: "memory = 17592186044416.B"
mem32tib: "memory = 35184372088832.B"
mem64tib: "memory = 70368744177664.B"
mem128tib: "memory = 140737488355328.B"
mem256tib: "memory = 281474976710656.B"
mem512tib: "memory = 562949953421312.B"
cpu1: "cpus = 1"
cpu2: "cpus = 2"
cpu5: "cpus = 5"
cpu10: "cpus = 10"
cpu20: "cpus = 20"
cpu50: "cpus = 50"
cpu100: "cpus = 100"
cpu200: "cpus = 200"
cpu500: "cpus = 500"
cpu1000: "cpus = 1000"
debug: false
container: "docker"
engines:
- type: "docker"
id: "docker"
image: "quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0"
target_registry: "images.viash-hub.com"
target_tag: "qualimap"
namespace_separator: "/"
setup:
- type: "docker"
run:
- "agat --version | sed 's/AGAT\\s\\(.*\\)/agat: \"\\1\"/' > /var/software_versions.txt\n"
entrypoint: []
cmd: null
- type: "native"
id: "native"
build_info:
config: "src/agat/agat_convert_sp_gff2gtf/config.vsh.yaml"
runner: "executable"
engine: "docker|native"
output: "target/executable/agat/agat_convert_sp_gff2gtf"
executable: "target/executable/agat/agat_convert_sp_gff2gtf/agat_convert_sp_gff2gtf"
viash_version: "0.9.0-RC6"
git_commit: "28cd12293505544b3e09ff6343e4724dedb772d3"
git_remote: "https://github.com/viash-hub/biobox"
package_config:
name: "biobox"
version: "qualimap"
description: "A collection of bioinformatics tools for working with sequence data.\n"
info: null
viash_version: "0.9.0-RC6"
source: "src"
target: "target"
config_mods:
- ".requirements.commands := ['ps']\n"
- ".engines += { type: \"native\" }"
- ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
- ".engines[.type == 'docker'].target_tag := 'qualimap'"
keywords:
- "bioinformatics"
- "modules"
- "sequencing"
license: "MIT"
organization: "vsh"
links:
repository: "https://github.com/viash-hub/biobox"
issue_tracker: "https://github.com/viash-hub/biobox/issues"

File diff suppressed because it is too large Load Diff

View File

@@ -1,5 +1,23 @@
name: "arriba"
version: "qualimap"
authors:
- name: "Robrecht Cannoodt"
roles:
- "author"
- "maintainer"
info:
links:
email: "robrecht@data-intuitive.com"
github: "rcannood"
orcid: "0000-0003-3641-729X"
linkedin: "robrechtcannoodt"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Science Engineer"
- name: "Open Problems"
href: "https://openproblems.bio"
role: "Core Member"
argument_groups:
- name: "Inputs"
arguments:
@@ -688,7 +706,7 @@ build_info:
output: "target/executable/arriba"
executable: "target/executable/arriba/arriba"
viash_version: "0.9.0-RC6"
git_commit: "e6420cd80f226128b7223ff79ce1297f99993657"
git_commit: "28cd12293505544b3e09ff6343e4724dedb772d3"
git_remote: "https://github.com/viash-hub/biobox"
package_config:
name: "biobox"

View File

@@ -10,6 +10,9 @@
# authors of this component should specify the license in the header of such
# files, or include a separate license file detailing the licenses of all included
# files.
#
# Component authors:
# * Robrecht Cannoodt (author, maintainer)
set -e
@@ -748,10 +751,11 @@ FROM quay.io/biocontainers/arriba:2.4.0--h0033a41_2
ENTRYPOINT []
RUN arriba -h | grep 'Version:' 2>&1 | sed 's/Version:\s\(.*\)/arriba: "\1"/' > /var/software_versions.txt
LABEL org.opencontainers.image.authors="Robrecht Cannoodt"
LABEL org.opencontainers.image.description="Companion container for running component arriba"
LABEL org.opencontainers.image.created="2024-07-29T14:42:19Z"
LABEL org.opencontainers.image.created="2024-07-29T14:45:24Z"
LABEL org.opencontainers.image.source="https://github.com/suhrig/arriba"
LABEL org.opencontainers.image.revision="e6420cd80f226128b7223ff79ce1297f99993657"
LABEL org.opencontainers.image.revision="28cd12293505544b3e09ff6343e4724dedb772d3"
LABEL org.opencontainers.image.version="qualimap"
VIASHDOCKER

View File

@@ -1,5 +1,30 @@
name: "bcl_convert"
version: "qualimap"
authors:
- name: "Toni Verbeiren"
roles:
- "author"
- "maintainer"
info:
links:
github: "tverbeiren"
linkedin: "verbeiren"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Scientist and CEO"
- name: "Dorien Roosen"
roles:
- "author"
info:
links:
email: "dorien@data-intuitive.com"
github: "dorien-er"
linkedin: "dorien-roosen"
organizations:
- name: "Data Intuitive"
href: "https://www.data-intuitive.com"
role: "Data Scientist"
argument_groups:
- name: "Input arguments"
arguments:
@@ -281,9 +306,16 @@ status: "enabled"
requirements:
commands:
- "ps"
license: "MIT"
keywords:
- "demultiplex"
- "fastq"
- "bcl"
- "illumina"
license: "Proprietary"
links:
repository: "https://github.com/viash-hub/biobox"
homepage: "https://support.illumina.com/sequencing/sequencing_software/bcl-convert.html"
documentation: "https://support.illumina.com/downloads/bcl-convert-user-guide.html"
runners:
- type: "executable"
id: "executable"
@@ -386,7 +418,7 @@ build_info:
output: "target/executable/bcl_convert"
executable: "target/executable/bcl_convert/bcl_convert"
viash_version: "0.9.0-RC6"
git_commit: "e6420cd80f226128b7223ff79ce1297f99993657"
git_commit: "28cd12293505544b3e09ff6343e4724dedb772d3"
git_remote: "https://github.com/viash-hub/biobox"
package_config:
name: "biobox"

View File

@@ -10,6 +10,10 @@
# authors of this component should specify the license in the header of such
# files, or include a separate license file detailing the licenses of all included
# files.
#
# Component authors:
# * Toni Verbeiren (author, maintainer)
# * Dorien Roosen (author)
set -e
@@ -592,10 +596,11 @@ rm /tmp/bcl-convert.rpm
RUN echo "bcl-convert: \"$(bcl-convert -V 2>&1 >/dev/null | sed -n '/Version/ s/^bcl-convert\ Version //p')\"" > /var/software_versions.txt
LABEL org.opencontainers.image.authors="Toni Verbeiren, Dorien Roosen"
LABEL org.opencontainers.image.description="Companion container for running component bcl_convert"
LABEL org.opencontainers.image.created="2024-07-29T14:42:19Z"
LABEL org.opencontainers.image.created="2024-07-29T14:45:25Z"
LABEL org.opencontainers.image.source="https://github.com/viash-hub/biobox"
LABEL org.opencontainers.image.revision="e6420cd80f226128b7223ff79ce1297f99993657"
LABEL org.opencontainers.image.revision="28cd12293505544b3e09ff6343e4724dedb772d3"
LABEL org.opencontainers.image.version="qualimap"
VIASHDOCKER

Some files were not shown because too many files have changed in this diff Show More