Build branch qualimap with version qualimap (28cd122)

Build pipeline: viash-hub.biobox.qualimap-6tqq7 Source commit: 28cd122935 Source message: Merge branch 'main' into qualimap
2024-07-29 15:00:07 +00:00
parent fbfdc19532
commit aa043fdc19
283 changed files with 42125 additions and 3915 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,18 +1,29 @@
 # biobox x.x.x

-## BUG FIXES
+## BREAKING CHANGES

-* `pear`: fix component not exiting with the correct exitcode when PEAR fails.
+* `star/star_align_reads`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).

-* `cutadapt`: fix `--par_quality_cutoff_r2` argument.
+* `star/star_genome_generate`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).

-* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode`.
+## NEW FUNCTIONALITY

-* `multiqc`: update multiple separator to `;` (PR #81).
+* `star/star_align_reads`: Add star solo related arguments (PR #62).
+
+* `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline (PR #75).
+
+* `umitools/umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).
+
+* `seqtk`:
+  - `seqtk/seqtk_sample`: Subsamples sequences from FASTA/Q files (PR #68).
+  - `seqtk/seqtk_subseq`: Extract the sequences (complete or subsequence) from the FASTA/FASTQ files
+                based on a provided sequence IDs or region coordinates file (PR #85).
+
+* `agat/agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).

 ## MINOR CHANGES

-* `busco` components: update BUSCO to `5.7.1`.
+* `busco` components: update BUSCO to `5.7.1` (PR #72).

 ## NEW FEATURES

@@ -20,12 +31,36 @@

 # biobox 0.1.0

-## BREAKING CHANGES
+* Update CI to reusable workflow in `viash-io/viash-actions` (PR #86).

-* Change default `multiple_sep` to `;` (PR #25). This aligns with an upcoming breaking change in
-  Viash 0.9.0 in order to avoid issues with the current default separator `:` unintentionally
-  splitting up certain file paths.
+## DOCUMENTATION

+* Extend the contributing guidelines (PR #82):
+
+  - Update format to Viash 0.9.
+
+  - Descriptions should be formatted in markdown.
+
+  - Add defaults to descriptions, not as a default of the argument.
+
+  - Explain parameter expansion.
+
+  - Mention that the contents of the output of components in tests should be checked.
+
+* Add authorship to existing components (PR #88).
+
+## BUG FIXES
+
+* `pear`: fix component not exiting with the correct exitcode when PEAR fails (PR #70).
+
+* `cutadapt`: fix `--par_quality_cutoff_r2` argument (PR #69).
+
+* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode` (PR #69).
+
+* `multiqc`: update multiple separator to `;` (PR #81).
+
+
+# biobox 0.1.0

 ## NEW FEATURES

@@ -74,12 +109,11 @@
    - `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTQ (PR #52).
    - `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53).

+* `umi_tools`:
+    -`umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71).

 * `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).

-* `umitools`:
-    - `umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).
-
 * `bedtools`:
    - `bedtools_getfasta`: extract sequences from a FASTA file for each of the
                           intervals defined in a BED/GFF/VCF file (PR #59).
@@ -104,4 +138,4 @@

 * Add escaping character before leading hashtag in the description field of the config file (PR #50).

-* Format URL in biobase/bcl_convert description (PR #55).
+* Format URL in biobase/bcl_convert description (PR #55).
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -65,22 +65,21 @@ runners:
 Fill in the relevant metadata fields in the config. Here is an example of the metadata of an existing component.

 ```yaml
-functionality:
-  name: arriba
-  description: Detect gene fusions from RNA-Seq data
-  keywords: [Gene fusion, RNA-Seq]
-  links:
-    homepage: https://arriba.readthedocs.io/en/latest/
-    documentation: https://arriba.readthedocs.io/en/latest/
-    repository: https://github.com/suhrig/arriba
-    issue_tracker: https://github.com/suhrig/arriba/issues
-  references:
-    doi: 10.1101/gr.257246.119
-    bibtex: |
-      @article{
-        ... a bibtex entry in case the doi is not available ...
-      }
-  license: MIT
+name: arriba
+description: Detect gene fusions from RNA-Seq data
+keywords: [Gene fusion, RNA-Seq]
+links:
+  homepage: https://arriba.readthedocs.io/en/latest/
+  documentation: https://arriba.readthedocs.io/en/latest/
+  repository: https://github.com/suhrig/arriba
+  issue_tracker: https://github.com/suhrig/arriba/issues
+references:
+  doi: 10.1101/gr.257246.119
+  bibtex: |
+    @article{
+      ... a bibtex entry in case the doi is not available ...
+    }
+license: MIT
 ```

 ### Step 4: Find a suitable container
@@ -162,7 +161,7 @@ argument_groups:
      type: file
      description: |
        File in SAM/BAM/CRAM format with main alignments as generated by STAR
-        (Aligned.out.sam). Arriba extracts candidate reads from this file.
+        (`Aligned.out.sam`). Arriba extracts candidate reads from this file.
      required: true
      example: Aligned.out.bam
 ```
@@ -175,7 +174,7 @@ Several notes:

 * Input arguments can have `multiple: true` to allow the user to specify multiple files.

-
+* The description should be formatted in markdown.

 ### Step 8: Add arguments for the output files

@@ -220,7 +219,7 @@ argument_groups:

 Note: 

-* Preferably, these outputs should not be directores but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
+* Preferably, these outputs should not be directories but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).

 ### Step 9: Add arguments for the other arguments

@@ -230,6 +229,8 @@ Finally, add all other arguments to the config file. There are a few exceptions:

 * Arguments related to printing the information such as printing the version (`-v`, `--version`) or printing the help (`-h`, `--help`) should not be added to the config file.

+* If the help lists defaults, do not add them as defaults but to the description. Example: `description: <Explanation of parameter>. Default: 10.`
+

 ### Step 10: Add a Docker engine

@@ -275,10 +276,13 @@ Next, we need to write a runner script that runs the tool with the input argumen
 ## VIASH START
 ## VIASH END

+# unset flags
+[[ "$par_option" == "false" ]] && unset par_option
+
 xxx \
  --input "$par_input" \
  --output "$par_output" \
-  $([ "$par_option" = "true" ] && echo "--option")
+  ${par_option:+--option}
 ```

 When building a Viash component, Viash will automatically replace the `## VIASH START` and `## VIASH END` lines (and anything in between) with environment variables based on the arguments specified in the config.
@@ -291,6 +295,11 @@ As an example, this is what the Bash script for the `arriba` component looks lik
 ## VIASH START
 ## VIASH END

+# unset flags
+[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking
+[[ "$par_extra_information" == "false" ]] && unset par_extra_information
+[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps
+
 arriba \
  -x "$par_bam" \
  -a "$par_genome" \
@@ -298,26 +307,30 @@ arriba \
  -o "$par_fusions" \
  ${par_known_fusions:+-k "${par_known_fusions}"} \
  ${par_blacklist:+-b "${par_blacklist}"} \
-  ${par_structural_variants:+-d "${par_structural_variants}"} \
-  $([ "$par_skip_duplicate_marking" = "true" ] && echo "-u") \
-  $([ "$par_extra_information" = "true" ] && echo "-X") \
-  $([ "$par_fill_gaps" = "true" ] && echo "-I")
+  # ...
+  ${par_extra_information:+-X} \
+  ${par_fill_gaps:+-I}
 ```

+Notes:
+
+* If your arguments can contain special variables (e.g. `$`), you can use quoting (need to find a documentation page for this) to make sure you can use the string as input. Example: `-x ${par_bam@Q}`.
+
+* Optional arguments can be passed to the command conditionally using Bash [parameter expansion](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html). For example: `${par_known_fusions:+-k ${par_known_fusions@Q}}`
+
+* If your tool allows for multiple inputs using a separator other than `;` (which is the default Viash multiple separator), you can substitute these values with a command like: `par_disable_filters=$(echo $par_disable_filters | tr ';' ',')`.
+

 ### Step 12: Create test script

-
 If the unit test requires test resources, these should be provided in the `test_resources` section of the component. 

 ```yaml
-functionality:
-  # ...
-  test_resources:
-    - type: bash_script
-      path: test.sh
-    - type: file
-      path: test_data
+test_resources:
+  - type: bash_script
+    path: test.sh
+  - type: file
+    path: test_data
 ```

 Create a test script at `src/xxx/test.sh` that runs the component with the test data. This script should run the component (available with `$meta_executable`) with the test data and check if the output is as expected. The script should exit with a non-zero exit code if the output is not as expected. For example:
@@ -325,48 +338,64 @@ Create a test script at `src/xxx/test.sh` that runs the component with the test
 ```bash
 #!/bin/bash

+set -e
+
 ## VIASH START
 ## VIASH END

-echo "> Run xxx with test data"
+#############################################
+# helper functions
+assert_file_exists() {
+  [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
+}
+assert_file_doesnt_exist() {
+  [ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
+}
+assert_file_empty() {
+  [ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
+}
+assert_file_not_empty() {
+  [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
+}
+assert_file_contains() {
+  grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
+}
+assert_file_not_contains() {
+  grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
+}
+assert_file_contains_regex() {
+  grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
+}
+assert_file_not_contains_regex() {
+  grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
+}
+#############################################
+
+echo "> Run $meta_name with test data"
 "$meta_executable" \
-  --input "$meta_resources_dir/test_data/input.txt" \
+  --input "$meta_resources_dir/test_data/reads_R1.fastq" \
  --output "output.txt" \
  --option

-echo ">> Checking output"
-[ ! -f "output.txt" ] && echo "Output file output.txt does not exist" && exit 1
-```
-
-
-For example, this is what the test script for the `arriba` component looks like:
-
-```bash
-#!/bin/bash
-
-## VIASH START
-## VIASH END
-
-echo "> Run arriba with blacklist"
-"$meta_executable" \
-  --bam "$meta_resources_dir/test_data/A.bam" \
-  --genome "$meta_resources_dir/test_data/genome.fasta" \
-  --gene_annotation "$meta_resources_dir/test_data/annotation.gtf" \
-  --blacklist "$meta_resources_dir/test_data/blacklist.tsv" \
-  --fusions "fusions.tsv" \
-  --fusions_discarded "fusions_discarded.tsv" \
-  --interesting_contigs "1,2"
-
-echo ">> Checking output"
-[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
-[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
+echo ">> Check if output exists"
+assert_file_exists "output.txt"

 echo ">> Check if output is empty"
-[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
-[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
+assert_file_not_empty "output.txt"
+
+echo ">> Check if output is correct"
+assert_file_contains "output.txt" "some expected output"
+
+echo "> All tests succeeded!"
 ```

-### Step 12: Create a `/var/software_versions.txt` file
+Notes:
+
+* Do always check the contents of the output file. If the output is not deterministic, you can use regular expressions to check the output.
+
+* If possible, generate your own test data instead of copying it from an external resource.
+
+### Step 13: Create a `/var/software_versions.txt` file

 For the sake of transparency and reproducibility, we require that the versions of the software used in the component are documented.

@@ -378,6 +407,8 @@ engines:
    image: quay.io/biocontainers/xxx:0.1.0--py_0
    setup:
      - type: docker
+        # note: /var/software_versions.txt should contain:
+        #   arriba: "2.4.0"
        run: |
          echo "xxx: \"0.1.0\"" > /var/software_versions.txt
 ```
--- a/src/_authors/angela_o_pisco.yaml
+++ b/src/_authors/angela_o_pisco.yaml
@@ -0,0 +1,14 @@
+name: Angela Oliveira Pisco
+info:
+  role: Contributor
+  links:
+    github: aopisco
+    orcid: "0000-0003-0142-2355"
+    linkedin: aopisco
+  organizations:
+    - name: Insitro
+      href: https://insitro.com
+      role: Director of Computational Biology
+    - name: Open Problems
+      href: https://openproblems.bio
+      role: Core Member
--- a/src/_authors/dorien_roosen.yaml
+++ b/src/_authors/dorien_roosen.yaml
@@ -0,0 +1,10 @@
+name: Dorien Roosen
+info:
+  links:
+    email: dorien@data-intuitive.com
+    github: dorien-er
+    linkedin: dorien-roosen
+  organizations:
+    - name: Data Intuitive
+      href: https://www.data-intuitive.com
+      role: Data Scientist
--- a/src/_authors/dries_schaumont.yaml
+++ b/src/_authors/dries_schaumont.yaml
@@ -0,0 +1,11 @@
+name: Dries Schaumont
+info:
+  links:
+    email: dries@data-intuitive.com
+    github: DriesSchaumont
+    orcid: "0000-0002-4389-0440"
+    linkedin: dries-schaumont
+  organizations:
+    - name: Data Intuitive
+      href: https://www.data-intuitive.com
+      role: Data Scientist
--- a/src/_authors/emma_rousseau.yaml
+++ b/src/_authors/emma_rousseau.yaml
@@ -0,0 +1,10 @@
+name: Emma Rousseau
+info:
+  links:
+    email: emma@data-intuitive.com
+    github: emmarousseau
+    linkedin: emmarousseau1
+  organizations:
+    - name: Data Intuitive
+      href: https://www.data-intuitive.com
+      role: Bioinformatician
--- a/src/_authors/jakub_majercik.yaml
+++ b/src/_authors/jakub_majercik.yaml
@@ -0,0 +1,10 @@
+name: Jakub Majercik
+info:
+  links:
+    email: jakub@data-intuitive.com
+    github: jakubmajercik
+    linkedin: jakubmajercik
+  organizations:
+    - name: Data Intuitive
+      href: https://www.data-intuitive.com
+      role: Bioinformatics Engineer
--- a/src/_authors/kai_waldrant.yaml
+++ b/src/_authors/kai_waldrant.yaml
@@ -0,0 +1,14 @@
+name: Kai Waldrant
+info:
+  links:
+    email: kai@data-intuitive.com
+    github: KaiWaldrant
+    orcid: "0009-0003-8555-1361"
+    linkedin: kaiwaldrant
+  organizations:
+    - name: Data Intuitive
+      href: https://www.data-intuitive.com
+      role: Bioinformatician
+    - name: Open Problems
+      href: https://openproblems.bio
+      role: Contributor
--- a/src/_authors/leila_paquay.yaml
+++ b/src/_authors/leila_paquay.yaml
@@ -0,0 +1,10 @@
+name: Leïla Paquay
+info:
+  links:
+    email: leila@data-intuitive.com
+    github: Leila011
+    linkedin: leilapaquay
+  organizations:
+    - name: Data Intuitive
+      href: https://www.data-intuitive.com
+      role: Software Developer
--- a/src/_authors/robrecht_cannoodt.yaml
+++ b/src/_authors/robrecht_cannoodt.yaml
@@ -0,0 +1,14 @@
+name: Robrecht Cannoodt
+info:
+  links:
+    email: robrecht@data-intuitive.com
+    github: rcannood
+    orcid: "0000-0003-3641-729X"
+    linkedin: robrechtcannoodt
+  organizations:
+    - name: Data Intuitive
+      href: https://www.data-intuitive.com
+      role: Data Science Engineer
+    - name: Open Problems
+      href: https://openproblems.bio
+      role: Core Member
--- a/src/_authors/sai_nirmayi_yasa.yaml
+++ b/src/_authors/sai_nirmayi_yasa.yaml
@@ -0,0 +1,10 @@
+name: Sai Nirmayi Yasa
+info:
+  links:
+    email: nirmayi@data-intuitive.com
+    github: sainirmayi
+    linkedin: sai-nirmayi-yasa
+  organizations:
+    - name: Data Intuitive
+      href: https://www.data-intuitive.com
+      role: Junior Bioinformatics Researcher
--- a/src/_authors/theodoro_gasperin.yaml
+++ b/src/_authors/theodoro_gasperin.yaml
@@ -0,0 +1,10 @@
+name: Theodoro Gasperin Terra Camargo
+info:
+  links:
+    email: theodorogtc@gmail.com
+    github: tgaspe
+    linkedin: theodoro-gasperin-terra-camargo
+  organizations:
+    - name: Data Intuitive
+      href: https://www.data-intuitive.com
+      role: Bioinformatician
--- a/src/_authors/toni_verbeiren.yaml
+++ b/src/_authors/toni_verbeiren.yaml
@@ -0,0 +1,9 @@
+name: Toni Verbeiren
+info:
+  links:
+    github: tverbeiren
+    linkedin: verbeiren
+  organizations:
+  - name: Data Intuitive
+    href: https://www.data-intuitive.com
+    role: Data Scientist and CEO
--- a/src/_authors/weiwei_schultz.yaml
+++ b/src/_authors/weiwei_schultz.yaml
@@ -0,0 +1,5 @@
+name: Weiwei Schultz
+info:
+  organizations:
+    - name: Janssen R&D US
+      role: Associate Director Data Sciences
--- a/src/agat/agat_convert_sp_gff2gtf/config.vsh.yaml
+++ b/src/agat/agat_convert_sp_gff2gtf/config.vsh.yaml
@@ -0,0 +1,94 @@
+name: agat_convert_sp_gff2gtf
+namespace: agat
+description: |
+  The script aims to convert any GTF/GFF file into a proper GTF file. Full
+  information about the format can be found here:
+  https://agat.readthedocs.io/en/latest/gxf.html You can choose among 7
+  different GTF types (1, 2, 2.1, 2.2, 2.5, 3 or relax). Depending the
+  version selected the script will filter out the features that are not
+  accepted. For GTF2.5 and 3, every level1 feature (e.g nc_gene
+  pseudogene) will be converted into gene feature and every level2 feature
+  (e.g mRNA ncRNA) will be converted into transcript feature. Using the
+  "relax" option you will produce a GTF-like output keeping all original
+  feature types (3rd column). No modification will occur e.g. mRNA to
+  transcript.
+
+  To be fully GTF compliant all feature have a gene_id and a transcript_id
+  attribute. The gene_id is unique identifier for the genomic source of
+  the transcript, which is used to group transcripts into genes. The
+  transcript_id is a unique identifier for the predicted transcript, which
+  is used to group features into transcripts.
+keywords: [gene annotations, GTF conversion]
+links:
+  homepage: https://github.com/NBISweden/AGAT
+  documentation: https://agat.readthedocs.io/
+  issue_tracker: https://github.com/NBISweden/AGAT/issues
+  repository: https://github.com/NBISweden/AGAT
+references: 
+  doi: 10.5281/zenodo.3552717
+license: GPL-3.0
+authors:
+  - __merge__: /src/_authors/leila_paquay.yaml
+    roles: [ author, maintainer ]
+
+argument_groups:
+  - name: Inputs
+    arguments:
+      - name: --gff
+        alternatives: [-i]
+        description: Input GFF/GTF file that will be read
+        type: file
+        required: true
+        direction: input
+        example: input.gff
+  - name: Outputs
+    arguments:       
+      - name: --output
+        alternatives: [-o, --out, --outfile, --gtf]
+        description: Output GTF file. If no output file is specified, the output will be written to STDOUT.
+        type: file
+        direction: output
+        required: true
+        example: output.gtf
+  - name: Arguments
+    arguments:
+      - name: --gtf_version
+        description: |
+          Version of the GTF output (1,2,2.1,2.2,2.5,3 or relax). Default value from AGAT config file (relax for the default config). The script option has the higher priority.  
+          
+            * relax: all feature types are accepted.  
+            * GTF3 (9 feature types accepted): gene, transcript, exon, CDS, Selenocysteine, start_codon, stop_codon, three_prime_utr and five_prime_utr.  
+            * GTF2.5 (8 feature types accepted): gene, transcript, exon, CDS, UTR, start_codon, stop_codon, Selenocysteine.  
+            * GTF2.2 (9 feature types accepted): CDS, start_codon, stop_codon, 5UTR, 3UTR, inter, inter_CNS, intron_CNS and exon.  
+            * GTF2.1 (6 feature types accepted): CDS, start_codon, stop_codon, exon, 5UTR, 3UTR.  
+            * GTF2 (4 feature types accepted): CDS, start_codon, stop_codon, exon.  
+            * GTF1 (5 feature types accepted): CDS, start_codon, stop_codon, exon, intron.  
+        type: string
+        choices: [relax, "1", "2", "2.1", "2.2", "2.5", "3"]
+        required: false
+        example: "3"
+      - name: --config
+        alternatives: [-c]
+        description: |
+          Input agat config file. By default AGAT takes as input agat_config.yaml file from the working directory if any, otherwise it takes the orignal agat_config.yaml shipped with AGAT. To get the agat_config.yaml locally type: "agat config --expose". The --config option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
+        type: file
+        required: false
+        example: custom_agat_config.yaml
+resources:
+  - type: bash_script
+    path: script.sh
+test_resources:
+  - type: bash_script
+    path: test.sh
+  - type: file
+    path: test_data
+engines:
+  - type: docker
+    image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
+    setup:
+      - type: docker
+        run: |
+          agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
+runners:
+  - type: executable
+  - type: nextflow
--- a/src/agat/agat_convert_sp_gff2gtf/help.txt
+++ b/src/agat/agat_convert_sp_gff2gtf/help.txt
@@ -0,0 +1,102 @@
+```sh
+agat_convert_sp_gff2gtf.pl --help
+```
+ ------------------------------------------------------------------------------
+|   Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0                      |
+|   https://github.com/NBISweden/AGAT                                          |
+|   National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se         |
+ ------------------------------------------------------------------------------
+
+
+Name:
+    agat_convert_sp_gff2gtf.pl
+
+Description:
+    The script aims to convert any GTF/GFF file into a proper GTF file. Full
+    information about the format can be found here:
+    https://agat.readthedocs.io/en/latest/gxf.html You can choose among 7
+    different GTF types (1, 2, 2.1, 2.2, 2.5, 3 or relax). Depending the
+    version selected the script will filter out the features that are not
+    accepted. For GTF2.5 and 3, every level1 feature (e.g nc_gene
+    pseudogene) will be converted into gene feature and every level2 feature
+    (e.g mRNA ncRNA) will be converted into transcript feature. Using the
+    "relax" option you will produce a GTF-like output keeping all original
+    feature types (3rd column). No modification will occur e.g. mRNA to
+    transcript.
+
+    To be fully GTF compliant all feature have a gene_id and a transcript_id
+    attribute. The gene_id is unique identifier for the genomic source of
+    the transcript, which is used to group transcripts into genes. The
+    transcript_id is a unique identifier for the predicted transcript, which
+    is used to group features into transcripts.
+
+Usage:
+        agat_convert_sp_gff2gtf.pl --gff infile.gff [ -o outfile ]
+        agat_convert_sp_gff2gtf -h
+
+Options:
+    --gff, --gtf or -i
+            Input GFF/GTF file that will be read
+
+    --gtf_version version of the GTF output (1,2,2.1,2.2,2.5,3 or relax).
+    Default value from AGAT config file (relax for the default config). The
+    script option has the higher priority.
+            relax: all feature types are accepted.
+
+            GTF3 (9 feature types accepted): gene, transcript, exon, CDS,
+            Selenocysteine, start_codon, stop_codon, three_prime_utr and
+            five_prime_utr
+
+            GTF2.5 (8 feature types accepted): gene, transcript, exon, CDS,
+            UTR, start_codon, stop_codon, Selenocysteine
+
+            GTF2.2 (9 feature types accepted): CDS, start_codon, stop_codon,
+            5UTR, 3UTR, inter, inter_CNS, intron_CNS and exon
+
+            GTF2.1 (6 feature types accepted): CDS, start_codon, stop_codon,
+            exon, 5UTR, 3UTR
+
+            GTF2 (4 feature types accepted): CDS, start_codon, stop_codon,
+            exon
+
+            GTF1 (5 feature types accepted): CDS, start_codon, stop_codon,
+            exon, intron
+
+    -o , --output , --out , --outfile or --gtf
+            Output GTF file. If no output file is specified, the output will
+            be written to STDOUT.
+
+    -c or --config
+            String - Input agat config file. By default AGAT takes as input
+            agat_config.yaml file from the working directory if any,
+            otherwise it takes the orignal agat_config.yaml shipped with
+            AGAT. To get the agat_config.yaml locally type: "agat config
+            --expose". The --config option gives you the possibility to use
+            your own AGAT config file (located elsewhere or named
+            differently).
+
+    -h or --help
+            Display this helpful text.
+
+Feedback:
+  Did you find a bug?:
+    Do not hesitate to report bugs to help us keep track of the bugs and
+    their resolution. Please use the GitHub issue tracking system available
+    at this address:
+
+                https://github.com/NBISweden/AGAT/issues
+
+     Ensure that the bug was not already reported by searching under Issues.
+     If you're unable to find an (open) issue addressing the problem, open a new one.
+     Try as much as possible to include in the issue when relevant:
+     - a clear description,
+     - as much relevant information as possible,
+     - the command used,
+     - a data sample,
+     - an explanation of the expected behaviour that is not occurring.
+
+  Do you want to contribute?:
+    You are very welcome, visit this address for the Contributing
+    guidelines:
+    https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md
+
--- a/src/agat/agat_convert_sp_gff2gtf/script.sh
+++ b/src/agat/agat_convert_sp_gff2gtf/script.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+## VIASH START
+## VIASH END
+
+agat_convert_sp_gff2gtf.pl \
+  -i "$par_gff" \
+  -o "$par_output" \
+  ${par_gtf_version:+--gtf_version "${par_gtf_version}"} \
+  ${par_config:+--config "${par_config}"}
--- a/src/agat/agat_convert_sp_gff2gtf/test.sh
+++ b/src/agat/agat_convert_sp_gff2gtf/test.sh
@@ -0,0 +1,37 @@
+#!/bin/bash
+
+## VIASH START
+## VIASH END
+
+test_dir="${meta_resources_dir}/test_data"
+
+echo "> Run $meta_name with test data"
+"$meta_executable" \
+  --gff "$test_dir/0_test.gff" \
+  --output "output.gtf" 
+
+echo ">> Checking output"
+[ ! -f "output.gtf" ] && echo "Output file output.gtf does not exist" && exit 1
+
+echo ">> Check if output is empty"
+[ ! -s "output.gtf" ] && echo "Output file output.gtf is empty" && exit 1
+
+echo ">> Check if the conversion resulted in the right GTF format"
+idGFF=$(head -n 2 "$test_dir/0_test.gff" | grep -o 'ID=[^;]*' | cut -d '=' -f 2-)
+expectedGTF="gene_id \"$idGFF\"; ID \"$idGFF\";"
+extractedGTF=$(head -n 3 "output.gtf" | grep -o 'gene_id "[^"]*"; ID "[^"]*";')
+[ "$extractedGTF" != "$expectedGTF" ] && echo "Output file output.gtf does not have the right format" && exit 1
+
+rm output.gtf
+
+echo "> Run $meta_name with test data and GTF version 2.5"
+"$meta_executable" \
+  --gff "$test_dir/0_test.gff" \
+  --output "output.gtf" \
+  --gtf_version "2.5"
+
+echo ">> Check if the output file header display the right GTF version"
+grep -q "##gtf-version 2.5" "output.gtf"
+[ $? -ne 0 ] && echo "Output file output.gtf header does not display the right GTF version" && exit 1
+
+echo "> Test successful"
--- a/src/agat/agat_convert_sp_gff2gtf/test_data/0_test.gff
+++ b/src/agat/agat_convert_sp_gff2gtf/test_data/0_test.gff
@@ -0,0 +1,36 @@
+##gff-version 3
+scaffold625	maker	gene	337818	343277	.	+	.	ID=CLUHARG00000005458;Name=TUBB3_2
+scaffold625	maker	mRNA	337818	343277	.	+	.	ID=CLUHART00000008717;Parent=CLUHARG00000005458
+scaffold625	maker	exon	337818	337971	.	+	.	ID=CLUHART00000008717:exon:1404;Parent=CLUHART00000008717
+scaffold625	maker	exon	340733	340841	.	+	.	ID=CLUHART00000008717:exon:1405;Parent=CLUHART00000008717
+scaffold625	maker	exon	341518	341628	.	+	.	ID=CLUHART00000008717:exon:1406;Parent=CLUHART00000008717
+scaffold625	maker	exon	341964	343277	.	+	.	ID=CLUHART00000008717:exon:1407;Parent=CLUHART00000008717
+scaffold625	maker	CDS	337915	337971	.	+	0	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
+scaffold625	maker	CDS	340733	340841	.	+	0	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
+scaffold625	maker	CDS	341518	341628	.	+	2	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
+scaffold625	maker	CDS	341964	343033	.	+	2	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
+scaffold625	maker	five_prime_UTR	337818	337914	.	+	.	ID=CLUHART00000008717:five_prime_utr;Parent=CLUHART00000008717
+scaffold625	maker	three_prime_UTR	343034	343277	.	+	.	ID=CLUHART00000008717:three_prime_utr;Parent=CLUHART00000008717
+scaffold789	maker	gene	558184	564780	.	+	.	ID=CLUHARG00000003852;Name=PF11_0240
+scaffold789	maker	mRNA	558184	564780	.	+	.	ID=CLUHART00000006146;Parent=CLUHARG00000003852
+scaffold789	maker	exon	558184	560123	.	+	.	ID=CLUHART00000006146:exon:995;Parent=CLUHART00000006146
+scaffold789	maker	exon	561401	561519	.	+	.	ID=CLUHART00000006146:exon:996;Parent=CLUHART00000006146
+scaffold789	maker	exon	564171	564235	.	+	.	ID=CLUHART00000006146:exon:997;Parent=CLUHART00000006146
+scaffold789	maker	exon	564372	564780	.	+	.	ID=CLUHART00000006146:exon:998;Parent=CLUHART00000006146
+scaffold789	maker	CDS	558191	560123	.	+	0	ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
+scaffold789	maker	CDS	561401	561519	.	+	2	ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
+scaffold789	maker	CDS	564171	564235	.	+	0	ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
+scaffold789	maker	CDS	564372	564588	.	+	1	ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
+scaffold789	maker	five_prime_UTR	558184	558190	.	+	.	ID=CLUHART00000006146:five_prime_utr;Parent=CLUHART00000006146
+scaffold789	maker	three_prime_UTR	564589	564780	.	+	.	ID=CLUHART00000006146:three_prime_utr;Parent=CLUHART00000006146
+scaffold789	maker	mRNA	558184	564780	.	+	.	ID=CLUHART00000006147;Parent=CLUHARG00000003852
+scaffold789	maker	exon	558184	560123	.	+	.	ID=CLUHART00000006147:exon:997;Parent=CLUHART00000006147
+scaffold789	maker	exon	561401	561519	.	+	.	ID=CLUHART00000006147:exon:998;Parent=CLUHART00000006147
+scaffold789	maker	exon	562057	562121	.	+	.	ID=CLUHART00000006147:exon:999;Parent=CLUHART00000006147
+scaffold789	maker	exon	564372	564780	.	+	.	ID=CLUHART00000006147:exon:1000;Parent=CLUHART00000006147
+scaffold789	maker	CDS	558191	560123	.	+	0	ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
+scaffold789	maker	CDS	561401	561519	.	+	2	ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
+scaffold789	maker	CDS	562057	562121	.	+	0	ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
+scaffold789	maker	CDS	564372	564588	.	+	1	ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
+scaffold789	maker	five_prime_UTR	558184	558190	.	+	.	ID=CLUHART00000006147:five_prime_utr;Parent=CLUHART00000006147
+scaffold789	maker	three_prime_UTR	564589	564780	.	+	.	ID=CLUHART00000006147:three_prime_utr;Parent=CLUHART00000006147
--- a/src/agat/agat_convert_sp_gff2gtf/test_data/script.sh
+++ b/src/agat/agat_convert_sp_gff2gtf/test_data/script.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+
+# clone repo
+if [ ! -d /tmp/agat_source ]; then
+  git clone --depth 1 --single-branch --branch master https://github.com/NBISweden/AGAT /tmp/agat_source
+fi
+
+# copy test data
+cp -r /tmp/agat_source/t/gff_syntax/in/0_test.gff src/agat/agat_convert_sp_gff2gtf/test_data
--- a/src/arriba/config.vsh.yaml
+++ b/src/arriba/config.vsh.yaml
@@ -11,6 +11,9 @@ license: MIT
 requirements:
  cpus: 1
  commands: [ arriba ]
+authors:
+  - __merge__: /src/_authors/robrecht_cannoodt.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/bcl_convert/config.vsh.yaml
+++ b/src/bcl_convert/config.vsh.yaml
@@ -4,6 +4,17 @@ description: |
  Information about upgrading from bcl2fastq via
  [Upgrading from bcl2fastq to BCL Convert](https://emea.support.illumina.com/bulletins/2020/10/upgrading-from-bcl2fastq-to-bcl-convert.html)
  and [BCL Convert Compatible Products](https://support.illumina.com/sequencing/sequencing_software/bcl-convert/compatibility.html)
+keywords: [demultiplex, fastq, bcl, illumina]
+links:
+  homepage: https://support.illumina.com/sequencing/sequencing_software/bcl-convert.html
+  documentation: https://support.illumina.com/downloads/bcl-convert-user-guide.html
+license: Proprietary
+authors:
+  - __merge__: /src/_authors/toni_verbeiren.yaml
+    roles: [ author, maintainer ]
+  - __merge__: /src/_authors/dorien_roosen.yaml
+    roles: [ author ]
+
 argument_groups:
  - name: Input arguments
    arguments:
--- a/src/bd_rhapsody/bd_rhapsody_make_reference/config.vsh.yaml
+++ b/src/bd_rhapsody/bd_rhapsody_make_reference/config.vsh.yaml
@@ -0,0 +1,143 @@
+name: bd_rhapsody_make_reference
+namespace: bd_rhapsody
+description: |
+  The Reference Files Generator creates an archive containing Genome Index
+  and Transcriptome annotation files needed for the BD Rhapsody Sequencing
+  Analysis Pipeline. The app takes as input one or more FASTA and GTF files
+  and produces a compressed archive in the form of a tar.gz file. The 
+  archive contains:
+  
+  - STAR index
+  - Filtered GTF file
+keywords: [genome, reference, index, align]
+links:
+  repository: https://bitbucket.org/CRSwDev/cwl/src/master/v2.2.1/Extra_Utilities/
+  documentation: https://bd-rhapsody-bioinfo-docs.genomics.bd.com/resources/extra_utilities.html#make-rhapsody-reference
+license: Unknown
+authors:
+  - __merge__: /src/_authors/robrecht_cannoodt.yaml
+    roles: [ author, maintainer ]
+  - __merge__: /src/_authors/weiwei_schultz.yaml
+    roles: [ contributor ]
+
+argument_groups:
+  - name: Inputs
+    arguments:
+      - type: file
+        name: --genome_fasta
+        required: true
+        description: Reference genome file in FASTA or FASTA.GZ format. The BD Rhapsody Sequencing Analysis Pipeline uses GRCh38 for Human and GRCm39 for Mouse.
+        example: genome_sequence.fa.gz
+        multiple: true
+        info:
+          config_key: Genome_fasta
+      - type: file
+        name: --gtf
+        required: true
+        description: |
+          File path to the transcript annotation files in GTF or GTF.GZ format. The Sequence Analysis Pipeline requires the 'gene_name' or 
+          'gene_id' attribute to be set on each gene and exon feature. Gene and exon feature lines must have the same attribute, and exons
+          must have a corresponding gene with the same value. For TCR/BCR assays, the TCR or BCR gene segments must have the 'gene_type' or
+          'gene_biotype' attribute set, and the value should begin with 'TR' or 'IG', respectively.
+        example: transcriptome_annotation.gtf.gz
+        multiple: true
+        info:
+          config_key: Gtf
+      - type: file
+        name: --extra_sequences
+        description: |
+          File path to additional sequences in FASTA format to use when building the STAR index. (e.g. transgenes or CRISPR guide barcodes).
+          GTF lines for these sequences will be automatically generated and combined with the main GTF.
+        required: false
+        multiple: true
+        info:
+          config_key: Extra_sequences
+  - name: Outputs
+    arguments:
+      - type: file
+        name: --reference_archive
+        direction: output
+        required: true
+        description: |
+          A Compressed archive containing the Reference Genome Index and annotation GTF files. This archive is meant to be used as an
+          input in the BD Rhapsody Sequencing Analysis Pipeline.
+        example: star_index.tar.gz
+  - name: Arguments
+    arguments:
+      - type: string
+        name: --mitochondrial_contigs
+        description: |
+          Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are
+          identified as 'nuclear fragments' in the ATACseq analysis pipeline.
+        required: false
+        multiple: true
+        default: [chrM, chrMT, M, MT]
+        info:
+          config_key: Mitochondrial_contigs
+      - type: boolean_true
+        name: --filtering_off
+        description: |
+          By default the input Transcript Annotation files are filtered based on the gene_type/gene_biotype attribute. Only features 
+          having the following attribute values are kept:
+
+            - protein_coding
+            - lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97)
+            - IG_LV_gene
+            - IG_V_gene
+            - IG_V_pseudogene
+            - IG_D_gene
+            - IG_J_gene
+            - IG_J_pseudogene
+            - IG_C_gene
+            - IG_C_pseudogene
+            - TR_V_gene
+            - TR_V_pseudogene
+            - TR_D_gene
+            - TR_J_gene
+            - TR_J_pseudogene
+            - TR_C_gene
+
+            If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True.
+        info:
+          config_key: Filtering_off
+      - type: boolean_true
+        name: --wta_only_index
+        description: Build a WTA only index, otherwise builds a WTA + ATAC index.
+        info:
+          config_key: Wta_Only
+      - type: string
+        name: --extra_star_params
+        description: Additional parameters to pass to STAR when building the genome index. Specify exactly like how you would on the command line.
+        example: --limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11
+        required: false
+        info:
+          config_key: Extra_STAR_params
+
+resources:
+  - type: python_script
+    path: script.py
+  - path: make_rhap_reference_2.2.1_nodocker.cwl
+
+test_resources:
+  - type: bash_script
+    path: test.sh
+  - path: test_data
+
+requirements:
+  commands: [ "cwl-runner" ]
+
+engines:
+  - type: docker
+    image: bdgenomics/rhapsody:2.2.1
+    setup:
+      - type: apt
+        packages: [procps]
+      - type: python
+        packages: [cwlref-runner, cwl-runner]
+      - type: docker
+        run: |
+          echo "bdgenomics/rhapsody: 2.2.1" > /var/software_versions.txt
+
+runners:
+  - type: executable
+  - type: nextflow
--- a/src/bd_rhapsody/bd_rhapsody_make_reference/help.txt
+++ b/src/bd_rhapsody/bd_rhapsody_make_reference/help.txt
@@ -0,0 +1,66 @@
+```bash
+cwl-runner src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl --help
+```
+
+usage: src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl
+       [-h] [--Archive_prefix ARCHIVE_PREFIX]
+       [--Extra_STAR_params EXTRA_STAR_PARAMS]
+       [--Extra_sequences EXTRA_SEQUENCES] [--Filtering_off] --Genome_fasta
+       GENOME_FASTA --Gtf GTF [--Maximum_threads MAXIMUM_THREADS]
+       [--Mitochondrial_Contigs MITOCHONDRIAL_CONTIGS] [--WTA_Only]
+       [job_order]
+
+The Reference Files Generator creates an archive containing Genome Index and
+Transcriptome annotation files needed for the BD Rhapsodyâ„¢ Sequencing
+Analysis Pipeline. The app takes as input one or more FASTA and GTF files and
+produces a compressed archive in the form of a tar.gz file. The archive
+contains:\n - STAR index\n - Filtered GTF file
+
+positional arguments:
+  job_order             Job input json file
+
+options:
+  -h, --help            show this help message and exit
+  --Archive_prefix ARCHIVE_PREFIX
+                        A prefix for naming the compressed archive file
+                        containing the Reference genome index and annotation
+                        files. The default value is constructed based on the
+                        input Reference files.
+  --Extra_STAR_params EXTRA_STAR_PARAMS
+                        Additional parameters to pass to STAR when building
+                        the genome index. Specify exactly like how you would
+                        on the command line. Example: --limitGenomeGenerateRAM
+                        48000 --genomeSAindexNbases 11
+  --Extra_sequences EXTRA_SEQUENCES
+                        Additional sequences in FASTA format to use when
+                        building the STAR index. (E.g. phiX genome)
+  --Filtering_off       By default the input Transcript Annotation files are
+                        filtered based on the gene_type/gene_biotype
+                        attribute. Only features having the following
+                        attribute values are are kept: - protein_coding -
+                        lncRNA (lincRNA and antisense for Gencode <
+                        v31/M22/Ensembl97) - IG_LV_gene - IG_V_gene -
+                        IG_V_pseudogene - IG_D_gene - IG_J_gene -
+                        IG_J_pseudogene - IG_C_gene - IG_C_pseudogene -
+                        TR_V_gene - TR_V_pseudogene - TR_D_gene - TR_J_gene -
+                        TR_J_pseudogene - TR_C_gene If you have already pre-
+                        filtered the input Annotation files and/or wish to
+                        turn-off the filtering, please set this option to
+                        True.
+  --Genome_fasta GENOME_FASTA
+                        Reference genome file in FASTA format. The BD
+                        Rhapsodyâ„¢ Sequencing Analysis Pipeline uses GRCh38
+                        for Human and GRCm39 for Mouse.
+  --Gtf GTF             Transcript annotation files in GTF format. The BD
+                        Rhapsodyâ„¢ Sequencing Analysis Pipeline uses Gencode
+                        v42 for Human and M31 for Mouse.
+  --Maximum_threads MAXIMUM_THREADS
+                        The maximum number of threads to use in the pipeline.
+                        By default, all available cores are used.
+  --Mitochondrial_Contigs MITOCHONDRIAL_CONTIGS
+                        Names of the Mitochondrial contigs in the provided
+                        Reference Genome. Fragments originating from contigs
+                        other than these are identified as 'nuclear fragments'
+                        in the ATACseq analysis pipeline.
+  --WTA_Only            Build a WTA only index, otherwise builds a WTA + ATAC
+                        index.
--- a/src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl
+++ b/src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl
@@ -0,0 +1,115 @@
+requirements:
+  InlineJavascriptRequirement: {}
+class: CommandLineTool
+label: Reference Files Generator for BD Rhapsodyâ„¢ Sequencing Analysis Pipeline
+cwlVersion: v1.2
+doc: >- 
+    The Reference Files Generator creates an archive containing Genome Index and Transcriptome annotation files needed for the BD Rhapsodyâ„¢ Sequencing Analysis Pipeline. The app takes as input one or more FASTA and GTF files and produces a compressed archive in the form of a tar.gz file. The archive contains:\n  - STAR index\n  - Filtered GTF file
+
+
+baseCommand: run_reference_generator.sh 
+inputs: 
+    Genome_fasta:
+        type: File[]
+        label: Reference Genome
+        doc: |-
+            Reference genome file in FASTA format. The BD Rhapsodyâ„¢ Sequencing Analysis Pipeline uses GRCh38 for Human and GRCm39 for Mouse.
+        inputBinding:
+            prefix: --reference-genome
+            shellQuote: false
+    Gtf:
+        type: File[]
+        label: Transcript Annotations
+        doc: |-
+            Transcript annotation files in GTF format. The BD Rhapsodyâ„¢ Sequencing Analysis Pipeline uses Gencode v42 for Human and M31 for Mouse.
+        inputBinding:
+            prefix: --gtf
+            shellQuote: false
+    Extra_sequences:
+        type: File[]?
+        label: Extra Sequences
+        doc: |-
+            Additional sequences in FASTA format to use when building the STAR index. (E.g. phiX genome)
+        inputBinding:
+            prefix: --extra-sequences
+            shellQuote: false
+    Mitochondrial_Contigs:
+        type: string[]?
+        default: ["chrM", "chrMT", "M", "MT"]
+        label: Mitochondrial Contig Names
+        doc: |-
+            Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are identified as 'nuclear fragments' in the ATACseq analysis pipeline.
+        inputBinding:
+            prefix: --mitochondrial-contigs
+            shellQuote: false
+    Filtering_off:
+        type: boolean?
+        label: Turn off filtering
+        doc: |-
+            By default the input Transcript Annotation files are filtered based on the gene_type/gene_biotype attribute. Only features having the following attribute values are are kept:
+            - protein_coding
+            - lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97)
+            - IG_LV_gene
+            - IG_V_gene
+            - IG_V_pseudogene
+            - IG_D_gene
+            - IG_J_gene
+            - IG_J_pseudogene
+            - IG_C_gene
+            - IG_C_pseudogene
+            - TR_V_gene
+            - TR_V_pseudogene
+            - TR_D_gene
+            - TR_J_gene
+            - TR_J_pseudogene
+            - TR_C_gene
+            If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True.
+        inputBinding: 
+            prefix: --filtering-off
+            shellQuote: false
+    WTA_Only:
+        type: boolean?
+        label: WTA only index
+        doc: Build a WTA only index, otherwise builds a WTA + ATAC index.
+        inputBinding:
+            prefix: --wta-only-index
+            shellQuote: false
+    Archive_prefix:
+        type: string?
+        label: Archive Prefix
+        doc: |-
+            A prefix for naming the compressed archive file containing the Reference genome index and annotation files. The default value is constructed based on the input Reference files.
+        inputBinding:
+            prefix: --archive-prefix
+            shellQuote: false
+    Extra_STAR_params:
+        type: string?
+        label: Extra STAR Params
+        doc: |-
+            Additional parameters to pass to STAR when building the genome index. Specify exactly like how you would on the command line.
+            Example:
+              --limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11
+        inputBinding:
+            prefix: --extra-star-params 
+            shellQuote: true
+  
+    Maximum_threads:
+        type: int?
+        label: Maximum Number of Threads
+        doc: |-
+            The maximum number of threads to use in the pipeline. By default, all available cores are used.
+        inputBinding:
+            prefix: --maximum-threads
+            shellQuote: false
+
+outputs:
+
+    Archive:
+        type: File
+        doc: |- 
+            A Compressed archive containing the Reference Genome Index and annotation GTF files. This archive is meant to be used as an input in the BD Rhapsodyâ„¢ Sequencing Analysis Pipeline.
+        id: Reference_Archive
+        label: Reference Files Archive
+        outputBinding:
+            glob: '*.tar.gz'
+
--- a/src/bd_rhapsody/bd_rhapsody_make_reference/script.py
+++ b/src/bd_rhapsody/bd_rhapsody_make_reference/script.py
@@ -0,0 +1,161 @@
+import os
+import re
+import subprocess
+import tempfile
+from typing import Any
+import yaml
+import shutil
+
+## VIASH START
+par = {
+    "genome_fasta": [],
+    "gtf": [],
+    "extra_sequences": [],
+    "mitochondrial_contigs": ["chrM", "chrMT", "M", "MT"],
+    "filtering_off": False,
+    "wta_only_index": False,
+    "extra_star_params": None,
+    "reference_archive": "output.tar.gz",
+}
+meta = {
+    "config": "target/nextflow/reference/build_bdrhap_2_reference/.config.vsh.yaml",
+    "resources_dir": os.path.abspath("src/reference/build_bdrhap_2_reference"),
+    "temp_dir": os.getenv("VIASH_TEMP"),
+    "memory_mb": None,
+    "cpus": None
+}
+## VIASH END
+
+def clean_arg(argument):
+    argument["clean_name"] = re.sub("^-*", "", argument["name"])
+    return argument
+
+def read_config(path: str) -> dict[str, Any]:
+    with open(path, "r") as f:
+        config = yaml.safe_load(f)
+    
+    config["all_arguments"] = [
+        clean_arg(arg)
+        for grp in config["argument_groups"]
+        for arg in grp["arguments"]
+    ]
+    
+    return config
+
+def strip_margin(text: str) -> str:
+    return re.sub("(\n?)[ \t]*\|", "\\1", text)
+
+def process_params(par: dict[str, Any], config) -> str:
+    # check input parameters
+    assert par["genome_fasta"], "Pass at least one set of inputs to --genome_fasta."
+    assert par["gtf"], "Pass at least one set of inputs to --gtf."
+    assert par["reference_archive"].endswith(".tar.gz"), "Output reference_archive must end with .tar.gz."
+
+    # make paths absolute
+    for argument in config["all_arguments"]:
+        if par[argument["clean_name"]] and argument["type"] == "file":
+            if isinstance(par[argument["clean_name"]], list):
+                par[argument["clean_name"]] = [ os.path.abspath(f) for f in par[argument["clean_name"]] ]
+            else:
+                par[argument["clean_name"]] = os.path.abspath(par[argument["clean_name"]])
+    
+    return par
+
+def generate_config(par: dict[str, Any], meta, config) -> str:
+    content_list = [strip_margin(f"""\
+        |#!/usr/bin/env cwl-runner
+        |
+        |""")]
+        
+    
+    config_key_value_pairs = []
+    for argument in config["all_arguments"]:
+        config_key = (argument.get("info") or {}).get("config_key")
+        arg_type = argument["type"]
+        par_value = par[argument["clean_name"]]
+        if par_value and config_key:
+            config_key_value_pairs.append((config_key, arg_type, par_value))
+
+    if meta["cpus"]:
+        config_key_value_pairs.append(("Maximum_threads", "integer", meta["cpus"]))
+
+    # print(config_key_value_pairs)
+
+    for config_key, arg_type, par_value in config_key_value_pairs:
+        if arg_type == "file":
+            str = strip_margin(f"""\
+                |{config_key}:
+                |""")
+            if isinstance(par_value, list):
+                for file in par_value:
+                    str += strip_margin(f"""\
+                        | - class: File
+                        |   location: "{file}"
+                        |""")
+            else:
+                str += strip_margin(f"""\
+                    |   class: File
+                    |   location: "{par_value}"
+                    |""")
+            content_list.append(str)
+        else:
+            content_list.append(strip_margin(f"""\
+                |{config_key}: {par_value}
+                |"""))
+            
+    ## Write config to file
+    return "".join(content_list)
+
+def get_cwl_file(meta: dict[str, Any]) -> str:
+    # create cwl file (if need be)
+    cwl_file=os.path.join(meta["resources_dir"], "make_rhap_reference_2.2.1_nodocker.cwl")
+
+    return cwl_file
+
+def main(par: dict[str, Any], meta: dict[str, Any]):
+    config = read_config(meta["config"])
+        
+    # Preprocess params
+    par = process_params(par, config)
+
+    # fetch cwl file
+    cwl_file = get_cwl_file(meta)
+
+    # Create output dir if not exists
+    outdir = os.path.dirname(par["reference_archive"])
+    if not os.path.exists(outdir):
+        os.makedirs(outdir)
+
+    ## Run pipeline
+    with tempfile.TemporaryDirectory(prefix="cwl-bd_rhapsody_wta-", dir=meta["temp_dir"]) as temp_dir:
+        # Create params file
+        config_file = os.path.join(temp_dir, "config.yml")
+        config_content = generate_config(par, meta, config)
+        with open(config_file, "w") as f:
+            f.write(config_content)
+
+
+        cmd = [
+            "cwl-runner",
+            "--no-container",
+            "--preserve-entire-environment",
+            "--outdir",
+            temp_dir,
+            cwl_file,
+            config_file
+        ]
+
+        env = dict(os.environ)
+        env["TMPDIR"] = temp_dir
+
+        print("> " + " ".join(cmd), flush=True)
+        _ = subprocess.check_call(
+            cmd,
+            cwd=os.path.dirname(config_file),
+            env=env
+        )
+
+        shutil.move(os.path.join(temp_dir, "Rhap_reference.tar.gz"), par["reference_archive"])
+
+if __name__ == "__main__":
+    main(par, meta)
--- a/src/bd_rhapsody/bd_rhapsody_make_reference/test.sh
+++ b/src/bd_rhapsody/bd_rhapsody_make_reference/test.sh
@@ -0,0 +1,65 @@
+#!/bin/bash
+
+set -e
+
+#############################################
+# helper functions
+assert_file_exists() {
+  [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
+}
+assert_file_doesnt_exist() {
+  [ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
+}
+assert_file_empty() {
+  [ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
+}
+assert_file_not_empty() {
+  [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
+}
+assert_file_contains() {
+  grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
+}
+assert_file_not_contains() {
+  grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
+}
+#############################################
+
+in_fa="$meta_resources_dir/test_data/reference_small.fa"
+in_gtf="$meta_resources_dir/test_data/reference_small.gtf"
+
+echo "#############################################"
+echo "> Simple run"
+
+mkdir simple_run
+cd simple_run
+
+out_tar="myreference.tar.gz"
+
+echo "> Running $meta_name."
+$meta_executable \
+  --genome_fasta "$in_fa" \
+  --gtf "$in_gtf" \
+  --reference_archive "$out_tar" \
+  --extra_star_params "--genomeSAindexNbases 6" \
+  ---cpus 2
+
+exit_code=$?
+[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
+
+assert_file_exists "$out_tar"
+assert_file_not_empty "$out_tar"
+
+echo ">> Checking whether output contains the expected files"
+tar -xvf "$out_tar" > /dev/null
+assert_file_exists "BD_Rhapsody_Reference_Files/star_index/genomeParameters.txt"
+assert_file_exists "BD_Rhapsody_Reference_Files/bwa-mem2_index/reference_small.ann"
+assert_file_exists "BD_Rhapsody_Reference_Files/reference_small-processed.gtf"
+assert_file_exists "BD_Rhapsody_Reference_Files/mitochondrial_contigs.txt"
+assert_file_contains "BD_Rhapsody_Reference_Files/reference_small-processed.gtf" "chr1.*HAVANA.*ENSG00000243485"
+assert_file_contains "BD_Rhapsody_Reference_Files/mitochondrial_contigs.txt" 'chrMT'
+
+cd ..
+
+echo "#############################################"
+
+echo "> Tests succeeded!"
--- a/src/bd_rhapsody/bd_rhapsody_make_reference/test_data/reference_small.fa
+++ b/src/bd_rhapsody/bd_rhapsody_make_reference/test_data/reference_small.fa
@@ -0,0 +1,27 @@
+>chr1 1
+TGGGGAAGCAAGGCGGAGTTGGGCAGCTCGTGTTCAATGGGTAGAGTTTCAGGCTGGGGT
+GATGGAAGGGTGCTGGAAATGAGTGGTAGTGATGGCGGCACAACAGTGTGAATCTACTTA
+ATCCCACTGAACTGTATGCTGAAAAATGGTTTAGACGGTGAATTTTAGGTTATGTATGTT
+TTACCACAATTTTTAAAAAGCTAGTGAAAAGCTGGTAAAAAGAAAGAAAAGAGGCTTTTT
+TAAAAAGTTAAATATATAAAAAGAGCATCATCAGTCCAAAGTCCAGCAGTTGTCCCTCCT
+GGAATCCGTTGGCTTGCCTCCGGCATTTTTGGCCCTTGCCTTTTAGGGTTGCCAGATTAA
+AAGACAGGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAA
+ATATGTCCCAAGCTTAGTTTGGGACATACTTATGCTAAAAAACATTATTGGTTGTTTATC
+TGAGATTCAGAATTAAGCATTTTATATTTTATTTGCTGCCTCTGGCCACCCTACTCTCTT
+CCTAACACTCTCTCCCTCTCCCAGTTTTGTCCGCCTTCCCTGCCTCCTCTTCTGGGGGAG
+TTAGATCGAGTTGTAACAAGAACATGCCACTGTCTCGCTGGCTGCAGCGTGTGGTCCCCT
+TACCAGAGGTAAAGAAGAGATGGATCTCCACTCATGTTGTAGACAGAATGTTTATGTCCT
+CTCCAAATGCTTATGTTGAAACCCTAACCCCTAATGTGATGGTATGTGGAGATGGGCCTT
+TGGTAGGTAATTACGGTTAGATGAGGTCATGGGGTGGGGCCCTCATTATAGATCTGGTAA
+GAAAAGAGAGCATTGTCTCTGTGTCTCCCTCTCTCTCTCTCTCTCTCTCTCTCATTTCTC
+TCTATCTCATTTCTCTCTCTCTCGCTATCTCATTTTTCTCTCTCTCTCTTTCTCTCCTCT
+GTCTTTTCCCACCAAGTGAGGATGCGAAGAGAAGGTGGCTGTCTGCAAACCAGGAAGAGA
+GCCCTCACCGGGAACCCGTCCAGCTGCCACCTTGAACTTGGACTTCCAAGCCTCCAGAAC
+TGTGAGGGATAAATGTATGATTTTAAAGTCGCCCAGTGTGTGGTATTTTGTTTTGACTAA
+TACAACCTGAAAACATTTTCCCCTCACTCCACCTGAGCAATATCTGAGTGGCTTAAGGTA
+CTCAGGACACAACAAAGGAGAAATGTCCCATGCACAAGGTGCACCCATGCCTGGGTAAAG
+CAGCCTGGCACAGAGGGAAGCACACAGGCTCAGGGATCTGCTATTCATTCTTTGTGTGAC
+CCTGGGCAAGCCATGAATGGAGCTTCAGTCACCCCATTTGTAATGGGATTTAATTGTGCT
+TGCCCTGCCTCCTTTTGAGGGCTGTAGAGAAAAGATGTCAAAGTATTTTGTAATCTGGCT
+GGGCGTGGTGGCTCATGCCTGTAATCCTAGCACTTTGGTAGGCTGACGCGAGAGGACTGC
+T
--- a/src/bd_rhapsody/bd_rhapsody_make_reference/test_data/reference_small.gtf
+++ b/src/bd_rhapsody/bd_rhapsody_make_reference/test_data/reference_small.gtf
@@ -0,0 +1,8 @@
+chr1	HAVANA	exon	565	668	.	+	.	gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; exon_number 2; exon_id "ENSE00001922571.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
+chr1	HAVANA	exon	977	1098	.	+	.	gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; exon_number 3; exon_id "ENSE00001827679.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
+chr1	HAVANA	transcript	268	1110	.	+	.	gene_id "ENSG00000243485.5"; transcript_id "ENST00000469289.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
+chr1	HAVANA	exon	268	668	.	+	.	gene_id "ENSG00000243485.5"; transcript_id "ENST00000469289.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; exon_number 1; exon_id "ENSE00001841699.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
+chr1	HAVANA	exon	977	1110	.	+	.	gene_id "ENSG00000243485.5"; transcript_id "ENST00000469289.1"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; exon_number 2; exon_id "ENSE00001890064.1"; level 2; transcript_support_level "5"; hgnc_id "HGNC:52482"; tag "not_best_in_genome_evidence"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
+chr1	ENSEMBL	gene	367	504	.	+	.	gene_id "ENSG00000284332.1"; gene_type "miRNA"; gene_name "MIR1302-2"; level 3; hgnc_id "HGNC:35294";
+chr1	ENSEMBL	transcript	367	504	.	+	.	gene_id "ENSG00000284332.1"; transcript_id "ENST00000607096.1"; gene_type "miRNA"; gene_name "MIR1302-2"; transcript_type "miRNA"; transcript_name "MIR1302-2-201"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:35294"; tag "basic"; tag "Ensembl_canonical";
+chr1	ENSEMBL	exon	367	504	.	+	.	gene_id "ENSG00000284332.1"; transcript_id "ENST00000607096.1"; gene_type "miRNA"; gene_name "MIR1302-2"; transcript_type "miRNA"; transcript_name "MIR1302-2-201"; exon_number 1; exon_id "ENSE00003695741.1"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:35294"; tag "basic"; tag "Ensembl_canonical";
--- a/src/bd_rhapsody/bd_rhapsody_make_reference/test_data/script.sh
+++ b/src/bd_rhapsody/bd_rhapsody_make_reference/test_data/script.sh
@@ -0,0 +1,47 @@
+#!/bin/bash
+
+TMP_DIR=/tmp/bd_rhapsody_make_reference
+OUT_DIR=src/bd_rhapsody/bd_rhapsody_make_reference/test_data
+
+# check if seqkit is installed
+if ! command -v seqkit &> /dev/null; then
+  echo "seqkit could not be found"
+  exit 1
+fi
+
+# create temporary directory and clean up on exit
+mkdir -p $TMP_DIR
+function clean_up {
+    rm -rf "$TMP_DIR"
+}
+trap clean_up EXIT
+
+# fetch reference
+ORIG_FA=$TMP_DIR/reference.fa.gz
+if [ ! -f $ORIG_FA ]; then
+  wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/GRCh38.primary_assembly.genome.fa.gz \
+    -O $ORIG_FA
+fi
+
+ORIG_GTF=$TMP_DIR/reference.gtf.gz
+if [ ! -f $ORIG_GTF ]; then
+  wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.annotation.gtf.gz \
+    -O $ORIG_GTF
+fi
+
+# create small reference
+START=30000
+END=31500
+CHR=chr1
+
+# subset to small region
+seqkit grep -r -p "^$CHR\$" "$ORIG_FA" | \
+  seqkit subseq -r "$START:$END" > $OUT_DIR/reference_small.fa
+
+zcat "$ORIG_GTF" | \
+  awk -v FS='\t' -v OFS='\t' "
+    \$1 == \"$CHR\" && \$4 >= $START && \$5 <= $END {
+      \$4 = \$4 - $START + 1;
+      \$5 = \$5 - $START + 1;
+      print;
+    }" > $OUT_DIR/reference_small.gtf
--- a/src/bedtools/bedtools_getfasta/config.vsh.yaml
+++ b/src/bedtools/bedtools_getfasta/config.vsh.yaml
@@ -10,6 +10,9 @@ references:
 license: GPL-2.0
 requirements:
  commands: [bedtools]
+authors:
+  - __merge__: /src/_authors/dries_schaumont.yaml
+    roles: [ author, maintainer ]

 argument_groups:
  - name: Input arguments
--- a/src/busco/busco_download_datasets/config.vsh.yaml
+++ b/src/busco/busco_download_datasets/config.vsh.yaml
@@ -9,6 +9,9 @@ links:
 references:
  doi: 10.1007/978-1-4939-9173-0_14
 license: MIT
+authors:
+  - __merge__: /src/_authors/dorien_roosen.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/busco/busco_list_datasets/config.vsh.yaml
+++ b/src/busco/busco_list_datasets/config.vsh.yaml
@@ -9,6 +9,9 @@ links:
 references:
  doi: 10.1007/978-1-4939-9173-0_14
 license: MIT
+authors:
+  - __merge__: /src/_authors/dorien_roosen.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Outputs
    arguments:
--- a/src/busco/busco_run/config.vsh.yaml
+++ b/src/busco/busco_run/config.vsh.yaml
@@ -9,6 +9,9 @@ links:
 references:
  doi: 10.1007/978-1-4939-9173-0_14
 license: MIT
+authors:
+  - __merge__: /src/_authors/dorien_roosen.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/cutadapt/config.vsh.yaml
+++ b/src/cutadapt/config.vsh.yaml
@@ -9,6 +9,9 @@ links:
 references:
  doi: 10.14806/ej.17.1.200
 license: MIT
+authors:
+  - __merge__: /src/_authors/toni_verbeiren.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  ####################################################################
  - name: Specify Adapters for R1
--- a/src/cutadapt/test.sh
+++ b/src/cutadapt/test.sh
@@ -6,25 +6,25 @@ set -eo pipefail
 #############################################
 # helper functions
 assert_file_exists() {
-  [ -f "$1" ] || (echo "File '$1' does not exist" && exit 1)
+  [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
 }
 assert_file_doesnt_exist() {
-  [ ! -f "$1" ] || (echo "File '$1' exists but shouldn't" && exit 1)
+  [ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
 }
 assert_file_empty() {
-  [ ! -s "$1" ] || (echo "File '$1' is not empty but should be" && exit 1)
+  [ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
 }
 assert_file_not_empty() {
-  [ -s "$1" ] || (echo "File '$1' is empty but shouldn't be" && exit 1)
+  [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
 }
 assert_file_contains() {
-  grep -q "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
+  grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
 }
 assert_file_not_contains() {
-  grep -q "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
+  grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
 }
-
 #############################################
+
 mkdir test_multiple_output
 cd test_multiple_output

--- a/src/falco/config.vsh.yaml
+++ b/src/falco/config.vsh.yaml
@@ -9,6 +9,9 @@ references:
 license: GPL-3.0
 requirements:
  commands: [falco]
+authors:
+  - __merge__: /src/_authors/toni_verbeiren.yaml
+    roles: [ author, maintainer ]

 # Notes:
 # - falco as arguments similar to -subsample and we update those to --subsample
--- a/src/fastp/config.vsh.yaml
+++ b/src/fastp/config.vsh.yaml
@@ -26,6 +26,9 @@ links:
 references:
  doi: "10.1093/bioinformatics/bty560"
 license: MIT
+authors:
+  - __merge__: /src/_authors/robrecht_cannoodt.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    description: |
--- a/src/featurecounts/config.vsh.yaml
+++ b/src/featurecounts/config.vsh.yaml
@@ -11,7 +11,9 @@ references:
 license: GPL-3.0 
 requirements:
  commands: [ featureCounts ]
-
+authors:
+  - __merge__: /src/_authors/sai_nirmayi_yasa.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/gffread/config.vsh.yaml
+++ b/src/gffread/config.vsh.yaml
@@ -8,8 +8,9 @@ links:
 references: 
  doi: 10.12688/f1000research.23297.2
 license: MIT
-requirements:
-  commands: [ gffread ]
+authors:
+  - __merge__: /src/_authors/emma_rousseau.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
@@ -52,7 +53,7 @@ argument_groups:
        required: true
        description: |
          Write the output records into <outfile>.
-        default: output.gff
+        example: output.gff
      - name: --force_exons
        type: boolean_true
        description: |
@@ -154,7 +155,6 @@ argument_groups:
      - name: --table
        type: string
        multiple: true
-        multiple_sep: ","
        description: |
          Output a simple tab delimited format instead of GFF, with columns having the values 
          of GFF attributes given in <attrlist>; special pseudo-attributes (prefixed by @) are 
--- a/src/gffread/script.sh
+++ b/src/gffread/script.sh
@@ -50,6 +50,8 @@
 [[ "$par_expose_dups" == "false" ]] && unset par_expose_dups
 [[ "$par_cluster_only" == "false" ]] && unset par_cluster_only

+# if par_table is not empty, replace ";" with ","
+par_table=$(echo "$par_table" | tr ';' ',')

 $(which gffread) \
    "$par_input" \
--- a/src/gffread/test.sh
+++ b/src/gffread/test.sh
@@ -86,7 +86,7 @@ diff "$expected_output_dir/transcripts.fa" "$test_output_dir/transcripts.fa" ||
 echo "> Test 4 - Generate table from GFF annotation file"

 "$meta_executable" \
-  --table @id,@chr,@start,@end,@strand,@exons,Name,gene,product \
+  --table "@id;@chr;@start;@end;@strand;@exons;Name;gene;product" \
  --outfile "$test_output_dir/annotation.tbl" \
  --input "$test_dir/sequence.gff3"

--- a/src/lofreq/call/config.vsh.yaml
+++ b/src/lofreq/call/config.vsh.yaml
@@ -17,6 +17,9 @@ references:
 license: "MIT"
 requirements:
  commands: [ lofreq ]
+authors:
+  - __merge__: /src/_authors/kai_waldrant.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/lofreq/indelqual/config.vsh.yaml
+++ b/src/lofreq/indelqual/config.vsh.yaml
@@ -18,6 +18,9 @@ references:
 license: "MIT"
 requirements:
  commands: [ lofreq ]
+authors:
+  - __merge__: /src/_authors/kai_waldrant.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/multiqc/config.vsh.yaml
+++ b/src/multiqc/config.vsh.yaml
@@ -11,7 +11,9 @@ info:
  references:
    doi: 10.1093/bioinformatics/btw354
  licence: GPL v3 or later
-
+authors:
+  - __merge__: /src/_authors/dorien_roosen.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: "Input"
    arguments:
--- a/src/pear/config.vsh.yaml
+++ b/src/pear/config.vsh.yaml
@@ -12,7 +12,10 @@ references:
  doi: 10.1093/bioinformatics/btt593
 license: "CC-BY-NC-SA-3.0"
 requirements:
-  commands: [ pear , gzip ]
+  commands: [ pear, gzip ]
+authors:
+  - __merge__: /src/_authors/kai_waldrant.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/salmon/salmon_index/config.vsh.yaml
+++ b/src/salmon/salmon_index/config.vsh.yaml
@@ -12,7 +12,9 @@ references:
 license: GPL-3.0 
 requirements:
  commands: [ salmon ]
-
+authors:
+  - __merge__: /src/_authors/sai_nirmayi_yasa.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/salmon/salmon_quant/config.vsh.yaml
+++ b/src/salmon/salmon_quant/config.vsh.yaml
@@ -12,7 +12,9 @@ references:
 license: GPL-3.0 
 requirements:
  commands: [ salmon ]
-
+authors:
+  - __merge__: /src/_authors/sai_nirmayi_yasa.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Common input options
    arguments:
--- a/src/samtools/samtools_collate/config.vsh.yaml
+++ b/src/samtools/samtools_collate/config.vsh.yaml
@@ -9,7 +9,9 @@ links:
 references: 
  doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
 license: MIT/Expat
-
+authors:
+  - __merge__: /src/_authors/emma_rousseau.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/samtools/samtools_faidx/config.vsh.yaml
+++ b/src/samtools/samtools_faidx/config.vsh.yaml
@@ -9,7 +9,9 @@ links:
 references: 
  doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
 license: MIT/Expat
-
+authors:
+  - __merge__: /src/_authors/emma_rousseau.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/samtools/samtools_fasta/config.vsh.yaml
+++ b/src/samtools/samtools_fasta/config.vsh.yaml
@@ -9,7 +9,9 @@ links:
 references: 
  doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
 license: MIT/Expat
-
+authors:
+  - __merge__: /src/_authors/emma_rousseau.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/samtools/samtools_fastq/config.vsh.yaml
+++ b/src/samtools/samtools_fastq/config.vsh.yaml
@@ -9,7 +9,9 @@ links:
 references: 
  doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
 license: MIT/Expat
-
+authors:
+  - __merge__: /src/_authors/emma_rousseau.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/samtools/samtools_flagstat/config.vsh.yaml
+++ b/src/samtools/samtools_flagstat/config.vsh.yaml
@@ -9,7 +9,9 @@ links:
 references: 
  doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
 license: MIT/Expat
-
+authors:
+  - __merge__: /src/_authors/emma_rousseau.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/samtools/samtools_idxstats/config.vsh.yaml
+++ b/src/samtools/samtools_idxstats/config.vsh.yaml
@@ -9,7 +9,9 @@ links:
 references: 
  doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
 license: MIT/Expat
-
+authors:
+  - __merge__: /src/_authors/emma_rousseau.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/samtools/samtools_index/config.vsh.yaml
+++ b/src/samtools/samtools_index/config.vsh.yaml
@@ -9,7 +9,9 @@ links:
 references: 
  doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
 license: MIT/Expat
-
+authors:
+  - __merge__: /src/_authors/emma_rousseau.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/samtools/samtools_sort/config.vsh.yaml
+++ b/src/samtools/samtools_sort/config.vsh.yaml
@@ -9,7 +9,9 @@ links:
 references: 
  doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
 license: MIT/Expat
-
+authors:
+  - __merge__: /src/_authors/emma_rousseau.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/samtools/samtools_stats/config.vsh.yaml
+++ b/src/samtools/samtools_stats/config.vsh.yaml
@@ -9,7 +9,9 @@ links:
 references: 
  doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
 license: MIT/Expat
-
+authors:
+  - __merge__: /src/_authors/emma_rousseau.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
@@ -30,10 +32,10 @@ argument_groups:
    - name: --coverage
      alternatives: -c
      type: integer
-      description: |
-        Coverage distribution min,max,step [1,1000,1].
      multiple: true
-      multiple_sep: ','
+      description: |
+        Coverage distribution min;max;step. Default: [1, 1000, 1].
+      example: [1, 1000, 1]
    - name: --remove_dups
      alternatives: -d
      type: boolean_true
@@ -48,25 +50,25 @@ argument_groups:
      alternatives: -f
      type: string
      description: |
-        Required flag, 0 for unset. See also `samtools flags`.
-      default: "0"
+        Required flag, 0 for unset. See also `samtools flags`. Default: `"0"`.
+      example: "0"
    - name: --filtering_flag
      alternatives: -F
      type: string
      description: |
-        Filtering flag, 0 for unset. See also `samtools flags`.
-      default: "0"
+        Filtering flag, 0 for unset. See also `samtools flags`. Default: `0`.
+      example: "0"
    - name: --GC_depth
      type: double
      description: |
-        The size of GC-depth bins (decreasing bin size increases memory requirement).
-      default: 20000.0
+        The size of GC-depth bins (decreasing bin size increases memory requirement). Default: `20000`.
+      example: 20000.0
    - name: --insert_size
      alternatives: -i
      type: integer
      description: |
-        Maximum insert size.
-      default: 8000
+        Maximum insert size. Default: `8000`.
+      example: 8000
    - name: --id
      alternatives: -I
      type: string
@@ -76,14 +78,14 @@ argument_groups:
      alternatives: -l
      type: integer
      description: |
-        Include in the statistics only reads with the given read length.
-      default: -1
+        Include in the statistics only reads with the given read length. Default: `-1`.
+      example: -1
    - name: --most_inserts
      alternatives: -m
      type: double
      description: |
-        Report only the main part of inserts.
-      default: 0.99
+        Report only the main part of inserts. Default: `0.99`.
+      example: 0.99
    - name: --split_prefix
      alternatives: -P
      type: string
@@ -93,8 +95,8 @@ argument_groups:
      alternatives: -q
      type: integer
      description: |
-        The BWA trimming parameter.
-      default: 0
+        The BWA trimming parameter. Default: `0`.
+      example: 0
    - name: --ref_seq
      alternatives: -r
      type: file
@@ -124,8 +126,8 @@ argument_groups:
      alternatives: -g
      type: integer
      description: |
-        Only bases with coverage above this value will be included in the target percentage computation.
-      default: 0
+        Only bases with coverage above this value will be included in the target percentage computation. Default: `0`.
+      example: 0
    - name: --input_fmt_option
      type: string
      description: |
@@ -141,7 +143,7 @@ argument_groups:
      type: file
      description: |
        Output file.
-      default: "out.txt"
+      example: "out.txt"
      required: true
      direction: output

--- a/src/samtools/samtools_stats/script.sh
+++ b/src/samtools/samtools_stats/script.sh
@@ -10,6 +10,9 @@ set -e
 [[ "$par_sparse" == "false" ]] && unset par_sparse
 [[ "$par_remove_overlaps" == "false" ]] && unset par_remove_overlaps

+# change the coverage input from X;X;X to X,X,X
+par_coverage=$(echo "$par_coverage" | tr ';' ',')
+
 samtools stats \
    ${par_coverage:+-c "$par_coverage"} \
    ${par_remove_dups:+-d} \
--- a/src/samtools/samtools_stats/test.sh
+++ b/src/samtools/samtools_stats/test.sh
@@ -17,7 +17,7 @@ echo ">>> Checking whether output is non-empty"
 [ ! -s "$test_dir/test.paired_end.sorted.txt" ] && echo "File 'test.paired_end.sorted.txt' is empty!" && exit 1

 echo ">>> Checking whether output is correct"
-# compare using diff,  ignoring the line stating the command that was passed.
+# compare using diff, ignoring the line stating the command that was passed.
 diff <(grep -v "^# The command" "$test_dir/test.paired_end.sorted.txt") \
    <(grep -v "^# The command" "$test_dir/ref.paired_end.sorted.txt") || \
    (echo "Output file ref.paired_end.sorted.txt does not match expected output" && exit 1)
--- a/src/samtools/samtools_view/config.vsh.yaml
+++ b/src/samtools/samtools_view/config.vsh.yaml
@@ -9,7 +9,9 @@ links:
 references: 
  doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
 license: MIT/Expat
-
+authors:
+  - __merge__: /src/_authors/emma_rousseau.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/seqtk/seqtk_sample/config.vsh.yaml
+++ b/src/seqtk/seqtk_sample/config.vsh.yaml
@@ -0,0 +1,57 @@
+name: seqtk_sample
+namespace: seqtk
+description: Subsamples sequences from FASTA/Q files.
+keywords: [sample, FASTA, FASTQ]
+links:
+  repository: https://github.com/lh3/seqtk/tree/v1.4
+license: MIT
+authors:
+  - __merge__: /src/_authors/jakub_majercik.yaml
+    roles: [ author, maintainer ]
+
+argument_groups:
+  - name: Inputs
+    arguments:
+      - name: --input
+        type: file
+        description: The input FASTA/Q file.
+        required: true
+
+  - name: Outputs
+    arguments:
+      - name: --output
+        type: file
+        description: The output FASTA/Q file.
+        required: true
+        direction: output
+
+  - name: Options
+    arguments:
+      - name: --seed
+        type: integer
+        description: Seed for random generator.
+        example: 42
+      - name: --fraction_number
+        type: double
+        description: Fraction or number of sequences to sample.
+        required: true
+        example: 0.1
+      - name: --two_pass_mode
+        type: boolean_true
+        description: Twice as slow but with much reduced memory
+
+resources:
+  - type: bash_script
+    path: script.sh
+test_resources:
+  - type: bash_script
+    path: test.sh
+  - type: file
+    path: ../test_data
+
+engines:
+  - type: docker
+    image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
+runners:
+  - type: executable
+  - type: nextflow
--- a/src/seqtk/seqtk_sample/help.txt
+++ b/src/seqtk/seqtk_sample/help.txt
@@ -0,0 +1,9 @@
+```
+seqtk_subseq
+```
+Usage:   seqtk subseq [options] <in.fa> <in.bed>|<name.list>
+Options:
+  -t       TAB delimited output
+  -s       strand aware
+  -l INT   sequence line length [0]
+Note: Use 'samtools faidx' if only a few regions are intended.
--- a/src/seqtk/seqtk_sample/script.sh
+++ b/src/seqtk/seqtk_sample/script.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+## VIASH START
+## VIASH END
+
+seqtk sample \
+    ${par_two_pass_mode:+-2} \
+    ${par_seed:+-s "$par_seed"} \
+    "$par_input" \
+    "$par_fraction_number" \
+    > "$par_output"
--- a/src/seqtk/seqtk_sample/test.sh
+++ b/src/seqtk/seqtk_sample/test.sh
@@ -0,0 +1,104 @@
+#!/bin/bash
+
+set -e
+
+## VIASH START
+meta_executable="target/executable/seqtk/seqtk_sample"
+meta_resources_dir="src/seqtk"
+## VIASH END
+
+#########################################################################################
+mkdir seqtk_sample_se
+cd seqtk_sample_se
+
+echo "> Run seqtk_sample on fastq SE"
+"$meta_executable" \
+  --input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
+  --seed 42 \
+  --fraction_number 3 \
+  --output "sampled.fastq"
+
+echo ">> Check if output exists"
+if [ ! -f "sampled.fastq" ]; then
+    echo ">> sampled.fastq does not exist"
+    exit 1
+fi
+
+echo ">> Count number of samples"
+num_samples=$(grep -c '^@' sampled.fastq)
+if [ "$num_samples" -ne 3 ]; then
+    echo ">> sampled.fastq does not contain 3 samples"
+    exit 1
+fi
+
+#########################################################################################
+cd ..
+mkdir seqtk_sample_pe_number
+cd seqtk_sample_pe_number
+
+echo ">> Run seqtk_sample on fastq.gz PE with number of reads"
+"$meta_executable" \
+  --input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
+  --seed 42 \
+  --fraction_number 3 \
+  --output "sampled_1.fastq"
+
+"$meta_executable" \
+  --input "$meta_resources_dir/test_data/reads/a.2.fastq.gz" \
+  --seed 42 \
+  --fraction_number 3 \
+  --output "sampled_2.fastq"
+
+echo ">> Check if output exists"
+if [ ! -f "sampled_1.fastq" ] || [ ! -f "sampled_2.fastq" ]; then
+    echo ">> One or both output files do not exist"
+    exit 1
+fi
+
+echo ">> Compare reads"
+# Extract headers
+headers1=$(grep '^@' sampled_1.fastq | sed -e's/ 1$//' | sort)
+headers2=$(grep '^@' sampled_2.fastq | sed -e 's/ 2$//' | sort)
+
+# Compare headers
+diff <(echo "$headers1") <(echo "$headers2") || { echo "Mismatch detected"; exit 1; }
+
+echo ">> Count number of samples"
+num_headers=$(echo "$headers1" | wc -l)
+if [ "$num_headers" -ne 3 ]; then
+    echo ">> sampled_1.fastq does not contain 3 headers"
+    exit 1
+fi
+
+#########################################################################################
+cd ..
+mkdir seqtk_sample_pe_fraction
+cd seqtk_sample_pe_fraction
+
+echo ">> Run seqtk_sample on fastq.gz PE with fraction of reads"
+"$meta_executable" \
+  --input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
+  --seed 42 \
+  --fraction_number 0.5 \
+  --output "sampled_1.fastq"
+
+"$meta_executable" \
+  --input "$meta_resources_dir/test_data/reads/a.2.fastq.gz" \
+  --seed 42 \
+  --fraction_number 0.5 \
+  --output "sampled_2.fastq"
+
+echo ">> Check if output exists"
+if [ ! -f "sampled_1.fastq" ] || [ ! -f "sampled_2.fastq" ]; then
+    echo ">> One or both output files do not exist"
+    exit 1
+fi
+
+echo ">> Compare reads"
+# Extract headers
+headers1=$(grep '^@' sampled_1.fastq | sed -e's/ 1$//' | sort)
+headers2=$(grep '^@' sampled_2.fastq | sed -e 's/ 2$//' | sort)
+
+# Compare headers
+diff <(echo "$headers1") <(echo "$headers2") || { echo "Mismatch detected"; exit 1; }
+
--- a/src/seqtk/seqtk_subseq/config.vsh.yaml
+++ b/src/seqtk/seqtk_subseq/config.vsh.yaml
@@ -0,0 +1,78 @@
+name: seqtk_subseq
+namespace: seqtk
+description: | 
+  Extract subsequences from FASTA/Q files. Takes as input a FASTA/Q file and a name.lst (sequence ids file) or a reg.bed (genomic regions file).
+keywords: [subseq, FASTA, FASTQ]
+links:
+  repository: https://github.com/lh3/seqtk/tree/v1.4
+license: MIT
+authors:
+  - __merge__: /src/_authors/theodoro_gasperin.yaml
+    roles: [ author, maintainer ]
+
+argument_groups:
+  - name: Inputs
+    arguments:
+      - name: "--input"
+        type: file
+        direction: input
+        description: The input FASTA/Q file.
+        required: true
+        example: input.fa
+        
+      - name: "--name_list"
+        type: file
+        direction: input
+        description: | 
+          List of sequence names (name.lst) or genomic regions (reg.bed) to extract.
+        required: true
+        example: list.lst
+
+  - name: Outputs
+    arguments:
+      - name: "--output"
+        alternatives: -o
+        type: file
+        direction: output
+        description: The output FASTA/Q file.
+        required: true
+        default: output.fa
+
+  - name: Options
+    arguments:
+      - name: "--tab"
+        alternatives: -t
+        type: boolean_true
+        description: TAB delimited output.
+        
+      - name: "--strand_aware"
+        alternatives: -s
+        type: boolean_true
+        description: Strand aware.
+        
+      - name: "--sequence_line_length"
+        alternatives: -l
+        type: integer
+        description: | 
+          Sequence line length of input fasta file. Default: 0.
+        example: 0
+        
+
+resources:
+  - type: bash_script
+    path: script.sh
+test_resources:
+  - type: bash_script
+    path: test.sh
+
+engines:
+  - type: docker
+    image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
+    setup:
+      - type: docker
+        run: |
+          echo $(echo $(seqtk 2>&1) | sed -n 's/.*\(Version: [^ ]*\).*/\1/p') > /var/software_versions.txt
+
+runners:
+  - type: executable
+  - type: nextflow
--- a/src/seqtk/seqtk_subseq/help.txt
+++ b/src/seqtk/seqtk_subseq/help.txt
@@ -0,0 +1,9 @@
+```bash
+seqtk subseq
+```
+Usage:   seqtk subseq [options] <in.fa> <in.bed>|<name.list>
+Options:
+  -t       TAB delimited output
+  -s       strand aware
+  -l INT   sequence line length [0]
+Note: Use 'samtools faidx' if only a few regions are intended.
--- a/src/seqtk/seqtk_subseq/script.sh
+++ b/src/seqtk/seqtk_subseq/script.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+## VIASH START
+## VIASH END
+
+[[ "$par_tab" == "false" ]] && unset par_tab
+[[ "$par_strand_aware" == "false" ]] && unset par_strand_aware
+
+seqtk subseq \
+    ${par_tab:+-t} \
+    ${par_strand_aware:+-s} \
+    ${par_sequence_line_length:+-l "$par_sequence_line_length"} \
+    "$par_input" \
+    "$par_name_list" \
+    > "$par_output"
--- a/src/seqtk/seqtk_subseq/test.sh
+++ b/src/seqtk/seqtk_subseq/test.sh
@@ -0,0 +1,182 @@
+#!/bin/bash
+
+# exit on error
+set -e
+
+## VIASH START
+meta_executable="target/executable/seqtk/seqtk_subseq"
+meta_resources_dir="src/seqtk"
+## VIASH END
+
+# Create directories for tests
+echo "Creating Test Data..."
+mkdir test_data
+
+# Create and populate input.fasta
+cat > "test_data/input.fasta" <<EOL
+>KU562861.1
+GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCA
+AGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
+>GU056837.1
+CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGAAATAATAATTATCATAATTA
+TTAATTACATATTTATTAGGTATAATATTTAAGGAAAAATATATTTTATGTTAATTGTAATAATTAGAAC
+>CP097510.1
+CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGAC
+AGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
+>JAMFTS010000002.1
+CCTAAACCCTAAACCCTAAACCCCCTACAAACCTTACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
+ACCCGAAACCCTATACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCAAACCTAATCCCTAAACC
+>MH150936.1
+TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTC
+AAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG
+EOL
+
+# Update id.list with new entries
+cat > "test_data/id.list" <<EOL
+KU562861.1
+MH150936.1
+EOL
+
+# Create and populate reg.bed
+cat > "test_data/reg.bed" <<EOL
+KU562861.1$(echo -e "\t")10$(echo -e "\t")20$(echo -e "\t")region$(echo -e "\t")0$(echo -e "\t")+$(echo -e "\n")
+MH150936.1$(echo -e "\t")10$(echo -e "\t")20$(echo -e "\t")region$(echo -e "\t")0$(echo -e "\t")-
+EOL
+
+#########################################################################################
+# Run basic test
+mkdir test1
+cd test1
+
+echo "> Run seqtk_subseq on FASTA/Q file"
+"$meta_executable" \
+  --input "../test_data/input.fasta" \
+  --name_list "../test_data/id.list" \
+  --output "sub_sample.fq"
+
+expected_output_basic=">KU562861.1
+GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
+>MH150936.1
+TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTCAAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG"
+output_basic=$(cat sub_sample.fq)
+
+if [ "$output_basic" != "$expected_output_basic" ]; then
+  echo "Test failed"
+  echo "Expected:"
+  echo "$expected_output_basic"
+  echo "Got:"
+  echo "$output_basic"
+  exit 1
+fi
+
+#########################################################################################
+# Run reg.bed as name list input test
+cd ..
+mkdir test2
+cd test2
+
+echo "> Run seqtk_subseq on FASTA/Q file with BED file as name list"
+"$meta_executable" \
+  --input "../test_data/input.fasta" \
+  --name_list "../test_data/reg.bed" \
+  --output "sub_sample.fq"
+
+expected_output_basic=">KU562861.1:11-20
+AGTGTTCGAG
+>MH150936.1:11-20
+TGAAAACTTT"
+output_basic=$(cat sub_sample.fq)
+
+if [ "$output_basic" != "$expected_output_basic" ]; then
+  echo "Test failed"
+  echo "Expected:"
+  echo "$expected_output_basic"
+  echo "Got:"
+  echo "$output_basic"
+  exit 1
+fi
+
+#########################################################################################
+# Run tab option output test
+cd ..
+mkdir test3
+cd test3
+
+echo "> Run seqtk_subseq with TAB option"
+"$meta_executable" \
+  --tab \
+  --input "../test_data/input.fasta" \
+  --name_list "../test_data/reg.bed" \
+  --output "sub_sample.fq"
+
+expected_output_tabular=$'KU562861.1\t11\tAGTGTTCGAG\nMH150936.1\t11\tTGAAAACTTT'
+output_tabular=$(cat sub_sample.fq)
+
+if [ "$output_tabular" != "$expected_output_tabular" ]; then
+  echo "Test failed"
+  echo "Expected:"
+  echo "$expected_output_tabular"
+  echo "Got:"
+  echo "$output_tabular"
+  exit 1
+fi
+
+#########################################################################################
+# Run line option output test
+cd ..
+mkdir test4
+cd test4
+
+echo "> Run seqtk_subseq with line length option"
+"$meta_executable" \
+  --sequence_line_length 5 \
+  --input "../test_data/input.fasta" \
+  --name_list "../test_data/reg.bed" \
+  --output "sub_sample.fq"
+
+expected_output_wrapped=">KU562861.1:11-20
+AGTGT
+TCGAG
+>MH150936.1:11-20
+TGAAA
+ACTTT"
+output_wrapped=$(cat sub_sample.fq)
+
+if [ "$output_wrapped" != "$expected_output_wrapped" ]; then
+  echo "Test failed"
+  echo "Expected:"
+  echo "$expected_output_wrapped"
+  echo "Got:"
+  echo "$output_wrapped"
+  exit 1
+fi
+
+#########################################################################################
+# Run Strand Aware option output test
+cd ..
+mkdir test5
+cd test5
+
+echo "> Run seqtk_subseq with strand aware option"
+"$meta_executable" \
+  --strand_aware \
+  --input "../test_data/input.fasta" \
+  --name_list "../test_data/reg.bed" \
+  --output "sub_sample.fq"
+
+expected_output_wrapped=">KU562861.1:11-20
+AGTGTTCGAG
+>MH150936.1:11-20
+AAAGTTTTCA"
+output_wrapped=$(cat sub_sample.fq)
+
+if [ "$output_wrapped" != "$expected_output_wrapped" ]; then
+  echo "Test failed"
+  echo "Expected:"
+  echo "$expected_output_wrapped"
+  echo "Got:"
+  echo "$output_wrapped"
+  exit 1
+fi
+
+echo "All tests succeeded!"
--- a/src/seqtk/test_data/reads/a.1.fastq.gz
+++ b/src/seqtk/test_data/reads/a.1.fastq.gz
--- a/src/seqtk/test_data/reads/a.2.fastq.gz
+++ b/src/seqtk/test_data/reads/a.2.fastq.gz
--- a/src/seqtk/test_data/reads/a.fastq
+++ b/src/seqtk/test_data/reads/a.fastq
@@ -0,0 +1,4 @@
+@1
+ACGGCAT
+
+!!!!!!!
--- a/src/seqtk/test_data/reads/a.fastq.gz
+++ b/src/seqtk/test_data/reads/a.fastq.gz
--- a/src/seqtk/test_data/reads/id.list
+++ b/src/seqtk/test_data/reads/id.list
@@ -0,0 +1 @@
+1
--- a/src/seqtk/test_data/script.sh
+++ b/src/seqtk/test_data/script.sh
@@ -0,0 +1,9 @@
+# clone repo
+if [ ! -d /tmp/snakemake-wrappers ]; then
+  git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
+fi
+
+# copy test data
+cp -r /tmp/snakemake-wrappers/bio/seqtk/test/* src/seqtk/test_data
+
+rm src/seqtk/test_data/Snakefile
--- a/src/star/star_align_reads/argument_groups.yaml
+++ b/src/star/star_align_reads/argument_groups.yaml
--- a/src/star/star_align_reads/config.vsh.yaml
+++ b/src/star/star_align_reads/config.vsh.yaml
@@ -11,6 +11,11 @@ references:
 license: MIT
 requirements:
  commands: [ STAR, python, ps, zcat, bzcat ]
+authors:
+  - __merge__: /src/_authors/angela_o_pisco.yaml
+    roles: [ author ]
+  - __merge__: /src/_authors/robrecht_cannoodt.yaml
+    roles: [ author, maintainer ]
 # manually taking care of the main input and output arguments
 argument_groups:
  - name: Inputs
@@ -113,6 +118,8 @@ engines:
            rm -rf /tmp/STAR-${STAR_VERSION} /tmp/${STAR_VERSION}.zip && \
            apt-get --purge autoremove -y ${PACKAGES} && \
            apt-get clean
+      - type: python
+        packages: [ pyyaml ]
      - type: docker
        run: |
          STAR --version | sed 's#\(.*\)#star: "\1"#' > /var/software_versions.txt
--- a/src/star/star_align_reads/script.py
+++ b/src/star/star_align_reads/script.py
@@ -2,6 +2,7 @@ import tempfile
 import subprocess
 import shutil
 from pathlib import Path
+import yaml

 ## VIASH START
 par = {
@@ -18,10 +19,20 @@ par = {
 }
 meta = {
    "cpus": 8,
-    "temp_dir": "/tmp"
+    "temp_dir": "/tmp",
+    "config": "target/executable/star/star_align_reads/.config.vsh.yaml",
 }
 ## VIASH END

+# read config
+with open(meta["config"], 'r') as stream:
+    config = yaml.safe_load(stream)
+all_arguments = {
+    arg["name"].lstrip('-'): arg
+    for argument_group in config["argument_groups"]
+    for arg in argument_group["arguments"]
+}
+
 ##################################################
 # check and process SE / PE R1 input files
 input_r1 = par["input"]
@@ -87,8 +98,13 @@ with tempfile.TemporaryDirectory(prefix="star-", dir=meta["temp_dir"], ignore_cl
    cmd_args = [ "STAR" ]
    for name, value in par.items():
        if value is not None:
+            if name in all_arguments:
+                arg_info = all_arguments[name].get("info", {})
+                cli_name = arg_info.get("orig_name", f"--{name}")
+            else:
+                cli_name = f"--{name}"
            val_to_add = value if isinstance(value, list) else [value]
-            cmd_args.extend([f"--{name}"] + [str(x) for x in val_to_add])
+            cmd_args.extend([cli_name] + [str(x) for x in val_to_add])
    print("", flush=True)

    # run command
--- a/src/star/star_align_reads/test.sh
+++ b/src/star/star_align_reads/test.sh
@@ -7,35 +7,34 @@ meta_executable="target/docker/star/star_align_reads/star_align_reads"
 meta_resources_dir="src/star/star_align_reads"
 ## VIASH END

-#########################################################################################
-
+#############################################
 # helper functions
 assert_file_exists() {
-  [ -f "$1" ] || (echo "File '$1' does not exist" && exit 1)
+  [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
 }
 assert_file_doesnt_exist() {
-  [ ! -f "$1" ] || (echo "File '$1' exists but shouldn't" && exit 1)
+  [ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
 }
 assert_file_empty() {
-  [ ! -s "$1" ] || (echo "File '$1' is not empty but should be" && exit 1)
+  [ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
 }
 assert_file_not_empty() {
-  [ -s "$1" ] || (echo "File '$1' is empty but shouldn't be" && exit 1)
+  [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
 }
 assert_file_contains() {
-  grep -q "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
+  grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
 }
 assert_file_not_contains() {
-  grep -q "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
+  grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
 }
 assert_file_contains_regex() {
-  grep -q -E "$2" "$1" || (echo "File '$1' does not contain '$2'" && exit 1)
+  grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
 }
 assert_file_not_contains_regex() {
-  grep -q -E "$2" "$1" && (echo "File '$1' contains '$2' but shouldn't" && exit 1)
+  grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
 }
+#############################################

-#########################################################################################
 echo "> Prepare test data"

 cat > reads_R1.fastq <<'EOF'
@@ -89,14 +88,14 @@ cd star_align_reads_se
 echo "> Run star_align_reads on SE"
 "$meta_executable" \
  --input "../reads_R1.fastq" \
-  --genomeDir "../index/" \
+  --genome_dir "../index/" \
  --aligned_reads "output.sam" \
  --log "log.txt" \
-  --outReadsUnmapped "Fastx" \
+  --out_reads_unmapped "Fastx" \
  --unmapped "unmapped.sam" \
-  --quantMode "TranscriptomeSAM;GeneCounts" \
+  --quant_mode "TranscriptomeSAM;GeneCounts" \
  --reads_per_gene "reads_per_gene.tsv" \
-  --outSJtype Standard \
+  --out_sj_type Standard \
  --splice_junctions "splice_junctions.tsv" \
  --reads_aligned_to_transcriptome "transcriptome_aligned.bam" \
  ${meta_cpus:+---cpus $meta_cpus}
@@ -144,10 +143,10 @@ echo ">> Run star_align_reads on PE"
 "$meta_executable" \
  --input ../reads_R1.fastq \
  --input_r2 ../reads_R2.fastq \
-  --genomeDir ../index/ \
+  --genome_dir ../index/ \
  --aligned_reads output.bam \
  --log log.txt \
-  --outReadsUnmapped Fastx \
+  --out_reads_unmapped Fastx \
  --unmapped unmapped_r1.bam \
  --unmapped_r2 unmapped_r2.bam \
  ${meta_cpus:+---cpus $meta_cpus}
--- a/src/star/star_align_reads/utils/process_params.R
+++ b/src/star/star_align_reads/utils/process_params.R
@@ -14,6 +14,14 @@ param_txt <- iconv(param_txt, "UTF-8", "ASCII//TRANSLIT")
 dev_begin <- grep("#####UnderDevelopment_begin", param_txt)
 dev_end <- grep("#####UnderDevelopment_end", param_txt)

+camel_case_to_snake_case <- function(x) {
+  x %>%
+    str_replace_all("([A-Z][A-Z][A-Z]*)", "_\\1_") %>%
+    str_replace_all("([a-z])([A-Z])", "\\1_\\2") %>%
+    str_to_lower() %>%
+    str_replace_all("_$", "")
+}
+
 # strip development sections
 nondev_ix <- unlist(map2(c(1, dev_end + 1), c(dev_begin - 1, length(param_txt)), function(i, j) {
  if (i >= 1 && i < j) {
@@ -128,9 +136,8 @@ out2 <- out %>%
  # remove arguments that are related to a different runmode
  filter(!grepl("--runMode", description) | grepl("--runMode alignReads", description)) %>%
  filter(!grepl("--runMode", group_name) | grepl("--runMode alignReads", group_name)) %>%
-  filter(!grepl("STARsolo", group_name)) %>%
  mutate(
-    viash_arg = paste0("--", name),
+    viash_arg = paste0("--", camel_case_to_snake_case(name)),
    type_step1 = type %>%
      str_replace_all(".*(int, string|string|int|real|double)\\(?(s?).*", "\\1\\2"),
    viash_type = type_map[gsub("(int, string|string|int|real|double).*", "\\1", type_step1)],
@@ -155,28 +162,41 @@ out2 <- out %>%
    group_name = gsub(" - .*", "", group_name),
    required = ifelse(name %in% required_args, TRUE, NA)
  )
-print(out2, n = 200)
-out2 %>% mutate(i = row_number()) %>% 
-  # filter(is.na(default_step1) != is.na(viash_default)) %>%
+
+# change references to argument names
+out3 <- out2
+for (i in seq_len(nrow(out2))) {
+  orig_name <- paste0("--", out2$name[[i]])
+  new_name <- out2$viash_arg[[i]]
+  out3$description <- str_replace_all(out3$description, orig_name, new_name)
+}
+
+# sanity checks
+out3 %>% select(name, viash_arg) %>% as.data.frame()
+print(out3, n = 200)
+out3 %>%
+  mutate(i = row_number()) %>%
  select(-group_name, -description)
+out3 %>% filter(!grepl("--runMode", description) | grepl("--runMode alignReads", description))

-out2 %>% filter(!grepl("--runMode", description) | grepl("--runMode alignReads", description))
-
-argument_groups <- map(unique(out2$group_name), function(group_name) {
-  args <- out2 %>%
+# create argument groups
+argument_groups <- map(unique(out3$group_name), function(group_name) {
+  args <- out3 %>%
    filter(group_name == !!group_name) %>%
-    pmap(function(viash_arg, viash_type, multiple, viash_default, description, required, ...) {
-      li <- lst(
+    pmap(function(viash_arg, viash_type, multiple, viash_default, description, required, name, ...) {
+      li <- list(
        name = viash_arg,
        type = viash_type,
-        description = description
+        description = description,
+        info = list(
+          orig_name = paste0("--", name)
+        )
      )
      if (all(!is.na(viash_default))) {
        li$example <- viash_default
      }
      if (!is.na(multiple) && multiple) {
        li$multiple <- multiple
-        li$multiple_sep <- ";"
      }
      if (!is.na(required) && required) {
        li$required <- required
@@ -186,4 +206,10 @@ argument_groups <- map(unique(out2$group_name), function(group_name) {
  list(name = group_name, arguments = args)
 })

-yaml::write_yaml(list(argument_groups = argument_groups), yaml_file)
+yaml::write_yaml(
+  list(argument_groups = argument_groups),
+  yaml_file,
+  handlers = list(
+    logical = yaml::verbatim_logical
+  )
+)
--- a/src/star/star_genome_generate/config.vsh.yaml
+++ b/src/star/star_genome_generate/config.vsh.yaml
@@ -11,75 +11,74 @@ references:
 license: MIT
 requirements:
  commands: [ STAR ]
-
+authors:
+  - __merge__: /src/_authors/sai_nirmayi_yasa.yaml
+    roles: [ author, maintainer ]
 argument_groups:
 - name: "Input"
  arguments: 
-  - name: "--genomeFastaFiles"
+  - name: "--genome_fasta_files"
    type: file
    description: |
      Path(s) to the fasta files with the genome sequences, separated by spaces. These files should be plain text FASTA files, they *cannot* be zipped.
    required: true
-    multiple: yes
-    multiple_sep: ;  
-  - name: "--sjdbGTFfile"
+    multiple: true
+  - name: "--sjdb_gtf_file"
    type: file
    description: Path to the GTF file with annotations
-  - name: --sjdbOverhang
+  - name: --sjdb_overhang
    type: integer
    description: Length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
    example: 100
-  - name: --sjdbGTFchrPrefix
+  - name: --sjdb_gtf_chr_prefix
    type: string
    description: Prefix for chromosome names in a GTF file (e.g. 'chr' for using ENSMEBL annotations with UCSC genomes)
-  - name: --sjdbGTFfeatureExon
+  - name: --sjdb_gtf_feature_exon
    type: string
    description: Feature type in GTF file to be used as exons for building transcripts
    example: exon
-  - name: --sjdbGTFtagExonParentTranscript
+  - name: --sjdb_gtf_tag_exon_parent_transcript
    type: string
    description: GTF attribute name for parent transcript ID (default "transcript_id" works for GTF files)
    example: transcript_id
-  - name: --sjdbGTFtagExonParentGene
+  - name: --sjdb_gtf_tag_exon_parent_gene
    type: string
    description: GTF attribute name for parent gene ID (default "gene_id" works for GTF files)
    example: gene_id
-  - name: --sjdbGTFtagExonParentGeneName
+  - name: --sjdb_gtf_tag_exon_parent_gene_name
    type: string
    description: GTF attribute name for parent gene name
    example: gene_name
-    multiple: yes
-    multiple_sep: ;
-  - name: --sjdbGTFtagExonParentGeneType
+    multiple: true
+  - name: --sjdb_gtf_tag_exon_parent_gene_type
    type: string
    description: GTF attribute name for parent gene type
    example:
    - gene_type
    - gene_biotype
-    multiple: yes
-    multiple_sep: ;
-  - name: --limitGenomeGenerateRAM
+    multiple: true
+  - name: --limit_genome_generate_ram
    type: long
    description: Maximum available RAM (bytes) for genome generation
-    example: '31000000000'
-  - name: --genomeSAindexNbases
+    example: 31000000000
+  - name: --genome_sa_index_nbases
    type: integer
    description: Length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, this parameter must be scaled down to min(14, log2(GenomeLength)/2 - 1).
    example: 14
-  - name: --genomeChrBinNbits
+  - name: --genome_chr_bin_nbits
    type: integer
    description: Defined as log2(chrBin), where chrBin is the size of the bins for genome storage. Each chromosome will occupy an integer number of bins. For a genome with large number of contigs, it is recommended to scale this parameter as min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]).
    example: 18
-  - name: --genomeSAsparseD
+  - name: --genome_sa_sparse_d
    type: integer
    min: 0
    example: 1
    description: Suffux array sparsity, i.e. distance between indices. Use bigger numbers to decrease needed RAM at the cost of mapping speed reduction.
-  - name: --genomeSuffixLengthMax
+  - name: --genome_suffix_length_max
    type: integer
    description: Maximum length of the suffixes, has to be longer than read length. Use -1 for infinite length.
    example: -1
-  - name: --genomeTransformType   
+  - name: --genome_transform_type   
    type: string
    description: |
      Type of genome transformation
@@ -87,7 +86,7 @@ argument_groups:
        Haploid    ... replace reference alleles with alternative alleles from VCF file (e.g. consensus allele)
        Diploid    ... create two haplotypes for each chromosome listed in VCF file, for genotypes 1|2, assumes perfect phasing (e.g. personal genome)
    example: None
-  - name: --genomeTransformVCF
+  - name: --genome_transform_vcf
    type: file
    description: path to VCF file for genome transformation
  
--- a/src/star/star_genome_generate/script.sh
+++ b/src/star/star_genome_generate/script.sh
@@ -10,20 +10,20 @@ mkdir -p $par_index
 STAR \
    --runMode genomeGenerate \
    --genomeDir $par_index \
-    --genomeFastaFiles $par_genomeFastaFiles \
+    --genomeFastaFiles $par_genome_fasta_files \
    ${meta_cpus:+--runThreadN "${meta_cpus}"} \
-    ${par_sjdbGTFfile:+--sjdbGTFfile "${par_sjdbGTFfile}"} \
+    ${par_sjdb_gtf_file:+--sjdbGTFfile "${par_sjdb_gtf_file}"} \
    ${par_sjdbOverhang:+--sjdbOverhang "${par_sjdbOverhang}"} \
-    ${par_genomeSAindexNbases:+--genomeSAindexNbases "${par_genomeSAindexNbases}"} \
-    ${par_sjdbGTFchrPrefix:+--sjdbGTFchrPrefix "${par_sjdbGTFchrPrefix}"} \
-    ${par_sjdbGTFfeatureExon:+--sjdbGTFfeatureExon "${par_sjdbGTFfeatureExon}"} \
-    ${par_sjdbGTFtagExonParentTranscript:+--sjdbGTFtagExonParentTranscript "${par_sjdbGTFtagExonParentTranscript}"} \
-    ${par_sjdbGTFtagExonParentGene:+--sjdbGTFtagExonParentGene "${par_sjdbGTFtagExonParentGene}"} \
-    ${par_sjdbGTFtagExonParentGeneName:+--sjdbGTFtagExonParentGeneName "${par_sjdbGTFtagExonParentGeneName}"} \
-    ${par_sjdbGTFtagExonParentGeneType:+--sjdbGTFtagExonParentGeneType "${sjdbGTFtagExonParentGeneType}"} \
-    ${par_limitGenomeGenerateRAM:+--limitGenomeGenerateRAM "${par_limitGenomeGenerateRAM}"} \
-    ${par_genomeChrBinNbits:+--genomeChrBinNbits "${par_genomeChrBinNbits}"} \
-    ${par_genomeSAsparseD:+--genomeSAsparseD "${par_genomeSAsparseD}"} \
-    ${par_genomeSuffixLengthMax:+--genomeSuffixLengthMax "${par_genomeSuffixLengthMax}"} \
-    ${par_genomeTransformType:+--genomeTransformType "${par_genomeTransformType}"} \
-    ${par_genomeTransformVCF:+--genomeTransformVCF "${par_genomeTransformVCF}"} \
+    ${par_genome_sa_index_nbases:+--genomeSAindexNbases "${par_genome_sa_index_nbases}"} \
+    ${par_sjdb_gtf_chr_prefix:+--sjdbGTFchrPrefix "${par_sjdb_gtf_chr_prefix}"} \
+    ${par_sjdb_gtf_feature_exon:+--sjdbGTFfeatureExon "${par_sjdb_gtf_feature_exon}"} \
+    ${par_sjdb_gtf_tag_exon_parent_transcript:+--sjdbGTFtag_exon_parent_transcript "${par_sjdb_gtf_tag_exon_parent_transcript}"} \
+    ${par_sjdb_gtf_tag_exon_parent_gene:+--sjdbGTFtag_exon_parent_gene "${par_sjdb_gtf_tag_exon_parent_gene}"} \
+    ${par_sjdb_gtf_tag_exon_parent_geneName:+--sjdbGTFtag_exon_parent_geneName "${par_sjdb_gtf_tag_exon_parent_geneName}"} \
+    ${par_sjdb_gtf_tag_exon_parent_geneType:+--sjdbGTFtag_exon_parent_geneType "${sjdbGTFtag_exon_parent_geneType}"} \
+    ${par_limit_genome_generate_ram:+--limitGenomeGenerateRAM "${par_limit_genome_generate_ram}"} \
+    ${par_genome_chr_bin_nbits:+--genomeChrBinNbits "${par_genome_chr_bin_nbits}"} \
+    ${par_genome_sa_sparse_d:+--genomeSAsparseD "${par_genome_sa_sparse_d}"} \
+    ${par_genome_suffix_length_max:+--genomeSuffixLengthMax "${par_genome_suffix_length_max}"} \
+    ${par_genome_transform_type:+--genomeTransformType "${par_genome_transform_type}"} \
+    ${par_genome_transform_vcf:+--genomeTransformVCF "${par_genome_transform_vCF}"} \
--- a/src/star/star_genome_generate/test.sh
+++ b/src/star/star_genome_generate/test.sh
@@ -27,9 +27,9 @@ echo "> Generate index"
 "$meta_executable" \
  ${meta_cpus:+---cpus $meta_cpus} \
  --index "star_index/" \
-  --genomeFastaFiles "genome.fasta" \
-  --sjdbGTFfile "genes.gtf" \
-  --genomeSAindexNbases 2
+  --genome_fasta_files "genome.fasta" \
+  --sjdb_gtf_file "genes.gtf" \
+  --genome_sa_index_nbases 4 

 files=("Genome" "Log.out" "SA" "SAindex" "chrLength.txt" "chrName.txt" "chrNameLength.txt" "chrStart.txt" "exonGeTrInfo.tab" "exonInfo.tab" "geneInfo.tab" "genomeParameters.txt" "sjdbInfo.txt" "sjdbList.fromGTF.out.tab" "sjdbList.out.tab" "transcriptInfo.tab")

--- a/src/umi_tools/umi_tools_dedup/config.vsh.yaml
+++ b/src/umi_tools/umi_tools_dedup/config.vsh.yaml
@@ -10,7 +10,9 @@ links:
 references: 
  doi: 10.1101/gr.209601.116
 license: MIT
-
+authors:
+  - __merge__: /src/_authors/emma_rousseau.yaml
+    roles: [ author, maintainer ]
 argument_groups:
  - name: Inputs
    arguments:
--- a/src/umi_tools/umi_tools_extract/config.vsh.yaml
+++ b/src/umi_tools/umi_tools_extract/config.vsh.yaml
@@ -0,0 +1,197 @@
+name: umi_tools_extract
+namespace: umi_tools
+description: |
+  Flexible removal of UMI sequences from fastq reads.
+  UMIs are removed and appended to the read name. Any other barcode, for example a library barcode,
+  is left on the read. Can also filter reads by quality or against a whitelist.
+keywords: [ extract, umi-tools, umi, fastq ]
+links:
+  homepage: https://umi-tools.readthedocs.io/en/latest/
+  documentation: https://umi-tools.readthedocs.io/en/latest/reference/extract.html
+  repository: https://github.com/CGATOxford/UMI-tools
+references: 
+  doi: 10.1101/gr.209601.116
+license: MIT
+
+argument_groups:
+
+  - name: Input
+    arguments: 
+    - name: --input
+      type: file
+      required: true
+      description: File containing the input data.
+      example: sample.fastq
+    - name: --read2_in
+      type: file
+      required: false
+      description: File containing the input data for the R2 reads (if paired). If provided, a <list of other required arguments> need to be provided.
+      example: sample_R2.fastq
+    - name: --bc_pattern
+      alternatives: -p
+      type: string
+      description: |
+        The UMI barcode pattern to use e.g. 'NNNNNN' indicates that the first 6 nucleotides 
+        of the read are from the UMI.
+    - name: --bc_pattern2
+      type: string
+      description: The UMI barcode pattern to use for read 2.
+    
+  - name: "Output"
+    arguments:  
+    - name: --output
+      type: file
+      required: true
+      description: Output file for read 1.
+      direction: output
+    - name: --read2_out
+      type: file
+      description: Output file for read 2.
+      direction: output
+    - name: --filtered_out
+      type: file
+      description: |
+        Write out reads not matching regex pattern or cell barcode whitelist to this file.
+    - name: --filtered_out2
+      type: file
+      description: |
+        Write out read pairs not matching regex pattern or cell barcode whitelist to this file.
+
+  - name: Extract Options
+    arguments:
+    - name: --extract_method
+      type: string
+      choices: [string, regex]
+      description: |
+        UMI pattern to use. Default: `string`.
+      example: "string"
+    - name: --error_correct_cell
+      type: boolean_true
+      description: Error correct cell barcodes to the whitelist.
+    - name: --whitelist
+      type: file
+      description: |
+        Whitelist of accepted cell barcodes tab-separated format, where column 1 is the whitelisted
+        cell barcodes and column 2 is the list (comma-separated) of other cell barcodes which should 
+        be corrected to the barcode in column 1. If the --error_correct_cell option is not used, this
+        column will be ignored.
+    - name: --blacklist
+      type: file
+      description: BlackWhitelist of cell barcodes to discard.
+    - name: --subset_reads
+      type: integer
+      description: Only parse the first N reads.
+    - name: --quality_filter_threshold
+      type: integer
+      description: Remove reads where any UMI base quality score falls below this threshold.
+    - name: --quality_filter_mask
+      type: string
+      description: |
+        If a UMI base has a quality below this threshold, replace the base with 'N'.
+    - name: --quality_encoding
+      type: string
+      choices: [phred33, phred64, solexa]
+      description: |
+        Quality score encoding. Choose from:
+          * phred33 [33-77]
+          * phred64 [64-106]
+          * solexa [59-106]
+    - name: --reconcile_pairs
+      type: boolean_true
+      description: |
+        Allow read 2 infile to contain reads not in read 1 infile. This enables support for upstream protocols
+        where read one contains cell barcodes, and the read pairs have been filtered and corrected without regard
+        to the read2.
+    - name: --three_prime
+      alternatives: --3prime
+      type: boolean_true
+      description: |
+        By default the barcode is assumed to be on the 5' end of the read, but use this option to sepecify that it is
+        on the 3' end instead. This option only works with --extract_method=string since 3' encoding can be specified
+        explicitly with a regex, e.g `.*(?P<umi_1>.{5})$`.
+    - name: --ignore_read_pair_suffixes
+      type: boolean_true
+      description: |
+        Ignore "/1" and "/2" read name suffixes. Note that this options is required if the suffixes are not whitespace
+        separated from the rest of the read name.
+        arguments:
+    - name: --umi_separator
+      type: string
+      description: |
+        The character that separates the UMI in the read name. Most likely a colon if you skipped the extraction with
+        UMI-tools and used other software. Default: `_`
+      example: "_"
+    - name: --grouping_method
+      type: string
+      choices: [unique, percentile, cluster, adjacency, directional]
+      description: |
+        Method to use to determine read groups by subsuming those with similar UMIs. All methods start by identifying
+        the reads with the same mapping position, but treat similar yet nonidentical UMIs differently. Default: `directional`
+      example: "directional"
+    - name: --umi_discard_read
+      type: integer
+      choices: [0, 1, 2]
+      description: |
+        After UMI barcode extraction discard either R1 or R2 by setting this parameter to 1 or 2, respectively. Default: `0`
+      example: 0
+
+  - name: Common Options
+    arguments:
+    - name: --log
+      type: file
+      description: File with logging information.
+      direction: output
+    - name: --log2stderr
+      type: boolean_true
+      description: Send logging information to stderr.
+      direction: output
+    - name: --verbose
+      type: integer
+      description: Log level. The higher, the more output.
+    - name: --error
+      type: file
+      description: File with error information.
+      direction: output
+    - name: --temp_dir
+      type: string
+      description: |
+        Directory for temporary files. If not set, the bash environmental variable TMPDIR is used.
+    - name: --compresslevel
+      type: integer
+      description: |
+        Level of Gzip compression to use. Default=6 matches GNU gzip rather than python gzip default (which is 9).
+        Default `6`.
+      example: 6
+    - name: --timeit
+      type: file
+      description: Store timing information in file.
+      direction: output
+    - name: --timeit_name
+      type: string
+      description: Name in timing file for this class of jobs.
+      default: all
+    - name: --timeit_header
+      type: boolean_true
+      description: Add header for timing information.
+    - name: --random_seed
+      type: integer
+      description: Random seed to initialize number generator with.
+  
+resources:
+  - type: bash_script
+    path: script.sh
+test_resources:
+  - type: bash_script
+    path: test.sh
+  - type: file
+    path: test_data
+engines:
+  - type: docker
+    image: quay.io/biocontainers/umi_tools:1.1.4--py310h4b81fae_2
+    setup:
+      - type: docker
+        run: |
+            umi_tools -v | sed 's/ version//g' > /var/software_versions.txt
+runners:
+- type: executable
+- type: nextflow
--- a/src/umi_tools/umi_tools_extract/help.txt
+++ b/src/umi_tools/umi_tools_extract/help.txt
@@ -0,0 +1,106 @@
+'''
+Generated from the following UMI-tools documentation:
+      https://umi-tools.readthedocs.io/en/latest/common_options.html#common-options
+      https://umi-tools.readthedocs.io/en/latest/reference/extract.html
+'''
+
+extract - Extract UMI from fastq
+
+Usage:
+
+   Single-end:
+      umi_tools extract [OPTIONS] -p PATTERN [-I IN_FASTQ[.gz]] [-S OUT_FASTQ[.gz]]
+
+   Paired end:
+      umi_tools extract [OPTIONS] -p PATTERN [-I IN_FASTQ[.gz]] [-S OUT_FASTQ[.gz]] --read2-in=IN2_FASTQ[.gz] --read2-out=OUT2_FASTQ[.gz]
+
+   note: If -I/-S are ommited standard in and standard out are used
+         for input and output.  To generate a valid BAM file on
+         standard out, please redirect log with --log=LOGFILE or
+         --log2stderr. Input/Output will be (de)compressed if a
+         filename provided to -S/-I/--read2-in/read2-out ends in .gz
+
+Common UMI-tools Options:
+
+      -S, --stdout                  File where output is to go [default = stdout].
+      -L, --log                     File with logging information [default = stdout].
+      --log2stderr                  Send logging information to stderr [default = False].
+      -v, --verbose                 Log level. The higher, the more output [default = 1].
+      -E, --error                   File with error information [default = stderr].
+      --temp-dir                    Directory for temporary files. If not set, the bash environmental variable TMPDIR is used[default = None].
+      --compresslevel               Level of Gzip compression to use. Default=6 matches GNU gzip rather than python gzip default (which is 9)
+
+      profiling and debugging options:
+      --timeit                      Store timing information in file [default=none].
+      --timeit-name                 Name in timing file for this class of jobs [default=all].
+      --timeit-header               Add header for timing information [default=none].
+      --random-seed                 Random seed to initialize number generator with [default=none].
+
+Extract Options:
+      -I, --stdin                   File containing the input data [default = stdin].
+      --error-correct-cell          Error correct cell barcodes to the whitelist (see --whitelist)
+      --whitelist                   Whitelist of accepted cell barcodes. The whitelist should be in the following format (tab-separated):
+                                          AAAAAA    AGAAAA
+                                          AAAATC
+                                          AAACAT
+                                          AAACTA    AAACTN,GAACTA
+                                          AAATAC
+                                          AAATCA    GAATCA
+                                          AAATGT    AAAGGT,CAATGT
+                                    Where column 1 is the whitelisted cell barcodes and column 2 is the list (comma-separated) of other cell 
+                                    barcodes which should be corrected to the barcode in column 1. If the --error-correct-cell option is not 
+                                    used, this column will be ignored. Any additional columns in the whitelist input, such as the counts columns 
+                                    from the output of umi_tools whitelist, will be ignored.
+      --blacklist                   BlackWhitelist of cell barcodes to discard
+      --subset-reads=[N]            Only parse the first N reads
+      --quality-filter-threshold    Remove reads where any UMI base quality score falls below this threshold
+      --quality-filter-mask         If a UMI base has a quality below this threshold, replace the base with 'N'
+      --quality-encoding            Quality score encoding. Choose from:
+                                          'phred33' [33-77]
+                                          'phred64' [64-106]
+                                          'solexa' [59-106]
+      --reconcile-pairs             Allow read 2 infile to contain reads not in read 1 infile. This enables support for upstream protocols
+                                    where read one contains cell barcodes, and the read pairs have been filtered and corrected without regard
+                                    to the read2s.
+
+Experimental options:
+      Note: These options have not been extensively testing to ensure behaviour is as expected. If you have some suitable input files which
+            we can use for testing, please contact us.
+            If you have a library preparation method where the UMI may be in either read, you can use the following options to search for the
+            UMI in either read:
+
+                  --either-read --extract-method --bc-pattern=[PATTERN1] --bc-pattern2=[PATTERN2]
+
+            Where both patterns match, the default behaviour is to discard both reads. If you want to select the read with the UMI with highest
+            sequence quality, provide --either-read-resolve=quality.
+
+
+      --bc-pattern                  Pattern for barcode(s) on read 1. See --extract-method
+      --bc-pattern2                 Pattern for barcode(s) on read 2. See --extract-method
+      --extract-method              There are two methods enabled to extract the umi barcode (+/- cell barcode). For both methods, the patterns
+                                    should be provided using the --bc-pattern and --bc-pattern2 options.x   
+                                    string: 
+                                    This should be used where the barcodes are always in the same place in the read.
+                                          N = UMI position (required)
+                                          C = cell barcode position (optional)
+                                          X = sample position (optional)
+                                    Bases with Ns and Cs will be extracted and added to the read name. The corresponding sequence qualities will
+                                    be removed from the read. Bases with an X will be reattached to the read.
+                                    regex:
+                                    This method allows for more flexible barcode extraction and should be used where the cell barcodes are variable
+                                    in length. Alternatively, the regex option can also be used to filter out reads which do not contain an expected
+                                    adapter sequence. The regex must contain groups to define how the barcodes are encoded in the read. 
+                                    The expected groups in the regex are:
+                                          umi_n = UMI positions, where n can be any value (required) 
+                                          cell_n = cell barcode positions, where n can be any value (optional) 
+                                          discard_n = positions to discard, where n can be any value (optional)
+      --3prime                      By default the barcode is assumed to be on the 5' end of the read, but use this option to sepecify that it is
+                                    on the 3' end instead. This option only works with --extract-method=string since 3' encoding can be specified
+                                    explicitly with a regex, e.g .*(?P<umi_1>.{5})$
+      --read2-in                    Filename for read pairs
+      --filtered-out                Write out reads not matching regex pattern or cell barcode whitelist to this file
+      --filtered-out2               Write out read pairs not matching regex pattern or cell barcode whitelist to this file
+      --ignore-read-pair-suffixes   Ignore SOH and STX read name suffixes. Note that this options is required if the suffixes are not whitespace
+                                    separated from the rest of the read name
+
+For full UMI-tools documentation, see https://umi-tools.readthedocs.io/en/latest/
--- a/src/umi_tools/umi_tools_extract/script.sh
+++ b/src/umi_tools/umi_tools_extract/script.sh
@@ -0,0 +1,88 @@
+#!/bin/bash
+
+## VIASH START
+## VIASH END
+
+set -exo pipefail
+
+test_dir="${metal_executable}/test_data"
+
+[[ "$par_error_correct_cell" == "false" ]] && unset par_error_correct_cell
+[[ "$par_reconcile_pairs" == "false" ]] && unset par_reconcile_pairs
+[[ "$par_three_prime" == "false" ]] && unset par_three_prime
+[[ "$par_ignore_read_pair_suffixes" == "false" ]] && unset par_ignore_read_pair_suffixes
+[[ "$par_timeit_header" == "false" ]] && unset par_timeit_header
+[[ "$par_log2stderr" == "false" ]] && unset par_log2stderr
+
+
+# Check if we have the correct number of input files and patterns for paired-end or single-end reads
+
+# For paired-end rends, check that we have two read files, two patterns
+# Check for paired-end inputs
+if [ -n "$par_input" ] && [ -n "$par_read2_in" ]; then
+    # Paired-end checks: Ensure both UMI patterns are provided
+    if [ -z "$par_bc_pattern" ] || [ -z "$par_bc_pattern2" ]; then
+        echo "Paired end input requires two UMI patterns."
+        exit 1
+    fi
+elif [ -n "$par_input" ]; then
+    # Single-end checks: Ensure no second read or UMI pattern for the second read is provided
+    if [ -n "$par_bc_pattern2" ]; then
+        echo "Single end input requires only one read file and one UMI pattern."
+        exit 1
+    fi
+    # Check that discard_read is not set or set to 0 for single-end reads
+    if [ -n "$par_umi_discard_read" ] && [ "$par_umi_discard_read" != 0 ]; then
+        echo "umi_discard_read is only valid when processing paired end reads."
+        exit 1
+    fi
+else
+    # No inputs provided
+    echo "No input files provided."
+    exit 1
+fi
+
+
+
+
+umi_tools extract \
+    -I "$par_input" \
+    ${par_read2_in:+ --read2-in "$par_read2_in"} \
+    -S "$par_output" \
+    ${par_read2_out:+--read2-out "$par_read2_out"} \
+    ${par_extract_method:+--extract-method "$par_extract_method"} \
+    --bc-pattern "$par_bc_pattern" \
+    ${par_bc_pattern2:+ --bc-pattern2 "$par_bc_pattern2"} \
+    ${par_umi_separator:+--umi-separator "$par_umi_separator"} \
+    ${par_output_stats:+--output-stats "$par_output_stats"} \
+    ${par_error_correct_cell:+--error-correct-cell} \
+    ${par_whitelist:+--whitelist "$par_whitelist"} \
+    ${par_blacklist:+--blacklist "$par_blacklist"} \
+    ${par_subset_reads:+--subset-reads "$par_subset_reads"} \
+    ${par_quality_filter_threshold:+--quality-filter-threshold "$par_quality_filter_threshold"} \
+    ${par_quality_filter_mask:+--quality-filter-mask "$par_quality_filter_mask"} \
+    ${par_quality_encoding:+--quality-encoding "$par_quality_encoding"} \
+    ${par_reconcile_pairs:+--reconcile-pairs} \
+    ${par_three_prime:+--3prime} \
+    ${par_filtered_out:+--filtered-out "$par_filtered_out"} \
+    ${par_filtered_out2:+--filtered-out2 "$par_filtered_out2"} \
+    ${par_ignore_read_pair_suffixes:+--ignore-read-pair-suffixes} \
+    ${par_random_seed:+--random-seed "$par_random_seed"} \
+    ${par_temp_dir:+--temp-dir "$par_temp_dir"} \
+    ${par_compresslevel:+--compresslevel "$par_compresslevel"} \
+    ${par_timeit:+--timeit "$par_timeit"} \
+    ${par_timeit_name:+--timeit-name "$par_timeit_name"} \
+    ${par_timeit_header:+--timeit-header} \
+    ${par_log:+--log "$par_log"} \
+    ${par_log2stderr:+--log2stderr} \
+    ${par_verbose:+--verbose "$par_verbose"} \
+    ${par_error:+--error "$par_error"}
+
+
+if [ "$par_umi_discard_read" == 1 ]; then
+    # discard read 1
+    rm "$par_read1_out"
+elif [ "$par_umi_discard_read" == 2 ]; then
+    # discard read 2 (-f to bypass file existence check)
+    rm -f "$par_read2_out"
+fi
--- a/src/umi_tools/umi_tools_extract/test.sh
+++ b/src/umi_tools/umi_tools_extract/test.sh
@@ -0,0 +1,86 @@
+#!/bin/bash
+
+test_dir="${meta_resources_dir}/test_data"
+
+echo ">>> Testing $meta_functionality_name"
+
+############################################################################################################
+
+echo ">>> Test 1: Testing for paired-end reads"
+"$meta_executable" \
+    --input "$test_dir/scrb_seq_fastq.1_30"\
+    --read2_in "$test_dir/scrb_seq_fastq.2_30" \
+    --bc_pattern "CCCCCCNNNNNNNNNN"\
+    --bc_pattern2 "CCCCCCNNNNNNNNNN" \
+    --extract_method string \
+    --umi_separator '_' \
+    --grouping_method directional \
+    --umi_discard_read 0 \
+    --output scrb_seq_fastq.1_30.extract \
+    --read2_out scrb_seq_fastq.2_30.extract \
+    --random_seed 1
+
+echo ">> Checking if the correct files are present"
+[[ ! -f "scrb_seq_fastq.1_30.extract" ]] || [[ ! -f "scrb_seq_fastq.2_30.extract" ]] && echo "Reads file missing" && exit 1
+[ ! -s "scrb_seq_fastq.1_30.extract" ] && echo "Read 1 file is empty" && exit 1
+[ ! -s "scrb_seq_fastq.2_30.extract" ] && echo "Read 2 file is empty" && exit 1
+
+
+echo ">> Checking if the files are correct"
+diff -q "${meta_resources_dir}/scrb_seq_fastq.1_30.extract" "$test_dir/scrb_seq_fastq.1_30.extract" || \
+    (echo "Read 1 file is not correct" && exit 1)
+diff -q "${meta_resources_dir}/scrb_seq_fastq.2_30.extract" "$test_dir/scrb_seq_fastq.2_30.extract" || \
+    (echo "Read 2 file is not correct" && exit 1)
+
+rm scrb_seq_fastq.1_30.extract scrb_seq_fastq.2_30.extract
+
+############################################################################################################
+
+echo ">>> Test 2: Testing for paired-end reads with umi_discard_reads option"
+"$meta_executable" \
+    --input "$test_dir/scrb_seq_fastq.1_30" \
+    --read2_in "$test_dir/scrb_seq_fastq.2_30" \
+    --bc_pattern CCCCCCNNNNNNNNNN \
+    --bc_pattern2 CCCCCCNNNNNNNNNN \
+    --extract_method string \
+    --umi_separator '_' \
+    --grouping_method directional \
+    --umi_discard_read 2 \
+    --output scrb_seq_fastq.1_30.extract \
+    --random_seed 1
+
+echo ">> Checking if the correct files are present"
+[ ! -f "scrb_seq_fastq.1_30.extract" ] && echo "Read 1 file is missing" && exit 1
+[ ! -s "scrb_seq_fastq.1_30.extract" ] && echo "Read 1 file is empty" && exit 1
+[ -f "scrb_seq_fastq.2_30.extract" ] && echo "Read 2 is not discarded" && exit 1
+
+echo ">> Checking if the files are correct"
+diff -q "${meta_resources_dir}/scrb_seq_fastq.1_30.extract" "$test_dir/scrb_seq_fastq.1_30.extract" || \
+    (echo "Read 1 file is not correct" && exit 1)
+
+rm scrb_seq_fastq.1_30.extract
+
+############################################################################################################
+
+echo ">>> Test 3: Testing for single-end reads"
+"$meta_executable" \
+    --input "$test_dir/slim_30.fastq" \
+    --bc_pattern "^(?P<umi_1>.{3}).{4}(?P<umi_2>.{2})" \
+    --extract_method regex \
+    --umi_separator '_' \
+    --grouping_method directional \
+    --output slim_30.extract \
+    --random_seed 1 
+
+echo ">> Checking if the correct files are present"
+[ ! -f "slim_30.extract" ] && echo "Trimmed reads file missing" && exit 1
+[ ! -s "slim_30.extract" ] && echo "Trimmed reads file is empty" && exit 1
+
+echo ">> Checking if the files are correct"
+diff -q "${meta_resources_dir}/slim_30.extract" "$test_dir/slim_30.extract" || \
+    (echo "Trimmed reads file is not correct" && exit 1)
+
+rm slim_30.extract
+
+echo ">>> Test finished successfully"
+exit 0
--- a/src/umi_tools/umi_tools_extract/test_data/scrb_seq_fastq.1_30
+++ b/src/umi_tools/umi_tools_extract/test_data/scrb_seq_fastq.1_30
@@ -0,0 +1,120 @@
+@SRR1058032.1 HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17
+AATAACTTCCCGCGTCG
+SRR1058032.1 HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17
+@@@DDDBDDF>FFHGIB
+@SRR1058032.2 HISEQ:653:H12WDADXX:1:1101:1191:2236 length=17
+AGCGGGGTGCTCGTCGT
+SRR1058032.2 HISEQ:653:H12WDADXX:1:1101:1191:2236 length=17
+CCCFFFFFHHHHHJJJJ
+@SRR1058032.3 HISEQ:653:H12WDADXX:1:1101:1715:2245 length=17
+CTTTAGTACCAGTCCTT
+SRR1058032.3 HISEQ:653:H12WDADXX:1:1101:1715:2245 length=17
+BBCFFDADHHHHHHIJJ
+@SRR1058032.4 HISEQ:653:H12WDADXX:1:1101:1905:2212 length=17
+AGGCGTTGTTTTTTTTT
+SRR1058032.4 HISEQ:653:H12WDADXX:1:1101:1905:2212 length=17
+CCCFFFFFHHHHHJJJJ
+@SRR1058032.5 HISEQ:653:H12WDADXX:1:1101:1927:2237 length=17
+ATCGAGACATAATTGAT
+SRR1058032.5 HISEQ:653:H12WDADXX:1:1101:1927:2237 length=17
+@B@FFFFFHHHHHJJJJ
+@SRR1058032.6 HISEQ:653:H12WDADXX:1:1101:1876:2243 length=17
+TGGGGGCGGTACATGAT
+SRR1058032.6 HISEQ:653:H12WDADXX:1:1101:1876:2243 length=17
+BBBFFFFFHHHHHJJJJ
+@SRR1058032.7 HISEQ:653:H12WDADXX:1:1101:2491:2207 length=17
+CTATATGTTTGCGCTGT
+SRR1058032.7 HISEQ:653:H12WDADXX:1:1101:2491:2207 length=17
+1=BDFFFFHHHHHJJJJ
+@SRR1058032.8 HISEQ:653:H12WDADXX:1:1101:2513:2219 length=17
+CTCCCGCATGCTGCTGT
+SRR1058032.8 HISEQ:653:H12WDADXX:1:1101:2513:2219 length=17
+?BBFFFFFHHHHHJJJJ
+@SRR1058032.9 HISEQ:653:H12WDADXX:1:1101:2604:2231 length=17
+GAGCCCTGAGGGGATCT
+SRR1058032.9 HISEQ:653:H12WDADXX:1:1101:2604:2231 length=17
+1??DDDFD>DFDGFGHG
+@SRR1058032.10 HISEQ:653:H12WDADXX:1:1101:2936:2218 length=17
+AGCGGGGTTCGCGGTTT
+SRR1058032.10 HISEQ:653:H12WDADXX:1:1101:2936:2218 length=17
+CCCFFFFFHHHHHJIJI
+@SRR1058032.11 HISEQ:653:H12WDADXX:1:1101:3447:2241 length=17
+AGAATTGCCTGGATTTT
+SRR1058032.11 HISEQ:653:H12WDADXX:1:1101:3447:2241 length=17
+@CCFFFFAFHHHGJJJJ
+@SRR1058032.12 HISEQ:653:H12WDADXX:1:1101:3620:2196 length=17
+AGGCGGGGCAACGGGTT
+SRR1058032.12 HISEQ:653:H12WDADXX:1:1101:3620:2196 length=17
+CCCFFFFFHHGHHJJHH
+@SRR1058032.13 HISEQ:653:H12WDADXX:1:1101:3875:2206 length=17
+GTCCCCGCGTCGTGTAG
+SRR1058032.13 HISEQ:653:H12WDADXX:1:1101:3875:2206 length=17
+@C@FFFFFHFFGHJJJJ
+@SRR1058032.14 HISEQ:653:H12WDADXX:1:1101:4131:2215 length=17
+CCACGCATTCACTCGGT
+SRR1058032.14 HISEQ:653:H12WDADXX:1:1101:4131:2215 length=17
+BBBDFFFFHHHHHJJJJ
+@SRR1058032.15 HISEQ:653:H12WDADXX:1:1101:4284:2241 length=17
+TGCGCAATAAGCGCTAT
+SRR1058032.15 HISEQ:653:H12WDADXX:1:1101:4284:2241 length=17
+:=DDDDDBHHGDIBEH
+@SRR1058032.16 HISEQ:653:H12WDADXX:1:1101:4599:2232 length=17
+CGCTGGCAGAGCCCGGT
+SRR1058032.16 HISEQ:653:H12WDADXX:1:1101:4599:2232 length=17
+@BCFFFFFHHHHHJJJJ
+@SRR1058032.17 HISEQ:653:H12WDADXX:1:1101:5428:2200 length=17
+AGGCGGTGCATAGTCTT
+SRR1058032.17 HISEQ:653:H12WDADXX:1:1101:5428:2200 length=17
+CCCFFFFFHHHHHIJIH
+@SRR1058032.18 HISEQ:653:H12WDADXX:1:1101:5336:2218 length=17
+GTCCCCCGCGTGTGACT
+SRR1058032.18 HISEQ:653:H12WDADXX:1:1101:5336:2218 length=17
+<BBFFFFFHHHHHIIJJ
+@SRR1058032.19 HISEQ:653:H12WDADXX:1:1101:5397:2220 length=17
+TATAGAAAAAACTTTTT
+SRR1058032.19 HISEQ:653:H12WDADXX:1:1101:5397:2220 length=17
+B@BFDDFFGHHFHIJIJ
+@SRR1058032.20 HISEQ:653:H12WDADXX:1:1101:5605:2194 length=17
+CATTATGGGCTTATTTT
+SRR1058032.20 HISEQ:653:H12WDADXX:1:1101:5605:2194 length=17
+BBBFFFFFHHHHHJJJJ
+@SRR1058032.21 HISEQ:653:H12WDADXX:1:1101:5519:2196 length=17
+AAATGTGCAGTTCAGAT
+SRR1058032.21 HISEQ:653:H12WDADXX:1:1101:5519:2196 length=17
+BCCFFFFFHHHHHJJJJ
+@SRR1058032.22 HISEQ:653:H12WDADXX:1:1101:5705:2220 length=17
+TGGGGGCTAAAGGGACT
+SRR1058032.22 HISEQ:653:H12WDADXX:1:1101:5705:2220 length=17
+BBBDFFFFHHHHHJIJI
+@SRR1058032.23 HISEQ:653:H12WDADXX:1:1101:5558:2236 length=17
+GATAATACTTACGGTGT
+SRR1058032.23 HISEQ:653:H12WDADXX:1:1101:5558:2236 length=17
+CCCFFFFFHHHHHJFHI
+@SRR1058032.24 HISEQ:653:H12WDADXX:1:1101:5649:2244 length=17
+CGTTAATAATTGTGGTT
+SRR1058032.24 HISEQ:653:H12WDADXX:1:1101:5649:2244 length=17
+BBBFFFFFHHHHHIIHG
+@SRR1058032.25 HISEQ:653:H12WDADXX:1:1101:5910:2207 length=17
+AAAAAAAAAAAAAAAAA
+SRR1058032.25 HISEQ:653:H12WDADXX:1:1101:5910:2207 length=17
+@CCFFFFFGHAA<:46'
+@SRR1058032.26 HISEQ:653:H12WDADXX:1:1101:5757:2217 length=17
+GCCGACCAACGATTTTT
+SRR1058032.26 HISEQ:653:H12WDADXX:1:1101:5757:2217 length=17
+:=?DD@?DH;AFBFDFF
+@SRR1058032.27 HISEQ:653:H12WDADXX:1:1101:5790:2248 length=17
+AATCAAGACCACTGAAT
+SRR1058032.27 HISEQ:653:H12WDADXX:1:1101:5790:2248 length=17
+@CCFFFFFHHHHHJJJI
+@SRR1058032.28 HISEQ:653:H12WDADXX:1:1101:6079:2195 length=17
+CGCGCTTTTGTTTTTTT
+SRR1058032.28 HISEQ:653:H12WDADXX:1:1101:6079:2195 length=17
+BB@FFFFFHHHHHJJJJ
+@SRR1058032.29 HISEQ:653:H12WDADXX:1:1101:6133:2213 length=17
+AAATACTTTGAGGGAAT
+SRR1058032.29 HISEQ:653:H12WDADXX:1:1101:6133:2213 length=17
+@CCFFEFFHHFHGJJII
+@SRR1058032.30 HISEQ:653:H12WDADXX:1:1101:6651:2198 length=17
+AGCGGGGTTTTATCGGT
+SRR1058032.30 HISEQ:653:H12WDADXX:1:1101:6651:2198 length=17
+CCCFFFFDHHHHHHJJJ
--- a/src/umi_tools/umi_tools_extract/test_data/scrb_seq_fastq.1_30.extract
+++ b/src/umi_tools/umi_tools_extract/test_data/scrb_seq_fastq.1_30.extract
@@ -0,0 +1,120 @@
+@SRR1058032.1_AATAACCCTACA_TTCCCGCGTCCTCTTTCCCT HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17
+G
+
+B
+@SRR1058032.2_AGCGGGACGCTA_GTGCTCGTCGTACTCTTTCC HISEQ:653:H12WDADXX:1:1101:1191:2236 length=17
+T
+
+J
+@SRR1058032.3_CTTTAGACGCTA_TACCAGTCCTCACTCTTTCC HISEQ:653:H12WDADXX:1:1101:1715:2245 length=17
+T
+
+J
+@SRR1058032.4_AGGCGTACTTTA_TGTTTTTTTTCACTCTCTCC HISEQ:653:H12WDADXX:1:1101:1905:2212 length=17
+T
+
+J
+@SRR1058032.5_ATCGAGGTGTAG_ACATAATTGAGGAAAGAGTG HISEQ:653:H12WDADXX:1:1101:1927:2237 length=17
+T
+
+J
+@SRR1058032.6_TGGGGGCCTATA_CGGTACATGATAGTATAGCT HISEQ:653:H12WDADXX:1:1101:1876:2243 length=17
+T
+
+J
+@SRR1058032.7_CTATATATTAAA_GTTTGCGCTGGACAAACTAC HISEQ:653:H12WDADXX:1:1101:2491:2207 length=17
+T
+
+J
+@SRR1058032.8_CTCCCGGCCTAG_CATGCTGCTGTTGTGAACCA HISEQ:653:H12WDADXX:1:1101:2513:2219 length=17
+T
+
+J
+@SRR1058032.9_GAGCCCCCCTTC_TGAGGGGATCACGACGCTAC HISEQ:653:H12WDADXX:1:1101:2604:2231 length=17
+T
+
+G
+@SRR1058032.10_AGCGGGGGGAAA_GTTCGCGGTTGAGTGTGTCG HISEQ:653:H12WDADXX:1:1101:2936:2218 length=17
+T
+
+I
+@SRR1058032.11_AGAATTCCCACA_GCCTGGATTTCTCTTTCCCT HISEQ:653:H12WDADXX:1:1101:3447:2241 length=17
+T
+
+J
+@SRR1058032.12_AGGCGGGTGTAT_GGCAACGGGTGGAAAGAGTG HISEQ:653:H12WDADXX:1:1101:3620:2196 length=17
+T
+
+H
+@SRR1058032.13_GTCCCCCTCTTT_GCGTCGTGTACCCTACACTC HISEQ:653:H12WDADXX:1:1101:3875:2206 length=17
+G
+
+J
+@SRR1058032.14_CCACGCGTGTAG_ATTCACTCGGCGTCGTGTAG HISEQ:653:H12WDADXX:1:1101:4131:2215 length=17
+T
+
+J
+@SRR1058032.15_TGCGCAGTGTAT_ATAAGCGCTAGGAAAGAGTG HISEQ:653:H12WDADXX:1:1101:4284:2241 length=17
+T
+
+H
+@SRR1058032.16_CGCTGGACTCTT_CAGAGCCCGGTCCCTACACT HISEQ:653:H12WDADXX:1:1101:4599:2232 length=17
+T
+
+J
+@SRR1058032.17_AGGCGGGATTCT_TGCATAGTCTTCAAATGAGG HISEQ:653:H12WDADXX:1:1101:5428:2200 length=17
+T
+
+H
+@SRR1058032.18_GTCCCCGCGTCG_CGCGTGTGACTGTAGGGAAA HISEQ:653:H12WDADXX:1:1101:5336:2218 length=17
+T
+
+J
+@SRR1058032.19_TATAGACCATCA_AAAAACTTTTCGCCTGCCCT HISEQ:653:H12WDADXX:1:1101:5397:2220 length=17
+T
+
+J
+@SRR1058032.20_CATTATTTAATG_GGGCTTATTTGACTGTTTCA HISEQ:653:H12WDADXX:1:1101:5605:2194 length=17
+T
+
+J
+@SRR1058032.21_AAATGTTATCTA_GCAGTTCAGAGACTGCTCGT HISEQ:653:H12WDADXX:1:1101:5519:2196 length=17
+T
+
+J
+@SRR1058032.22_TGGGGGACTGTT_CTAAAGGGACCTTTAACCAA HISEQ:653:H12WDADXX:1:1101:5705:2220 length=17
+T
+
+I
+@SRR1058032.23_GATAATTTCCAT_ACTTACGGTGACACTCTTTC HISEQ:653:H12WDADXX:1:1101:5558:2236 length=17
+T
+
+I
+@SRR1058032.24_CGTTAAAGACGG_TAATTGTGGTACCAGAGCGA HISEQ:653:H12WDADXX:1:1101:5649:2244 length=17
+T
+
+G
+@SRR1058032.25_AAAAAAGAGTAT_AAAAAAAAAAAGGGAAAGAG HISEQ:653:H12WDADXX:1:1101:5910:2207 length=17
+A
+
+'
+@SRR1058032.26_GCCGACCCTTTT_CAACGATTTTATACAATACA HISEQ:653:H12WDADXX:1:1101:5757:2217 length=17
+T
+
+F
+@SRR1058032.27_AATCAAATCACA_GACCACTGAAGCTGGAGAGA HISEQ:653:H12WDADXX:1:1101:5790:2248 length=17
+T
+
+I
+@SRR1058032.28_CGCGCTGTACTA_TTTGTTTTTTGGCATCGTCA HISEQ:653:H12WDADXX:1:1101:6079:2195 length=17
+T
+
+J
+@SRR1058032.29_AAATACCCAATA_TTTGAGGGAAACTTGACCAA HISEQ:653:H12WDADXX:1:1101:6133:2213 length=17
+T
+
+I
+@SRR1058032.30_AGCGGGGAGTGT_GTTTTATCGGACACTCTTTC HISEQ:653:H12WDADXX:1:1101:6651:2198 length=17
+T
+
+J
--- a/src/umi_tools/umi_tools_extract/test_data/scrb_seq_fastq.2_30
+++ b/src/umi_tools/umi_tools_extract/test_data/scrb_seq_fastq.2_30
@@ -0,0 +1,120 @@
+@SRR1058032.1 HISEQ:653:H12WDADXX:1:1101:1210:2217 length=34
+CCTACACTCTTTCCCTACACGACGCTACACTCTN
+SRR1058032.1 HISEQ:653:H12WDADXX:1:1101:1210:2217 length=34
+@@@DFEDABD?A?ABGHGGGIGGEGIIIJJJFI#
+@SRR1058032.2 HISEQ:653:H12WDADXX:1:1101:1191:2236 length=34
+ACGCTATACTCTTTCCCTACACGACGCTACACTN
+SRR1058032.2 HISEQ:653:H12WDADXX:1:1101:1191:2236 length=34
+CCCDFFFFHHHHHJJJJGGICGE6FDH<?F<F<#
+@SRR1058032.3 HISEQ:653:H12WDADXX:1:1101:1715:2245 length=34
+ACGCTACACTCTTTCCCTACACGACGCTACACTN
+SRR1058032.3 HISEQ:653:H12WDADXX:1:1101:1715:2245 length=34
+C@CFFFFFGHHGAEHIIEIGIIAGFHIFG@FBE#
+@SRR1058032.4 HISEQ:653:H12WDADXX:1:1101:1905:2212 length=34
+ACTTTACACTCTCTCCCTACACGACGCTACACTN
+SRR1058032.4 HISEQ:653:H12WDADXX:1:1101:1905:2212 length=34
+??;==DBDD?F:D<EGH<HGHIF>GEGCDG9FD#
+@SRR1058032.5 HISEQ:653:H12WDADXX:1:1101:1927:2237 length=34
+GTGTAGGGAAAGAGTGTAAGGAAAGAGTGTAGCN
+SRR1058032.5 HISEQ:653:H12WDADXX:1:1101:1927:2237 length=34
+?=??B?DB2ACCAEAEFHHIHHHIHFHCEHHIG#
+@SRR1058032.6 HISEQ:653:H12WDADXX:1:1101:1876:2243 length=34
+CCTATATAGTATAGCTTCCCATCTTCTTTGAGAN
+SRR1058032.6 HISEQ:653:H12WDADXX:1:1101:1876:2243 length=34
+CCCFFFFFHDHBHEIIJJJJIIIJJJGGGIGIE#
+@SRR1058032.7 HISEQ:653:H12WDADXX:1:1101:2491:2207 length=34
+ATTAAAGACAAACTACAACTCATATGAGGCATTN
+SRR1058032.7 HISEQ:653:H12WDADXX:1:1101:2491:2207 length=34
+@@@DDDADDHHHFBFAHIGBHH<H<BHDFGIIG#
+@SRR1058032.8 HISEQ:653:H12WDADXX:1:1101:2513:2219 length=34
+GCCTAGTTGTGAACCAAATGTGAAAAAACCTCCN
+SRR1058032.8 HISEQ:653:H12WDADXX:1:1101:2513:2219 length=34
+@@@FFDDDFFFFFIIGHIFI<HHEHCEBEFEED#
+@SRR1058032.9 HISEQ:653:H12WDADXX:1:1101:2604:2231 length=34
+CCCTTCACGACGCTACACTCTTTCCCTACACGAN
+SRR1058032.9 HISEQ:653:H12WDADXX:1:1101:2604:2231 length=34
+C@CFFFFFHHHHHJJJIJJJJIJJJIGBHBFG:#
+@SRR1058032.10 HISEQ:653:H12WDADXX:1:1101:2936:2218 length=34
+GGGAAAGAGTGTGTCGTGTATGGAAAGAGTGTAN
+SRR1058032.10 HISEQ:653:H12WDADXX:1:1101:2936:2218 length=34
+CCCFFFFDD>FAH;E@@?AB>F@BF3;3?1C?<#
+@SRR1058032.11 HISEQ:653:H12WDADXX:1:1101:3447:2241 length=34
+CCCACACTCTTTCCCTACACGACGCTACACTCTN
+SRR1058032.11 HISEQ:653:H12WDADXX:1:1101:3447:2241 length=34
+@@@DDFDDBHBFHGI<F@GFBFEE>)C:D@@@B#
+@SRR1058032.12 HISEQ:653:H12WDADXX:1:1101:3620:2196 length=34
+GTGTATGGAAAGAGTGTAGGGAAAGAGTGTAGGN
+SRR1058032.12 HISEQ:653:H12WDADXX:1:1101:3620:2196 length=34
+@@@DDDDAHHHFHIABEEEAB??CFBF?C@BFF#
+@SRR1058032.13 HISEQ:653:H12WDADXX:1:1101:3875:2206 length=34
+CTCTTTCCCTACACTCTTTCCCTACACGACGCTN
+SRR1058032.13 HISEQ:653:H12WDADXX:1:1101:3875:2206 length=34
+@@@DDDAAADHDHDGDGIIIIIJJJJJJIJIIJ#
+@SRR1058032.14 HISEQ:653:H12WDADXX:1:1101:4131:2215 length=34
+GTGTAGCGTCGTGTAGGGAAAGAGTGTGTGGAAN
+SRR1058032.14 HISEQ:653:H12WDADXX:1:1101:4131:2215 length=34
+@@@DDDDD?DFDCAEFHIGGFHEH:D1C:CG@F#
+@SRR1058032.15 HISEQ:653:H12WDADXX:1:1101:4284:2241 length=34
+GTGTATGGAAAGAGTGTGCGTCGTACGTGTAGAN
+SRR1058032.15 HISEQ:653:H12WDADXX:1:1101:4284:2241 length=34
+@?@DDFFFHHHHGDAC:CHGGIIGIIIFHFGHB#
+@SRR1058032.16 HISEQ:653:H12WDADXX:1:1101:4599:2232 length=34
+ACTCTTTCCCTACACTCTTTCCCTACACGACGCN
+SRR1058032.16 HISEQ:653:H12WDADXX:1:1101:4599:2232 length=34
+@@<DAAAA?>BCBE@9;EGGGGGIHJJIJHIGG#
+@SRR1058032.17 HISEQ:653:H12WDADXX:1:1101:5428:2200 length=34
+GATTCTTCAAATGAGGACTATGCGGGACATGAAN
+SRR1058032.17 HISEQ:653:H12WDADXX:1:1101:5428:2200 length=34
+@@@DDDDDFHHFAHB;FHIIIIIIIIFHEHIHI#
+@SRR1058032.18 HISEQ:653:H12WDADXX:1:1101:5336:2218 length=34
+GCGTCGTGTAGGGAAAGAGTGTAGCGTCGTGTAN
+SRR1058032.18 HISEQ:653:H12WDADXX:1:1101:5336:2218 length=34
+@@@DDDDD<FFD?GIIDGF+<<CBAFCGE@FB@#
+@SRR1058032.19 HISEQ:653:H12WDADXX:1:1101:5397:2220 length=34
+CCATCACGCCTGCCCTTCCTTGAAATTACACCTN
+SRR1058032.19 HISEQ:653:H12WDADXX:1:1101:5397:2220 length=34
+;===AAA<@A72??A+22<+,+<+@+++*:***#
+@SRR1058032.20 HISEQ:653:H12WDADXX:1:1101:5605:2194 length=34
+TTAATGGACTGTTTCAGGTAAAAGAGAATGAATN
+SRR1058032.20 HISEQ:653:H12WDADXX:1:1101:5605:2194 length=34
+CCCFFDAEHHHHDEHIGCEIIJJIGIJGIGGHE#
+@SRR1058032.21 HISEQ:653:H12WDADXX:1:1101:5519:2196 length=34
+TATCTAGACTGCTCGTCATTTAGAAGACACGTCN
+SRR1058032.21 HISEQ:653:H12WDADXX:1:1101:5519:2196 length=34
+@B@FDDFFHFHBHEIIGIIJJGHGHIIIGIGII#
+@SRR1058032.22 HISEQ:653:H12WDADXX:1:1101:5705:2220 length=34
+ACTGTTCTTTAACCAAACATCCGTGCGATTCGTN
+SRR1058032.22 HISEQ:653:H12WDADXX:1:1101:5705:2220 length=34
+CCCFFFFFHHHHHJJJJGHIJJIGIIIBEFG?G#
+@SRR1058032.23 HISEQ:653:H12WDADXX:1:1101:5558:2236 length=34
+TTCCATACACTCTTTCCCTACACGACGCACACTN
+SRR1058032.23 HISEQ:653:H12WDADXX:1:1101:5558:2236 length=34
+@@@DFBEFHFFD<A<CD>BHEGGFGHGGIEGII#
+@SRR1058032.24 HISEQ:653:H12WDADXX:1:1101:5649:2244 length=34
+AGACGGACCAGAGCGAAAGCATTTGCCAAGAATN
+SRR1058032.24 HISEQ:653:H12WDADXX:1:1101:5649:2244 length=34
+CCCFFFDFGHHHGJIIJJIJHEDD919CGGHJ@#
+@SRR1058032.25 HISEQ:653:H12WDADXX:1:1101:5910:2207 length=34
+GAGTATAGGGAAAGAGTTTTTTTTTTTTTTTTTN
+SRR1058032.25 HISEQ:653:H12WDADXX:1:1101:5910:2207 length=34
+?=?DDDD>AB:ACEEGHIJJIJJJJIIJJHFDD#
+@SRR1058032.26 HISEQ:653:H12WDADXX:1:1101:5757:2217 length=34
+CCTTTTATACAATACAAAGCTTTGCTTTTTTTTN
+SRR1058032.26 HISEQ:653:H12WDADXX:1:1101:5757:2217 length=34
+???DDDDDDDDD4EEEII@A<:33<33,22110#
+@SRR1058032.27 HISEQ:653:H12WDADXX:1:1101:5790:2248 length=34
+ATCACAGCTGGAGAGATCTTGATCTTCATGGTGN
+SRR1058032.27 HISEQ:653:H12WDADXX:1:1101:5790:2248 length=34
+CCCFFFFFHHFHGGIIIIJIEAHCEHHEFECGD#
+@SRR1058032.28 HISEQ:653:H12WDADXX:1:1101:6079:2195 length=34
+GTACTAGGCATCGTCATCCAATGCGACGAGTCCN
+SRR1058032.28 HISEQ:653:H12WDADXX:1:1101:6079:2195 length=34
+@@CFFDDFHHGHHIJJJIJJJIGGHIDG<GFHG#
+@SRR1058032.29 HISEQ:653:H12WDADXX:1:1101:6133:2213 length=34
+CCAATAACTTGACCAACGGAACAAGTTACCCTAN
+SRR1058032.29 HISEQ:653:H12WDADXX:1:1101:6133:2213 length=34
+@CCFFFFFHHGHHIJJJJIJIIIIIIIIIJIJI#
+@SRR1058032.30 HISEQ:653:H12WDADXX:1:1101:6651:2198 length=34
+GAGTGTACACTCTTTCCCTACACGACGTTACACN
+SRR1058032.30 HISEQ:653:H12WDADXX:1:1101:6651:2198 length=34
+???A:2ABDBDDDBEEIIA:F:CC8F<))1:??#
--- a/src/umi_tools/umi_tools_extract/test_data/scrb_seq_fastq.2_30.extract
+++ b/src/umi_tools/umi_tools_extract/test_data/scrb_seq_fastq.2_30.extract
@@ -0,0 +1,120 @@
+@SRR1058032.1_AATAACCCTACA_TTCCCGCGTCCTCTTTCCCT HISEQ:653:H12WDADXX:1:1101:1210:2217 length=34
+ACACGACGCTACACTCTN
+
+HGGGIGGEGIIIJJJFI#
+@SRR1058032.2_AGCGGGACGCTA_GTGCTCGTCGTACTCTTTCC HISEQ:653:H12WDADXX:1:1101:1191:2236 length=34
+CTACACGACGCTACACTN
+
+JGGICGE6FDH<?F<F<#
+@SRR1058032.3_CTTTAGACGCTA_TACCAGTCCTCACTCTTTCC HISEQ:653:H12WDADXX:1:1101:1715:2245 length=34
+CTACACGACGCTACACTN
+
+IEIGIIAGFHIFG@FBE#
+@SRR1058032.4_AGGCGTACTTTA_TGTTTTTTTTCACTCTCTCC HISEQ:653:H12WDADXX:1:1101:1905:2212 length=34
+CTACACGACGCTACACTN
+
+H<HGHIF>GEGCDG9FD#
+@SRR1058032.5_ATCGAGGTGTAG_ACATAATTGAGGAAAGAGTG HISEQ:653:H12WDADXX:1:1101:1927:2237 length=34
+TAAGGAAAGAGTGTAGCN
+
+FHHIHHHIHFHCEHHIG#
+@SRR1058032.6_TGGGGGCCTATA_CGGTACATGATAGTATAGCT HISEQ:653:H12WDADXX:1:1101:1876:2243 length=34
+TCCCATCTTCTTTGAGAN
+
+JJJJIIIJJJGGGIGIE#
+@SRR1058032.7_CTATATATTAAA_GTTTGCGCTGGACAAACTAC HISEQ:653:H12WDADXX:1:1101:2491:2207 length=34
+AACTCATATGAGGCATTN
+
+HIGBHH<H<BHDFGIIG#
+@SRR1058032.8_CTCCCGGCCTAG_CATGCTGCTGTTGTGAACCA HISEQ:653:H12WDADXX:1:1101:2513:2219 length=34
+AATGTGAAAAAACCTCCN
+
+HIFI<HHEHCEBEFEED#
+@SRR1058032.9_GAGCCCCCCTTC_TGAGGGGATCACGACGCTAC HISEQ:653:H12WDADXX:1:1101:2604:2231 length=34
+ACTCTTTCCCTACACGAN
+
+IJJJJIJJJIGBHBFG:#
+@SRR1058032.10_AGCGGGGGGAAA_GTTCGCGGTTGAGTGTGTCG HISEQ:653:H12WDADXX:1:1101:2936:2218 length=34
+TGTATGGAAAGAGTGTAN
+
+@?AB>F@BF3;3?1C?<#
+@SRR1058032.11_AGAATTCCCACA_GCCTGGATTTCTCTTTCCCT HISEQ:653:H12WDADXX:1:1101:3447:2241 length=34
+ACACGACGCTACACTCTN
+
+F@GFBFEE>)C:D@@@B#
+@SRR1058032.12_AGGCGGGTGTAT_GGCAACGGGTGGAAAGAGTG HISEQ:653:H12WDADXX:1:1101:3620:2196 length=34
+TAGGGAAAGAGTGTAGGN
+
+EEEAB??CFBF?C@BFF#
+@SRR1058032.13_GTCCCCCTCTTT_GCGTCGTGTACCCTACACTC HISEQ:653:H12WDADXX:1:1101:3875:2206 length=34
+TTTCCCTACACGACGCTN
+
+GIIIIIJJJJJJIJIIJ#
+@SRR1058032.14_CCACGCGTGTAG_ATTCACTCGGCGTCGTGTAG HISEQ:653:H12WDADXX:1:1101:4131:2215 length=34
+GGAAAGAGTGTGTGGAAN
+
+HIGGFHEH:D1C:CG@F#
+@SRR1058032.15_TGCGCAGTGTAT_ATAAGCGCTAGGAAAGAGTG HISEQ:653:H12WDADXX:1:1101:4284:2241 length=34
+TGCGTCGTACGTGTAGAN
+
+:CHGGIIGIIIFHFGHB#
+@SRR1058032.16_CGCTGGACTCTT_CAGAGCCCGGTCCCTACACT HISEQ:653:H12WDADXX:1:1101:4599:2232 length=34
+CTTTCCCTACACGACGCN
+
+;EGGGGGIHJJIJHIGG#
+@SRR1058032.17_AGGCGGGATTCT_TGCATAGTCTTCAAATGAGG HISEQ:653:H12WDADXX:1:1101:5428:2200 length=34
+ACTATGCGGGACATGAAN
+
+FHIIIIIIIIFHEHIHI#
+@SRR1058032.18_GTCCCCGCGTCG_CGCGTGTGACTGTAGGGAAA HISEQ:653:H12WDADXX:1:1101:5336:2218 length=34
+GAGTGTAGCGTCGTGTAN
+
+DGF+<<CBAFCGE@FB@#
+@SRR1058032.19_TATAGACCATCA_AAAAACTTTTCGCCTGCCCT HISEQ:653:H12WDADXX:1:1101:5397:2220 length=34
+TCCTTGAAATTACACCTN
+
+22<+,+<+@+++*:***#
+@SRR1058032.20_CATTATTTAATG_GGGCTTATTTGACTGTTTCA HISEQ:653:H12WDADXX:1:1101:5605:2194 length=34
+GGTAAAAGAGAATGAATN
+
+GCEIIJJIGIJGIGGHE#
+@SRR1058032.21_AAATGTTATCTA_GCAGTTCAGAGACTGCTCGT HISEQ:653:H12WDADXX:1:1101:5519:2196 length=34
+CATTTAGAAGACACGTCN
+
+GIIJJGHGHIIIGIGII#
+@SRR1058032.22_TGGGGGACTGTT_CTAAAGGGACCTTTAACCAA HISEQ:653:H12WDADXX:1:1101:5705:2220 length=34
+ACATCCGTGCGATTCGTN
+
+JGHIJJIGIIIBEFG?G#
+@SRR1058032.23_GATAATTTCCAT_ACTTACGGTGACACTCTTTC HISEQ:653:H12WDADXX:1:1101:5558:2236 length=34
+CCTACACGACGCACACTN
+
+D>BHEGGFGHGGIEGII#
+@SRR1058032.24_CGTTAAAGACGG_TAATTGTGGTACCAGAGCGA HISEQ:653:H12WDADXX:1:1101:5649:2244 length=34
+AAGCATTTGCCAAGAATN
+
+JJIJHEDD919CGGHJ@#
+@SRR1058032.25_AAAAAAGAGTAT_AAAAAAAAAAAGGGAAAGAG HISEQ:653:H12WDADXX:1:1101:5910:2207 length=34
+TTTTTTTTTTTTTTTTTN
+
+HIJJIJJJJIIJJHFDD#
+@SRR1058032.26_GCCGACCCTTTT_CAACGATTTTATACAATACA HISEQ:653:H12WDADXX:1:1101:5757:2217 length=34
+AAGCTTTGCTTTTTTTTN
+
+II@A<:33<33,22110#
+@SRR1058032.27_AATCAAATCACA_GACCACTGAAGCTGGAGAGA HISEQ:653:H12WDADXX:1:1101:5790:2248 length=34
+TCTTGATCTTCATGGTGN
+
+IIJIEAHCEHHEFECGD#
+@SRR1058032.28_CGCGCTGTACTA_TTTGTTTTTTGGCATCGTCA HISEQ:653:H12WDADXX:1:1101:6079:2195 length=34
+TCCAATGCGACGAGTCCN
+
+JIJJJIGGHIDG<GFHG#
+@SRR1058032.29_AAATACCCAATA_TTTGAGGGAAACTTGACCAA HISEQ:653:H12WDADXX:1:1101:6133:2213 length=34
+CGGAACAAGTTACCCTAN
+
+JJIJIIIIIIIIIJIJI#
+@SRR1058032.30_AGCGGGGAGTGT_GTTTTATCGGACACTCTTTC HISEQ:653:H12WDADXX:1:1101:6651:2198 length=34
+CCTACACGACGTTACACN
+
+IIA:F:CC8F<))1:??#
--- a/src/umi_tools/umi_tools_extract/test_data/script.sh
+++ b/src/umi_tools/umi_tools_extract/test_data/script.sh
@@ -0,0 +1,34 @@
+#!/bin/bash
+
+# Download test data
+wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/slim.fastq.gz
+wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.1.gz
+wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.2.gz
+
+gunzip -f slim.fastq.gz scrb_seq_fastq.1.gz scrb_seq_fastq.2.gz
+
+# smaller datasets
+head -n 120 slim.fastq > slim_30.fastq
+head -n 120 scrb_seq_fastq.1 > scrb_seq_fastq.1_30
+head -n 120 scrb_seq_fastq.2 > scrb_seq_fastq.2_30
+rm slim.fastq scrb_seq_fastq.1 scrb_seq_fastq.2
+
+# Generate expected output
+# Test 1 and 2
+umi_tools extract \
+    --stdin "scrb_seq_fastq.1_30" \
+    --read2-in "scrb_seq_fastq.2_30" \
+    --bc-pattern "CCCCCCNNNNNNNNNN" \
+    --bc-pattern2 "CCCCCCNNNNNNNNNN" \
+    --extract-method string \
+    --stdout scrb_seq_fastq.1_30.extract \
+    --read2-out scrb_seq_fastq.2_30.extract \
+    --random-seed 1
+
+# Test 3
+umi_tools extract \
+    --stdin "slim_30.fastq" \
+    --bc-pattern "^(?P<umi_1>.{3}).{4}(?P<umi_2>.{2})" \
+    --extract-method regex \
+    --stdout slim_30.extract \
+    --random-seed 1
--- a/src/umi_tools/umi_tools_extract/test_data/slim_30.extract
+++ b/src/umi_tools/umi_tools_extract/test_data/slim_30.extract
@@ -0,0 +1,120 @@
+@SRR2057595.7_CAGAA
+GTTCTCTCGGTGGGACCTC
+
+FFFFHHHJJJFGIJIJJIJ
+@SRR2057595.9_TTGAA
+GTTCTCTGATGCCCTCTTCTGGTGCATCTGAAGACAGCTACAGTGTACTTAGATATAATAAATAAATCTT
+
+FDBDFHHIGGEHJGGIHGHGGCAFCHGIGEHIJJJJIJJJIHIIIIIIJIIIIIGHIIGGIJGIIJIIJ@
+@SRR2057595.14_TGGAT
+GTTAGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+
+FFFFHHHJJIJJJJIGHJJIIJJJJJIJHFHHFFEDEEEEDDDDBDDDD
+@SRR2057595.22_ACGAT
+GTTAGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGC
+
+FFFFHHHJJJJJJJJIJJJJJJJJJJJJHHHFFFEDEEEEDDDDBDDD
+@SRR2057595.23_GCGTT
+GTTACCTAAGGCGAGCTCAGGGAGGACAGAAACCTCCCGTGGAGCAGAAGGGCAAAAGCTCGCTTGATCT
+
+FFFFHHHJJJJJJJJJJJJJJJIJJIIJJJJJJJJJJJJIJJHHHHHFFFFDDDDDDDDDDDDDDDDDDA
+@SRR2057595.29_ACGTT
+GTTCGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCTT
+
+FFFFHHHJJJJJJJJHIJJJJJJIJJJJHHHFFDEDEEDDCDDDBDDDDD
+@SRR2057595.30_GAGAA
+GTTGAATCCGTGCTAAGAAGAA
+
+DFFFHHHJJJJIJJJJJJJJJJ
+@SRR2057595.33_TCGAT
+GTTTCTCGTCTGATCTCGGAAGCTAAGCAGGGTCGGGCCTGGTTAGTACTTGGATGGGAGACCGCCTGGG
+
+FFFFHHHJJJJJJJJJJJJJJJJJJJJJJJJJDHIJJJJIJJJHGGEEHFFFFFFEDDEDDDDDDDDDDB
+@SRR2057595.35_ACGCT
+GTTACCCGGGGCTACGCCTGTCTGAGCGTCGCT
+
+DFFFHHHJJJJJJIJJJJJJIJJJJJJJHIIJJ
+@SRR2057595.38_GGGCC
+GTTATGCATGTTTATAGTTTCTAGTTTTGGCATTTTGTGTGGTCTCTTTTTTGTT
+
+DFFFHHHJJJJJJJJJJHJJIJJJIJJJJJJJJJJJJGIGHJHIJJIJJJJJJJJ
+@SRR2057595.42_TAGGA
+GTTGTAAGTTATACACTGACTAAGTCATCTGTTACTGCCTTCACTGAGTTTTTATTTCCTTT
+
+DFFFHHHJJJJJJJJJJJJJJJJJIIJJJJGJJJJJJJJJJJJJJJIIHIJJJJJJJIJJJI
+@SRR2057595.45_CTGGC
+GTTTTGCGGAAGGATCATTA
+
+DDDDFFDFFAGFE<EB8?BF
+@SRR2057595.46_CAGTT
+GTTTTGGCTTTTTTTTAAAACCATTTTGTGAAAGGTTTCTGAAACTTGATAATAAAAAGCAGTTGGTGTA
+
+DDDDHHFIGIJJJJJJJIIIJIIIJJJICHHIGIJFHHGHIEIHGHFHEDFFEFEFEEDEDD@CDD<@B:
+@SRR2057595.56_GGGCG
+GTTTATGAAGAACGCAGCTAGCTGCGAGAATTAATGTGAATTGCAGGACACATTGATCATCGACACTTCG
+
+FFFFHHHJJJJJJJJJJJJJJJJJJJIJJJJJIJJJJJJJJJJJJJJJJJJJHHHHHHFFFFFDDDDDDD
+@SRR2057595.59_GCGCC
+GTTATCCTGTCTTATCATTGTCTTTTGAGCCTGGGCCTTGCCAGGTAGCTCTAGACTGGCCTAGAACTCA
+
+FFFFHHHJJJJJJJJJJJJ4CHHJJJJJJJJJJJJJJJJJJJJIJDHIJJJIIJJJJJIJJJJHHHHHHB
+@SRR2057595.60_ATGCA
+GTTTTCTCGTCTGATCTCGGAAGCTAAGCAGGGCCGGGCCTGGTTAGTACTTGGATGGGAGACCGCC
+
+DDFDBBBFECFE@HHIBCBG<2CGEC49?1CBD)86:;AB=7C.=;=)77;A3;?C@;96=?@B8;?
+@SRR2057595.61_GAGAG
+GTTTCAGGACACATTGATCATCGACACTTCGAACGCACTTGCGGCCCCGGGTTCCTCCCGGGGCTACGCC
+
+DFFFHHHGGHGHIIJGEFEGGFH9GGIIFGGGGIFGDHBG@FGGGHEFCCB?@@CDCCD?B7>B@ACB9<
+@SRR2057595.65_GCGCG
+GTTTGAGCTTGCTCCGTCCACTCAACGCATCGACCTGGTATTGCAGTACCTCCAGGAACGGTGCACCAAG
+
+FFFFHHHJJJJJJHJIHHIIIIIIIJHJBHIHBFHHJI@EHJJHHHHHHHFFFBDE?AEBD=AB@CDBD?
+@SRR2057595.67_AAGGT
+GTTGTTTTGAGGTCCTGCTCGTGCAGGGT
+
+DDDFHHHHGFHGFGGIIDGHHIGIJJJJ9
+@SRR2057595.69_ATTAT
+GGTTTTTGTTTTTCCTCCTTCTCTTTCTAAA
+
+FFFFHHHHJJJJJJJJJJJJJJJJJJIJIJJ
+@SRR2057595.70_TTAAA
+GGTTTTGTAATTTTATGAGGTCCCATTTGTCAATTCTT
+
+DDDD2CDFA@FBGHCCHFHGBFHGHIGGDHGHIIFCFF
+@SRR2057595.71_TGCCA
+GGTTTATTAGCATGGCCCCTGCGCAAGGATGACACGCAAATTCGTGAAGCGTTCCATATTT
+
+FFFFHGHHJJJJJJJJJJIIJJIJIJJIFHJIIIJJJIJJJJJJHIIHHHHFFFDEECEEE
+@SRR2057595.73_TGACA
+GGTTGCGAGTGCCTAGTGGGCCACTTTTGGTAAGCAGAACTGGCGCTGCGGGA
+
+FFFFGFFHC@EBHGHGAEGIIHIIIIJJJJGHIIIJIJIIGHIJIJJIGGEFD
+@SRR2057595.74_AATTC
+GGTTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
+FFFFDFFHFIJJJGGGGJJGDDDDDDDDDDDDDBDDDDDDBBDDDDDDDDDDDDDDDDDDDBBBDDDDBD>
+@SRR2057595.77_GCGGA
+GTTCTCCCACTTCTGAC
+
+FFFFDHHHIJJJIJJJJ
+@SRR2057595.82_GAGAC
+GGTTTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+
+FFFFHHHHJJJJJJJJJJJJJJJJIJJJJJIJIIJJJH
+@SRR2057595.83_TGGAT
+GTTGCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+
+DFFFHHHJJIJJJJJJIJJJIJJIGGHIFHGEH
+@SRR2057595.86_ACCAC
+GGTTTTTTTTTAAATGTAAAGCATAAATAAAAAGCCTTTGTGGACTGTGAAAAAAAAAAAAAAAAAAAAAA
+
+FFFFHHHHJJJJJJJJIIJJJJJJJJJIJJJJJJJJJJJJGIJIIJJIJJJJJJHFDDDDDDDDDDDDDB>
+@SRR2057595.88_TCAGC
+GGTTCTAAGCATAGATAACCATATATCAGGGGGAGCTCCATGTTCTAGTCCTGCAAGCGCCTGGGCAATAA
+
+FFFFHHHHJJJJJJIJJJJJIJJJJJJIJJIJJIJJJJJJJJJHIJJJJJJIIIHJIHHHFFDDDDEDDD@
+@SRR2057595.99_TGACA
+GGTTTCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAGGGTTTGT
+
+FFFFDHHHIHIIIJJIJJJJJIGEHGFHIJJGHIHADHIIJIJJJIJG
--- a/src/umi_tools/umi_tools_extract/test_data/slim_30.fastq
+++ b/src/umi_tools/umi_tools_extract/test_data/slim_30.fastq
@@ -0,0 +1,120 @@
+@SRR2057595.7
+CAGGTTCAATCTCGGTGGGACCTC
+SRR2057595.7
+1=DFFFFHHHHHJJJFGIJIJJIJ
+@SRR2057595.9
+TTGGTTCAATCTGATGCCCTCTTCTGGTGCATCTGAAGACAGCTACAGTGTACTTAGATATAATAAATAAATCTT
+SRR2057595.9
+4=DFDBDHHFHHIGGEHJGGIHGHGGCAFCHGIGEHIJJJJIJJJIHIIIIIIJIIIIIGHIIGGIJGIIJIIJ@
+@SRR2057595.14
+TGGGTTAATGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+SRR2057595.14
+1=DFFFFHHHHHJJIJJJJIGHJJIIJJJJJIJHFHHFFEDEEEEDDDDBDDDD
+@SRR2057595.22
+ACGGTTAATGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGC
+SRR2057595.22
+1=DFFFFHHHHHJJJJJJJJIJJJJJJJJJJJJHHHFFFEDEEEEDDDDBDDD
+@SRR2057595.23
+GCGGTTATTCCTAAGGCGAGCTCAGGGAGGACAGAAACCTCCCGTGGAGCAGAAGGGCAAAAGCTCGCTTGATCT
+SRR2057595.23
+1=DFFFFHHHHHJJJJJJJJJJJJJJJIJJIIJJJJJJJJJJJJIJJHHHHHFFFFDDDDDDDDDDDDDDDDDDA
+@SRR2057595.29
+ACGGTTCTTGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCTT
+SRR2057595.29
+1=DFFFFHHHHHJJJJJJJJHIJJJJJJIJJJJHHHFFDEDEEDDCDDDBDDDDD
+@SRR2057595.30
+GAGGTTGAAAATCCGTGCTAAGAAGAA
+SRR2057595.30
+4=DDFFFHHHHHJJJJIJJJJJJJJJJ
+@SRR2057595.33
+TCGGTTTATCTCGTCTGATCTCGGAAGCTAAGCAGGGTCGGGCCTGGTTAGTACTTGGATGGGAGACCGCCTGGG
+SRR2057595.33
+1=DFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJDHIJJJJIJJJHGGEEHFFFFFFEDDEDDDDDDDDDDB
+@SRR2057595.35
+ACGGTTACTCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+SRR2057595.35
+1=DDFFFHHHHHJJJJJJIJJJJJJIJJJJJJJHIIJJ
+@SRR2057595.38
+GGGGTTACCTGCATGTTTATAGTTTCTAGTTTTGGCATTTTGTGTGGTCTCTTTTTTGTT
+SRR2057595.38
+1=DDFFFHHHHHJJJJJJJJJJHJJIJJJIJJJJJJJJJJJJGIGHJHIJJIJJJJJJJJ
+@SRR2057595.42
+TAGGTTGGATAAGTTATACACTGACTAAGTCATCTGTTACTGCCTTCACTGAGTTTTTATTTCCTTT
+SRR2057595.42
+1=DDFFFHHHHHJJJJJJJJJJJJJJJJJIIJJJJGJJJJJJJJJJJJJJJIIHIJJJJJJJIJJJI
+@SRR2057595.45
+CTGGTTTGCTGCGGAAGGATCATTA
+SRR2057595.45
+1:DDDDDDDFFDFFAGFE<EB8?BF
+@SRR2057595.46
+CAGGTTTTTTGGCTTTTTTTTAAAACCATTTTGTGAAAGGTTTCTGAAACTTGATAATAAAAAGCAGTTGGTGTA
+SRR2057595.46
+4=DDDDDHHHHFIGIJJJJJJJIIIJIIIJJJICHHIGIJFHHGHIEIHGHFHEDFFEFEFEEDEDD@CDD<@B:
+@SRR2057595.56
+GGGGTTTCGATGAAGAACGCAGCTAGCTGCGAGAATTAATGTGAATTGCAGGACACATTGATCATCGACACTTCG
+SRR2057595.56
+4=DFFFFHHHHHJJJJJJJJJJJJJJJJJJJIJJJJJIJJJJJJJJJJJJJJJJJJJHHHHHHFFFFFDDDDDDD
+@SRR2057595.59
+GCGGTTACCTCCTGTCTTATCATTGTCTTTTGAGCCTGGGCCTTGCCAGGTAGCTCTAGACTGGCCTAGAACTCA
+SRR2057595.59
+1=DFFFFHHHHHJJJJJJJJJJJJ4CHHJJJJJJJJJJJJJJJJJJJJIJDHIJJJIIJJJJJIJJJJHHHHHHB
+@SRR2057595.60
+ATGGTTTCATCTCGTCTGATCTCGGAAGCTAAGCAGGGCCGGGCCTGGTTAGTACTTGGATGGGAGACCGCC
+SRR2057595.60
+11BDDFDFFBBBFECFE@HHIBCBG<2CGEC49?1CBD)86:;AB=7C.=;=)77;A3;?C@;96=?@B8;?
+@SRR2057595.61
+GAGGTTTAGCAGGACACATTGATCATCGACACTTCGAACGCACTTGCGGCCCCGGGTTCCTCCCGGGGCTACGCC
+SRR2057595.61
+1=DDFFFGHHHHGGHGHIIJGEFEGGFH9GGIIFGGGGIFGDHBG@FGGGHEFCCB?@@CDCCD?B7>B@ACB9<
+@SRR2057595.65
+GCGGTTTCGGAGCTTGCTCCGTCCACTCAACGCATCGACCTGGTATTGCAGTACCTCCAGGAACGGTGCACCAAG
+SRR2057595.65
+1=DFFFFHHHHHJJJJJJHJIHHIIIIIIIJHJBHIHBFHHJI@EHJJHHHHHHHFFFBDE?AEBD=AB@CDBD?
+@SRR2057595.67
+AAGGTTGGTTTTTGAGGTCCTGCTCGTGCAGGGT
+SRR2057595.67
+1:BDDDFHFHHHHGFHGFGGIIDGHHIGIJJJJ9
+@SRR2057595.69
+ATTGGTTATTTTGTTTTTCCTCCTTCTCTTTCTAAA
+SRR2057595.69
+CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJIJJ
+@SRR2057595.70
+TTAGGTTAATTGTAATTTTATGAGGTCCCATTTGTCAATTCTT
+SRR2057595.70
+@@@DDDDD+2CDFA@FBGHCCHFHGBFHGHIGGDHGHIIFCFF
+@SRR2057595.71
+TGCGGTTCATATTAGCATGGCCCCTGCGCAAGGATGACACGCAAATTCGTGAAGCGTTCCATATTT
+SRR2057595.71
+CCCFFFFFHHGHHJJJJJJJJJJIIJJIJIJJIFHJIIIJJJIJJJJJJHIIHHHHFFFDEECEEE
+@SRR2057595.73
+TGAGGTTCAGCGAGTGCCTAGTGGGCCACTTTTGGTAAGCAGAACTGGCGCTGCGGGA
+SRR2057595.73
+@@@FFFFFHGFFHC@EBHGHGAEGIIHIIIIJJJJGHIIIJIJIIGHIJIJJIGGEFD
+@SRR2057595.74
+AATGGTTTCTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+SRR2057595.74
+@CCFFFFFGDFFHFIJJJGGGGJJGDDDDDDDDDDDDDBDDDDDDBBDDDDDDDDDDDDDDDDDDDBBBDDDDBD>
+@SRR2057595.77
+GCGGTTCGATCCCACTTCTGAC
+SRR2057595.77
+1=DFFFFHGDHHHIJJJIJJJJ
+@SRR2057595.82
+GAGGGTTACTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+SRR2057595.82
+CBCFFFFFHHHHHJJJJJJJJJJJJJJJJIJJJJJIJIIJJJH
+@SRR2057595.83
+TGGGTTGATCCCGGGGCTACGCCTGTCTGAGCGTCGCT
+SRR2057595.83
+1=DDFFFHHHHHJJIJJJJJJIJJJIJJIGGHIFHGEH
+@SRR2057595.86
+ACCGGTTACTTTTTTTAAATGTAAAGCATAAATAAAAAGCCTTTGTGGACTGTGAAAAAAAAAAAAAAAAAAAAAA
+SRR2057595.86
+BCCFFFFFHHHHHJJJJJJJJIIJJJJJJJJJIJJJJJJJJJJJJGIJIIJJIJJJJJJHFDDDDDDDDDDDDDB>
+@SRR2057595.88
+TCAGGTTGCCTAAGCATAGATAACCATATATCAGGGGGAGCTCCATGTTCTAGTCCTGCAAGCGCCTGGGCAATAA
+SRR2057595.88
+CCCFFFFFHHHHHJJJJJJIJJJJJIJJJJJJIJJIJJIJJJJJJJJJHIJJJJJJIIIHJIHHHFFDDDDEDDD@
+@SRR2057595.99
+TGAGGTTCATCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAGGGTTTGT
+SRR2057595.99
+B@CFFFFFFDHHHIHIIIJJIJJJJJIGEHGFHIJJGHIHADHIIJIJJJIJG
--- a/target/executable/agat/agat_convert_sp_gff2gtf/.config.vsh.yaml
+++ b/target/executable/agat/agat_convert_sp_gff2gtf/.config.vsh.yaml
@@ -0,0 +1,254 @@
+name: "agat_convert_sp_gff2gtf"
+namespace: "agat"
+version: "qualimap"
+authors:
+- name: "Leïla Paquay"
+  roles:
+  - "author"
+  - "maintainer"
+  info:
+    links:
+      email: "leila@data-intuitive.com"
+      github: "Leila011"
+      linkedin: "leilapaquay"
+    organizations:
+    - name: "Data Intuitive"
+      href: "https://www.data-intuitive.com"
+      role: "Software Developer"
+argument_groups:
+- name: "Inputs"
+  arguments:
+  - type: "file"
+    name: "--gff"
+    alternatives:
+    - "-i"
+    description: "Input GFF/GTF file that will be read"
+    info: null
+    example:
+    - "input.gff"
+    must_exist: true
+    create_parent: true
+    required: true
+    direction: "input"
+    multiple: false
+    multiple_sep: ";"
+- name: "Outputs"
+  arguments:
+  - type: "file"
+    name: "--output"
+    alternatives:
+    - "-o"
+    - "--out"
+    - "--outfile"
+    - "--gtf"
+    description: "Output GTF file. If no output file is specified, the output will\
+      \ be written to STDOUT."
+    info: null
+    example:
+    - "output.gtf"
+    must_exist: true
+    create_parent: true
+    required: true
+    direction: "output"
+    multiple: false
+    multiple_sep: ";"
+- name: "Arguments"
+  arguments:
+  - type: "string"
+    name: "--gtf_version"
+    description: "Version of the GTF output (1,2,2.1,2.2,2.5,3 or relax). Default\
+      \ value from AGAT config file (relax for the default config). The script option\
+      \ has the higher priority.  \n\n  * relax: all feature types are accepted. \
+      \ \n  * GTF3 (9 feature types accepted): gene, transcript, exon, CDS, Selenocysteine,\
+      \ start_codon, stop_codon, three_prime_utr and five_prime_utr.  \n  * GTF2.5\
+      \ (8 feature types accepted): gene, transcript, exon, CDS, UTR, start_codon,\
+      \ stop_codon, Selenocysteine.  \n  * GTF2.2 (9 feature types accepted): CDS,\
+      \ start_codon, stop_codon, 5UTR, 3UTR, inter, inter_CNS, intron_CNS and exon.\
+      \  \n  * GTF2.1 (6 feature types accepted): CDS, start_codon, stop_codon, exon,\
+      \ 5UTR, 3UTR.  \n  * GTF2 (4 feature types accepted): CDS, start_codon, stop_codon,\
+      \ exon.  \n  * GTF1 (5 feature types accepted): CDS, start_codon, stop_codon,\
+      \ exon, intron.  \n"
+    info: null
+    example:
+    - "3"
+    required: false
+    choices:
+    - "relax"
+    - "1"
+    - "2"
+    - "2.1"
+    - "2.2"
+    - "2.5"
+    - "3"
+    direction: "input"
+    multiple: false
+    multiple_sep: ";"
+  - type: "file"
+    name: "--config"
+    alternatives:
+    - "-c"
+    description: "Input agat config file. By default AGAT takes as input agat_config.yaml\
+      \ file from the working directory if any, otherwise it takes the orignal agat_config.yaml\
+      \ shipped with AGAT. To get the agat_config.yaml locally type: \"agat config\
+      \ --expose\". The --config option gives you the possibility to use your own\
+      \ AGAT config file (located elsewhere or named differently).\n"
+    info: null
+    example:
+    - "custom_agat_config.yaml"
+    must_exist: true
+    create_parent: true
+    required: false
+    direction: "input"
+    multiple: false
+    multiple_sep: ";"
+resources:
+- type: "bash_script"
+  path: "script.sh"
+  is_executable: true
+description: "The script aims to convert any GTF/GFF file into a proper GTF file.\
+  \ Full\ninformation about the format can be found here:\nhttps://agat.readthedocs.io/en/latest/gxf.html\
+  \ You can choose among 7\ndifferent GTF types (1, 2, 2.1, 2.2, 2.5, 3 or relax).\
+  \ Depending the\nversion selected the script will filter out the features that are\
+  \ not\naccepted. For GTF2.5 and 3, every level1 feature (e.g nc_gene\npseudogene)\
+  \ will be converted into gene feature and every level2 feature\n(e.g mRNA ncRNA)\
+  \ will be converted into transcript feature. Using the\n\"relax\" option you will\
+  \ produce a GTF-like output keeping all original\nfeature types (3rd column). No\
+  \ modification will occur e.g. mRNA to\ntranscript.\n\nTo be fully GTF compliant\
+  \ all feature have a gene_id and a transcript_id\nattribute. The gene_id is unique\
+  \ identifier for the genomic source of\nthe transcript, which is used to group transcripts\
+  \ into genes. The\ntranscript_id is a unique identifier for the predicted transcript,\
+  \ which\nis used to group features into transcripts.\n"
+test_resources:
+- type: "bash_script"
+  path: "test.sh"
+  is_executable: true
+- type: "file"
+  path: "test_data"
+info: null
+status: "enabled"
+requirements:
+  commands:
+  - "ps"
+keywords:
+- "gene annotations"
+- "GTF conversion"
+license: "GPL-3.0"
+references:
+  doi:
+  - "10.5281/zenodo.3552717"
+links:
+  repository: "https://github.com/NBISweden/AGAT"
+  homepage: "https://github.com/NBISweden/AGAT"
+  documentation: "https://agat.readthedocs.io/"
+  issue_tracker: "https://github.com/NBISweden/AGAT/issues"
+runners:
+- type: "executable"
+  id: "executable"
+  docker_setup_strategy: "ifneedbepullelsecachedbuild"
+- type: "nextflow"
+  id: "nextflow"
+  directives:
+    tag: "$id"
+  auto:
+    simplifyInput: true
+    simplifyOutput: false
+    transcript: false
+    publish: false
+  config:
+    labels:
+      mem1gb: "memory = 1000000000.B"
+      mem2gb: "memory = 2000000000.B"
+      mem5gb: "memory = 5000000000.B"
+      mem10gb: "memory = 10000000000.B"
+      mem20gb: "memory = 20000000000.B"
+      mem50gb: "memory = 50000000000.B"
+      mem100gb: "memory = 100000000000.B"
+      mem200gb: "memory = 200000000000.B"
+      mem500gb: "memory = 500000000000.B"
+      mem1tb: "memory = 1000000000000.B"
+      mem2tb: "memory = 2000000000000.B"
+      mem5tb: "memory = 5000000000000.B"
+      mem10tb: "memory = 10000000000000.B"
+      mem20tb: "memory = 20000000000000.B"
+      mem50tb: "memory = 50000000000000.B"
+      mem100tb: "memory = 100000000000000.B"
+      mem200tb: "memory = 200000000000000.B"
+      mem500tb: "memory = 500000000000000.B"
+      mem1gib: "memory = 1073741824.B"
+      mem2gib: "memory = 2147483648.B"
+      mem4gib: "memory = 4294967296.B"
+      mem8gib: "memory = 8589934592.B"
+      mem16gib: "memory = 17179869184.B"
+      mem32gib: "memory = 34359738368.B"
+      mem64gib: "memory = 68719476736.B"
+      mem128gib: "memory = 137438953472.B"
+      mem256gib: "memory = 274877906944.B"
+      mem512gib: "memory = 549755813888.B"
+      mem1tib: "memory = 1099511627776.B"
+      mem2tib: "memory = 2199023255552.B"
+      mem4tib: "memory = 4398046511104.B"
+      mem8tib: "memory = 8796093022208.B"
+      mem16tib: "memory = 17592186044416.B"
+      mem32tib: "memory = 35184372088832.B"
+      mem64tib: "memory = 70368744177664.B"
+      mem128tib: "memory = 140737488355328.B"
+      mem256tib: "memory = 281474976710656.B"
+      mem512tib: "memory = 562949953421312.B"
+      cpu1: "cpus = 1"
+      cpu2: "cpus = 2"
+      cpu5: "cpus = 5"
+      cpu10: "cpus = 10"
+      cpu20: "cpus = 20"
+      cpu50: "cpus = 50"
+      cpu100: "cpus = 100"
+      cpu200: "cpus = 200"
+      cpu500: "cpus = 500"
+      cpu1000: "cpus = 1000"
+  debug: false
+  container: "docker"
+engines:
+- type: "docker"
+  id: "docker"
+  image: "quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0"
+  target_registry: "images.viash-hub.com"
+  target_tag: "qualimap"
+  namespace_separator: "/"
+  setup:
+  - type: "docker"
+    run:
+    - "agat --version | sed 's/AGAT\\s\\(.*\\)/agat: \"\\1\"/' > /var/software_versions.txt\n"
+  entrypoint: []
+  cmd: null
+- type: "native"
+  id: "native"
+build_info:
+  config: "src/agat/agat_convert_sp_gff2gtf/config.vsh.yaml"
+  runner: "executable"
+  engine: "docker|native"
+  output: "target/executable/agat/agat_convert_sp_gff2gtf"
+  executable: "target/executable/agat/agat_convert_sp_gff2gtf/agat_convert_sp_gff2gtf"
+  viash_version: "0.9.0-RC6"
+  git_commit: "28cd12293505544b3e09ff6343e4724dedb772d3"
+  git_remote: "https://github.com/viash-hub/biobox"
+package_config:
+  name: "biobox"
+  version: "qualimap"
+  description: "A collection of bioinformatics tools for working with sequence data.\n"
+  info: null
+  viash_version: "0.9.0-RC6"
+  source: "src"
+  target: "target"
+  config_mods:
+  - ".requirements.commands := ['ps']\n"
+  - ".engines += { type: \"native\" }"
+  - ".engines[.type == 'docker'].target_registry := 'images.viash-hub.com'"
+  - ".engines[.type == 'docker'].target_tag := 'qualimap'"
+  keywords:
+  - "bioinformatics"
+  - "modules"
+  - "sequencing"
+  license: "MIT"
+  organization: "vsh"
+  links:
+    repository: "https://github.com/viash-hub/biobox"
+    issue_tracker: "https://github.com/viash-hub/biobox/issues"
--- a/target/executable/agat/agat_convert_sp_gff2gtf/agat_convert_sp_gff2gtf
+++ b/target/executable/agat/agat_convert_sp_gff2gtf/agat_convert_sp_gff2gtf
--- a/target/executable/arriba/.config.vsh.yaml
+++ b/target/executable/arriba/.config.vsh.yaml
@@ -1,5 +1,23 @@
 name: "arriba"
 version: "qualimap"
+authors:
+- name: "Robrecht Cannoodt"
+  roles:
+  - "author"
+  - "maintainer"
+  info:
+    links:
+      email: "robrecht@data-intuitive.com"
+      github: "rcannood"
+      orcid: "0000-0003-3641-729X"
+      linkedin: "robrechtcannoodt"
+    organizations:
+    - name: "Data Intuitive"
+      href: "https://www.data-intuitive.com"
+      role: "Data Science Engineer"
+    - name: "Open Problems"
+      href: "https://openproblems.bio"
+      role: "Core Member"
 argument_groups:
 - name: "Inputs"
  arguments:
@@ -688,7 +706,7 @@ build_info:
  output: "target/executable/arriba"
  executable: "target/executable/arriba/arriba"
  viash_version: "0.9.0-RC6"
-  git_commit: "e6420cd80f226128b7223ff79ce1297f99993657"
+  git_commit: "28cd12293505544b3e09ff6343e4724dedb772d3"
  git_remote: "https://github.com/viash-hub/biobox"
 package_config:
  name: "biobox"
--- a/target/executable/arriba/arriba
+++ b/target/executable/arriba/arriba
@@ -10,6 +10,9 @@
 # authors of this component should specify the license in the header of such
 # files, or include a separate license file detailing the licenses of all included
 # files.
+# 
+# Component authors:
+#  * Robrecht Cannoodt (author, maintainer)

 set -e

@@ -748,10 +751,11 @@ FROM quay.io/biocontainers/arriba:2.4.0--h0033a41_2
 ENTRYPOINT []
 RUN arriba -h | grep 'Version:' 2>&1 |  sed 's/Version:\s\(.*\)/arriba: "\1"/' > /var/software_versions.txt

+LABEL org.opencontainers.image.authors="Robrecht Cannoodt"
 LABEL org.opencontainers.image.description="Companion container for running component arriba"
-LABEL org.opencontainers.image.created="2024-07-29T14:42:19Z"
+LABEL org.opencontainers.image.created="2024-07-29T14:45:24Z"
 LABEL org.opencontainers.image.source="https://github.com/suhrig/arriba"
-LABEL org.opencontainers.image.revision="e6420cd80f226128b7223ff79ce1297f99993657"
+LABEL org.opencontainers.image.revision="28cd12293505544b3e09ff6343e4724dedb772d3"
 LABEL org.opencontainers.image.version="qualimap"

 VIASHDOCKER
--- a/target/executable/bcl_convert/.config.vsh.yaml
+++ b/target/executable/bcl_convert/.config.vsh.yaml
@@ -1,5 +1,30 @@
 name: "bcl_convert"
 version: "qualimap"
+authors:
+- name: "Toni Verbeiren"
+  roles:
+  - "author"
+  - "maintainer"
+  info:
+    links:
+      github: "tverbeiren"
+      linkedin: "verbeiren"
+    organizations:
+    - name: "Data Intuitive"
+      href: "https://www.data-intuitive.com"
+      role: "Data Scientist and CEO"
+- name: "Dorien Roosen"
+  roles:
+  - "author"
+  info:
+    links:
+      email: "dorien@data-intuitive.com"
+      github: "dorien-er"
+      linkedin: "dorien-roosen"
+    organizations:
+    - name: "Data Intuitive"
+      href: "https://www.data-intuitive.com"
+      role: "Data Scientist"
 argument_groups:
 - name: "Input arguments"
  arguments:
@@ -281,9 +306,16 @@ status: "enabled"
 requirements:
  commands:
  - "ps"
-license: "MIT"
+keywords:
+- "demultiplex"
+- "fastq"
+- "bcl"
+- "illumina"
+license: "Proprietary"
 links:
  repository: "https://github.com/viash-hub/biobox"
+  homepage: "https://support.illumina.com/sequencing/sequencing_software/bcl-convert.html"
+  documentation: "https://support.illumina.com/downloads/bcl-convert-user-guide.html"
 runners:
 - type: "executable"
  id: "executable"
@@ -386,7 +418,7 @@ build_info:
  output: "target/executable/bcl_convert"
  executable: "target/executable/bcl_convert/bcl_convert"
  viash_version: "0.9.0-RC6"
-  git_commit: "e6420cd80f226128b7223ff79ce1297f99993657"
+  git_commit: "28cd12293505544b3e09ff6343e4724dedb772d3"
  git_remote: "https://github.com/viash-hub/biobox"
 package_config:
  name: "biobox"
--- a/target/executable/bcl_convert/bcl_convert
+++ b/target/executable/bcl_convert/bcl_convert
@@ -10,6 +10,10 @@
 # authors of this component should specify the license in the header of such
 # files, or include a separate license file detailing the licenses of all included
 # files.
+# 
+# Component authors:
+#  * Toni Verbeiren (author, maintainer)
+#  * Dorien Roosen (author)

 set -e

@@ -592,10 +596,11 @@ rm /tmp/bcl-convert.rpm

 RUN echo "bcl-convert: \"$(bcl-convert -V 2>&1 >/dev/null | sed -n '/Version/ s/^bcl-convert\ Version //p')\"" > /var/software_versions.txt

+LABEL org.opencontainers.image.authors="Toni Verbeiren, Dorien Roosen"
 LABEL org.opencontainers.image.description="Companion container for running component bcl_convert"
-LABEL org.opencontainers.image.created="2024-07-29T14:42:19Z"
+LABEL org.opencontainers.image.created="2024-07-29T14:45:25Z"
 LABEL org.opencontainers.image.source="https://github.com/viash-hub/biobox"
-LABEL org.opencontainers.image.revision="e6420cd80f226128b7223ff79ce1297f99993657"
+LABEL org.opencontainers.image.revision="28cd12293505544b3e09ff6343e4724dedb772d3"
 LABEL org.opencontainers.image.version="qualimap"

 VIASHDOCKER
--- a/Show More
+++ b/Show More