biobox/src/arriba/help.txt

```bash
arriba -h
```

Arriba gene fusion detector
---------------------------
Version: 2.4.0

Arriba is a fast tool to search for aberrant transcripts such as gene fusions.
It is based on chimeric alignments found by the STAR RNA-Seq aligner.

Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
              -g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
              [-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
              -o fusions.tsv [-O fusions.discarded.tsv] \
              [OPTIONS]

 -c FILE  File in SAM/BAM/CRAM format with chimeric alignments as generated by STAR
          (Chimeric.out.sam). This parameter is only required, if STAR was run with the
          parameter '--chimOutType SeparateSAMold'. When STAR was run with the parameter
          '--chimOutType WithinBAM', it suffices to pass the parameter -x to Arriba and -c
          can be omitted.

 -x FILE  File in SAM/BAM/CRAM format with main alignments as generated by STAR
          (Aligned.out.sam). Arriba extracts candidate reads from this file.

 -g FILE  GTF file with gene annotation. The file may be gzip-compressed.

 -G GTF_FEATURES  Comma-/space-separated list of names of GTF features.
                  Default: gene_name=gene_name|gene_id gene_id=gene_id
                  transcript_id=transcript_id feature_exon=exon feature_CDS=CDS

 -a FILE  FastA file with genome sequence (assembly). The file may be gzip-compressed. An
          index with the file extension .fai must exist only if CRAM files are processed.

 -b FILE  File containing blacklisted events (recurrent artifacts and transcripts
          observed in healthy tissue).

 -k FILE  File containing known/recurrent fusions. Some cancer entities are often
          characterized by fusions between the same pair of genes. In order to boost
          sensitivity, a list of known fusions can be supplied using this parameter. The list
          must contain two columns with the names of the fused genes, separated by tabs.

 -o FILE  Output file with fusions that have passed all filters.

 -O FILE  Output file with fusions that were discarded due to filtering.

 -t FILE  Tab-separated file containing fusions to annotate with tags in the 'tags' column.
          The first two columns specify the genes; the third column specifies the tag. The
          file may be gzip-compressed.

 -p FILE  File in GFF3 format containing coordinates of the protein domains of genes. The
          protein domains retained in a fusion are listed in the column
          'retained_protein_domains'. The file may be gzip-compressed.

 -d FILE  Tab-separated file with coordinates of structural variants found using
          whole-genome sequencing data. These coordinates serve to increase sensitivity
          towards weakly expressed fusions and to eliminate fusions with low evidence.

 -D MAX_GENOMIC_BREAKPOINT_DISTANCE  When a file with genomic breakpoints obtained via
                                     whole-genome sequencing is supplied via the -d
                                     parameter, this parameter determines how far a
                                     genomic breakpoint may be away from a
                                     transcriptomic breakpoint to consider it as a
                                     related event. For events inside genes, the
                                     distance is added to the end of the gene; for
                                     intergenic events, the distance threshold is
                                     applied as is. Default: 100000

 -s STRANDEDNESS  Whether a strand-specific protocol was used for library preparation,
                  and if so, the type of strandedness (auto/yes/no/reverse). When
                  unstranded data is processed, the strand can sometimes be inferred from
                  splice-patterns. But in unclear situations, stranded data helps
                  resolve ambiguities. Default: auto

 -i CONTIGS  Comma-/space-separated list of interesting contigs. Fusions between genes
             on other contigs are ignored. Cfontigs can be specified with or without the
             prefix "chr". Asterisks (*) are treated as wild-cards.
             Default: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_*

 -v CONTIGS  Comma-/space-separated list of viral contigs. Asterisks (*) are treated as
             wild-cards.
             Default: AC_* NC_*

 -f FILTERS  Comma-/space-separated list of filters to disable. By default all filters are
             enabled. Valid values: homologs, low_entropy, isoforms,
             top_expressed_viral_contigs, viral_contigs, uninteresting_contigs,
             non_coding_neighbors, mismatches, duplicates, no_genomic_support,
             genomic_support, intronic, end_to_end, relative_support,
             low_coverage_viral_contigs, merge_adjacent, mismappers, multimappers,
             same_gene, long_gap, internal_tandem_duplication, small_insert_size,
             read_through, inconsistently_clipped, intragenic_exonic,
             marginal_read_through, spliced, hairpin, blacklist, min_support,
             select_best, in_vitro, short_anchor, known_fusions, no_coverage,
             homopolymer, many_spliced

 -E MAX_E-VALUE  Arriba estimates the number of fusions with a given number of supporting
                 reads which one would expect to see by random chance. If the expected number
                 of fusions (e-value) is higher than this threshold, the fusion is
                 discarded by the 'relative_support' filter. Note: Increasing this
                 threshold can dramatically increase the number of false positives and may
                 increase the runtime of resource-intensive steps. Fractional values are
                 possible. Default: 0.300000

 -S MIN_SUPPORTING_READS  The 'min_support' filter discards all fusions with fewer than
                          this many supporting reads (split reads and discordant mates
                          combined). Default: 2

 -m MAX_MISMAPPERS  When more than this fraction of supporting reads turns out to be
                    mismappers, the 'mismappers' filter discards the fusion. Default:
                    0.800000

 -L MAX_HOMOLOG_IDENTITY  Genes with more than the given fraction of sequence identity are
                          considered homologs and removed by the 'homologs' filter.
                          Default: 0.300000

 -H HOMOPOLYMER_LENGTH  The 'homopolymer' filter removes breakpoints adjacent to
                        homopolymers of the given length or more. Default: 6

 -R READ_THROUGH_DISTANCE  The 'read_through' filter removes read-through fusions
                           where the breakpoints are less than the given distance away
                           from each other. Default: 10000

 -A MIN_ANCHOR_LENGTH  Alignment artifacts are often characterized by split reads coming
                       from only one gene and no discordant mates. Moreover, the split
                       reads only align to a short stretch in one of the genes. The
                       'short_anchor' filter removes these fusions. This parameter sets
                       the threshold in bp for what the filter considers short. Default: 23

 -M MANY_SPLICED_EVENTS  The 'many_spliced' filter recovers fusions between genes that
                         have at least this many spliced breakpoints. Default: 4

 -K MAX_KMER_CONTENT  The 'low_entropy' filter removes reads with repetitive 3-mers. If
                      the 3-mers make up more than the given fraction of the sequence, then
                      the read is discarded. Default: 0.600000

 -V MAX_MISMATCH_PVALUE  The 'mismatches' filter uses a binomial model to calculate a
                         p-value for observing a given number of mismatches in a read. If
                         the number of mismatches is too high, the read is discarded.
                         Default: 0.010000

 -F FRAGMENT_LENGTH  When paired-end data is given, the fragment length is estimated
                     automatically and this parameter has no effect. But when single-end
                     data is given, the mean fragment length should be specified to
                     effectively filter fusions that arise from hairpin structures.
                     Default: 200

 -U MAX_READS  Subsample fusions with more than the given number of supporting reads. This
               improves performance without compromising sensitivity, as long as the
               threshold is high. Counting of supporting reads beyond the threshold is
               inaccurate, obviously. Default: 300

 -Q QUANTILE  Highly expressed genes are prone to produce artifacts during library
              preparation. Genes with an expression above the given quantile are eligible
              for filtering by the 'in_vitro' filter. Default: 0.998000

 -e EXONIC_FRACTION  The breakpoints of false-positive predictions of intragenic events
                     are often both in exons. True predictions are more likely to have at
                     least one breakpoint in an intron, because introns are larger. If the
                     fraction of exonic sequence between two breakpoints is smaller than
                     the given fraction, the 'intragenic_exonic' filter discards the
                     event. Default: 0.330000

 -T TOP_N  Only report viral integration sites of the top N most highly expressed viral
           contigs. Default: 5

 -C COVERED_FRACTION  Ignore virally associated events if the virus is not fully
                      expressed, i.e., less than the given fraction of the viral contig is
                      transcribed. Default: 0.050000

 -l MAX_ITD_LENGTH  Maximum length of internal tandem duplications. Note: Increasing
                    this value beyond the default can impair performance and lead to many
                    false positives. Default: 100

 -z MIN_ITD_ALLELE_FRACTION  Required fraction of supporting reads to report an internal
                             tandem duplication. Default: 0.070000

 -Z MIN_ITD_SUPPORTING_READS  Required absolute number of supporting reads to report an
                              internal tandem duplication. Default: 10

 -u  Instead of performing duplicate marking itself, Arriba relies on duplicate marking by a
     preceding program using the BAM_FDUP flag. This makes sense when unique molecular
     identifiers (UMI) are used.

 -X  To reduce the runtime and file size, by default, the columns 'fusion_transcript',
     'peptide_sequence', and 'read_identifiers' are left empty in the file containing
     discarded fusion candidates (see parameter -O). When this flag is set, this extra
     information is reported in the discarded fusions file.

 -I  If assembly of the fusion transcript sequence from the supporting reads is incomplete
     (denoted as '...'), fill the gaps using the assembly sequence wherever possible.

 -h  Print help and exit.

         Code repository: https://github.com/suhrig/arriba
    Get help/report bugs: https://github.com/suhrig/arriba/issues
             User manual: https://arriba.readthedocs.io/
             Please cite: https://doi.org/10.1101/gr.257246.119