CLI Reference#

A method for content-agnostic high-throughput identification of Gene Cluster Families (GCFs) from gene clusters of genomic and metagenomic origin.

usage: igua [-h] [--help-all] [-V] [-j N] [-g FILE] [-l FILE]
            [--antismash-gbk FILE] [--antismash-zip FILE]
            [--defense-finder-tsv FILE] [-o OUTPUT] [-C COMPOSITIONS]
            [-F FEATURES] [-w WORKDIR] [--prefix PREFIX]
            [--dedup-identity FLOAT] [--dedup-coverage FLOAT]
            [--dedup-evalue FLOAT] [--dedup-cluster-mode INT]
            [--dedup-coverage-mode INT] [--dedup-spaced-kmer-mode INT]
            [--nuc-identity FLOAT] [--nuc-coverage FLOAT] [--nuc-evalue FLOAT]
            [--nuc-cluster-mode INT] [--nuc-coverage-mode INT]
            [--nuc-spaced-kmer-mode INT] [--prot-identity FLOAT]
            [--prot-coverage FLOAT] [--prot-evalue FLOAT]
            [--prot-coverage-mode INT]
            [--clustering-method {average,single,complete,weighted,centroid,median,ward,linclust,none}]
            [--clustering-distance CLUSTERING_DISTANCE]
            [--clustering-precision {half,single,double}]
            [--clustering-weight {protein,none}]

Named Arguments#

-V, --version

Display the program version and exit.

-j, --jobs

The number of threads to use in parallel sections.

Default: 2

Input#

Mandatory input files required by the pipeline.

-g, --gbk

Input path to GenBank files containing generic gene clusters.

Default: []

-l, --gbk-list

Path to a file containing a list of GenBank files to process.

Default: []

--antismash-gbk

Input path to GenBank files containing antiSMASH predictions.

Default: []

--antismash-zip

Input path to Zip files containing antiSMASH predictions.

Default: []

--defense-finder-tsv

Input path to TSV metadata file for DefenseFinder predictions.

Default: []

Output#

Output files generated by the pipeline.

-o, --output

The name of the output table to generate.

Default: gcfs.tsv

-C, --compositions

A file where to write compositional data for GCF representatives.

-F, --features

A file where to write protein cluster representatives in FASTA format.

Parameters#

General purpose parameters.

-w, --workdir

A folder to use for temporary data produced by MMSeqs2.

--prefix

The prefix for GCF identifiers generated by the pipeline.

Default: 'GCF'

MMSeqs2 Deduplication#

Parameters for the first nucleotide clustering step (exact/near-exact deduplication).

--dedup-identity

Sequence identity threshold for deduplication step.

Default: 0.85

--dedup-coverage

Coverage threshold for deduplication step.

Default: 1.0

--dedup-evalue

E-value threshold for deduplication step.

Default: 0.001

--dedup-cluster-mode

Possible choices: 0, 1, 2

Clustering mode for deduplication: 0=SetCover, 1=Connected component, 2=Greedy incremental.

Default: 0

--dedup-coverage-mode

Possible choices: 0, 1, 2, 3, 4, 5

Coverage mode for deduplication: 0=target, 1=query, 2=both, 3=length of target, 4=length of query, 5=length of both.

Default: 1

--dedup-spaced-kmer-mode

Possible choices: 0, 1

Spaced k-mer mode for deduplication: 0=use ungapped k-mers, 1=use spaced k-mers.

Default: 0

MMSeqs2 Nucleotide Clustering#

Parameters for the second nucleotide clustering step (relaxed clustering of representatives).

--nuc-identity

Sequence identity threshold for nucleotide clustering step.

Default: 0.6

--nuc-coverage

Coverage threshold for nucleotide clustering step.

Default: 0.5

--nuc-evalue

E-value threshold for nucleotide clustering step.

Default: 0.001

--nuc-cluster-mode

Possible choices: 0, 1, 2

Clustering mode for nucleotide step: 0=SetCover, 1=Connected component, 2=Greedy incremental.

Default: 0

--nuc-coverage-mode

Possible choices: 0, 1, 2, 3, 4, 5

Coverage mode for nucleotide step: 0=target, 1=query, 2=both, 3=length of target, 4=length of query, 5=length of both.

Default: 0

--nuc-spaced-kmer-mode

Possible choices: 0, 1

Spaced k-mer mode for nucleotide step: 0=use ungapped k-mers, 1=use spaced k-mers.

Default: 0

MMSeqs2 Protein Clustering#

Parameters for protein clustering step (used for compositional analysis).

--prot-identity

Sequence identity threshold for protein clustering step.

Default: 0.5

--prot-coverage

Coverage threshold for protein clustering step.

Default: 0.9

--prot-evalue

E-value threshold for protein clustering step.

Default: 0.001

--prot-coverage-mode

Possible choices: 0, 1, 2, 3, 4, 5

Coverage mode for protein step: 0=target, 1=query, 2=both, 3=length of target, 4=length of query, 5=length of both.

Default: 1

Clustering#

Parameters to control the hierarchical clustering.

--clustering-method

Possible choices: average, single, complete, weighted, centroid, median, ward, linclust, none

The hierarchical clustering method to use for protein-level clustering.

Default: 'average'

--clustering-distance

The distance threshold after which to stop merging clusters.

Default: 0.8

--clustering-precision

Possible choices: half, single, double

The numerical precision to use for computing distances for hierarchical clustering.

Default: 'double'

--clustering-weight

Possible choices: protein, none

The method to use to weigh dimensions in the distance matrix for hierarchical clustering.

Default: 'protein'