CLI Reference#
A method for content-agnostic high-throughput identification of Gene Cluster Families (GCFs) from gene clusters of genomic and metagenomic origin.
usage: igua [-h] [--help-all] [-V] [-j N] [-g FILE] [-l FILE]
[--antismash-gbk FILE] [--antismash-zip FILE]
[--defense-finder-tsv FILE] [-o OUTPUT] [-C COMPOSITIONS]
[-F FEATURES] [-w WORKDIR] [--prefix PREFIX]
[--dedup-identity FLOAT] [--dedup-coverage FLOAT]
[--dedup-evalue FLOAT] [--dedup-cluster-mode INT]
[--dedup-coverage-mode INT] [--dedup-spaced-kmer-mode INT]
[--nuc-identity FLOAT] [--nuc-coverage FLOAT] [--nuc-evalue FLOAT]
[--nuc-cluster-mode INT] [--nuc-coverage-mode INT]
[--nuc-spaced-kmer-mode INT] [--prot-identity FLOAT]
[--prot-coverage FLOAT] [--prot-evalue FLOAT]
[--prot-coverage-mode INT]
[--clustering-method {average,single,complete,weighted,centroid,median,ward,linclust,none}]
[--clustering-distance CLUSTERING_DISTANCE]
[--clustering-precision {half,single,double}]
[--clustering-weight {protein,none}]
Named Arguments#
- -V, --version
Display the program version and exit.
- -j, --jobs
The number of threads to use in parallel sections.
Default:
2
Input#
Mandatory input files required by the pipeline.
- -g, --gbk
Input path to GenBank files containing generic gene clusters.
Default:
[]- -l, --gbk-list
Path to a file containing a list of GenBank files to process.
Default:
[]- --antismash-gbk
Input path to GenBank files containing antiSMASH predictions.
Default:
[]- --antismash-zip
Input path to Zip files containing antiSMASH predictions.
Default:
[]- --defense-finder-tsv
Input path to TSV metadata file for DefenseFinder predictions.
Default:
[]
Output#
Output files generated by the pipeline.
- -o, --output
The name of the output table to generate.
Default:
gcfs.tsv- -C, --compositions
A file where to write compositional data for GCF representatives.
- -F, --features
A file where to write protein cluster representatives in FASTA format.
Parameters#
General purpose parameters.
- -w, --workdir
A folder to use for temporary data produced by MMSeqs2.
- --prefix
The prefix for GCF identifiers generated by the pipeline.
Default:
'GCF'
MMSeqs2 Deduplication#
Parameters for the first nucleotide clustering step (exact/near-exact deduplication).
- --dedup-identity
Sequence identity threshold for deduplication step.
Default:
0.85- --dedup-coverage
Coverage threshold for deduplication step.
Default:
1.0- --dedup-evalue
E-value threshold for deduplication step.
Default:
0.001- --dedup-cluster-mode
Possible choices: 0, 1, 2
Clustering mode for deduplication: 0=SetCover, 1=Connected component, 2=Greedy incremental.
Default:
0- --dedup-coverage-mode
Possible choices: 0, 1, 2, 3, 4, 5
Coverage mode for deduplication: 0=target, 1=query, 2=both, 3=length of target, 4=length of query, 5=length of both.
Default:
1- --dedup-spaced-kmer-mode
Possible choices: 0, 1
Spaced k-mer mode for deduplication: 0=use ungapped k-mers, 1=use spaced k-mers.
Default:
0
MMSeqs2 Nucleotide Clustering#
Parameters for the second nucleotide clustering step (relaxed clustering of representatives).
- --nuc-identity
Sequence identity threshold for nucleotide clustering step.
Default:
0.6- --nuc-coverage
Coverage threshold for nucleotide clustering step.
Default:
0.5- --nuc-evalue
E-value threshold for nucleotide clustering step.
Default:
0.001- --nuc-cluster-mode
Possible choices: 0, 1, 2
Clustering mode for nucleotide step: 0=SetCover, 1=Connected component, 2=Greedy incremental.
Default:
0- --nuc-coverage-mode
Possible choices: 0, 1, 2, 3, 4, 5
Coverage mode for nucleotide step: 0=target, 1=query, 2=both, 3=length of target, 4=length of query, 5=length of both.
Default:
0- --nuc-spaced-kmer-mode
Possible choices: 0, 1
Spaced k-mer mode for nucleotide step: 0=use ungapped k-mers, 1=use spaced k-mers.
Default:
0
MMSeqs2 Protein Clustering#
Parameters for protein clustering step (used for compositional analysis).
- --prot-identity
Sequence identity threshold for protein clustering step.
Default:
0.5- --prot-coverage
Coverage threshold for protein clustering step.
Default:
0.9- --prot-evalue
E-value threshold for protein clustering step.
Default:
0.001- --prot-coverage-mode
Possible choices: 0, 1, 2, 3, 4, 5
Coverage mode for protein step: 0=target, 1=query, 2=both, 3=length of target, 4=length of query, 5=length of both.
Default:
1
Clustering#
Parameters to control the hierarchical clustering.
- --clustering-method
Possible choices: average, single, complete, weighted, centroid, median, ward, linclust, none
The hierarchical clustering method to use for protein-level clustering.
Default:
'average'- --clustering-distance
The distance threshold after which to stop merging clusters.
Default:
0.8- --clustering-precision
Possible choices: half, single, double
The numerical precision to use for computing distances for hierarchical clustering.
Default:
'double'- --clustering-weight
Possible choices: protein, none
The method to use to weigh dimensions in the distance matrix for hierarchical clustering.
Default:
'protein'