Pipeline#

class igua.pipeline.PipelineParameters(nuc1, nuc2, prot)#

The parameters of the IGUA pipeline.

nuc1#

A dictionary of parameters to pass to MMseqs2 for the stage 1 (nucleotide deduplication).

Type:

dict

nuc2#

A dictionary of parameters to pass to MMseqs2 for the stage 2 (nucleotide clustering).

Type:

dict

prot#

A dictionary of parameters to pass to MMseqs2 for the stage 3 (protein clustering).

Type:

dict

classmethod default()#

Create new default clustering parameters.

Returns:

PipelineParameters – A new pipeline parameters object with all parameters initialized to defaults.

class igua.pipeline.PipelineResult(gcfs, compositions)#

The results of the IGUA clustering pipeline.

gcfs#

A dataframe with columns cluster_id, cluster_length, gcf_id, gcf_representative, nucleotide_representative, fragment_representative and filename summarizing the generated gene cluster families at each stage.

Type:

pandas.DataFrame

compositions#

The protein compositions of the stage 2 representative clusters generated by IGUA. May be None if no clustering method was requested, or if stage 2 already only produced a singleton.

Type:

anndata.AnnData or None

class igua.pipeline.Pipeline(strategy=<igua.clustering.HierarchicalClustering object>, params=None, *, prefix='GCF', jobs=1, weight='protein', mmseqs=None, progress=None, workdir=None)#

The IGUA multi-stage clustering pipeline.

__init__(strategy=<igua.clustering.HierarchicalClustering object>, params=None, *, prefix='GCF', jobs=1, weight='protein', mmseqs=None, progress=None, workdir=None)#

Create a new pipeline.

Parameters:
  • strategy (ClusteringStrategy or None) – The clustering strategy to use to form the final clusters using the protein compositions in the stage 3 of the pipeline. Pass None to disable the stage 3 altogether.

  • params (PipelineParameters or None) – An object with parameters to pass to MMseqs2 for each stages of the pipeline.

  • prefix (str) – The prefix to use for the GCF identifiers.

  • jobs (int) – The number of threads to use with MMseqs2 and the distance computation in stage 3.

  • weight (str or None) – How to weight dimensions in the protein composition matrix. Use protein to weigh by protein length, or None to disable weighting.

  • mmseqs (MMseqs) – The MMSeqs2 driver to use to conduct linear clustering in stages 1 and 2 and protein clustering in stage 3.

  • progress (Progress or None) – A rich progress object to use for reporting progress.

  • workdir (pathlib.Path or None) – The path to a folder to use for temporary files. If None given, generates a temporary folder with tempfile.TemporaryDirectory on each run invokation.

_extract_clusters_to_file(dataset, output)#

Extract the dataset clusters to the given file.

Parameters:
  • dataset (BaseDataset) – The dataset containing the gene clusters to extract.

  • output (pathlib.Path) – The path to the output file to write the gene clusters sequence to.

Returns:

pandas.DataFrame – A table with columns cluster_id and cluster_length reporting the ID and lengths of the extracted clusters.

_extract_proteins_to_file(dataset, clusters, output)#

Extract the dataset proteins to the given file.

Parameters:
  • dataset (BaseDataset) – The dataset containing the gene clusters to extract and their proteins.

  • clusters (Collection of str) – A collection of clusters from which to extract proteins (typically the clusters selected as representatives in the previous step).

  • output (pathlib.Path) – The path to the output file to write proteins to.

Returns:

pandas.DataFrame – A table with columns protein_id, protein_length and cluster_id reporting the ID, length and source cluster of the extracted proteins.

_extract_representatives(gcfs2)#

Extract the representative clusters in gcfs2 into a index.

Parameters:

gcfs2 (pandas.DataFrame) – The clustering results of the second stage.

Returns:

dict of str to int – A dictionary mapping every unique representative cluster in gcfs2 to a single index.

_hierarchical_clustering(compositions)#

Perform hierarchical (or linear) clustering on the given data.

Parameters:

compositions (anndata.AnnData) – A compositional matrix obtained from the protein clustering.

Returns:

pandas.DataFrame – A table mapping each row of compositions to a GCF with arbitrary identifiers.

_join_results(gcfs1, gcfs2, gcfs3, input_sequences)#

Join the results of the different stages to a single dataframe.

_make_compositions(protein_clusters, representatives, protein_representatives)#

Build a compositional matrix from the given protein clusters.

Parameters:
  • protein_clusters (pandas.DataFrame) – The clustering of proteins returned by MMSeqs.

  • representatives (dict of str to int) – The index mapping gene cluster IDs to their index in the compositional table.

  • protein_representatives (dict of str to int) – The index mapping protein IDs to their index in the compositional table.

Returns:

AnnData – An annotated compositional matrix where each cluster is represented as a combination of protein counts.

_run_nucleotide_clustering(workdir, step1)#

Run the nucleotide clustering stage of the pipeline.

Parameters:
  • workdir (pathlib.Path) – The path to the temporary directory to use for processing data.

  • step1 (Clustering) – The MMseqs clustering file created at the previous stage.

Returns:

pandas.DataFrame – The result of the clustering on the representative sequences of the step1 clustering.

_run_nucleotide_deduplication(workdir, clusters_fna, mmseqs)#

Run the nucleotide deduplication stage of the pipeline.

Parameters:
  • workdir (pathlib.Path) – The path to the temporary directory to use for processing data.

  • clusters_fna (pathlib.Path) – The path to a two-line FASTA file containing the gene clusters to process.

Returns:

(Clustering, pandas.DataFrame) – The clustering file created by MMseqs, and a dataframe containing a summary of the result.

_run_protein_clustering(workdir, dataset, representatives, mmseqs)#

Run protein clustering and build a compositional matrix.

Parameters:
  • workdir (pathlib.Path) – The path to the temporary directory to use for processing data.

  • dataset (BaseDataset) – The dataset containing the gene clusters to extract and their proteins.

  • representatives (collections.abc.Collection of str) – A collection of gene cluster IDs to extract proteins from.

Returns:

anndata.AnnData – A compositional matrix derived from the protein clustering. See ClusteringPipeline._make_compositions for more information.

run(dataset)#

Run the clustering pipeline on the given dataset.

Parameters:

dataset (BaseDataset) – A dataset containing the gene clusters to cluster into families.

Returns:

PipelineResult – The results of the pipeline. See class-level documentation for more information.