Pipeline#
- class igua.pipeline.PipelineParameters(nuc1, nuc2, prot)#
The parameters of the IGUA pipeline.
- nuc1#
A dictionary of parameters to pass to MMseqs2 for the stage 1 (nucleotide deduplication).
- Type:
- nuc2#
A dictionary of parameters to pass to MMseqs2 for the stage 2 (nucleotide clustering).
- Type:
- prot#
A dictionary of parameters to pass to MMseqs2 for the stage 3 (protein clustering).
- Type:
- classmethod default()#
Create new default clustering parameters.
- Returns:
PipelineParameters– A new pipeline parameters object with all parameters initialized to defaults.
- class igua.pipeline.PipelineResult(gcfs, compositions)#
The results of the IGUA clustering pipeline.
- gcfs#
A dataframe with columns
cluster_id,cluster_length,gcf_id,gcf_representative,nucleotide_representative,fragment_representativeandfilenamesummarizing the generated gene cluster families at each stage.- Type:
pandas.DataFrame
- class igua.pipeline.Pipeline(strategy=<igua.clustering.HierarchicalClustering object>, params=None, *, prefix='GCF', jobs=1, weight='protein', mmseqs=None, progress=None, workdir=None)#
The IGUA multi-stage clustering pipeline.
- __init__(strategy=<igua.clustering.HierarchicalClustering object>, params=None, *, prefix='GCF', jobs=1, weight='protein', mmseqs=None, progress=None, workdir=None)#
Create a new pipeline.
- Parameters:
strategy (
ClusteringStrategyorNone) – The clustering strategy to use to form the final clusters using the protein compositions in the stage 3 of the pipeline. PassNoneto disable the stage 3 altogether.params (
PipelineParametersorNone) – An object with parameters to pass to MMseqs2 for each stages of the pipeline.prefix (
str) – The prefix to use for the GCF identifiers.jobs (
int) – The number of threads to use with MMseqs2 and the distance computation in stage 3.weight (
strorNone) – How to weight dimensions in the protein composition matrix. Useproteinto weigh by protein length, orNoneto disable weighting.mmseqs (
MMseqs) – The MMSeqs2 driver to use to conduct linear clustering in stages 1 and 2 and protein clustering in stage 3.progress (
ProgressorNone) – Arichprogress object to use for reporting progress.workdir (
pathlib.PathorNone) – The path to a folder to use for temporary files. IfNonegiven, generates a temporary folder withtempfile.TemporaryDirectoryon eachruninvokation.
- _extract_clusters_to_file(dataset, output)#
Extract the dataset clusters to the given file.
- Parameters:
dataset (
BaseDataset) – The dataset containing the gene clusters to extract.output (
pathlib.Path) – The path to the output file to write the gene clusters sequence to.
- Returns:
pandas.DataFrame– A table with columnscluster_idandcluster_lengthreporting the ID and lengths of the extracted clusters.
- _extract_proteins_to_file(dataset, clusters, output)#
Extract the dataset proteins to the given file.
- Parameters:
dataset (
BaseDataset) – The dataset containing the gene clusters to extract and their proteins.clusters (
Collectionofstr) – A collection of clusters from which to extract proteins (typically the clusters selected as representatives in the previous step).output (
pathlib.Path) – The path to the output file to write proteins to.
- Returns:
pandas.DataFrame– A table with columnsprotein_id,protein_lengthandcluster_idreporting the ID, length and source cluster of the extracted proteins.
- _extract_representatives(gcfs2)#
Extract the representative clusters in
gcfs2into a index.
- _hierarchical_clustering(compositions)#
Perform hierarchical (or linear) clustering on the given data.
- Parameters:
compositions (
anndata.AnnData) – A compositional matrix obtained from the protein clustering.- Returns:
pandas.DataFrame– A table mapping each row ofcompositionsto a GCF with arbitrary identifiers.
- _join_results(gcfs1, gcfs2, gcfs3, input_sequences)#
Join the results of the different stages to a single dataframe.
- _make_compositions(protein_clusters, representatives, protein_representatives)#
Build a compositional matrix from the given protein clusters.
- Parameters:
protein_clusters (
pandas.DataFrame) – The clustering of proteins returned by MMSeqs.representatives (
dictofstrtoint) – The index mapping gene cluster IDs to their index in the compositional table.protein_representatives (
dictofstrtoint) – The index mapping protein IDs to their index in the compositional table.
- Returns:
AnnData– An annotated compositional matrix where each cluster is represented as a combination of protein counts.
- _run_nucleotide_clustering(workdir, step1)#
Run the nucleotide clustering stage of the pipeline.
- Parameters:
workdir (
pathlib.Path) – The path to the temporary directory to use for processing data.step1 (
Clustering) – The MMseqs clustering file created at the previous stage.
- Returns:
pandas.DataFrame– The result of the clustering on the representative sequences of thestep1clustering.
- _run_nucleotide_deduplication(workdir, clusters_fna, mmseqs)#
Run the nucleotide deduplication stage of the pipeline.
- Parameters:
workdir (
pathlib.Path) – The path to the temporary directory to use for processing data.clusters_fna (
pathlib.Path) – The path to a two-line FASTA file containing the gene clusters to process.
- Returns:
(
Clustering,pandas.DataFrame) – The clustering file created by MMseqs, and a dataframe containing a summary of the result.
- _run_protein_clustering(workdir, dataset, representatives, mmseqs)#
Run protein clustering and build a compositional matrix.
- Parameters:
workdir (
pathlib.Path) – The path to the temporary directory to use for processing data.dataset (
BaseDataset) – The dataset containing the gene clusters to extract and their proteins.representatives (
collections.abc.Collectionofstr) – A collection of gene cluster IDs to extract proteins from.
- Returns:
anndata.AnnData– A compositional matrix derived from the protein clustering. SeeClusteringPipeline._make_compositionsfor more information.
- run(dataset)#
Run the clustering pipeline on the given dataset.
- Parameters:
dataset (
BaseDataset) – A dataset containing the gene clusters to cluster into families.- Returns:
PipelineResult– The results of the pipeline. See class-level documentation for more information.