Base dataset#

class igua.dataset.base.Cluster(id, sequence, source=None)#

A gene cluster.

id#

The identifier of the gene cluster. Must be unique across all datasets, as the identifier will be used to refer to each cluster in the resulting gene cluster families.

Type:: str

sequence#

The genomic sequence of the dataset.

Type:: str

source#

The source of the gene cluster, usually the path to the file where this cluster record was extracted from.

Type:: str or None

property length#

The length of the cluster sequence.

Type:: int

class igua.dataset.base.Protein(id, sequence, cluster_id)#

A protein inside a gene cluster.

id#

The identifier of the protein. Must be unique across all datasets, as the identifier will be used to refer to each protein while clustering.

Type:: str

sequence#

The amino-acid sequence of the protein.

Type:: str

cluster_id#

The identifier of the cluster this protein belongs to.

Type:: str

property length#

The length of the protein sequence.

Type:: int

class igua.dataset.base.BaseDataset#

An abstract dataset to provide clusters to a ClusteringPipeline.

IGUA needs the genomic sequences of the gene clusters to process, and the individual proteins of each gene cluster. To provide them to a clustering pipeline, the BaseDataset interface provides two functions: BaseDataset.extract_clusters and BaseDataset.extract_proteins which yield Cluster and Protein objects, respectively.

abstract extract_clusters(progress)#

Extract the clusters from the dataset.

Parameters:: progress (rich.progress.Progress) – A Progress instance that can be used for tracking progress.
Yields:: Cluster – A cluster object for each gene cluster to be processed in the dataset.

abstract extract_proteins(progress, cluster_ids)#

Extracts protein sequences from GenBank files.

Parameters:

progress (rich.progress.Progress) – A Progress instance that can be used for tracking progress.
cluster_ids (collections.abc.Collection of str) – A collection of cluster IDs from which to extract proteins.

Yields:

Protein – A protein object for each protein of the gene clusters to be processed in the dataset.