Base dataset#
- class igua.dataset.base.Cluster(id, sequence, source=None)#
A gene cluster.
- id#
The identifier of the gene cluster. Must be unique across all datasets, as the identifier will be used to refer to each cluster in the resulting gene cluster families.
- Type:
- source#
The source of the gene cluster, usually the path to the file where this cluster record was extracted from.
- class igua.dataset.base.Protein(id, sequence, cluster_id)#
A protein inside a gene cluster.
- id#
The identifier of the protein. Must be unique across all datasets, as the identifier will be used to refer to each protein while clustering.
- Type:
- class igua.dataset.base.BaseDataset#
An abstract dataset to provide clusters to a
ClusteringPipeline.IGUA needs the genomic sequences of the gene clusters to process, and the individual proteins of each gene cluster. To provide them to a clustering pipeline, the
BaseDatasetinterface provides two functions:BaseDataset.extract_clustersandBaseDataset.extract_proteinswhich yieldClusterandProteinobjects, respectively.- abstract extract_clusters(progress)#
Extract the clusters from the dataset.
- Parameters:
progress (
rich.progress.Progress) – AProgressinstance that can be used for tracking progress.- Yields:
Cluster– A cluster object for each gene cluster to be processed in the dataset.
- abstract extract_proteins(progress, cluster_ids)#
Extracts protein sequences from GenBank files.
- Parameters:
progress (
rich.progress.Progress) – AProgressinstance that can be used for tracking progress.cluster_ids (
collections.abc.Collectionofstr) – A collection of cluster IDs from which to extract proteins.
- Yields:
Protein– A protein object for each protein of the gene clusters to be processed in the dataset.