GenBank dataset#
- class igua.dataset.genbank.GenBankDataset(path, *, gene_feature='CDS')#
A dataset composed of gene clusters in a GenBank file.
GenBank files are commonly used to distribute a genomic sequence with associated metadata inside a single file.
Note
This method treats each GenBank record as an independent gene cluster and extracts the full record sequence and all the annotated genes. For GenBank files obtained with antiSMASH, please use the
AntiSMASHGenBankDatasetclass to enable correct processing of the antiSMASH regions.- __init__(path, *, gene_feature='CDS')#
Create a new GenBank dataset.
- Parameters:
input (
pathlib.Path) – The path to a GenBank file.gene_feature (
str) – The GenBank feature to extract gene sequences from. Defaults to CDS which is used by most annotation tools.
- extract_clusters(progress)#
Extract the clusters from the dataset.
- Parameters:
progress (
rich.progress.Progress) – AProgressinstance that can be used for tracking progress.- Yields:
Cluster– A cluster object for each gene cluster to be processed in the dataset.
- extract_proteins(progress, clusters)#
Extracts protein sequences from GenBank files.
- Parameters:
progress (
rich.progress.Progress) – AProgressinstance that can be used for tracking progress.cluster_ids (
collections.abc.Collectionofstr) – A collection of cluster IDs from which to extract proteins.
- Yields:
Protein– A protein object for each protein of the gene clusters to be processed in the dataset.