GenBank dataset#

class igua.dataset.genbank.GenBankDataset(path, *, gene_feature='CDS')#

A dataset composed of gene clusters in a GenBank file.

GenBank files are commonly used to distribute a genomic sequence with associated metadata inside a single file.

Note

This method treats each GenBank record as an independent gene cluster and extracts the full record sequence and all the annotated genes. For GenBank files obtained with antiSMASH, please use the AntiSMASHGenBankDataset class to enable correct processing of the antiSMASH regions.

__init__(path, *, gene_feature='CDS')#

Create a new GenBank dataset.

Parameters:
  • input (pathlib.Path) – The path to a GenBank file.

  • gene_feature (str) – The GenBank feature to extract gene sequences from. Defaults to CDS which is used by most annotation tools.

extract_clusters(progress)#

Extract the clusters from the dataset.

Parameters:

progress (rich.progress.Progress) – A Progress instance that can be used for tracking progress.

Yields:

Cluster – A cluster object for each gene cluster to be processed in the dataset.

extract_proteins(progress, clusters)#

Extracts protein sequences from GenBank files.

Parameters:
Yields:

Protein – A protein object for each protein of the gene clusters to be processed in the dataset.