FASTA/GFF dataset#

class igua.dataset.fasta_gff.FastaGFFDataset(inputs, column_mapping=None)#

Dataset for extracting sequences from FASTA/GFF files.

__init__(inputs, column_mapping=None)#

Initialize the FastaGFFDataset class.

Parameters:

inputs – List of input paths (metadata TSV or individual files).
column_mapping – Custom column mapping. If None, uses default generic mapping.

_create_genome_context(row, genome_id)#

Create GenomeContext from metadata row.

Parameters:

Returns:

GenomeContext instance.

_load_and_filter_systems(tsv_path, console)#

Load systems TSV file.

Parameters:

Returns:

Polars DataFrame with system data.

_log_protein_summary(progress, results, representatives)#

Log summary of protein extraction results.

Parameters:

extract_clusters(progress)#

Extract the clusters from the dataset.

Parameters:: progress (rich.progress.Progress) – A Progress instance that can be used for tracking progress.
Yields:: Cluster – A cluster object for each gene cluster to be processed in the dataset.

extract_proteins(progress, cluster_ids)#

Extracts protein sequences from GenBank files.

Parameters:

progress (rich.progress.Progress) – A Progress instance that can be used for tracking progress.
cluster_ids (collections.abc.Collection of str) – A collection of cluster IDs from which to extract proteins.

Yields:

Protein – A protein object for each protein of the gene clusters to be processed in the dataset.