FASTA/GFF dataset#

class igua.dataset.fasta_gff.FastaGFFDataset(inputs, column_mapping=None)#

Dataset for extracting sequences from FASTA/GFF files.

__init__(inputs, column_mapping=None)#

Initialize the FastaGFFDataset class.

Parameters:
  • inputs – List of input paths (metadata TSV or individual files).

  • column_mapping – Custom column mapping. If None, uses default generic mapping.

_create_genome_context(row, genome_id)#

Create GenomeContext from metadata row.

Parameters:
  • row – Dictionary with file paths from metadata TSV.

  • genome_id – Genome identifier.

Returns:

GenomeContext instance.

_load_and_filter_systems(tsv_path, console)#

Load systems TSV file.

Parameters:
  • tsv_path – Path to systems TSV.

  • console – Rich console for logging.

Returns:

Polars DataFrame with system data.

_log_protein_summary(progress, results, representatives)#

Log summary of protein extraction results.

Parameters:
  • progress – Rich progress bar instance.

  • results – Dictionary of extracted proteins.

  • representatives – Container of representative cluster IDs.

extract_clusters(progress)#

Extract the clusters from the dataset.

Parameters:

progress (rich.progress.Progress) – A Progress instance that can be used for tracking progress.

Yields:

Cluster – A cluster object for each gene cluster to be processed in the dataset.

extract_proteins(progress, cluster_ids)#

Extracts protein sequences from GenBank files.

Parameters:
Yields:

Protein – A protein object for each protein of the gene clusters to be processed in the dataset.