FASTA/GFF dataset#
- class igua.dataset.fasta_gff.FastaGFFDataset(inputs, column_mapping=None)#
Dataset for extracting sequences from FASTA/GFF files.
- __init__(inputs, column_mapping=None)#
Initialize the FastaGFFDataset class.
- Parameters:
inputs – List of input paths (metadata TSV or individual files).
column_mapping – Custom column mapping. If None, uses default generic mapping.
- _create_genome_context(row, genome_id)#
Create GenomeContext from metadata row.
- Parameters:
row – Dictionary with file paths from metadata TSV.
genome_id – Genome identifier.
- Returns:
GenomeContext instance.
- _load_and_filter_systems(tsv_path, console)#
Load systems TSV file.
- Parameters:
tsv_path – Path to systems TSV.
console – Rich console for logging.
- Returns:
Polars DataFrame with system data.
- _log_protein_summary(progress, results, representatives)#
Log summary of protein extraction results.
- Parameters:
progress – Rich progress bar instance.
results – Dictionary of extracted proteins.
representatives – Container of representative cluster IDs.
- extract_clusters(progress)#
Extract the clusters from the dataset.
- Parameters:
progress (
rich.progress.Progress) – AProgressinstance that can be used for tracking progress.- Yields:
Cluster– A cluster object for each gene cluster to be processed in the dataset.
- extract_proteins(progress, cluster_ids)#
Extracts protein sequences from GenBank files.
- Parameters:
progress (
rich.progress.Progress) – AProgressinstance that can be used for tracking progress.cluster_ids (
collections.abc.Collectionofstr) – A collection of cluster IDs from which to extract proteins.
- Yields:
Protein– A protein object for each protein of the gene clusters to be processed in the dataset.