Scripting IGUA#
To facilitate using IGUA on arbitrary datasets, we since v0.2.0 provide
an API which can be used to configure and execute IGUA from a Python
script. The API supports arbitrary datasets, allowing users to provide
data from arbitrary sources.
Basic usage#
The Pipeline class implements the same features as the
command line. It can be used to cluster the gene clusters provided in
a dataset. The following snippet shows the minimum code needed to cluster
gene clusters contained in a GenBank file:
import rich.progress
from igua.pipeline import Pipeline
from igua.dataset.genbank import GenBankDataset
dataset = GenBankDataset("clusters.gbk")
with rich.progress.Progress() as progress:
pipeline = Pipeline(jobs=8, progress=progress)
results = pipeline.run(dataset)
print(results.gcfs)
results is an PipelineResults object which stores the
generated GCF table as a pandas.DataFrame, which can then be saved to a
file or processed further.
Combining datasets#
IGUA provides multiple dataset classes that can be used to manage IGUA inputs.
from igua.dataset.genbank import GenBankDataset
dataset = GenBankDataset("clusters.gbk")
The DatasetList class can be used to combine several
datasets inside a single BaseDataset class.
For instance, to combine gene clusters from two different GenBank files:
from igua.dataset.genbank import GenBankDataset
from igua.dataset.list import DatasetList
dataset = DatasetList([
GenBankDataset("clusters1.gbk"),
GenBankDataset("clusters2.gbk"),
])
Implementing a custom dataset class#
IGUA tries its best to support various formats, but there are always edge
cases. What if your data is in a tar archive? What if you don’t have
gene translations inside a GenBank file but in a supplementary FASTA file?
You can implement your own dataset!
A BaseDataset needs to implement only two methods:
extract_clustersto extract the genomic sequences of the gene clusters to process.extract_proteinsto extract the protein sequences of the gene clusters to process.
For instance, let’s implement a BaseDataset using the output tables produced
by GECCO: not the GenBank files, but the
clusters.tsv and genes.tsv table.
Initialization#
Let’s start by defining a new subclass, and adding the paths we need to
the constructor. We can parse the tables directly in the __init__
method:
import pandas
from pathlib import Path
from igua.data.base import BaseDataset
class GECCOClustersDataset(BaseDataset):
def __init__(self, genome_fna, clusters_tsv, genes_tsv):
self.genome_fna = genome_fna
self.clusters_tsv = clusters_tsv
self.genes_tsv = genes_tsv
self.clusters = pandas.read_table(self.clusters_tsv, sep="\t")
self.genes = pandas.read_table(self.genes_tsv, sep="\t")
Cluster extraction#
Now we need to implement the extract_clusters method, which yields the
clusters to process. To do so, we can use Biopython to stream records from the
FASTA file, and use the clusters TSV to extract the gene cluster sequences:
import Bio.SeqIO
from igua.data.base import BaseDataset, Cluster
class GECCOClustersDataset(BaseDataset):
# ... __init__ ...
def extract_clusters(self, progress):
for record in Bio.SeqIO.parse(self.genome_fna, "fasta"):
# extract the clusters predicted in this record and yield
# a Cluster object for each of them
record_clusters = self.clusters[self.clusters["sequence_id"] == record.id]
for row in record_clusters.itertuples():
yield Cluster(
id=row.cluster_id,
sequence=str(record.seq[row.start-1:row.end]),
source=str(self.genome_fna)
)
Note that we are not ensuring that all clusters have been extracted successfully,
which may be a desirable check to add to the code if we were to use it in
production. We could use a dictionary to track which clusters in
self.clusters['cluster_id'] have been found.
Protein extraction#
Now moving on to the extract_proteins method. This method takes an
additional argument, clusters, which is a container (typically a list)
of cluster IDs from which to extract proteins from.
import Bio.SeqIO
from igua.data.base import BaseDataset, Protein
class GECCOClustersDataset(BaseDataset):
# ... __init__ ...
# ... extract_clusters ...
def extract_proteins(self, progress, clusters):
# subset the clusters table with only the IDs of the clusters to
# extract, which are the representatives of the previous stages
clusters_to_extract = self.clusters[self.clusters["cluster_id"].isin(clusters)]
for record in Bio.SeqIO.parse(self.genome_fna, "fasta"):
# extract the clusters predicted in this record and extract
# the genes corresponding to this cluster
record_clusters = clusters_to_extract[clusters_to_extract["sequence_id"] == record.id]
for cluster_row in record_clusters.itertuples():
cluster_gene_ids = cluster_row.proteins.split(";")
cluster_genes = self.genes[self.genes["protein_id"].isin(cluster_genes_ids)]
# yield a protein for every gene, using Biopython to
# translate the genomic sequence to get the protein
for gene_row in cluster_genes.itertuples():
seq = str(record.seq[gene_row.start-1:gene_row.end].translate())
yield Protein(
id=gene_row.protein_id,
sequence=seq,
cluster_id=cluster_row.cluster_id,
)
With this method implemented, we have a working dataset which can load data
described by a GECCO table. It’s not the best in terms of performance
(we should rather index self.genes by protein_id to accelerate
the retrieval of table rows), but should be enough to get us started!
Using the dataset#
Now that we have a working dataset class, we can simply instantiate the dataset and pass it to a pipeline:
dataset = GECCOClustersDataset("genome.fna", "clusters.tsv", "genes.tsv")
with rich.progress.Progress() as progress:
pipeline = Pipeline(jobs=8, progress=progress)
results = pipeline.run(dataset)