Introduction to dingo-stats: package for statistics on flux sampling datasets
Sotirios Touliopoulos
Contributor to Google Summer of Code 2025 with GeomScale
Mentors of this project: Elias Tsigaridas, Apostolos Chalkis, Haris Zafeiropoulos, Ioannis Psarros
Goal of the project
The goal of this project is to introduce dingo-stats: a python package for statistical analysis of flux sampling datasets. Till now, all post-sampling statistics were integrated into the dingo package. However, as the package grew, it became clear that separating the statistical analysis into its own package would be beneficial. This separation allows for a more focused development of statistical methods and makes it easier for users to access and utilize these methods without needing to install the entire dingo package.
The dingo-stats package provides a set of functions for performing statistical tests and visualizing results given any flux sampling dataset (not restricted to dingo samples). The package is designed to be user-friendly with a series of tutorials and documentation available.
Updated features
Functions regarding computing pairwise reaction correlations, clustering and graphs analysis from dingo have been updated and moved to dingo-stats. The removal of these functions from dingo was completed in Pull Request #118. These functions will not be presented in this blog, as they were presented in last year’s blog.
New features
The dingo-stats package will include the following new features:
- Functions for modifying metabolic models (changing the objective function, objective direction, setting reaction bounds)
- Functions for inspecting Sampling Distributions
- Functions for retrieving pathway information and mapping reaction IDs to KEGG IDs.
- Functions used to identify loopy reactions in metabolic models.
- Functions for performing Differential Flux Analysis, including statistical tests such as the Kolmogorov-Smirnov test and Hypergeometric test for pathway enrichment.
Modify Metabolic Models
The modify_model() function modifies a given cobra model. Users can change objective function, optimal percentage (lower bound of the objective will be fixed based on the optimal percentage), define custom reaction bounds and change objective direction.
cobra_modelis a cobra model objectobjective_functionis a string with the requested objective function ID to be assigned to the modeloptimal_percentageis the percentage of the FBA solution to be set as lower bound to the objectie function.reaction_boundsis a dictionary of custom reaction bounds to be assigned to the model’s reactionsobjective_directiondefines the direction to optimize the model: max and min are available options.
This function enables the user to create and sample metabolic models simulating different conditions. Here, 2 new models are created based on updating the initial model:
- One that maximizes for biomass production (asking at least almost 100% of the biomass maximum FBA value)
ec_cobra_model_condition_100 = modify_model(
cobra_model = ec_cobra_model,
objective_function = "BIOMASS_Ecoli_core_w_GAM",
optimal_percentage = 100,
objective_direction = "max"
)
- One that maximize for biomass production (asking at least 0% of the biomass maximum FBA value)
ec_cobra_model_condition_0 = modify_model(
cobra_model = ec_cobra_model,
objective_function = "BIOMASS_Ecoli_core_w_GAM",
optimal_percentage = 0,
objective_direction = "max"
)
Now you are ready to sample your edited model with the sampling methods of your choice
Inspect Sampling Distributions
The plot_grid_reactions_flux() function plots a grid of selected flux distributions and enables inspections of sampling results to get early insights. Distributions here are chosen based on a list of reaction IDs (subset from Glycolysis and Pentose Phosphate Pathways) derived from KEGG pathway information (see helper functions in next paragraph). The corresponding KEGG pathways tutorial is presented in the next section. From the grid we can detect: Left/Right Shifted, Normal, Fixed (transparent based on the tolerance parameter) or other uncommon distributions.
samplesis a numpy 2D array of the samplesmodel_reactionsis a list containing strings of reaction IDstoleranceis a tolerance level to make distribution plot transparent based on flux rangenrowsis the number of rows for the plotncolsis the number of columns for the plottitleis the title of the plot
plot_grid_reactions_flux( samples = samples_glycolysis_ppp, model_reactions = reaction_ids_glycolysis_ppp, nrows = 5, ncols = 5, tolerance = 1e-3, title = “Sampling Distributions”)

The sampling_statistics() function calculates statistics on flux distributions for a reaction of interest. This information can be used to identify significantly altered flux distributions between different conditions
samplesis a numpy 2D array of the samplesmodel_reactionsis a list containing strings of reaction IDsreaction_idis a reaction ID of the selected reaction to calculate statistics on
mean, min, max, std, skewness, kurtosis = sampling_statistics(
samples = samples,
model_reactions = ec_cobra_reaction_ids,
reaction_id = "FRD7")
print(mean, min, max, std, skewness, kurtosis)
479.1369619828837 0.14995320273677437 993.9157983127856 289.3315287517938 0.09951034266820558 -1.2064534542990524
KEGG Pathway Information
A set of helper functions is available to retrieve KEGG pathway information and map reaction IDs to KEGG IDs. These helper functions will not be presented here, but are available in the dingo-stats documentation. Here, we will present how to leverage the pathway information to analyze a sampling dataset.
The subset_model_reactions_from_pathway_info() function works given a dataFrame created with the get_kegg_pathways_from_reaction_ids() function and returns all reaction IDs affiliated with a given KEGG pathway name or ID.
PPP_from_name = subset_model_reactions_from_pathway_info(
kegg_info_df = df_kegg_pathways,
pathway_query = "Pentose phosphate pathway")
Glycolysis_from_name = subset_model_reactions_from_pathway_info(
kegg_info_df = df_kegg_pathways,
pathway_query = "Glycolysis / Gluconeogenesis")
Glycolysis_from_id = subset_model_reactions_from_pathway_info(
kegg_info_df = df_kegg_pathways,
pathway_query = "rn00010")
print(PPP_from_name)
print(Glycolysis_from_name)
print(Glycolysis_from_id)
['FBA', 'FBP', 'GND', 'PFK', 'PGL', 'RPE', 'RPI', 'TKT1']
['ALCD2x', 'ENO', 'FBA', 'FBP', 'GAPD', 'PFK', 'PGK', 'PGM', 'PPCK', 'PPS', 'PYK', 'TPI']
['ALCD2x', 'ENO', 'FBA', 'FBP', 'GAPD', 'PFK', 'PGK', 'PGM', 'PPCK', 'PPS', 'PYK', 'TPI']
The dictionary_reaction_id_to_pathway() function takes one or multiple lists containing reaction IDs (corresponding to KEGG pathways) and creates a dictionary that maps the IDs to pathway names. If a reaction appears in more than one pathway, it is classified with the term Multiple-Pathways. This is useful for plotting to work with subsets of reactions and to replace names from the df_kegg_pathways dataframe like Glycolysis / Gluconeogenesis to Glycolysis and Pentose phosphate pathway to PPP.
**named_listsare named lists where each argument is a list of reaction IDs and the argument name represents the pathway name.
named_lists = {
"Glycolysis": Glycolysis,
"PPP": PPP
}
bigg_to_pathway_dict = dictionary_reaction_id_to_pathway(
named_lists)
print(bigg_to_pathway_dict.get("GND"))
print(bigg_to_pathway_dict.get("ENO"))
print(bigg_to_pathway_dict.get("FBA"))
"Pentose phosphate pathway"
"Glycolysis / Gluconeogenesis"
"Multiple-Pathways"
The reaction_in_pathway_binary_matrix() function is used to create a new pandas dataframe with reactions as rows and different pathways as columns. The corresponding cell of the dataframe will show if a reaction belongs to a certain pathway (1) or not (0). If a reaction belongs to more than one pathways, then the column: Multiple-Pathways is created and the reaction matching this will only get True (1) there and not in the individual pathway columns (e.g. 1 in Multiple-Pathways, 0 in Glycolysis and 0 in PPP).
reaction_id_to_pathway_dictis dictionary mapping reaction IDs to pathway names created with the dictionary_reaction_id_to_pathway function
binary_df = reaction_in_pathway_binary_matrix(
reaction_id_to_pathway_dict = bigg_to_pathway_dict)
The plot_reaction_in_pathway_heatmap() function is used to plot a heatmap of the binary_df created from the reaction_in_pathway_binary_matrix() function to better illustrate the connection between reactions and pathways.
binary_dfis a pandas dataFrame with binary values (0 or 1)font_sizeis the font size for axis labels and ticksfig_widthis the width of the figure in pixelsfig_heightis the height of the figure in pixelstitleis the title of the plot
plot_reaction_in_pathway_heatmap(
binary_df = binary_df,
font_size = 8,
fig_width = 600,
fig_height = 600,
title = "" )

Identify Loopy Reactions
The loops_enumeration_from_fva() function computes twice a Flux Variability Analysis (with loopless parameter to False and True) and identifies possible loopy reaction in given model and returns them in a list. Each item in the list is a tuple of a string (reaction ID) and a float (loopy score returned from the function)
ec_cobra_modelis a cobra model objectfraction_of_optimumis a float variable that defines the fraction_of_optimum parameter used influx_variability_analysis
loopy_reactions_fva = loops_enumeration_from_fva(
ec_cobra_model = ec_cobra_model,
fraction_of_optimum = 0.999)
print(loopy_reactions_fva)
[('SUCDi', 994.7794007141794), ('FRD7', 995.0539767141795)]
The loopy_reactions_turned_off_in_pfba() function identifies which reactions from the loopy reactions set calculated with the loops_enumeration_from_fva() function can be turned off when performing a loopless_solution applied to pFBA results
loopy_cobra_modelis a cobra model object possibly containing loops (default model without any preproccess)loopy_reactionsis a list with loopy_reactions with reactions classified as loopytolis a tolerance value used to classify a numeric change as important or not
turned_off_reactions = loopy_reactions_turned_off_in_pfba(
loopy_cobra_model = ec_cobra_model,
loopy_reactions = loopy_reactions,
tol = 1e-6)
print(turned_off_reactions)
['FRD7']
Differential Flux Analysis
The significantly_altered_reactions() function takes as input at least two (different) flux sampling datasets to compare and identifies significantly altered reactions. It performs a Kolmogorov-Smirnov (KS) non-parametric test and corrects p-value for multiple comparisons. It additionally calculates a fold change that together with the p-value classifies reactions as significantly altered or not. It returns two lists, one with the significantly changed (significant_diff_reactions) and one with the not significantly changed (not_significant_diff_reactions) reactions. Also, two dictionaries, one mapping reaction IDs to corrected p_values (pval_dict) and one mapping reactions IDs to fold change values (fold_change_dict)
conditionsis a list of different flux sampling arraysselected_comparisonsis a list showing which conditions to compare (useful when comparing more than two sampling arrays)cobra_modelis a cobra model objectfold_change_cutoffis a cutoff for fold-change to consider two distributions significantly differentstd_cutoffis a cutoff to ensure distributions that are compared are not fixed to a certain value
conditions = [samples_condition_1, samples_condition_2]
selected_comparisons = [(0, 1)]
(significant_diff_reactions,
not_significant_diff_reactions,
pval_dict,
fold_change_dict) = significantly_altered_reactions(
conditions = conditions,
selected_comparisons = selected_comparisons,
cobra_model = ec_cobra_model,
fold_change_cutoff = 0.6,
std_cutoff = 1e-3)
The plot_volcano() function creates a volcano plot to show results from differential flux analysis. Volcano plot has Fold Change on the x-axis and -log10(p_value) on the y-axis. Users can provide a reactions list in the annotate parameter, to show reaction IDs on the plot. Also, lines showing the significance cutoffs may optionally be added, when providing p_value_cutoff, fold_change_cutoff and show_cutoff_lines.
pval_dictis a dictionary mapping reaction ID to corrected p-value.fold_change_dictis a dictionary mapping reaction ID to fold-change value.p_value_cutoffis a significance threshold for p-value (to appear in the plot).fold_change_cutoffis a threshold for fold-change (to appear in the plot).annotateis a list of reaction IDs to annotate on the plot.widthis the width of the figure in pixels.heightis the height of the figure in pixels.titleis the title of the plot.show_cutoff_linesis a boolean variable for whether to draw p-value and fold-change cutoff lines.
reactions_to_annotate = ["NH4t"]
plot_volcano(
pval_dict = pval_dict,
fold_change_dict = fold_change_dict,
p_value_cutoff = 0.05,
fold_change_cutoff = 0.6,
annotate = reactions_to_annotate,
width = 800,
height = 600,
title = "",
show_cutoff_lines = True)

The hypergeometric_test_pathway_enrichment() function performs a hypergeometric test to find significantly affected pathways between our sampling conditions. It also calculated fold enrichment and p_values useful for filtering significant changes
significant_reactionsis a list of reaction IDs with altered flux.model_reactionsis a list of all reaction IDs considered in the analysis.reaction_to_pathwaysis a dictinary mapping reaction ID -> list of pathway names.
hypergeometric_pathway_enrichment_df = hypergeometric_test_pathway_enrichment(
significant_reactions = significant_diff_reactions,
model_reactions = ec_cobra_reaction_ids,
reaction_to_pathways = reaction_to_pathways)
The plot_pathway_enrichment() function takes as input the hypergeometric_pathway_enrichment_df dataframe created with the hypergeometric_test_pathway_enrichment() function and plots the enrichment results in a bubble plot. Bubble size stands for pathway size (number of reactions) and reaction colour stands for -log10(p-value)
hypergeometric_enrichment_dfis a dataframe from the hypergeometric_test_pathway_enrichment function.pval_thresholdis a significance p value threshold for filtering.use_fdris a boolean for whether to use ‘fdr’ or ‘pval’ for filtering.font_sizeis the font size for plot labels and title.widthis the width of the figure in pixels.heightis the height of the figure in pixels.
plot_pathway_enrichment(
hypergeometric_enrichment_df = hypergeometric_pathway_enrichment_df,
pval_threshold = 2,
use_fdr = True,
font_size = 14,
width = 900,
height = 600)

References
-
Ebrahim, A.; Lerman, J. A.; Palsson, B. O.; Hyduke, D. R. COBRApy: COnstraints-Based Reconstruction and Analysis for Python. BMC Systems Biology 2013, 7 (1). https://doi.org/10.1186/1752-0509-7-74.
-
Apostolos Chalkis; Vissarion Fisikopoulos; Tsigaridas, E.; Haris Zafeiropoulos. Dingo: A Python Package for Metabolic Flux Sampling. Bioinformatics Advances 2024, 4 (1). https://doi.org/10.1093/bioadv/vbae037.