Introduction to dingo-stats: package for statistics on flux sampling datasets

9 minute read


Sotirios Touliopoulos
Contributor to Google Summer of Code 2025 with GeomScale

Mentors of this project: Elias Tsigaridas, Apostolos Chalkis, Haris Zafeiropoulos, Ioannis Psarros

Goal of the project

The goal of this project is to introduce dingo-stats: a python package for statistical analysis of flux sampling datasets. Till now, all post-sampling statistics were integrated into the dingo package. However, as the package grew, it became clear that separating the statistical analysis into its own package would be beneficial. This separation allows for a more focused development of statistical methods and makes it easier for users to access and utilize these methods without needing to install the entire dingo package.

The dingo-stats package provides a set of functions for performing statistical tests and visualizing results given any flux sampling dataset (not restricted to dingo samples). The package is designed to be user-friendly with a series of tutorials and documentation available.

Updated features

Functions regarding computing pairwise reaction correlations, clustering and graphs analysis from dingo have been updated and moved to dingo-stats. The removal of these functions from dingo was completed in Pull Request #118. These functions will not be presented in this blog, as they were presented in last year’s blog.

New features

The dingo-stats package will include the following new features:

  • Functions for modifying metabolic models (changing the objective function, objective direction, setting reaction bounds)
  • Functions for inspecting Sampling Distributions
  • Functions for retrieving pathway information and mapping reaction IDs to KEGG IDs.
  • Functions used to identify loopy reactions in metabolic models.
  • Functions for performing Differential Flux Analysis, including statistical tests such as the Kolmogorov-Smirnov test and Hypergeometric test for pathway enrichment.

Modify Metabolic Models

The modify_model() function modifies a given cobra model. Users can change objective function, optimal percentage (lower bound of the objective will be fixed based on the optimal percentage), define custom reaction bounds and change objective direction.

  • cobra_model is a cobra model object
  • objective_function is a string with the requested objective function ID to be assigned to the model
  • optimal_percentage is the percentage of the FBA solution to be set as lower bound to the objectie function.
  • reaction_bounds is a dictionary of custom reaction bounds to be assigned to the model’s reactions
  • objective_direction defines the direction to optimize the model: max and min are available options.

This function enables the user to create and sample metabolic models simulating different conditions. Here, 2 new models are created based on updating the initial model:

  • One that maximizes for biomass production (asking at least almost 100% of the biomass maximum FBA value)
ec_cobra_model_condition_100 = modify_model(
    cobra_model         = ec_cobra_model,
    objective_function  = "BIOMASS_Ecoli_core_w_GAM",
    optimal_percentage  = 100,
    objective_direction = "max"
)
  • One that maximize for biomass production (asking at least 0% of the biomass maximum FBA value)
ec_cobra_model_condition_0 = modify_model(
    cobra_model         = ec_cobra_model,
    objective_function  = "BIOMASS_Ecoli_core_w_GAM",
    optimal_percentage  = 0,
    objective_direction = "max"
)

Now you are ready to sample your edited model with the sampling methods of your choice

Inspect Sampling Distributions

The plot_grid_reactions_flux() function plots a grid of selected flux distributions and enables inspections of sampling results to get early insights. Distributions here are chosen based on a list of reaction IDs (subset from Glycolysis and Pentose Phosphate Pathways) derived from KEGG pathway information (see helper functions in next paragraph). The corresponding KEGG pathways tutorial is presented in the next section. From the grid we can detect: Left/Right Shifted, Normal, Fixed (transparent based on the tolerance parameter) or other uncommon distributions.

  • samples is a numpy 2D array of the samples
  • model_reactions is a list containing strings of reaction IDs
  • tolerance is a tolerance level to make distribution plot transparent based on flux range
  • nrows is the number of rows for the plot
  • ncols is the number of columns for the plot
  • title is the title of the plot

plot_grid_reactions_flux( samples = samples_glycolysis_ppp, model_reactions = reaction_ids_glycolysis_ppp, nrows = 5, ncols = 5, tolerance = 1e-3, title = “Sampling Distributions”)



The sampling_statistics() function calculates statistics on flux distributions for a reaction of interest. This information can be used to identify significantly altered flux distributions between different conditions

  • samples is a numpy 2D array of the samples
  • model_reactions is a list containing strings of reaction IDs
  • reaction_id is a reaction ID of the selected reaction to calculate statistics on
mean, min, max, std, skewness, kurtosis = sampling_statistics(
    samples = samples, 
    model_reactions = ec_cobra_reaction_ids,
    reaction_id = "FRD7")

print(mean, min, max, std, skewness, kurtosis)
479.1369619828837 0.14995320273677437 993.9157983127856 289.3315287517938 0.09951034266820558 -1.2064534542990524

KEGG Pathway Information

A set of helper functions is available to retrieve KEGG pathway information and map reaction IDs to KEGG IDs. These helper functions will not be presented here, but are available in the dingo-stats documentation. Here, we will present how to leverage the pathway information to analyze a sampling dataset.

The subset_model_reactions_from_pathway_info() function works given a dataFrame created with the get_kegg_pathways_from_reaction_ids() function and returns all reaction IDs affiliated with a given KEGG pathway name or ID.

PPP_from_name = subset_model_reactions_from_pathway_info(
    kegg_info_df = df_kegg_pathways, 
    pathway_query = "Pentose phosphate pathway")

Glycolysis_from_name = subset_model_reactions_from_pathway_info(
    kegg_info_df = df_kegg_pathways, 
    pathway_query = "Glycolysis / Gluconeogenesis")

Glycolysis_from_id = subset_model_reactions_from_pathway_info(
    kegg_info_df = df_kegg_pathways, 
    pathway_query = "rn00010")

print(PPP_from_name)
print(Glycolysis_from_name)
print(Glycolysis_from_id)
['FBA', 'FBP', 'GND', 'PFK', 'PGL', 'RPE', 'RPI', 'TKT1']
['ALCD2x', 'ENO', 'FBA', 'FBP', 'GAPD', 'PFK', 'PGK', 'PGM', 'PPCK', 'PPS', 'PYK', 'TPI']
['ALCD2x', 'ENO', 'FBA', 'FBP', 'GAPD', 'PFK', 'PGK', 'PGM', 'PPCK', 'PPS', 'PYK', 'TPI']

The dictionary_reaction_id_to_pathway() function takes one or multiple lists containing reaction IDs (corresponding to KEGG pathways) and creates a dictionary that maps the IDs to pathway names. If a reaction appears in more than one pathway, it is classified with the term Multiple-Pathways. This is useful for plotting to work with subsets of reactions and to replace names from the df_kegg_pathways dataframe like Glycolysis / Gluconeogenesis to Glycolysis and Pentose phosphate pathway to PPP.

  • **named_lists are named lists where each argument is a list of reaction IDs and the argument name represents the pathway name.
named_lists = {
   "Glycolysis": Glycolysis,
    "PPP": PPP
}

bigg_to_pathway_dict = dictionary_reaction_id_to_pathway(
    named_lists)

print(bigg_to_pathway_dict.get("GND"))
print(bigg_to_pathway_dict.get("ENO"))
print(bigg_to_pathway_dict.get("FBA"))
"Pentose phosphate pathway"
"Glycolysis / Gluconeogenesis"
"Multiple-Pathways"

The reaction_in_pathway_binary_matrix() function is used to create a new pandas dataframe with reactions as rows and different pathways as columns. The corresponding cell of the dataframe will show if a reaction belongs to a certain pathway (1) or not (0). If a reaction belongs to more than one pathways, then the column: Multiple-Pathways is created and the reaction matching this will only get True (1) there and not in the individual pathway columns (e.g. 1 in Multiple-Pathways, 0 in Glycolysis and 0 in PPP).

  • reaction_id_to_pathway_dict is dictionary mapping reaction IDs to pathway names created with the dictionary_reaction_id_to_pathway function
binary_df = reaction_in_pathway_binary_matrix(
    reaction_id_to_pathway_dict = bigg_to_pathway_dict)

The plot_reaction_in_pathway_heatmap() function is used to plot a heatmap of the binary_df created from the reaction_in_pathway_binary_matrix() function to better illustrate the connection between reactions and pathways.

  • binary_df is a pandas dataFrame with binary values (0 or 1)
  • font_size is the font size for axis labels and ticks
  • fig_width is the width of the figure in pixels
  • fig_height is the height of the figure in pixels
  • title is the title of the plot
plot_reaction_in_pathway_heatmap(
    binary_df = binary_df, 
    font_size = 8, 
    fig_width = 600, 
    fig_height = 600, 
    title = "" )


Identify Loopy Reactions

The loops_enumeration_from_fva() function computes twice a Flux Variability Analysis (with loopless parameter to False and True) and identifies possible loopy reaction in given model and returns them in a list. Each item in the list is a tuple of a string (reaction ID) and a float (loopy score returned from the function)

  • ec_cobra_model is a cobra model object
  • fraction_of_optimum is a float variable that defines the fraction_of_optimum parameter used in flux_variability_analysis
loopy_reactions_fva = loops_enumeration_from_fva(
    ec_cobra_model = ec_cobra_model, 
    fraction_of_optimum = 0.999)

print(loopy_reactions_fva)
[('SUCDi', 994.7794007141794), ('FRD7', 995.0539767141795)]

The loopy_reactions_turned_off_in_pfba() function identifies which reactions from the loopy reactions set calculated with the loops_enumeration_from_fva() function can be turned off when performing a loopless_solution applied to pFBA results

  • loopy_cobra_model is a cobra model object possibly containing loops (default model without any preproccess)
  • loopy_reactions is a list with loopy_reactions with reactions classified as loopy
  • tol is a tolerance value used to classify a numeric change as important or not
turned_off_reactions = loopy_reactions_turned_off_in_pfba(
    loopy_cobra_model = ec_cobra_model, 
    loopy_reactions = loopy_reactions, 
    tol = 1e-6)

print(turned_off_reactions)
['FRD7']

Differential Flux Analysis

The significantly_altered_reactions() function takes as input at least two (different) flux sampling datasets to compare and identifies significantly altered reactions. It performs a Kolmogorov-Smirnov (KS) non-parametric test and corrects p-value for multiple comparisons. It additionally calculates a fold change that together with the p-value classifies reactions as significantly altered or not. It returns two lists, one with the significantly changed (significant_diff_reactions) and one with the not significantly changed (not_significant_diff_reactions) reactions. Also, two dictionaries, one mapping reaction IDs to corrected p_values (pval_dict) and one mapping reactions IDs to fold change values (fold_change_dict)

  • conditions is a list of different flux sampling arrays
  • selected_comparisons is a list showing which conditions to compare (useful when comparing more than two sampling arrays)
  • cobra_model is a cobra model object
  • fold_change_cutoff is a cutoff for fold-change to consider two distributions significantly different
  • std_cutoff is a cutoff to ensure distributions that are compared are not fixed to a certain value
conditions = [samples_condition_1, samples_condition_2]
selected_comparisons = [(0, 1)]

(significant_diff_reactions,
 not_significant_diff_reactions,
 pval_dict,
 fold_change_dict) = significantly_altered_reactions(
    conditions = conditions, 
    selected_comparisons = selected_comparisons, 
    cobra_model = ec_cobra_model,
    fold_change_cutoff = 0.6,
    std_cutoff = 1e-3)

The plot_volcano() function creates a volcano plot to show results from differential flux analysis. Volcano plot has Fold Change on the x-axis and -log10(p_value) on the y-axis. Users can provide a reactions list in the annotate parameter, to show reaction IDs on the plot. Also, lines showing the significance cutoffs may optionally be added, when providing p_value_cutoff, fold_change_cutoff and show_cutoff_lines.

  • pval_dict is a dictionary mapping reaction ID to corrected p-value.
  • fold_change_dict is a dictionary mapping reaction ID to fold-change value.
  • p_value_cutoff is a significance threshold for p-value (to appear in the plot).
  • fold_change_cutoff is a threshold for fold-change (to appear in the plot).
  • annotate is a list of reaction IDs to annotate on the plot.
  • width is the width of the figure in pixels.
  • height is the height of the figure in pixels.
  • title is the title of the plot.
  • show_cutoff_lines is a boolean variable for whether to draw p-value and fold-change cutoff lines.
reactions_to_annotate = ["NH4t"]

plot_volcano(
    pval_dict = pval_dict,
    fold_change_dict = fold_change_dict,
    p_value_cutoff = 0.05,
    fold_change_cutoff = 0.6,
    annotate = reactions_to_annotate,
    width = 800,
    height = 600,
    title = "",
    show_cutoff_lines = True)


The hypergeometric_test_pathway_enrichment() function performs a hypergeometric test to find significantly affected pathways between our sampling conditions. It also calculated fold enrichment and p_values useful for filtering significant changes

  • significant_reactions is a list of reaction IDs with altered flux.
  • model_reactions is a list of all reaction IDs considered in the analysis.
  • reaction_to_pathways is a dictinary mapping reaction ID -> list of pathway names.
hypergeometric_pathway_enrichment_df = hypergeometric_test_pathway_enrichment(
    significant_reactions = significant_diff_reactions, 
    model_reactions = ec_cobra_reaction_ids, 
    reaction_to_pathways = reaction_to_pathways)

The plot_pathway_enrichment() function takes as input the hypergeometric_pathway_enrichment_df dataframe created with the hypergeometric_test_pathway_enrichment() function and plots the enrichment results in a bubble plot. Bubble size stands for pathway size (number of reactions) and reaction colour stands for -log10(p-value)

  • hypergeometric_enrichment_df is a dataframe from the hypergeometric_test_pathway_enrichment function.
  • pval_threshold is a significance p value threshold for filtering.
  • use_fdr is a boolean for whether to use ‘fdr’ or ‘pval’ for filtering.
  • font_size is the font size for plot labels and title.
  • width is the width of the figure in pixels.
  • height is the height of the figure in pixels.
plot_pathway_enrichment(
    hypergeometric_enrichment_df = hypergeometric_pathway_enrichment_df, 
    pval_threshold = 2,
    use_fdr = True,
    font_size = 14,
    width = 900,
    height = 600)


References

  • Ebrahim, A.; Lerman, J. A.; Palsson, B. O.; Hyduke, D. R. COBRApy: COnstraints-Based Reconstruction and Analysis for Python. BMC Systems Biology 2013, 7 (1). https://doi.org/10.1186/1752-0509-7-74.

  • Apostolos Chalkis; Vissarion Fisikopoulos; Tsigaridas, E.; Haris Zafeiropoulos. Dingo: A Python Package for Metabolic Flux Sampling. Bioinformatics Advances 2024, 4 (1). https://doi.org/10.1093/bioadv/vbae037.

Updated: