library(ggplot2)
library(dplyr)

Context

Setting up experiments to measure energy spent by an algorithm is an extremely complex task. Our initial approach was to run separated baseline and workload experiments, and subtracting the former from the latter to extract actual workload (our Brave New Algorithm (Merelo, Merelo, and Garcı́a-Valdez 2022)) energy expenses. The underlying assumption is that measured energy reflects, effectively, what the algorithm is spending and that it will change with the different algorithm parameters. Let us find out, in this expanded paper, how much is the case with the data we used in the paper.


Data preparation

lion_wl_118 <- read.csv("data/lion-1.11.8-bna-fix-rand.csv")
lion_wl_118$alpha <- NULL # deletes unused column
lion_wl_118$diff_fitness <- NULL # focus on energy here

plot_relative_impact <- function( time_model ) {
  tidy_model <- tidy( time_model) %>%
  # Remove the intercept because it usually overshadows the predictors
  filter(term != "(Intercept)") %>%
  mutate(
    # Create a boolean column for statistical significance (p < 0.05)
    is_significant = p.value < 0.05,
    # Clean up the term names: replace ":" with " × " for better readability
    term = str_replace_all(term, ":", " × ")
  )

# 2. Create the Lollipop Chart plotting the t-values
plot_output <- ggplot(tidy_model, aes(x = statistic, y = reorder(term, statistic))) +
  # Add a dashed line at 0 for reference (using linewidth instead of size)
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50", linewidth = 0.8) +
  
  # Draw the lollipop sticks (using linewidth)
  geom_segment(aes(x = 0, xend = statistic, y = term, yend = term, color = is_significant),
               linewidth = 1) +
  
  # Draw the lollipop heads
  geom_point(aes(color = is_significant), size = 3.5) +
  
  # Custom colors for significance
  scale_color_manual(values = c("TRUE" = "#0072B2", "FALSE" = "#D55E00"),
                     labels = c("Not Significant", "Significant (p < 0.05)")) +
  
  # Labels and theming
  labs(
    title = "Relative Impact of Predictors on 'Seconds'",
    subtitle = "Comparing t-values to account for the vastly different scales of variables",
    x = "t-value (Signal-to-Noise Ratio)",
    y = NULL,
    color = "Model Term Status:"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "bottom",
    panel.grid.minor = element_blank(),
    plot.title = element_text(face = "bold")
  )

  print(plot_output)
}

We load the data from the experiment; in this case it is a run that uses Julia 1.11.8 and a series of fixes and improvements. Please check (Merelo-GuervĂłs et al. 2026) for details or take a look at the explainer version.


Stage 1: How much does time depend on the algorithm variables

Operating conditions might account for a certain percentage of the results we are seeing. Let us first isolate co-dependent variables to indentify that dependence.

The measured energy will depend, first and foremost, on the time it takes to run the algorithm; more time will obviously consume more energy for many reasons. And running time will be dependent on the algorithm parameters. Let us first model that dependence to better understand how our algorithm spends energy.

lion_wl_118_time_model <- glm( seconds ~ dimension*population_size*max_gens + evaluations*generations, data=lion_wl_118)

library(broom)
library(stringr)

# 1. Tidy the model output using broom
tidy_model <- tidy(lion_wl_118_time_model) %>%
  # Remove the intercept because it usually overshadows the predictors
  filter(term != "(Intercept)") %>%
  mutate(
    # Create a boolean column for statistical significance (p < 0.05)
    is_significant = p.value < 0.05,
    # Clean up the term names: replace ":" with " × " for better readability
    term = str_replace_all(term, ":", " × ")
  )

# 2. Create the Lollipop Chart plotting the t-values
ggplot(tidy_model, aes(x = statistic, y = reorder(term, statistic))) +
  # Add a dashed line at 0 for reference (using linewidth instead of size)
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50", linewidth = 0.8) +
  
  # Draw the lollipop sticks (using linewidth)
  geom_segment(aes(x = 0, xend = statistic, y = term, yend = term, color = is_significant),
               linewidth = 1) +
  
  # Draw the lollipop heads
  geom_point(aes(color = is_significant), size = 3.5) +
  
  # Custom colors for significance
  scale_color_manual(values = c("TRUE" = "#0072B2", "FALSE" = "#D55E00"),
                     labels = c("Not Significant", "Significant (p < 0.05)")) +
  
  # Labels and theming
  labs(
    title = "Relative Impact of Predictors on 'Seconds'",
    subtitle = "Comparing t-values",
    x = "t-value (Signal-to-Noise Ratio)",
    y = NULL,
    color = "Model Term Status:"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "bottom",
    panel.grid.minor = element_blank(),
    plot.title = element_text(face = "bold")
  )
**Fig. 1** — How is running time dependent on the algorithm variables?

Fig. 1 — How is running time dependent on the algorithm variables?

In this first stage, we model time to have an idea of what is the precise dependence on algorithm parameters (dimension, population size, maximum number of generations without change to stop) and algorithm outcomes (number of generations until stopping and evaluations performed).

We have plotted here the t-value of the model coefficients, which represents how much the result (number of seconds) is correlated, positively or negatively, with the different combination of parameters. We find a significant correlation for several individual parameters (max_gens and evaluations) and combinations (population_size x max_gens, and the three interacting algorithm parameters with the maximum positive correlation, dimension x population_size x max_gens).

But we need to check what percentage of the variance they explain.

library(stringr)

# 1. Run Analysis of Deviance to get the variance explained by each term
# We use tidy() to immediately format the output into a clean dataframe
anova_tidy <- tidy(anova(lion_wl_118_time_model))

# 2. Extract deviance totals from the model object
null_dev <- lion_wl_118_time_model$null.deviance
resid_dev <- lion_wl_118_time_model$deviance

# 3. Calculate percentages and categorize the terms
variance_data <- anova_tidy %>%
  # Remove the intercept/NULL row
  filter(!is.na(term) & term != "NULL") %>%
  select(term, deviance) %>%
  # Calculate the percentage of total variance explained
  mutate(pct_variance = (deviance / null_dev) * 100) %>%
  # Make the interaction terms readable
  mutate(term = str_replace_all(term, ":", " × ")) %>%
  # Add the Unexplained variance back in as its own row
  bind_rows(
    tibble(
      term = "Unexplained Variance (Residuals)",
      deviance = resid_dev,
      pct_variance = (resid_dev / null_dev) * 100
    )
  ) %>%
  # Categorize into the groups you specified
  mutate(
    category = case_when(
      term == "Unexplained Variance (Residuals)" ~ "Unexplained",
      str_detect(term, "evaluations|generations") ~ "Outcomes / Runtime Metrics",
      TRUE ~ "Algorithm Parameters"
    )
  )

# 4. Generate the Visualization
ggplot(variance_data, aes(x = pct_variance, y = reorder(term, pct_variance), fill = category)) +
  geom_col(width = 0.7) +
  
  # Add the percentage labels at the end of each bar
  geom_text(aes(label = sprintf("%.2f%%", pct_variance)), 
            hjust = -0.15, size = 4, color = "black") +
  
  # Color-code by the specified categories
  scale_fill_manual(values = c(
    "Algorithm Parameters" = "#0072B2",        # Blue
    "Outcomes / Runtime Metrics" = "#D55E00",  # Orange
    "Unexplained" = "#CCCCCC"                  # Grey
  )) +
  
  # Expand the x-axis slightly so the text labels don't get cut off
  scale_x_continuous(limits = c(0, max(variance_data$pct_variance) * 1.15)) +
  
  labs(
    title = "Percentage of Variance Explained by Each Factor",
    subtitle = "Decomposition of Deviance from the GLM",
    x = "% of Total Variance Explained",
    y = NULL,
    fill = "Variable Group:"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "bottom",
    panel.grid.major.y = element_blank(),
    plot.title = element_text(face = "bold")
  )
**Fig. 1** — How is running time dependent on the algorithm variables?

Fig. 1 — How is running time dependent on the algorithm variables?


And this is what we find, 77.6163972% of the variance can’t be explained by the algorithm parameters. This should be expected: part of it will be the system and framework overhead, and a part will be also explained by the specific algorithm implementation.

There is an issue here, however; the fact that evaluations and generations should be dependent on the algorithm parameters and problem size. The leverage we have to optimize energy for a specific implementation are only those parameters. So let us try and see how time behaves in terms of those parameters only, and we will use a Generalized Additive Model

# 1. Preparar los datos tratando las variables de 2 niveles como factores
lion_wl_118_clean <- lion_wl_118 %>%
  mutate(
    dimension = as.factor(dimension),
    population_size = as.factor(population_size),
    max_gens = as.factor(max_gens)
  )

strict_factor_model <- glm(
  formula = seconds ~ dimension * population_size * max_gens, 
  data = lion_wl_118_clean
)

summary(strict_factor_model)
## 
## Call:
## glm(formula = seconds ~ dimension * population_size * max_gens, 
##     data = lion_wl_118_clean)
## 
## Coefficients:
##                                            Estimate Std. Error  t value
## (Intercept)                               8.8753366  0.0048662 1823.877
## dimension5                               -0.0086926  0.0068818   -1.263
## population_size400                        0.0023670  0.0068818    0.344
## max_gens25                                0.0172104  0.0068818    2.501
## dimension5:population_size400             0.0094696  0.0097324    0.973
## dimension5:max_gens25                     0.0002257  0.0097324    0.023
## population_size400:max_gens25            -0.0197788  0.0097324   -2.032
## dimension5:population_size400:max_gens25  0.0328759  0.0137637    2.389
##                                          Pr(>|t|)    
## (Intercept)                                <2e-16 ***
## dimension5                                 0.2078    
## population_size400                         0.7312    
## max_gens25                                 0.0131 *  
## dimension5:population_size400              0.3316    
## dimension5:max_gens25                      0.9815    
## population_size400:max_gens25              0.0433 *  
## dimension5:population_size400:max_gens25   0.0177 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.0007103952)
## 
##     Null deviance: 0.20088  on 239  degrees of freedom
## Residual deviance: 0.16481  on 232  degrees of freedom
## AIC: -1049
## 
## Number of Fisher Scoring iterations: 2
# This is not used, but can help
summary_effects <- lion_wl_118_clean %>%
  group_by(dimension, population_size) %>%
  summarise(mean_seconds = mean(seconds), .groups = "drop")

plot_relative_impact( strict_factor_model)
Generalized additive model on algorithm parameters and problem size

Generalized additive model on algorithm parameters and problem size

We have also factored those parameters to use the two different values that we have for each one of them. This is also interesting because we can see what specific values have an impact on running time, of which max_gens = 25 have a maximum positive impact, population_size = 400 with max_gens=25 a negative one, meaning that that combination will make it take less time to run; the fact that that happens independently of problem size is interesting by itself.

Time is a factor in energy consumption, but we are really interested in the influence of those parameters on the energy consumed. We’ll get to that next.

View 2 — Understanding noise, that is, operating context

Can we learn something from energy consumption, even outside our algorithm?

We use residuals to eliminate the part of the running time that is explained by the algorithm parameters, and will leave only the residuals to inject into the energy model. This is what we do next.

lion_wl_118_clean$residual_seconds <- residuals( strict_factor_model )

lion_energy_model <- glm(
  formula = PKG ~ dimension * population_size * max_gens + residual_seconds, 
  data = lion_wl_118_clean
)

clean_lion_energy_model_coefficients <- tidy( lion_energy_model )

clean_lion_energy_model_coefficients %>% mutate(
    # Create a boolean column for statistical significance (p < 0.05)
    is_significant = p.value < 0.05,
    # Clean up the term names: replace ":" with " × " for better readability
    term = str_replace_all(term, ":", " × ")
  ) -> clean_lion_energy_model_coefficients

library(ggplot2)
library(dplyr)
library(stringr)

# Assuming your data frame is named 'tidy_energy_model'
plot_data <- clean_lion_energy_model_coefficients %>%
  filter(term != "(Intercept)") %>% # Always drop the massive intercept for scale
  mutate(
    term = str_replace_all(term, "×", " × "),
    term = str_replace_all(term, "dimension5", "Dimension 5"),
    term = str_replace_all(term, "population_size400", "Population 400"),
    term = str_replace_all(term, "max_gens25", "Max Gens 25"),
    term = str_replace_all(term, "residual_seconds", "Unexpected Delay (Residuals)"),
    # Create a strict ordering: put the significant term at the top
    term = factor(term, levels = term[order(is_significant, statistic)])
  )

ggplot(plot_data, aes(x = statistic, y = term)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
  # Use linewidth instead of size for segments
  geom_segment(aes(x = 0, xend = statistic, y = term, yend = term, color = is_significant), 
               linewidth = 1.2) +
  geom_point(aes(color = is_significant), size = 4) +
  scale_color_manual(values = c("FALSE" = "#B0B0B0", "TRUE" = "#D55E00")) +
  labs(
    title = "Impact on Total CPU Energy (PKG)",
    subtitle = "Only essential running time generate a significant energy footprint",
    x = "t-value (Signal-to-Noise Ratio)",
    y = NULL
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "none", # Kill the legend to reduce clutter
    plot.title = element_text(face = "bold"),
    panel.grid.minor = element_blank(),
    axis.text.y = element_text(face = ifelse(levels(plot_data$term) == "Essential Running Time (Residuals)", "bold", "plain"))
  )

We can visualize these coefficients as above. Baseline is, the only significant term in the energy model (here modeled using PKG, energy consumed by the package) is the essential part of the running time, the one that is not affected by algorithm parameters. The rest of the measurements indicate that whatever additional energy expenses are incurred by running under different parametrizations it is not enough to overcome non-controlled changes in energy consumption caused by the operating context.

We can try and check of there is some structure in the noise so that it can be somehow subtracted to work with it.

library(ggplot2)
library(broom)

# 1. Extract the residuals and fitted values from your time model
# (Make sure to use the model where you dropped evaluations and generations!)
diagnostics_data <- augment(lion_energy_model)

# 2. Plot the Noise (Residuals) against the Predictions (Fitted)
ggplot(diagnostics_data, aes(x = .fitted, y = .resid)) +
  # Draw the actual data points
  geom_point(alpha = 0.6, size = 3, color = "#444444") +
  
  # Draw the "Zero Error" baseline (using linewidth as requested)
  geom_hline(yintercept = 0, linetype = "dashed", color = "#D55E00", linewidth = 1) +
  
  # Add a smoothing line to help detect hidden curves in the noise
  geom_smooth(method = "loess", se = FALSE, color = "#0072B2", linewidth = 1.2) +
  
  labs(
    title = "Diagnostic: Is the Noise Organized?",
    subtitle = "Residuals vs. Fitted Values",
    x = "Predicted Total PKG energy (J)",
    y = "Residual Error (Actual - Predicted)"
  ) +
  theme_bw(base_size = 14) +
  theme(plot.title = element_text(face = "bold"))
what we see

what we see

However, we can se it is not. Energy varies within a 20-Joule range, and within that range it is more or less white noise, with error independent on the actual amount of energy.


Takeaway messages

This was still a very initial stage of the research. We tried to apply a static view of the system, which unfortunately does not seem to be true. The main takeaway message is that with these experiments and after analyzing data we have a better understanding on how energy consumption for algorithms work in reality, which will help us design better experimental methodologies to do it in the future. In this case, the range of variation induced by algorithm parameters seems to be well below the variation of operating conditions in the system we are measuring our algorithm.

Since running time has the greatest impact in energy consumption, any energy reduction strategy should focus first on reducing that running time, using any of the tools that are available to the researcher, from programming language through sub-algorithmic improvements to programming enhancements. After all, the question that we are asking in this specific report might not be the right one: algorithmic parameters are the last step in a scale of algorithm greening
 and it might not be available after all.

There are different ways to overcome this: - Repeating the whole set of measurements several times so that different operating conditions will help understand how much additional energy is really being spent. This will also increase statistical significance of the measures. Any of the two factors will help differentiate changes in consumption from simple system noise. - Look at different methodologies for running our workload, possibly making it closer to baseline measurements so that, by lowering the intercept, the percentage of change with respect to the system is raised. - Use more sensor of operating conditions so that what causes energy consumption can be more easily modeled and the impact of algorithm parameters can be discerned. - Extending the range of algorithm parameters tested, on top of the other measures, will also help. We have only tested two levels for 3 variables; other variables might have a bigger impact. These variables will have an impact on the number of evaluations and generations, and the former has a coefficient of in the time model. A difference of several orders of magnitude need to occur for a change of evaluations to be noticed in the energy consumption pattern.

Research progresses by understanding the system under study. With these results, and those in the paper, we have progressed towards this understanding.


About

This document is a companion to the LION-26 paper

by Merelo & Merelo-Molina. All data and code are available at https://github.com/JJ/brave-new-green-algorithm.

Research supported by the Ministerio español de Economía y Competitividad under project PID2023-147409NB-C21.


Source: JJ Merelo, Cecilia Merelo Molina Best practices in measuring energy consumption in population-based metaheuristics, in Proceedings OLA’26 International Conference on Optimization and Learning, pp 183-194, available online. Please check references.bib for the BibTeX entry.

References

Merelo, Cecilia, Juan J Merelo, and Mario Garcı́a-Valdez. 2022. “A Brave New Algorithm to Maintain the Exploration/Exploitation Balance.” In New Perspectives on Hybrid Intelligent System Design Based on Fuzzy Logic, Neural Networks and Metaheuristics, 305–16. Springer.
Merelo-Guervós, Juan J., Cecilia Merelo-Molina, Pablo García-Sánchez, and Mario García-Valdez. 2026. “Is There a (Carbon-) Free Lunch? Energy/Performance Tradeoffs in Population-Based Metaheuristics.”