Setting up experiments to measure energy spent by an algorithm is an extremely complex task. Our initial approach was to run separated baseline and workload experiments, and subtracting the former from the latter to extract actual workload (our Brave New Algorithm (Merelo, Merelo, and GarcıÌa-Valdez 2022)) energy expenses. The underlying assumption is that measured energy reflects, effectively, what the algorithm is spending and that it will change with the different algorithm parameters. Let us find out, in this expanded paper, how much is the case with the data we used in the paper.
lion_wl_118 <- read.csv("data/lion-1.11.8-bna-fix-rand.csv")
lion_wl_118$alpha <- NULL # deletes unused column
lion_wl_118$diff_fitness <- NULL # focus on energy here
plot_relative_impact <- function( time_model ) {
tidy_model <- tidy( time_model) %>%
# Remove the intercept because it usually overshadows the predictors
filter(term != "(Intercept)") %>%
mutate(
# Create a boolean column for statistical significance (p < 0.05)
is_significant = p.value < 0.05,
# Clean up the term names: replace ":" with " Ă " for better readability
term = str_replace_all(term, ":", " Ă ")
)
# 2. Create the Lollipop Chart plotting the t-values
plot_output <- ggplot(tidy_model, aes(x = statistic, y = reorder(term, statistic))) +
# Add a dashed line at 0 for reference (using linewidth instead of size)
geom_vline(xintercept = 0, linetype = "dashed", color = "grey50", linewidth = 0.8) +
# Draw the lollipop sticks (using linewidth)
geom_segment(aes(x = 0, xend = statistic, y = term, yend = term, color = is_significant),
linewidth = 1) +
# Draw the lollipop heads
geom_point(aes(color = is_significant), size = 3.5) +
# Custom colors for significance
scale_color_manual(values = c("TRUE" = "#0072B2", "FALSE" = "#D55E00"),
labels = c("Not Significant", "Significant (p < 0.05)")) +
# Labels and theming
labs(
title = "Relative Impact of Predictors on 'Seconds'",
subtitle = "Comparing t-values to account for the vastly different scales of variables",
x = "t-value (Signal-to-Noise Ratio)",
y = NULL,
color = "Model Term Status:"
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "bottom",
panel.grid.minor = element_blank(),
plot.title = element_text(face = "bold")
)
print(plot_output)
}We load the data from the experiment; in this case it is a run that uses Julia 1.11.8 and a series of fixes and improvements. Please check (Merelo-GuervĂłs et al. 2026) for details or take a look at the explainer version.
Operating conditions might account for a certain percentage of the results we are seeing. Let us first isolate co-dependent variables to indentify that dependence.
The measured energy will depend, first and foremost, on the time it takes to run the algorithm; more time will obviously consume more energy for many reasons. And running time will be dependent on the algorithm parameters. Let us first model that dependence to better understand how our algorithm spends energy.
lion_wl_118_time_model <- glm( seconds ~ dimension*population_size*max_gens + evaluations*generations, data=lion_wl_118)
library(broom)
library(stringr)
# 1. Tidy the model output using broom
tidy_model <- tidy(lion_wl_118_time_model) %>%
# Remove the intercept because it usually overshadows the predictors
filter(term != "(Intercept)") %>%
mutate(
# Create a boolean column for statistical significance (p < 0.05)
is_significant = p.value < 0.05,
# Clean up the term names: replace ":" with " Ă " for better readability
term = str_replace_all(term, ":", " Ă ")
)
# 2. Create the Lollipop Chart plotting the t-values
ggplot(tidy_model, aes(x = statistic, y = reorder(term, statistic))) +
# Add a dashed line at 0 for reference (using linewidth instead of size)
geom_vline(xintercept = 0, linetype = "dashed", color = "grey50", linewidth = 0.8) +
# Draw the lollipop sticks (using linewidth)
geom_segment(aes(x = 0, xend = statistic, y = term, yend = term, color = is_significant),
linewidth = 1) +
# Draw the lollipop heads
geom_point(aes(color = is_significant), size = 3.5) +
# Custom colors for significance
scale_color_manual(values = c("TRUE" = "#0072B2", "FALSE" = "#D55E00"),
labels = c("Not Significant", "Significant (p < 0.05)")) +
# Labels and theming
labs(
title = "Relative Impact of Predictors on 'Seconds'",
subtitle = "Comparing t-values",
x = "t-value (Signal-to-Noise Ratio)",
y = NULL,
color = "Model Term Status:"
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "bottom",
panel.grid.minor = element_blank(),
plot.title = element_text(face = "bold")
)Fig. 1 â How is running time dependent on the algorithm variables?
In this first stage, we model time to have an idea of what is the precise dependence on algorithm parameters (dimension, population size, maximum number of generations without change to stop) and algorithm outcomes (number of generations until stopping and evaluations performed).
We have plotted here the t-value of the model coefficients, which represents how much the result (number of seconds) is correlated, positively or negatively, with the different combination of parameters. We find a significant correlation for several individual parameters (max_gens and evaluations) and combinations (population_size x max_gens, and the three interacting algorithm parameters with the maximum positive correlation, dimension x population_size x max_gens).
But we need to check what percentage of the variance they explain.
library(stringr)
# 1. Run Analysis of Deviance to get the variance explained by each term
# We use tidy() to immediately format the output into a clean dataframe
anova_tidy <- tidy(anova(lion_wl_118_time_model))
# 2. Extract deviance totals from the model object
null_dev <- lion_wl_118_time_model$null.deviance
resid_dev <- lion_wl_118_time_model$deviance
# 3. Calculate percentages and categorize the terms
variance_data <- anova_tidy %>%
# Remove the intercept/NULL row
filter(!is.na(term) & term != "NULL") %>%
select(term, deviance) %>%
# Calculate the percentage of total variance explained
mutate(pct_variance = (deviance / null_dev) * 100) %>%
# Make the interaction terms readable
mutate(term = str_replace_all(term, ":", " Ă ")) %>%
# Add the Unexplained variance back in as its own row
bind_rows(
tibble(
term = "Unexplained Variance (Residuals)",
deviance = resid_dev,
pct_variance = (resid_dev / null_dev) * 100
)
) %>%
# Categorize into the groups you specified
mutate(
category = case_when(
term == "Unexplained Variance (Residuals)" ~ "Unexplained",
str_detect(term, "evaluations|generations") ~ "Outcomes / Runtime Metrics",
TRUE ~ "Algorithm Parameters"
)
)
# 4. Generate the Visualization
ggplot(variance_data, aes(x = pct_variance, y = reorder(term, pct_variance), fill = category)) +
geom_col(width = 0.7) +
# Add the percentage labels at the end of each bar
geom_text(aes(label = sprintf("%.2f%%", pct_variance)),
hjust = -0.15, size = 4, color = "black") +
# Color-code by the specified categories
scale_fill_manual(values = c(
"Algorithm Parameters" = "#0072B2", # Blue
"Outcomes / Runtime Metrics" = "#D55E00", # Orange
"Unexplained" = "#CCCCCC" # Grey
)) +
# Expand the x-axis slightly so the text labels don't get cut off
scale_x_continuous(limits = c(0, max(variance_data$pct_variance) * 1.15)) +
labs(
title = "Percentage of Variance Explained by Each Factor",
subtitle = "Decomposition of Deviance from the GLM",
x = "% of Total Variance Explained",
y = NULL,
fill = "Variable Group:"
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "bottom",
panel.grid.major.y = element_blank(),
plot.title = element_text(face = "bold")
)Fig. 1 â How is running time dependent on the algorithm variables?
And this is what we find, 77.6163972% of the variance canât be explained by the algorithm parameters. This should be expected: part of it will be the system and framework overhead, and a part will be also explained by the specific algorithm implementation.
There is an issue here, however; the fact that
evaluations and generations should be
dependent on the algorithm parameters and problem size. The leverage we
have to optimize energy for a specific implementation are only those
parameters. So let us try and see how time behaves in terms of those
parameters only, and we will use a Generalized Additive Model
# 1. Preparar los datos tratando las variables de 2 niveles como factores
lion_wl_118_clean <- lion_wl_118 %>%
mutate(
dimension = as.factor(dimension),
population_size = as.factor(population_size),
max_gens = as.factor(max_gens)
)
strict_factor_model <- glm(
formula = seconds ~ dimension * population_size * max_gens,
data = lion_wl_118_clean
)
summary(strict_factor_model)##
## Call:
## glm(formula = seconds ~ dimension * population_size * max_gens,
## data = lion_wl_118_clean)
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 8.8753366 0.0048662 1823.877
## dimension5 -0.0086926 0.0068818 -1.263
## population_size400 0.0023670 0.0068818 0.344
## max_gens25 0.0172104 0.0068818 2.501
## dimension5:population_size400 0.0094696 0.0097324 0.973
## dimension5:max_gens25 0.0002257 0.0097324 0.023
## population_size400:max_gens25 -0.0197788 0.0097324 -2.032
## dimension5:population_size400:max_gens25 0.0328759 0.0137637 2.389
## Pr(>|t|)
## (Intercept) <2e-16 ***
## dimension5 0.2078
## population_size400 0.7312
## max_gens25 0.0131 *
## dimension5:population_size400 0.3316
## dimension5:max_gens25 0.9815
## population_size400:max_gens25 0.0433 *
## dimension5:population_size400:max_gens25 0.0177 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.0007103952)
##
## Null deviance: 0.20088 on 239 degrees of freedom
## Residual deviance: 0.16481 on 232 degrees of freedom
## AIC: -1049
##
## Number of Fisher Scoring iterations: 2
# This is not used, but can help
summary_effects <- lion_wl_118_clean %>%
group_by(dimension, population_size) %>%
summarise(mean_seconds = mean(seconds), .groups = "drop")
plot_relative_impact( strict_factor_model)Generalized additive model on algorithm parameters and problem size
We have also factored those parameters to use the two different values that we have for each one of them. This is also interesting because we can see what specific values have an impact on running time, of which max_gens = 25 have a maximum positive impact, population_size = 400 with max_gens=25 a negative one, meaning that that combination will make it take less time to run; the fact that that happens independently of problem size is interesting by itself.
Time is a factor in energy consumption, but we are really interested in the influence of those parameters on the energy consumed. Weâll get to that next.
Can we learn something from energy consumption, even outside our algorithm?
We use residuals to eliminate the part of the running time that is explained by the algorithm parameters, and will leave only the residuals to inject into the energy model. This is what we do next.
lion_wl_118_clean$residual_seconds <- residuals( strict_factor_model )
lion_energy_model <- glm(
formula = PKG ~ dimension * population_size * max_gens + residual_seconds,
data = lion_wl_118_clean
)
clean_lion_energy_model_coefficients <- tidy( lion_energy_model )
clean_lion_energy_model_coefficients %>% mutate(
# Create a boolean column for statistical significance (p < 0.05)
is_significant = p.value < 0.05,
# Clean up the term names: replace ":" with " Ă " for better readability
term = str_replace_all(term, ":", " Ă ")
) -> clean_lion_energy_model_coefficients
library(ggplot2)
library(dplyr)
library(stringr)
# Assuming your data frame is named 'tidy_energy_model'
plot_data <- clean_lion_energy_model_coefficients %>%
filter(term != "(Intercept)") %>% # Always drop the massive intercept for scale
mutate(
term = str_replace_all(term, "Ă", " Ă "),
term = str_replace_all(term, "dimension5", "Dimension 5"),
term = str_replace_all(term, "population_size400", "Population 400"),
term = str_replace_all(term, "max_gens25", "Max Gens 25"),
term = str_replace_all(term, "residual_seconds", "Unexpected Delay (Residuals)"),
# Create a strict ordering: put the significant term at the top
term = factor(term, levels = term[order(is_significant, statistic)])
)
ggplot(plot_data, aes(x = statistic, y = term)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
# Use linewidth instead of size for segments
geom_segment(aes(x = 0, xend = statistic, y = term, yend = term, color = is_significant),
linewidth = 1.2) +
geom_point(aes(color = is_significant), size = 4) +
scale_color_manual(values = c("FALSE" = "#B0B0B0", "TRUE" = "#D55E00")) +
labs(
title = "Impact on Total CPU Energy (PKG)",
subtitle = "Only essential running time generate a significant energy footprint",
x = "t-value (Signal-to-Noise Ratio)",
y = NULL
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "none", # Kill the legend to reduce clutter
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
axis.text.y = element_text(face = ifelse(levels(plot_data$term) == "Essential Running Time (Residuals)", "bold", "plain"))
)We can visualize these coefficients as above. Baseline is, the only significant term in the energy model (here modeled using PKG, energy consumed by the package) is the essential part of the running time, the one that is not affected by algorithm parameters. The rest of the measurements indicate that whatever additional energy expenses are incurred by running under different parametrizations it is not enough to overcome non-controlled changes in energy consumption caused by the operating context.
We can try and check of there is some structure in the noise so that it can be somehow subtracted to work with it.
library(ggplot2)
library(broom)
# 1. Extract the residuals and fitted values from your time model
# (Make sure to use the model where you dropped evaluations and generations!)
diagnostics_data <- augment(lion_energy_model)
# 2. Plot the Noise (Residuals) against the Predictions (Fitted)
ggplot(diagnostics_data, aes(x = .fitted, y = .resid)) +
# Draw the actual data points
geom_point(alpha = 0.6, size = 3, color = "#444444") +
# Draw the "Zero Error" baseline (using linewidth as requested)
geom_hline(yintercept = 0, linetype = "dashed", color = "#D55E00", linewidth = 1) +
# Add a smoothing line to help detect hidden curves in the noise
geom_smooth(method = "loess", se = FALSE, color = "#0072B2", linewidth = 1.2) +
labs(
title = "Diagnostic: Is the Noise Organized?",
subtitle = "Residuals vs. Fitted Values",
x = "Predicted Total PKG energy (J)",
y = "Residual Error (Actual - Predicted)"
) +
theme_bw(base_size = 14) +
theme(plot.title = element_text(face = "bold"))what we see
However, we can se it is not. Energy varies within a 20-Joule range, and within that range it is more or less white noise, with error independent on the actual amount of energy.
This was still a very initial stage of the research. We tried to apply a static view of the system, which unfortunately does not seem to be true. The main takeaway message is that with these experiments and after analyzing data we have a better understanding on how energy consumption for algorithms work in reality, which will help us design better experimental methodologies to do it in the future. In this case, the range of variation induced by algorithm parameters seems to be well below the variation of operating conditions in the system we are measuring our algorithm.
Since running time has the greatest impact in energy consumption, any energy reduction strategy should focus first on reducing that running time, using any of the tools that are available to the researcher, from programming language through sub-algorithmic improvements to programming enhancements. After all, the question that we are asking in this specific report might not be the right one: algorithmic parameters are the last step in a scale of algorithm greening⊠and it might not be available after all.
There are different ways to overcome this: - Repeating the whole set of measurements several times so that different operating conditions will help understand how much additional energy is really being spent. This will also increase statistical significance of the measures. Any of the two factors will help differentiate changes in consumption from simple system noise. - Look at different methodologies for running our workload, possibly making it closer to baseline measurements so that, by lowering the intercept, the percentage of change with respect to the system is raised. - Use more sensor of operating conditions so that what causes energy consumption can be more easily modeled and the impact of algorithm parameters can be discerned. - Extending the range of algorithm parameters tested, on top of the other measures, will also help. We have only tested two levels for 3 variables; other variables might have a bigger impact. These variables will have an impact on the number of evaluations and generations, and the former has a coefficient of in the time model. A difference of several orders of magnitude need to occur for a change of evaluations to be noticed in the energy consumption pattern.
Research progresses by understanding the system under study. With these results, and those in the paper, we have progressed towards this understanding.
This document is a companion to the LION-26 paper
by Merelo & Merelo-Molina. All data and code are available at https://github.com/JJ/brave-new-green-algorithm.
Research supported by the Ministerio español de EconomĂa y Competitividad under project PID2023-147409NB-C21.
Source: JJ Merelo, Cecilia Merelo Molina Best practices
in measuring energy consumption in population-based
metaheuristics, in Proceedings OLAâ26 International Conference
on Optimization and Learning, pp 183-194, available
online. Please check references.bib for the BibTeX
entry.