\documentclass[runningheads]{llncs}
\usepackage[T1]{fontenc}

\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{color}

\begin{document}

\title{Best practices in measuring energy consumption in population-based metaheuristics}
%
\titlerunning{Best practices in measuring energy consumption}

\author{JJ Merelo\inst{1,2}\orcidID{ 0000-0002-1385-9741 } \and Cecilia Merelo-Molina\inst{3}}

\authorrunning{Merelo and Merelo}

\institute{Department of Computer Engineering, Automatics and Robotics, University of Granada \email{jmerelo@ugr.es}\and
CITIC, UGR, Spain
\and
Zenzorrito, Granada, Spain}

\maketitle

\begin{abstract}
Although developing energy-aware software has been
a research topic for more than a decade, there are still no established
methodologies for measuring and benchmarking energy consumption of specific
algorithm implementations. This is especially true for population-based
metaheuristics, where the interaction between the algorithm and the underlying
system is complex and difficult to model. In this paper, we explore different
measurement methodologies using a specific population-based metaheuristic, the
Brave New Algorithm, implemented in the Julia programming language. We analyze
the impact of different measurement frequencies, as well as diverse ways of
separating baseline energy consumption from that of the actual workload. We also
explore some micro-optimizations of the implementation that can lead to
significant energy savings, and whose measurement is enabled by the increases in
precision brought by the methodology. Our results show that configuring the
measurement tool properly and planning the experiments in such a way that the
effects of the underlying hardware are mitigated helps achieve that degree of
precision; once these steps in the methodology have been made, we have proved
how working with the latest software version as well as understanding it to
perform micro-optimization can help achieve {\em greener} implementation of
population-based metaheuristics.

\keywords{Green computing  \and Energy profiling \and Metaheuristics \and Software Engineering.}
\end{abstract}

\section{Introduction}

There are several trends that have converged to influence the development and implementation of optimization algorithms in the last years: the end of Moore's law implies that you cannot optimize performance and energy consumption at the same time; sustainable development objectives aim for lowering the carbon footprint of human endeavors, including science, and finally the pervasiveness of machine learning and optimization techniques make the need for {\em greener} implementations more acute, looking to obtain the same results with less energy \cite{vankempen2024itseasygreenenergy}.

Specific methodologies for doing so, however, are still lacking, with every paper using different measurement strategies and experimental setups to measure and then compare energy-saving strategies. This is especially problematic in an area that is usually not so well covered in computer science: benchmarking and measurement. Although there have been some reports on best practices \cite{cruz21b}, the challenge of measuring the energy spent by specific processes is still present.

There are several reasons why this occurs, but there are two main ones: first, modern computer systems proactively manage energy consumption, so the interaction with user-level measures is not straightforward; second, the tooling available makes impossible to separate the actual workload from the overhead created not only by the system, but also by the runtime system of the language we are using, be it compiled (including just-in-time compiled, or compiled to bytecode) or interpreted \cite{en12112204}. Using baseline measurements which are then averaged causes issues such as negative workload measured energy \cite{low-level,lion24}. At the same time, hysteretic effects \cite{cotta25} make it difficult to reproduce results, since the physical conditions of the hardware platform will affect measurements in a non-deterministic way. All this is combined with the specific issues with the measurement tool we are using and its possible lack of precision or other kinds of errors.

In this paper we will try to address some of these issues, looking for a methodology that measures energy consumption in an population-based optimization algorithm called Brave New Algorithm \cite{merelo2022brave}, implemented in the Julia programming language; we will test the methodology with optimization measures that should be difficult to assess without these precise, reproducible measurements. The scripts that carry out the methodology will be released as open source, which once again enhances reproducibility.

The rest of the paper is organized as follows: the next Section describes the state of the art in the exploration of energy consumption of evolutionary algorithms and other metaheuristics. Section \ref{sec:bna} describes the algorithm and how it has been parametrized for the purpose of this paper. Since there is no fixed methodology for measuring energy consumption, we will show how it was done for this experiment and the trade-offs it implies, together with the results, in Section \ref{sec:results}; these results will be discussed in the last Section \ref{sec:discussion} along with our conclusions.


\section{State of the art}

Our focus has been on optimizing energy consumption in implementation of metaheuristics; in general, this research thrust shares many features with other optimization and machine learning algorithms. Large language models, for instance, have lately received intensified attention, with papers focused as much on the the optimization part as on the measurement axis \cite{10549890}. This falls within a context where calls for a better knowledge of energy consumption \cite{7155416}. More efficient software and more energy-aware optimizations have been made \cite{fonseca2019manifesto}. At the end of the day, energy-efficient software is better software and, as with any other implementation issue, understanding better how algorithm implementations spend their time and energy will lead to algorithmic insights that will, eventually, lead to better algorithms \cite{merelo2011implementation}.

The main issue, however, is still a lack of sound methodology, or even a set of best practices, for measuring energy consumption for specific algorithm implementations, not to mention specific functions within them. The challenges were already identified in \cite{von2018measuring}: benchmarking methodologies need to be fair, verifiable, usable but, overall, reproducible. \cite{10549890} presents a finer analysis of these problems, emphasizing the accessibility problem, the power-measuring architecture, and finally the bigger issue from the point of view of algorithms: separating the power consumed by the workload from the overhead caused by the rest of the system. Several recommendations for {\em green} software engineering have been issued \cite{cruz21b}, with an emphasis on reproducibility \cite{song2025reproducible}. This, indeed, might be the biggest issue. \cite{cotta25} shows that time-related, possibly hysteretic effects due to changes in the physical parameters of the hardware platform will affect the measurements taken by as much as 10\%. As a matter of fact, what we want in this kind of optimization is to ensure that energy consumption is measured and eventually lowered in regular lab conditions, so trying to fix physical conditions in different ways is not really an option.

In this paper, we will try to contribute to this line of research, looking mainly at improving reproducibility of results, but also at making the measurements of the workload of interest more precise.

\section{A Brave New Algorithm}\label{sec:bna}

Despite the scathing sentence by Sörensen \cite{sorensen} that most biologically inspired metaheuristics "take the field of metaheuristics a step backward", we can affirm that in the case of the Brave New Algorithm that is inspired on how selection and reproduction happened in novel Brave New World \cite{huxley2022brave}, we are not dealing with biological, but {\em literary} inspiration, so that does not apply. On the other hand, we are fully aware of the disadvantages of proposing a new terminology for a population-based evolutionary algorithm, so we will explain the algorithm used here in terms of the more widely known evolutionary algorithms, from which it basically departs on the reproduction scheme it uses \cite{blickle1996comparison}.

Since the Brave New Algorithm \cite{merelo2022brave} (BNA) uses mutation, crossover as {\em genetic} operators and fitness-based selection to find the optimum of a given function, represented as a {\em chromosome}, in this case a vector of real numbers. The original novel intervenes to inspire a reproduction restriction scheme that is geared towards having a finer control on the exploration/exploitation balance of the algorithm. This scheme distributes the population in different groups called, as in the book, {\em castes}, and they have different reproduction methods. After selection creating a reproductive pool using some kind of selection, new members will be generated depending on the caste:
\begin{itemize}
\item The $\alpha$ caste will generate every new generation via mutation and crossover.
\item The $\beta$ caste will pair one member of the $\alpha$ with one of them and generate new members via mutation and crossover.
\item $\delta$ and $\epsilon$ castes generate new individuals by mutation of other members of its own caste. The only difference between castes is that mutation rates can be set separately.
\item $\gamma$ generates new members by applying mutation repeatedly until a better individual is found.
\end{itemize}

All the new individuals of the population are ranked after being generated and assigned to a caste according to the percentage that has been assigned to every one of them; the $\alpha$ caste will hold the best $\%\alpha$ of the population, and so on.

The parametrization of the algorithm operators is as follows:\begin{itemize}
\item Mutation is applied every time a new individual is generated and changes 40\% of the chromosome elements. Since we are using low dimensions (3,5), this means that one or two elements will be generated randomly at most.
\item We use a two-point crossover.
\item The reproductive pool is created via binary tournament selection.
\end{itemize}

These are mainstream choices in evolutionary algorithms; for the time being our focus is on the energy measuring methodology as well as optimizing the performance of the framework.

Castes are, however, the main differentiation feature of this algorithm, with the percentage allotted to every caste being our main control parameter; these have some restrictions in, due to the way new members are generated: the beta caste needs to have twice as many members as the alpha caste \cite{bna25}. This is why we will use the percentage of population in the $\alpha$ caste as the main lever, as it can be used to generate the rest if we keep $\%\delta$ and $\%\epsilon$ fixed to a low value. In practice, then, after setting $\%\alpha$, we will set $\%\beta = 2 * \%\alpha$ and $\%\gamma = 100\% - (\%\alpha + \%\beta + \%\delta + \%\epsilon)$.

The implementation of the algorithm, as well as all the parameters used, are available under a free license at \url{https://github.com/cecimerelo/BraveNewAlgorithm.jl}

\section{Experimental methodology and results}\label{sec:results}

In order to carry out these experiments, we have used a methodology that is like that used in our previous papers: \cite{low-level}. We employ {\sf pinpoint}, \cite{pinpoint}, a command line tool that taps the RAPL interface \cite{rapl} to measure energy consumption of a given process. The version we have used was compiled from commit {\tt dfee658}. We carry experiments in an AMD Ryzen 9 9950X 16-Core Processor, with Ubuntu Linux 25.04 with kernel version 6.14.0-37-generic. We have initially used Julia version $v1.11.7$. The fitness function used has been the Sphere function \cite{hansen2010comparing} from {\sf BlackBoxOptimizationBenchmarking.jl}.

We will first deal with RAPL sampling; previously, we have used the default {\sf pinpoint} value of 50ms. In general, it is advised to increase the sampling period to avoid errors in the estimation, so we have doubled it to 100ms.

<<ola.base.frequency, echo=FALSE, warning=F, message=F>>=
null_baseline_columns <- c("alpha", "max_gens", "different_seeds", "diff_fitness", "generations", "evaluations")
baseline_data_evoapps <- read.csv("data/evoapps-1.11.7-baseline-bna-baseline-16-Oct-11-08-20.csv")
baseline_data_evoapps[ null_baseline_columns ] <- NULL

baseline_data_ola_100s <- read.csv( "data/ola-base-ola-baseline-14-Dec-12-06-42.csv")
baseline_data_ola_100s$work <- "ola-baseline"
baseline_data_ola_100s[null_baseline_columns] <- NULL

initial_baseline_data <- rbind(baseline_data_evoapps, baseline_data_ola_100s)
initial_baseline_data$power <- initial_baseline_data$PKG / initial_baseline_data$seconds
library(ggplot2)
#ggplot(initial_baseline_data,aes(x=PKG,y=seconds,color=work))+geom_point() + theme_minimal() +
  # xlab("Package energy (Joules)") +
  # ylab("Time (seconds)")

library(dplyr)
initial_baseline_data %>% group_by(population_size, dimension, work) %>% summarise(
    mean_pkg = mean(PKG),
    sd_pkg = sd(PKG),
    mean_time = mean(seconds),
    sd_time = sd(seconds),
    mean_power = mean(power),
    sd_power = sd(power)
  ) -> initial_baseline_data_summary

# compute wilcoxon tests between work values for every population_size and dimension
wilcoxon_tests <- function( data, column_name = "PKG" ) {
  wilcoxon_results <- data.frame()
  for (pop_size in unique(data$population_size)) {
    for (dim in unique(data$dimension)) {
      data_subset <- data %>% filter(population_size == pop_size, dimension == dim)
      work_values <- unique(data_subset$work)
      if (length(work_values) == 2) {
        work1_data <- data_subset %>% filter(work == work_values[1]) %>% pull(.data[[column_name]])
        work2_data <- data_subset %>% filter(work == work_values[2]) %>% pull(.data[[column_name]])
        wilcox_test <- wilcox.test(work1_data, work2_data)
        wilcoxon_results <- rbind(wilcoxon_results, data.frame(
          population_size = pop_size,
          dimension = dim,
          work1 = work_values[1],
          work2 = work_values[2],
          p_value = wilcox_test$p.value
        ))
      }
    }
  }
return(wilcoxon_results)
}

wilcoxon_initial_baseline <- wilcoxon_tests( initial_baseline_data )

initial_baseline_data_summary$sd_and_mean_pkg <- paste0("$",
  round(initial_baseline_data_summary$mean_pkg, 2), " \\pm", round(initial_baseline_data_summary$sd_pkg, 2), "$"
)

library(reshape2)
initial_baseline_table <- reshape2::dcast(
  initial_baseline_data_summary,
  population_size + dimension ~ work,
  value.var = "sd_and_mean_pkg"
)

# add a column with the p-values from the wilcoxon tests in wilcoxon_initial_baseline for every population_size and dimension

initial_baseline_table <- merge(
  initial_baseline_table,
  wilcoxon_initial_baseline %>%
    select(population_size, dimension, p_value),
  by = c("population_size", "dimension")
)

initial_baseline_table %>% mutate( p_value = ifelse( as.numeric(p_value) < 0.05, "+", "-" ) ) -> initial_baseline_table

library(kableExtra)

kable(initial_baseline_table, "latex",
      col.names = c("Population size","Dimension","Baseline PKG (J)","Sampling = 100s", "Significant"),
      caption = "Baseline energy consumption for different sampling frequencies",
      escape=FALSE) %>%
  kable_styling( full_width = F)

@

Results for measurement of baseline runs (which generate the initial population) are shown in Table \ref{tab:ola.base.frequency}. Differences are significant for most rows; despite measuring the same system and parameters, using a lower sampling rate is able to appreciate differences in energy consumption that go in different directions. But baseline measurements are mainly used to differentiate the actual workload energy consumption.

<<ola.deltas, echo=FALSE, warning=F, message=F,fig.cap="Workload energy spent vs. time for the initial configuration vs. measurements with sampling period = 100 ms", fig.height=3>>=
compute_deltas <- function( baseline_summary, workload ) {
  workload$delta_PKG <-0
  workload$delta_seconds <- 0
  for (dim in c(3,5)) {
    for ( pop_size in c(200,400)) {
      number_of_rows <- nrow(workload[ workload$dimension==dim & workload$population_size==pop_size,])
      workload[ workload$dimension==dim & workload$population_size==pop_size,]$delta_PKG <-
      workload[ workload$dimension==dim & workload$population_size==pop_size,]$PKG  -
      rep(baseline_summary[ baseline_summary$population_size == pop_size & baseline_summary$dimension==dim, ]$median_energy,number_of_rows)

      workload[ workload$dimension==dim & workload$population_size==pop_size,]$delta_seconds <-
      workload[ workload$dimension==dim & workload$population_size==pop_size,]$seconds  -
      rep(baseline_summary[ baseline_summary$population_size == pop_size & baseline_summary$dimension==dim, ]$median_time,number_of_rows)
    }
  }
  return (workload)
}

baseline_data_evoapps %>% group_by(dimension,population_size) %>%
  summarise(median_energy=median(PKG), sd_energy=sd(PKG),
            trimmed_mean_energy=mean(PKG,trim=0.2),
            median_time=median(seconds), sd_time=sd(seconds),
            trimmed_mean_time=mean(seconds, trim=0.2)) -> summary_baseline_data_evoapps

workload_data_evoapps <- read.csv("data/evoapps-1.11.7-fix-rand-bna-fix-rand-25-Oct-11-06-07.csv")
workload_data_evoapps <- compute_deltas( summary_baseline_data_evoapps, workload_data_evoapps )

workload_data_evoapps %>% group_by(dimension,population_size,max_gens,alpha) %>%
  summarise(median_delta_energy=median(delta_PKG), sd_delta_energy=sd(delta_PKG),
            trimmed_mean_delta_energy=mean(delta_PKG,trim=0.2),
            median_delta_time=median(delta_seconds), sd_delta_time=sd(delta_seconds),
            trimmed_mean_delta_time=mean(delta_seconds, trim=0.2)) -> summary_workload_data_evoapps

baseline_data_ola_100s %>% group_by(dimension,population_size) %>%
  summarise(median_energy=median(PKG), sd_energy=sd(PKG),
            trimmed_mean_energy=mean(PKG,trim=0.2),
            median_time=median(seconds), sd_time=sd(seconds),
            trimmed_mean_time=mean(seconds, trim=0.2)) -> summary_baseline_data_ola

workload_data_ola <- read.csv("data/ola-1.11.7-ola-14-Dec-13-02-30.csv")
workload_data_ola <- compute_deltas( summary_baseline_data_ola, workload_data_ola )

workload_data <- rbind( workload_data_evoapps, workload_data_ola )

ggplot(workload_data, aes(x=delta_seconds, y=delta_PKG, color=as.factor(work))) +
  geom_point() +
  theme_minimal() +
  xlab("Delta Time (seconds)") +
  ylab("Delta Package energy (Joules)") +
  labs(color="Measurement frequency") + guides(color=guide_legend(title="Experiment"))
@

In Figure \ref{fig:ola.deltas} we show the differential of energy spent vs. the time needed with the new sampling frequency (once ever 100ms) vs. the old one (50ms, in red). What is interesting to see here is that just a change in the measuring tool configuration leads to significant changes in the distribution as well as the absolute values of energy consumed and time spent. However, time spent should not, a priori, be affected by the measuring tool, so what we see here is that we have only reduced a source of uncertainty, error in measurement, not every one of them. We still have the issue of very small values, below 50 Joules, that are not realistic, over all for time measurements that are below 1.6 seconds. This will be addressed later.

With this higher precision, we can approach micro-optimizations of the implementation, mainly rewriting the population and individual generation functions so that they use less memory allocations.

<<ola.micro, echo=FALSE, warning=F, message=F>>=
baseline_ola_v2_data <- read.csv("data/ola-1.11.7-v2-baseline-v2-14-Dec-20-40-47.csv")
baseline_ola_v2_data[ null_baseline_columns ] <- NULL
initial_vs_v2_baseline <- rbind(
  baseline_data_ola_100s %>% mutate( work = "ola-baseline-v1"),
  baseline_ola_v2_data %>% mutate( work = "ola-baseline-v2")
)

wilcoxon_micro_optimization <- wilcoxon_tests( initial_vs_v2_baseline )
initial_vs_v2_baseline %>% group_by(population_size, dimension, work) %>% summarise(
    mean_pkg = mean(PKG),
    sd_pkg = sd(PKG),
    mean_time = mean(seconds),
    sd_time = sd(seconds)
  ) -> initial_vs_v2_baseline_summary

initial_vs_v2_baseline_summary$sd_and_mean_pkg <- paste0("$",
  round(initial_vs_v2_baseline_summary$mean_pkg, 2), " \\pm", round(initial_vs_v2_baseline_summary$sd_pkg, 2), "$"
)

library(reshape2)
initial_vs_v2_baseline_table <- reshape2::dcast(
  initial_vs_v2_baseline_summary,
  population_size + dimension ~ work,
  value.var = "sd_and_mean_pkg"
)

initial_vs_v2_baseline_table <- merge(
  initial_vs_v2_baseline_table,
  wilcoxon_micro_optimization %>%
    select(population_size, dimension, p_value),
  by = c("population_size", "dimension")
)
initial_vs_v2_baseline_table %>% mutate( p_value = ifelse( as.numeric(p_value) < 0.05, "+", "-" ) ) -> initial_vs_v2_baseline_table

kable(initial_vs_v2_baseline_table, "latex",
      col.names = c("Population size","Dimension","Initial version","Micro-optimized version","Significant"),
      caption = "Baseline energy consumption for different implementations of the genetic operators",
      escape=FALSE) %>%
  kable_styling( full_width = F)
@

What we see in Table \ref{tab:ola.micro}, computed for the baseline runs, can lead to relatively important improvements for some combination of dimensions and population sizes, but also to some regressions. But this is just the baseline; we need to compute the actual workload energy consumption.

<<ola.micro.plot, echo=FALSE, warning=F, message=F,fig.cap="Workload energy spent vs. time for different implementations of the genetic operators",fig.height=4, fig.pos="h!tb">>=
workload_ola_v2_data <- read.csv("data/ola-1.11.7-v2-ola-v2-15-Dec-13-18-43.csv")

baseline_ola_v2_data %>% group_by(population_size, dimension) %>% summarise(
    mean_pkg = mean(PKG),
    sd_pkg = sd(PKG),
    mean_time = mean(seconds),
    sd_time = sd(seconds),
    median_energy = median(PKG),
    median_time = median(seconds),
    trimmed_mean_pkg = mean(PKG, trim=0.2)
  ) -> summary_baseline_ola_v2

workload_ola_v2_data <- compute_deltas( summary_baseline_ola_v2, workload_ola_v2_data )

initial_vs_v2_workload <- rbind(
  workload_data_ola %>% mutate( work = "ola-workload-v1"),
  workload_ola_v2_data %>% mutate( work = "ola-workload-v2")
)

initial_vs_v2_workload$population_dimension <- paste0(
  "Pop. size: ", initial_vs_v2_workload$population_size,
  ", Dim.: ", initial_vs_v2_workload$dimension
)

ggplot(initial_vs_v2_workload, aes(x=work, y=delta_PKG, color=as.factor(work))) +
  geom_violin() +
  theme_minimal() +
  xlab("Version") +
  ylab("Delta Package energy (Joules)") +
  labs(color="Implementation version") +
  facet_wrap( ~ population_dimension )

@

We see in Figure \ref{fig:ola.micro.plot} that the scenario is mixed: microoptimization can result in some improvements, but also in regressions. The fact that these go in opposite directions than the baseline makes us think that there could be some hysteresis effects at work. here

<<ola.micro.hysteresis, echo=FALSE, warning=F, message=F,fig.cap="PKG energy data vs. accumulated time for the two baseline runs we are examining", fig.pos="h!tb", fig.height=3>>=
baseline_data_ola_100s$cumulative_time <- cumsum(baseline_data_ola_100s$seconds)
baseline_data_ola_100s$work <- "Baseline v1"
workload_data_ola$cumulative_time <- cumsum(workload_data_ola$seconds)
workload_data_ola$work <- "Workload v1"
baseline_ola_v2_data$cumulative_time <- cumsum(baseline_ola_v2_data$seconds)
baseline_ola_v2_data$work <- "Baseline v2"
workload_ola_v2_data$cumulative_time <- cumsum(workload_ola_v2_data$seconds)
workload_ola_v2_data$work <- "Workload v2"

ggplot( rbind(baseline_data_ola_100s, baseline_ola_v2_data), aes(x=cumulative_time, y=PKG, color=work) ) +
  geom_point() +
  theme_minimal() +
  xlab("Cumulative Time (seconds)") +
  ylab("Package energy (Joules)") +
  labs(color="Measurement type")

@

What we see in Figure \ref{fig:ola.micro.hysteresis} is a a plot of the PKG energy measured before and after the optimization but plotted vs. accumulated time. The time-related effects can clearly be observed, as well as the fact that the optimized version optimizes time rather than energy. But you can also see that the energy spent by the initial experiment is divided into two areas; after 2000 seconds there seems to be a new regime, caused by something external to the experiment itself. This is inherent to this kind of measurements; since experiments are not made in a machine devoted exclusively to them, we cannot measure energy consumption in this kind of machine, either. However, this causes a real issue in the reproducibility of results, but overall, when we compute the real energy consumption of the workload. This "second area" would correspond to experiments with chromosome dimension 5, which in part explains results in Table \ref{tab:ola.micro} and Figure \ref{fig:ola.micro.plot}.

<<ola.mixed, echo=FALSE, warning=F, message=F,fig.cap="Workload energy spent vs. time for different implementations of the genetic operators after mitigating hysteresis effects",fig.height=3, fig.pos="h!tb">>=
ola_mixed_v11_7 <- read.csv("data/ola-1.11.7-mixed-ola-mixed-15-Dec-19-49-11.csv")
ola_mixed_v11_7$cumulative_time <- cumsum(ola_mixed_v11_7$seconds)

for (i in 2:nrow(ola_mixed_v11_7)) {
  if (ola_mixed_v11_7$work[i] == "ola-mixed") {
    ola_mixed_v11_7$delta_seconds[i] <- ola_mixed_v11_7$seconds[i] - ola_mixed_v11_7$seconds[i-1]
    ola_mixed_v11_7$delta_PKG[i] <- ola_mixed_v11_7$PKG[i] - ola_mixed_v11_7$PKG[i-1]
  }
}

ggplot( ola_mixed_v11_7, aes(x=cumulative_time, y=PKG, color=as.factor(work)), group=as.factor(work) ) +
  geom_point() + geom_smooth() +
  geom_bar(ola_mixed_v11_7[ ola_mixed_v11_7$work == "ola-mixed", ], mapping=aes(x=cumulative_time, y=delta_PKG), stat="identity", alpha=0.2, color="blue" ) +
  theme_minimal() +
  xlab("Cumulative Time (seconds)") +
  ylab("Package energy (Joules)") + labs(color="Experiment")
@

There are many possible ways of mitigating the effect of external factors on measurements.
In Figure \ref{fig:ola.mixed} we have mixed baseline and workload runs making the workload run right after the baseline run; instead of averaging baseline runs and then subtracting it from the individual workload runs, we subtract the baseline run that has been made before the workload run from it; these are represented as blue bars in the figure. As it can be seen, even with some variability (and some negative values with which we will deal later) there seems to be a better fit, since, as the smoothed line shows, when the underlying system changes its energy profile it does so for both, mitigating to a certain extent the system overhead effect.

<<ola.mixed.comparison, echo=FALSE, warning=F, message=F,fig.cap="Workload energy spent compared with different ways of measuring the workload effect; we also compare two values of the maximum number of generations run without improving fitness.",fig.height=3, fig.pos="h!tb", fig.height=4>>=
ola_mixed_v11_7_summary_data <- ola_mixed_v11_7 %>%
  group_by(work,population_size, dimension, max_gens) %>%
  summarise(
    mean_PKG = mean(PKG),
    median_PKG = median(PKG),
    sd_PKG = sd(PKG),
    trimmed_PKG = mean(PKG, trim = 0.2),
    mean_delta_PKG = mean(delta_PKG, trim=0.2),
    sd_delta_PKG = sd(delta_PKG)
  )

ola_mixed_v11_7_deltas_summary_data <- ola_mixed_v11_7 %>%
  group_by(work,population_size, dimension, max_gens) %>%
  summarise(
    mean_PKG = mean(PKG),
    median_PKG = median(PKG),
    sd_PKG = sd(PKG),
    trimmed_PKG = mean(PKG, trim = 0.2),
    mean_delta_PKG = mean(delta_PKG, trim=0.2),
    sd_delta_PKG = sd(delta_PKG)
  )

workload_ola_v2_data %>% group_by(dimension,population_size,max_gens,alpha) %>%
                         summarise(median_delta_energy=median(delta_PKG), sd_delta_energy=sd(delta_PKG),
                                   trimmed_mean_delta_energy=mean(delta_PKG,trim=0.2),
                          median_delta_time=median(delta_seconds), sd_delta_time=sd(delta_seconds),
                          trimmed_mean_delta_time=mean(delta_seconds, trim=0.2)) -> summary_workload_ola_v2_data

ola_mixed_v11_7[ ola_mixed_v11_7$work == "ola-mixed", ] %>%
  group_by(population_size, dimension, max_gens) %>%
  summarise(
    median_delta_energy = median(delta_PKG),
    sd_delta_energy = sd(delta_PKG),
    trimmed_mean_delta_PKG = mean(delta_PKG, trim=0.2),
    median_delta_time=median(delta_seconds),
    sd_delta_time=sd(delta_seconds),
    trimmed_mean_delta_time=mean(delta_seconds, trim=0.2)
  ) -> summary_ola_mixed_deltas

mixed_vs_separated_workload <- rbind( workload_ola_v2_data, ola_mixed_v11_7[ ola_mixed_v11_7$work == "ola-mixed", ] )
mixed_vs_separated_workload$population_dimension <- paste0(
  "Pop. size: ", mixed_vs_separated_workload$population_size,
  ", Dim.: ", mixed_vs_separated_workload$dimension
)

ggplot(mixed_vs_separated_workload, aes(x=population_dimension, y=delta_PKG, color=as.factor(work))) +
  geom_violin(position="dodge") +
  theme_minimal() +
  xlab("Version") +
  ylab("Delta Package energy (Joules)") +
  labs(color="Implementation version") + # x ticks rotated
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  facet_wrap( ~ max_gens )

# ggplot(mixed_vs_separated_workload, aes(x=population_dimension, y=delta_PKG, color=as.factor(max_gens))) +
#   geom_boxplot(position="dodge") +
#   theme_minimal() +
#   xlab("Version") +
#   ylab("Delta Package energy (Joules)") +
#   labs(color="Implementation version") + # x ticks rotated
#   theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
#   facet_wrap( ~ work )
@

In Figure \ref{fig:ola.mixed.comparison} we start to compare the effects of changes in measurement methodology together with the effects of different parametrization of the algorithm. Again, since these parametrizations might have a small effect, increasing precision in the measures will help us assess and understand them.

<<ola.v11_8, echo=F, warning=F, message=F, fig.cap="Comparing Julia versions. We have clipped the y axis to 0,250", fig.height=4, fig.pos="h!tb">>=
ola_mixed_v11_8 <- read.csv("data/ola-1.11.8-mixed-inverted-ola-mixed-inverted-16-Dec-09-02-52.csv")
ola_mixed_v11_8$cumulative_time <- cumsum(ola_mixed_v11_8$seconds)
for (i in 2:nrow(ola_mixed_v11_8)) {
  if (ola_mixed_v11_8$work[i] == "ola-mixed-inverted") {
    ola_mixed_v11_8$delta_seconds[i] <- ola_mixed_v11_8$seconds[i] - ola_mixed_v11_8$seconds[i-1]
    ola_mixed_v11_8$delta_PKG[i] <- ola_mixed_v11_8$PKG[i] - ola_mixed_v11_8$PKG[i-1]
  }
}

ola_mixed_v11_7_workload <- ola_mixed_v11_7[ ola_mixed_v11_7$work == "ola-mixed",]
ola_mixed_v11_7_workload$work <- "v1.11.7"
ola_mixed_v11_8_workload <- ola_mixed_v11_8[ ola_mixed_v11_8$work == "ola-mixed-inverted",]
ola_mixed_v11_8_workload$work <- "v1.11.8"

ola_v11_7_vs_8_data  <- rbind(ola_mixed_v11_7_workload,ola_mixed_v11_8_workload)
ola_v11_7_vs_8_data$population_dimension <- paste0(
  "Pop. size: ", ola_v11_7_vs_8_data$population_size,
  ", Dim.: ", ola_v11_7_vs_8_data$dimension
)

ggplot(ola_v11_7_vs_8_data, aes(x=population_dimension, y=delta_PKG, color=as.factor(work))) +
  geom_boxplot(position="dodge",notch=T) +
  theme_minimal() +
  xlab("Version") +
  ylab("Delta Package energy (Joules)") +
  labs(color="Implementation version") + # x ticks rotated
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylim(0,250)+
  facet_wrap(~ max_gens)
@

With this we can compare, for instance, different versions of the Julia compiler. In Figure \ref{fig:ola.v11_8} we compare side by side experiments that only change the version of Julia, v1.11.7 vs v1.11.8, which was recently released in December 2025. It is considered a good practice, in general to test for the latest version of the compiler for the language  you are using, not only because it will be, in general, the one that needs to be used, but also because they normally have improvements in efficiency. That need not be the case always: some changes will include performance regressions, but in that case carrying out an energy profile will also allow you to check when that happens and start mitigating measures in the implementation. That has not been the case here: for any number of generations till ending the run, the new version spends less energy than the previous one, considerably so in the case of having the max number of generations = 25.

There is a remaining issue, however; as can be seen in Figure \ref{fig:ola.mixed.comparison} and \ref{fig:ola.v11_8}, there are some occasions when {\sf pinpoint} returns a 0 measurement. In the previous version of the experiment runner, this was avoided by repeating any runs until the required number had been reached, but this was dropped in this case. We changed it back again so that in case a run reported no energy, we repeated it until it did. This can certainly "separate" the baseline from the workload run, but it is preferable to reporting erroneous results that skew the comparison.

<<ola.no0, echo=F, fig.height=4, fig.cap="Avoiding zero energy runs, comparison of results">>=
ola_mixed_no0 <- read.csv("data/ola-1.11.8-ola-no0-16-Dec-17-43-49.csv" )
ola_mixed_no0$cumulative_time <- cumsum(ola_mixed_no0$seconds)
for (i in 2:nrow(ola_mixed_no0)) {
  if (ola_mixed_no0$work[i] == "ola-no0") {
    ola_mixed_no0$delta_seconds[i] <- ola_mixed_no0$seconds[i] - ola_mixed_no0$seconds[i-1]
    ola_mixed_no0$delta_PKG[i] <- ola_mixed_no0$PKG[i] - ola_mixed_no0$PKG[i-1]
  }
}

ola_mixed_v11_8_workload$work <- "zero energy runs"
ola_mixed_no0_workload <- ola_mixed_no0[ ola_mixed_no0$work == "ola-no0",]
ola_mixed_no0_workload$work <- "no 0 energy runs"

ola_0_vs_no0_data  <- rbind(ola_mixed_v11_8_workload,ola_mixed_no0_workload)
ola_0_vs_no0_data$population_dimension <- paste0(
  "Pop. size: ", ola_0_vs_no0_data$population_size,
  ", Dim.: ", ola_0_vs_no0_data$dimension
)

ggplot(ola_0_vs_no0_data, aes(x=population_dimension, y=delta_PKG, color=as.factor(work))) +
  geom_boxplot(position="dodge",notch=T) +
  theme_minimal() +
  xlab("Version") +
  ylab("Delta Package energy (Joules)") +
  labs(color="Implementation version") + # x ticks rotated
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  facet_wrap(~ max_gens)
@

In Figure \ref{fig:ola.no0}, we show the whole automatic $y$ axis range and compare this new way of measuring energy spent with the old one. The dispersion of the no-0 energy runs is less, not only because those erroneous readings had been suppressed, but also because apparently some {\em dry} runs give some stability to the measurements, even if now the baseline and workload run might be separated by one or more of these runs. There are any way differences, significant in some cases, between the old way of measuring and the new ones, even in cases where there seem to be no zeros, for instance population = 200, dimension = 3 and max. generations = 25.

With this last change in the measurement device, however, we have achieved a measuring methodology with a better degree of stability, that is able to smooth out differences in the underlying system.

\section{Conclusion and discussion}
\label{sec:discussion}

This paper, whose development can be also checked at the GitHub repository \url{https://github.com/CeciMerelo/BraveNewalgorithm.jl}, attempts to propose a new methodology in measuring energy looking to optimize the implementation of population-based evolutionary algorithms. We have found that decreasing the sampling rate (to 1 every 100 ms) reduces overhead and improves precision; repeating runs where errors occur so that a fixed number of runs is reached improves reproducibility. But one of the measures that contributes the most to robustness in measurements and definitely eliminate {\em negative} workload energy measures is sequencing baseline and workload measurements the one after the other, so that the system conditions in which they are taking place are more similar, instead of separating baseline and workload runs and using averages of the former subtracted from the latter to compute actual workload energy consumption.

All these population-based optimization algorithms need also to include one-shot computations such as the generation of the initial population in the baseline runs; that way, the actual energy consumed from running the different population operators such as mutation and crossover will be pinpointed with higher precision; besides, energy consumption and how it varies with the {\em chromosome} and population size almost never has a straightforward relationship; this is especially true in languages, such as Julia, which have a garbage-collection mechanism that might be triggered at specific memory occupation levels, and thus might not change linearly with them.

With these new methodologies in hand, we have been able to reach conclusions on two optimizations performed on our implementation of the BNA: one at the implementation level, making population and individual generation more efficient, which has resulted in a small improvement of around 10\% in energy consumption in some cases, and another at the language version level, bumping up the patchlevel from 1.11.7 to 1.11.8, which has resulted in a small but significant improvement of around the same size. Incidentally, we have also been examining the influence of the maximum number of generations without change on energy consumption; looking at Figure \ref{fig:ola.v11_8} we can see that, even with the updated version, energy consumption increases with the number of generations with no improvement before stopping the algorithm. We would have to check this against the actual improvements in fitness achieved, but that is not the main point of the paper; from our point of view, what is interesting is that, as is shown in Figure \ref{fig:ola.no0}, with the new methodology we can achieve levels of precision that allow us to examine relatively small improvements (or regressions) in energy consumption, with added reproducibility in any other computing platform.

\section*{Acknowledgements}
This work is supported by the Ministerio espa\~{n}ol de Econom\'{\i}a y Competitividad (Spanish Ministry of Competitivity and Economy) under project PID2023-147409NB-C21.

\bibliographystyle{splncs04}
\bibliography{ours,energy,ga-energy,GAs,julia,metaheuristics}

\end{document}