Results Analysis#

Overview

Register a New Results Analysis Method#

Customization Analysis Pipline#

List of Performance Evaluation Metrics#

For each type of task instance, the framework offers performance evaluation metrics to assess the quality of the solutions generated by the algorithms. The metrics are categorized based on the type of task and are designed to evaluate various aspects of the solutions. The tables below summarize the performance metrics available for different tasks.

Task

Metric

Description

Scale

Type

Synthetic

Absolute Error

The difference between the min value and the optimal solution.

[0, ∞]

Minimization

HPO (Classification)

F1 Score

The mean of precision and recall, providing a balanced measure of accuracy.

[0, 1]

Maximization

Area Under Curve

The area under the receiver operating characteristic (ROC) curve, quantifying the overall ability of a classifier to discriminate between positive and negative instances.

[0, 1]

Maximization

HPO (Regression)

RMSE

Root mean squared error (RMSE) measures the average magnitude of the differences between predicted values and actual values.

[0, ∞]

Minimization

MAE

Mean absolute error (MAE) measures the average absolute differences between predicted values and actual values.

[0, ∞]

Minimization

Protein Design

Binding Affinity

The strength of the interaction between a protein and its ligand, typically measured by the equilibrium dissociation constant.

[-∞, 0]

Minimization

RNA Inverse Design

GC-content

The percentage of guanine (G) and cytosine (C) bases in a DNA or RNA molecule, which affects the stability and melting temperature.

[0, 1]

Maximization

LLVM/GCC

Avg Execution Time

The average execution time of multiple runs.

[0, ∞]

Minimization

Compilation Time

The time required to compile the code.

[0, ∞]

Minimization

File Size

The size of the executable file generated after compilation.

[0, ∞]

Minimization

Max RSS

The maximum resident set size used during execution.

[0, ∞]

Minimization

PAPI TOT CYC

The total number of CPU cycles consumed during execution.

[0, ∞]

Minimization

PAPI TOT INS

The total number of instructions executed by the CPU.

[0, ∞]

Minimization

PAPI BR MSP

The number of times the CPU mispredicted branch directions.

[0, ∞]

Minimization

PAPI BR PRC

The number of times the CPU correctly predicted branch directions.

[0, ∞]

Minimization

PAPI BR CN

The number of conditional branch instructions.

[0, ∞]

Minimization

PAPI MEM WCY

The number of cycles spent waiting for memory access.

[0, ∞]

Minimization

MySQL

Throughput

The number of transactions processed per unit of time.

[0, ∞]

Maximization

Latency

The time required to complete a single transaction from initiation to completion.

[0, ∞]

Minimization

CPU Usage

The proportion of CPU resources used during database operations.

[0, ∞]

Minimization

Memory Usage

The amount of memory resources used during database operations.

[0, ∞]

Minimization

Hadoop

Execution Time

The execution time of a big data task.

[0, ∞]

Minimization

Statistical Measures#

This section provides detailed explanations of the statistical methods used for analyzing the performance of different algorithms. Each method is accompanied by the relevant formulas and calculation procedures.

Wilcoxon Signed-Rank Test#

The Wilcoxon signed-rank test is a non-parametric statistical test used to compare two paired samples. Unlike the paired t-test, the Wilcoxon signed-rank test does not assume that the differences between pairs are normally distributed. It is particularly useful when dealing with small sample sizes or non-normally distributed data.

Given two related samples \(X\) and \(Y\), the steps to perform the Wilcoxon signed-rank test are:

  1. Compute the differences between each pair of observations: \(d_i = X_i - Y_i\).

  2. Rank the absolute values of the differences, assigning ranks from the smallest to the largest difference.

  3. Assign signs to the ranks based on the sign of the original differences \(d_i\).

  4. Calculate the test statistic \(W\), which is the sum of the ranks corresponding to the positive differences:

    \[W = \sum_{d_i > 0} \text{Rank}(d_i)\]
  5. Compare the computed test statistic \(W\) against the critical value from the Wilcoxon signed-rank table or calculate the p-value to determine the significance of the result.

Scott-Knott Test#

The Scott-Knott test is a statistical method used to rank the performance of different techniques across multiple runs on each benchmark instance. It is particularly effective in scenarios where multiple comparisons are being made, and it controls the family-wise error rate.

The procedure involves:

  1. Partitioning the data: Initially, all techniques are considered in one group. The group is then split into two subgroups if the mean difference between them is statistically significant.

  2. Calculating the mean difference between the groups using an appropriate test (e.g., ANOVA or t-test).

  3. Assigning ranks: If a significant difference is found, the techniques are ranked within their respective subgroups. If no significant difference is found, the techniques are considered to be in the same rank.

  4. Repeating the process until no further significant splits can be made.

The Scott-Knott test is particularly useful for determining the relative performance of multiple techniques, providing a clear ranking based on statistically significant differences.

A12 Effect Size#

The A12 effect size is a non-parametric measure used to evaluate the probability that one algorithm outperforms another. It is particularly useful in understanding whether observed differences are practically significant, beyond just being statistically significant.

The A12 statistic is calculated as follows:

  1. Let \(A\) and \(B\) be the two sets of performance measures for two algorithms.

  2. Calculate the A12 statistic:

    \[A_{12} = \frac{\sum_{x \in A} \sum_{y \in B} \mathbf{I}(x > y) + 0.5 \cdot \mathbf{I}(x = y)}{|A| \cdot |B|}\]

Critical Difference (CD)#

The Critical Difference (CD) is a statistical measure used to assess whether performance differences between algorithms are derived from randomness. It is typically used in conjunction with methods like the Friedman test or Nemenyi post-hoc test to evaluate multiple algorithms across multiple datasets.

The steps involved in calculating the Critical Difference are:

  1. Perform a Friedman test to rank the algorithms for each dataset.

  2. Calculate the average ranks for each algorithm across all datasets.

  3. Compute the Critical Difference (CD) using the following formula:

    \[\text{CD} = q_{\alpha} \sqrt{\frac{k(k+1)}{6N}}\]

    where: - \(q_{\alpha}\) is the critical value for a given significance level \(\alpha\) from the studentized range statistic. - \(k\) is the number of algorithms. - \(N\) is the number of datasets.

  4. If the difference in average ranks between two algorithms exceeds the CD, the performance difference is considered statistically significant, and not due to random variation.

These statistical methods provide robust tools for comparing algorithm performance across various benchmarks, ensuring that conclusions drawn are both statistically and practically significant.