Results Analysis#
Overview
Register a New Results Analysis Method: How to add a new results analysis method to PrismBO.
Customization Analysis Pipline: How to customize your own results analysis pipline or add your own analysis method into the pipline.
Performance Evaluation Metrics: The list of the performance evaluation metrics available in PrismBO
Statistical Measures: The list of the statistical measures supportede in PrismBO
Register a New Results Analysis Method#
Customization Analysis Pipline#
List of Performance Evaluation Metrics#
For each type of task instance, the framework offers performance evaluation metrics to assess the quality of the solutions generated by the algorithms. The metrics are categorized based on the type of task and are designed to evaluate various aspects of the solutions. The tables below summarize the performance metrics available for different tasks.
Task |
Metric |
Description |
Scale |
Type |
---|---|---|---|---|
Synthetic |
Absolute Error |
The difference between the min value and the optimal solution. |
[0, ∞] |
Minimization |
HPO (Classification) |
F1 Score |
The mean of precision and recall, providing a balanced measure of accuracy. |
[0, 1] |
Maximization |
Area Under Curve |
The area under the receiver operating characteristic (ROC) curve, quantifying the overall ability of a classifier to discriminate between positive and negative instances. |
[0, 1] |
Maximization |
|
HPO (Regression) |
RMSE |
Root mean squared error (RMSE) measures the average magnitude of the differences between predicted values and actual values. |
[0, ∞] |
Minimization |
MAE |
Mean absolute error (MAE) measures the average absolute differences between predicted values and actual values. |
[0, ∞] |
Minimization |
|
Protein Design |
Binding Affinity |
The strength of the interaction between a protein and its ligand, typically measured by the equilibrium dissociation constant. |
[-∞, 0] |
Minimization |
RNA Inverse Design |
GC-content |
The percentage of guanine (G) and cytosine (C) bases in a DNA or RNA molecule, which affects the stability and melting temperature. |
[0, 1] |
Maximization |
LLVM/GCC |
Avg Execution Time |
The average execution time of multiple runs. |
[0, ∞] |
Minimization |
Compilation Time |
The time required to compile the code. |
[0, ∞] |
Minimization |
|
File Size |
The size of the executable file generated after compilation. |
[0, ∞] |
Minimization |
|
Max RSS |
The maximum resident set size used during execution. |
[0, ∞] |
Minimization |
|
PAPI TOT CYC |
The total number of CPU cycles consumed during execution. |
[0, ∞] |
Minimization |
|
PAPI TOT INS |
The total number of instructions executed by the CPU. |
[0, ∞] |
Minimization |
|
PAPI BR MSP |
The number of times the CPU mispredicted branch directions. |
[0, ∞] |
Minimization |
|
PAPI BR PRC |
The number of times the CPU correctly predicted branch directions. |
[0, ∞] |
Minimization |
|
PAPI BR CN |
The number of conditional branch instructions. |
[0, ∞] |
Minimization |
|
PAPI MEM WCY |
The number of cycles spent waiting for memory access. |
[0, ∞] |
Minimization |
|
MySQL |
Throughput |
The number of transactions processed per unit of time. |
[0, ∞] |
Maximization |
Latency |
The time required to complete a single transaction from initiation to completion. |
[0, ∞] |
Minimization |
|
CPU Usage |
The proportion of CPU resources used during database operations. |
[0, ∞] |
Minimization |
|
Memory Usage |
The amount of memory resources used during database operations. |
[0, ∞] |
Minimization |
|
Hadoop |
Execution Time |
The execution time of a big data task. |
[0, ∞] |
Minimization |
Statistical Measures#
This section provides detailed explanations of the statistical methods used for analyzing the performance of different algorithms. Each method is accompanied by the relevant formulas and calculation procedures.
Wilcoxon Signed-Rank Test#
The Wilcoxon signed-rank test is a non-parametric statistical test used to compare two paired samples. Unlike the paired t-test, the Wilcoxon signed-rank test does not assume that the differences between pairs are normally distributed. It is particularly useful when dealing with small sample sizes or non-normally distributed data.
Given two related samples \(X\) and \(Y\), the steps to perform the Wilcoxon signed-rank test are:
Compute the differences between each pair of observations: \(d_i = X_i - Y_i\).
Rank the absolute values of the differences, assigning ranks from the smallest to the largest difference.
Assign signs to the ranks based on the sign of the original differences \(d_i\).
Calculate the test statistic \(W\), which is the sum of the ranks corresponding to the positive differences:
\[W = \sum_{d_i > 0} \text{Rank}(d_i)\]Compare the computed test statistic \(W\) against the critical value from the Wilcoxon signed-rank table or calculate the p-value to determine the significance of the result.
Scott-Knott Test#
The Scott-Knott test is a statistical method used to rank the performance of different techniques across multiple runs on each benchmark instance. It is particularly effective in scenarios where multiple comparisons are being made, and it controls the family-wise error rate.
The procedure involves:
Partitioning the data: Initially, all techniques are considered in one group. The group is then split into two subgroups if the mean difference between them is statistically significant.
Calculating the mean difference between the groups using an appropriate test (e.g., ANOVA or t-test).
Assigning ranks: If a significant difference is found, the techniques are ranked within their respective subgroups. If no significant difference is found, the techniques are considered to be in the same rank.
Repeating the process until no further significant splits can be made.
The Scott-Knott test is particularly useful for determining the relative performance of multiple techniques, providing a clear ranking based on statistically significant differences.
A12 Effect Size#
The A12 effect size is a non-parametric measure used to evaluate the probability that one algorithm outperforms another. It is particularly useful in understanding whether observed differences are practically significant, beyond just being statistically significant.
The A12 statistic is calculated as follows:
Let \(A\) and \(B\) be the two sets of performance measures for two algorithms.
Calculate the A12 statistic:
\[A_{12} = \frac{\sum_{x \in A} \sum_{y \in B} \mathbf{I}(x > y) + 0.5 \cdot \mathbf{I}(x = y)}{|A| \cdot |B|}\]
Critical Difference (CD)#
The Critical Difference (CD) is a statistical measure used to assess whether performance differences between algorithms are derived from randomness. It is typically used in conjunction with methods like the Friedman test or Nemenyi post-hoc test to evaluate multiple algorithms across multiple datasets.
The steps involved in calculating the Critical Difference are:
Perform a Friedman test to rank the algorithms for each dataset.
Calculate the average ranks for each algorithm across all datasets.
Compute the Critical Difference (CD) using the following formula:
\[\text{CD} = q_{\alpha} \sqrt{\frac{k(k+1)}{6N}}\]where: - \(q_{\alpha}\) is the critical value for a given significance level \(\alpha\) from the studentized range statistic. - \(k\) is the number of algorithms. - \(N\) is the number of datasets.
If the difference in average ranks between two algorithms exceeds the CD, the performance difference is considered statistically significant, and not due to random variation.
These statistical methods provide robust tools for comparing algorithm performance across various benchmarks, ensuring that conclusions drawn are both statistically and practically significant.