Results Analysis#

Overview

Register a New Results Analysis Method: How to add a new results analysis method to PrismBO.
Customization Analysis Pipline: How to customize your own results analysis pipline or add your own analysis method into the pipline.
Performance Evaluation Metrics: The list of the performance evaluation metrics available in PrismBO
Statistical Measures: The list of the statistical measures supportede in PrismBO

Register a New Results Analysis Method#

Customization Analysis Pipline#

List of Performance Evaluation Metrics#

For each type of task instance, the framework offers performance evaluation metrics to assess the quality of the solutions generated by the algorithms. The metrics are categorized based on the type of task and are designed to evaluate various aspects of the solutions. The tables below summarize the performance metrics available for different tasks.

Task	Metric	Description	Scale	Type
Synthetic	Absolute Error	The difference between the min value and the optimal solution.	[0, ∞]	Minimization
HPO (Classification)	F1 Score	The mean of precision and recall, providing a balanced measure of accuracy.	[0, 1]	Maximization
	Area Under Curve	The area under the receiver operating characteristic (ROC) curve, quantifying the overall ability of a classifier to discriminate between positive and negative instances.	[0, 1]	Maximization
HPO (Regression)	RMSE	Root mean squared error (RMSE) measures the average magnitude of the differences between predicted values and actual values.	[0, ∞]	Minimization
	MAE	Mean absolute error (MAE) measures the average absolute differences between predicted values and actual values.	[0, ∞]	Minimization
Protein Design	Binding Affinity	The strength of the interaction between a protein and its ligand, typically measured by the equilibrium dissociation constant.	[-∞, 0]	Minimization
RNA Inverse Design	GC-content	The percentage of guanine (G) and cytosine (C) bases in a DNA or RNA molecule, which affects the stability and melting temperature.	[0, 1]	Maximization
LLVM/GCC	Avg Execution Time	The average execution time of multiple runs.	[0, ∞]	Minimization
	Compilation Time	The time required to compile the code.	[0, ∞]	Minimization
	File Size	The size of the executable file generated after compilation.	[0, ∞]	Minimization
	Max RSS	The maximum resident set size used during execution.	[0, ∞]	Minimization
	PAPI TOT CYC	The total number of CPU cycles consumed during execution.	[0, ∞]	Minimization
	PAPI TOT INS	The total number of instructions executed by the CPU.	[0, ∞]	Minimization
	PAPI BR MSP	The number of times the CPU mispredicted branch directions.	[0, ∞]	Minimization
	PAPI BR PRC	The number of times the CPU correctly predicted branch directions.	[0, ∞]	Minimization
	PAPI BR CN	The number of conditional branch instructions.	[0, ∞]	Minimization
	PAPI MEM WCY	The number of cycles spent waiting for memory access.	[0, ∞]	Minimization
MySQL	Throughput	The number of transactions processed per unit of time.	[0, ∞]	Maximization
	Latency	The time required to complete a single transaction from initiation to completion.	[0, ∞]	Minimization
	CPU Usage	The proportion of CPU resources used during database operations.	[0, ∞]	Minimization
	Memory Usage	The amount of memory resources used during database operations.	[0, ∞]	Minimization
Hadoop	Execution Time	The execution time of a big data task.	[0, ∞]	Minimization

Statistical Measures#

This section provides detailed explanations of the statistical methods used for analyzing the performance of different algorithms. Each method is accompanied by the relevant formulas and calculation procedures.

Wilcoxon Signed-Rank Test#

The Wilcoxon signed-rank test is a non-parametric statistical test used to compare two paired samples. Unlike the paired t-test, the Wilcoxon signed-rank test does not assume that the differences between pairs are normally distributed. It is particularly useful when dealing with small sample sizes or non-normally distributed data.

Given two related samples \(X\) and \(Y\), the steps to perform the Wilcoxon signed-rank test are:

Compute the differences between each pair of observations: \(d_i = X_i - Y_i\).
Rank the absolute values of the differences, assigning ranks from the smallest to the largest difference.
Assign signs to the ranks based on the sign of the original differences \(d_i\).
Calculate the test statistic \(W\), which is the sum of the ranks corresponding to the positive differences:

\[W = \sum_{d_i > 0} \text{Rank}(d_i)\]
Compare the computed test statistic \(W\) against the critical value from the Wilcoxon signed-rank table or calculate the p-value to determine the significance of the result.

Scott-Knott Test#

The Scott-Knott test is a statistical method used to rank the performance of different techniques across multiple runs on each benchmark instance. It is particularly effective in scenarios where multiple comparisons are being made, and it controls the family-wise error rate.

The procedure involves:

Partitioning the data: Initially, all techniques are considered in one group. The group is then split into two subgroups if the mean difference between them is statistically significant.
Calculating the mean difference between the groups using an appropriate test (e.g., ANOVA or t-test).
Assigning ranks: If a significant difference is found, the techniques are ranked within their respective subgroups. If no significant difference is found, the techniques are considered to be in the same rank.
Repeating the process until no further significant splits can be made.

The Scott-Knott test is particularly useful for determining the relative performance of multiple techniques, providing a clear ranking based on statistically significant differences.

A12 Effect Size#

The A12 effect size is a non-parametric measure used to evaluate the probability that one algorithm outperforms another. It is particularly useful in understanding whether observed differences are practically significant, beyond just being statistically significant.

The A12 statistic is calculated as follows:

Let \(A\) and \(B\) be the two sets of performance measures for two algorithms.
Calculate the A12 statistic:

\[A_{12} = \frac{\sum_{x \in A} \sum_{y \in B} \mathbf{I}(x > y) + 0.5 \cdot \mathbf{I}(x = y)}{|A| \cdot |B|}\]

Critical Difference (CD)#

The Critical Difference (CD) is a statistical measure used to assess whether performance differences between algorithms are derived from randomness. It is typically used in conjunction with methods like the Friedman test or Nemenyi post-hoc test to evaluate multiple algorithms across multiple datasets.

The steps involved in calculating the Critical Difference are:

Perform a Friedman test to rank the algorithms for each dataset.
Calculate the average ranks for each algorithm across all datasets.
Compute the Critical Difference (CD) using the following formula:

\[\text{CD} = q_{\alpha} \sqrt{\frac{k(k+1)}{6N}}\]

where: - \(q_{\alpha}\) is the critical value for a given significance level \(\alpha\) from the studentized range statistic. - \(k\) is the number of algorithms. - \(N\) is the number of datasets.
If the difference in average ranks between two algorithms exceeds the CD, the performance difference is considered statistically significant, and not due to random variation.

These statistical methods provide robust tools for comparing algorithm performance across various benchmarks, ensuring that conclusions drawn are both statistically and practically significant.