AI Model Evaluation
Permanent URI for this collection
Browse
Recent Submissions
Item Benchmarking Cluster-Then-Predict Models to Challenge Prevailing Global Machine Learning Models(2025-01-07) Hambauer, Nico; Stoecker, Theodor; Zschech, Patrick; Kraus, MathiasIn predictive analytics domains, such as healthcare, marketing and finance, data exhibits inherent segmentation, like patient, customer and market segments. Powerful global models, like XGBoost or Catboost, offer high predictive qualities, yet ignore modeling clusters explicitly and are limited by low interpretability. Cluster-then-predict (CTP) models have been proposed to offer more actionable insights. These hybrid models first segment data and then train cluster-specific linear models, combining the capacity to model complex relationships with model transparency. Previous CTP approaches rely on decision trees for segmentation, neglecting alternative methods. This study proposes six CTP models and benchmarks them against five global models. Our results show that k-means CTP ranks fourth out of eleven models in 20 benchmark datasets. While CTP models with DTs rank fifth best, they are substantially simpler to interpret. Consequently, we establish a variety of cluster-then-predict models and call for their consideration when faced with heterogeneous datasets.Item Debiased Gaussian Process-based Machine Learning with Partially Observed Information(2025-01-07) Xu, Yanwen; Wu, Shengxiang; Chao, XuehuiWidely applicable machine learning and artificial intelligence technologies have resulted in an increasing demand for reliable models. Due to the ubiquitous data scarcity in the real world, model training can often be challenging and face limitations. Although various data augmentation techniques can efficiently alleviate this dilemma, bias is unavoidable, causing trade-offs in prediction accuracy. As this general dilemma is addressed, we discuss a Two-Stage Debiased Gaussian Process (TSDGP)-based machine learning model capable of providing robust and accurate predictions across various fields, even with partially observed information. Given the partially observed information in input data, the latent variable model was leveraged to enhance heterogeneous data utilization by reconstructing the unavailable information in stage one. Subsequently, the model and uncertainties from the first stage were refined within the Bayesian framework using the augmented dataset in stage two. By demonstrating the consistency and first and second moments of the proposed two-stage model, we are confident in the accuracy and robustness of the results. Supported by solid theoretical proof, we further evaluate the results of TSDGP through numerical and empirical experiments, showing the premium performances of the proposed approach. In conclusion, TSDGP can solve the dilemma caused by data scarcity in the real world—enabling a reliable high-fidelity predictive model to be trained on partially observed datasets without a significant trade-off in accuracy.Item The Reliability Paradox: Exploring How Shortcut Learning Undermines Language Model Calibration(2025-01-07) Bihani, Geetanjali; Rayz, JuliaThe advent of pre-trained language models (PLMs) has enabled significant performance gains in the field of natural language processing. However, recent studies have found PLMs to suffer from miscalibration, indicating a lack of accuracy in the confidence estimates provided by these models. Current evaluation methods for PLM calibration often assume that lower calibration error estimates indicate more reliable predictions. However, fine-tuned PLMs often resort to shortcuts, leading to overconfident predictions that create the illusion of enhanced performance but lack generalizability in their decision rules. The relationship between PLM reliability, as measured by calibration error, and shortcut learning, has not been thoroughly explored thus far. This paper aims to investigate this relationship, studying whether lower calibration error implies reliable decision rules for a language model. Our findings reveal that models with seemingly superior calibration portray higher levels of non-generalizable decision rules. This challenges the prevailing notion that well-calibrated models are inherently reliable. Our study highlights the need to bridge the current gap between language model calibration and generalization objectives, urging the development of comprehensive frameworks to achieve truly robust and reliable language models.Item Simulating Strategic Reasoning: Comparing the Ability of Single LLMs and Multi-Agent Systems to Replicate Human Behavior(2025-01-07) Sreedhar, Karthik; Chilton, LydiaWhen creating policies, plans, or designs for people, it is challenging for designers to foresee all of the ways in which people may reason and behave. Recently, Large Language Models (LLMs) have been shown to be able to simulate human reasoning. We extend this work by measuring LLMs’ ability to simulate strategic reasoning in the ultimatum game – a classic economics bargaining experiment. Experimental evidence shows human strategic reasoning is complex – people will often choose to “punish” other players to enforce social norms even at personal expense. We test if LLMs can replicate this behavior in simulation, comparing two structures: single LLMs and multi-agent systems. We compare their abilities to (1) simulate human-like reasoning in the ultimatum game, (2) simulate two player personalities, greedy and fair, and (3) create robust strategies that are logically complete and consistent with personality. Our evaluation shows that multi-agent systems are more accurate than single LLMs (88% vs. 50%) in simulating human reasoning and actions for personality pairs. Thus, there is potential to use LLMs to simulate human strategic reasoning to help decision and policy-makers perform preliminary explorations of how people behave in systems.Item Exploring User Evaluations of Machine Learning Models: A Qualitative Study on the Impact of Confidence Intervals(2025-01-07) Meyers, Scott; Murry, Paige; Jessup, Sarah; Alarcon, Gene; Harris, KristaResearch on artificial intelligence and machine learning models has burgeoned in the last decade. However, research has seldom utilized qualitative methods for assessing user-based experiences and system evaluations of AI/ML models. This study aims to provide an example of how thematic text analysis can be used to provide greater insight into user experiences with these systems and examine how varying levels of model transparency affects evaluations. Participants (N = 130) completed an image binning monitoring task with either an uncalibrated classification model (UCM), which displayed high confidence regardless of classification accuracy or a calibrated classification model (CCM), which had greater calibration between accuracy and confidence. Results revealed detailed information on user evaluations for both models including various performance perceptions, impressions, and strategy behaviors. Furthermore, we identified key differences in user evaluations between these models and our confidence manipulation, such as greater trust and confidence display use. Qualitative analysis has been shown to be an effective approach for detailed investigation of user experiences and model evaluation.Item Introduction to the Minitrack on AI Model Evaluation(2025-01-07) Rayz, Julia; Juric, Radmila; Steele, RobertItem Reinforcement Learning for Adversarial Environments(2025-01-07) Carrizales, Christian; Zhang, Xiaodong (Frank); Bihl, TrevorThis paper explores the development of more intelligent and competitive AI agents for adversarial environments. A hide and seek simulation environment with three sensor models is developed, including a lidar, a far filed sensor, and a near-filed sensor. Four AI vs AI adversarial scenarios are investigated using the Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC) reinforcement learning (RL) algorithms. Various experimental results across RL algorithms and sensor models have shown the seeker and hider have the most competitive advantage in the scenario of a SAC seeker versus a PPO hider and a PPO seeker versus a SAC hider, respectively. Additionally, the impact of sensing modalities on agent learning performance is investigated. Comparative studies reveal that extra sensing modalities improve agent performance, and the far-field sensor outperforms the near-field sensor. The results also suggest that an agent with a competitive advantage of AI algorithm is more resilient to variations in sensing modalities.