AI System Evaluation
Permanent URI for this collectionhttps://hdl.handle.net/10125/112419
Browse
Recent Submissions
Item type: Item , Mining Hidden Prompt Engineering Patterns with Formal Concept Analysis and Association Rules(2026-01-06) Tolzin, Antonia; Hille, Tobias; Knoth, Nils; Janson, AndreasDesigning effective prompts to guide generative artificial intelligence (GAI) systems, or prompt engineering, has become a crucial skill. However, the underlying prompt patterns have not yet been thoroughly examined. This paper introduces a novel analytical method that combines formal concept analysis (FCA) and association rule mining. This approach is used to systematically analyze prompt engineering behaviors within an empirical dataset of human–AI interactions. Findings reveal hidden prompt patterns linking prompts to GAI outputs, providing insights that traditional analyses cannot offer. Furthermore, we demonstrate that prompting guides, especially those with examples, facilitate more sophisticated prompt engineering behavior and improve GAI output quality. Our work contributes to information systems theory by demonstrating the value of FCA-based structural analysis in human–GAI contexts and to the practice of prompt engineering by offering evidence-based guidance on improving prompt design and prompt engineering skill development.Item type: Item , Designing Algorithmic Ensembles for Fair and Accurate AI-based Mammography Screening(2026-01-06) Ahsen, Eren; Ayvaci, Mehmet; Ozeken, CetinAs AI algorithms become increasingly prevalent in healthcare, ensuring both accuracy and fairness in decision-making poses a dual challenge. Breast cancer screening is a particularly high-stakes example: FDA-cleared AI tools are already in clinical use, and nearly 40 million mammograms are performed annually in the United States, yet many of these systems have been trained on limited subpopulations, raising equity concerns. We address this challenge by leveraging the diversity of predictive algorithms developed for the same task and forming a linear ensemble. We develop a statistical model that establishes conditions under which such an ensemble can satisfy equal-opportunity fairness and then demonstrate its application using simulated data calibrated from a real-world AI mammography competition. Our analysis shows that ensembles can improve both accuracy and fairness, especially when constituent algorithms differ in subgroup performance. These findings provide actionable guidance for health.Item type: Item , Capturing Authorship Style Through Large Language Models(2026-01-06) Rayz, Julia; Dubey, AnujThis work explores the use of large language model (LLM) embeddings to capture style in authorship attribution tasks. By applying a Siamese network to embeddings from OpenAI’s text-embedding-ada-002, we assess whether style embeddings can distinguish authors across unseen texts. The results show a strong attribution accuracy that outperforms traditional characteristics, though the performance declines with more authors and improves with longer texts.Item type: Item , Reinforcement Learning for Adversarial Systems Using Relational Observations(2026-01-06) Gilson, Sophia; Zhang, Xiaodong (Frank); Bihl, Trevor; Cox, BruceThis paper investigates the integration of relational observations with the reinforcement learning (RL) framework for improved generalization capability. A hide-and-seek simulation environment is designed in Unity for proof-of-concept demonstration. Two observation representations—relational (analogical) and standard positional—are designed to evaluate agent learning and generalization capabilities. Agents are trained using the Proximal Policy Optimization (PPO) algorithm in a random-room environment and tested in both the random-room environment and a novel environment with greater spatial complexity and path obstructions. Comparative studies indicate that relational representation of objects in the adversarial environment could potentially improve the generalization capability of RL agents to novel and complex environments. Cross-testing results also suggest that relational observations may enhance agents’ effectiveness in pursuit and evasion tasks in adversarial environments.Item type: Item , Why do Large Language Models Judge Differently than Humans? An Examination of Sentiment Analysis of Movie Reviews(2026-01-06) Messerschmidt, Nils; Sartipi, Amir; Abbas, Antragama Ewa; Papageorgiou, Orestis; Tchappi, Igor; Fridgen, GilbertThis research investigates the root causes of divergence between Large Language Model (LLM)-based and human sentiment judgments. Using an inductive approach, we qualitatively analyzed a movie review dataset and identify two main causes: (i) contextual statements, where sentiment depends on situational factors (e.g., describing a film as “childish” may be positive for younger audiences but negative for adults); and (ii) linguistic statements, where sentiment shifts due to complex constructions such as sarcasm or double negation. Our study thus highlights the importance of both context (where, when, and for whom a statement is made) and linguistic form (how it is phrased) in sentiment interpretation. We contribute to the literature by identifying justificatory mechanisms behind differences in sentiment judgments between humans and LLMs. This may initiate a broader discourse on whether machine-generated sentiment can serve as a valid proxy for human interpretation. Even more, human-in-the-middle approaches may still outperform solely LLM-based sentiment interpretations.Item type: Item , Small Language Models for Curriculum-based Guidance(2026-01-06) Katharakis, Konstantinos; Rossi, Sippo; Mukkamala, Raghava RaoThe adoption of generative AI and large language models (LLMs) in education is still emerging. In this study, we explore the development and evaluation of AI teaching assistants that provide curriculum-based guidance using a retrieval-augmented generation (RAG) pipeline applied to selected open-source small language models (SLMs). We benchmarked eight SLMs, including LLaMA 3.1, IBM Granite 3.3, and Gemma 3 (7–17B parameters), against GPT-4o. Our findings show that with proper prompting and targeted retrieval, SLMs can match LLMs in delivering accurate, pedagogically aligned responses. Importantly, SLMs offer significant sustainability benefits due to their lower computational and energy requirements, enabling real-time use on consumer-grade hardware without depending on cloud infrastructure. This makes them not only cost-effective and privacy-preserving but also environmentally responsible, positioning them as viable AI teaching assistants for educational institutions aiming to scale personalized learning in a sustainable and energy-efficient manner.Item type: Item , AI-Enabled Forensic Risk Assessment: TRAP-18 System Architecture and Proof of Concept(2026-01-06) Acklin, Marvin; Edginton, Kyler; Topping, Kailey; Meloy, Reid; Kupper, JuliaThis study explores the development and validation of an AI-enabled forensic risk assessment system using the Terrorist Radicalization Assessment Protocol-18 (TRAP-18) as a proof of concept. Building on prior work demonstrating large language models’ (LLMs) ability to code TRAP-18 indicators with expert-level reliability, this project integrates structured professional judgment (SPJ) methodology with advanced AI frameworks, including LangChain and LangGraph. A prototype system was constructed to simulate a full TRAP-18 evaluation, encompassing indicator coding, justification generation, Bayesian probability estimation, and narrative risk formulation. Results demonstrate high consistency with human raters on proximal warning behaviors and strong agreement across multi-model workflows. The architecture emphasizes transparency, reproducibility, and bias mitigation, highlighting AI’s potential to augment forensic practice through structured reasoning, hypothesis testing, and scalable data integration. Beyond TRAP-18, this framework offers a pathway toward AI-assisted applications in general violence risk assessment, reinforcing human-AI collaboration in forensic psychology.Item type: Item , Towards Quantifying Compliance with the EU AI Act(2026-01-06) Clement, Tobias; Nguyen, Phuc Truong Loc; Arnold, StefanAs AI systems proliferate in high-risk domains, assessing their compliance with emerging regulatory standards has become imperative. The EU AI Act outlines ethical requirements across five dimensions: explainability, fairness, privacy, robustness, and social and environmental well-being. However, existing evaluation approaches lack a unified methodology to quantitatively operationalize these principles. In this paper, we propose a structured, score-based framework that translates the Act’s pillars into 22 interpretable metrics, enabling reproducible, model-agnostic compliance assessments. Applied to three benchmark tabular classification tasks using a standardized deep learning model, our framework captures how dataset characteristics shape ethical performance. The results reveal key trade-offs: models with high predictive accuracy do not necessarily meet compliance expectations, and larger datasets tend to improve robustness but increase vulnerability to privacy leakage. Correlation analyses expose metric redundancy in fairness and explainability, suggesting potential for simplification. Privacy metrics, by contrast, remain essential and diverse. Social and environmental measures emerge as least mature, underscoring the need for novel, bounded metrics in future research.Item type: Item , On the Feasibility of Vision-Language Models for Time-Series Classification(2026-01-06) Prithyani, Vinay; Mohammed, Mohsin; Gadgil, Richa; Buitrago, Ricardo; Jain, Vinija; Chadha, AmanWe explore the feasibility of applying Vision–Language Models (VLMs) to time-series classification (TSC) by fine-tuning them with minimal supervision. We develop a novel approach that incorporates graphical data representations as images in conjunction with numerical data. This approach is rooted in the hypothesis that graphical representations can provide additional contextual information that numerical data alone may not capture. Additionally, providing a graphical representation can circumvent issues such as limited context length faced by LLMs. To study this systematically, we implement a scalable end-to-end pipeline that supports multiple scenarios, varying context length, downsampling strategy, and prompt design. Using this pipeline, we fine-tune VLMs for only one to two epochs on univariate and multivariate datasets, and analyze how design choices affect accuracy. Our findings position VLMs as a feasible but currently limited baseline for TSC, and point towards design considerations for future work at the intersection of multimodal learning and temporal data.Item type: Item , Introduction to the Minitrack on AI System Evaluation(2026-01-06) Steele, Robert; Juric, Radmila; Rayz, Julia
