Zero-shot Comparison of Large Language Models (LLMs) Reasoning Abilities on Long-text Analogies

Date

2025-01-07

Contributor

Advisor

Department

Instructor

Depositor

Speaker

Researcher

Consultant

Interviewer

Narrator

Transcriber

Annotator

Journal Title

Journal ISSN

Volume Title

Publisher

Volume

Number/Issue

Starting Page

1610

Ending Page

Alternative Title

Abstract

In recent years, large language models (LLMs) have made substantial strides in mimicking human language and coherently presenting information. However, researchers continue to debate the accuracy and robustness of LLMs’ reasoning abilities. The reasoning abilities of thirteen LLMs were tested on two long-text analogy datasets, named Rattermann and Wharton, which required them to rank a series of stories from most analogous to least analogous compared to a source story. On the Rattermann dataset, GPT-4 obtained the highest accuracy of 70%. As a whole, LLMs seem to struggle with over-emphasizing similar story entities (characters and settings) and a lack of awareness of higher-order relationship(s) between stories. LLMs struggled more with the Wharton dataset, with the highest accuracy achieved being 46.4% by GPT-4o, and all but nine LLMs performing below random chance accuracy. Although LLMs are improving, they still struggle with higher-cognitive tasks such as analogical reasoning.

Description

Keywords

Natural Language Processing and Large Language Models Supporting Data Analytics for System Sciences, analogical reasoning, artificial intelligence, generative ai, large language models, zero-shot learning

Citation

Extent

10

Format

Geographic Location

Time Period

Related To

Proceedings of the 58th Hawaii International Conference on System Sciences

Related To (URI)

Table of Contents

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International

Rights Holder

Local Contexts

Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.