DOCUSAGE: HARNESSING HIERARCHICAL CLUSTERING IN SALIENCE-DRIVEN NARRATIVE SYNTHESIS

Date

2024

Contributor

Instructor

Depositor

Speaker

Researcher

Consultant

Interviewer

Narrator

Transcriber

Annotator

Journal Title

Journal ISSN

Volume Title

Publisher

Volume

Number/Issue

Starting Page

Ending Page

Alternative Title

Abstract

Text summarization remains a crucial yet challenging task in natural language processing, especially as the volume of text data grows exponentially. This thesis introduces Sumsage, a new optimization-based text summarization method that synthesizes concise yet informative summaries. Our work presents several notable contributions to the field. We developed the Syn-D-sum dataset from the CNN/DailyMail dataset, creating a robust resource for training and evaluating summarization models. We also propose the Sumsage algorithm, which leverages hierarchical clustering to extract key sentences and construct coherent summaries, closely emulating human summarizers. Additionally, we designed two new evaluation methods: the Symphony penalty and the Captured Importance Quantification scores, which assess the quality of generated summaries by considering both narrative structure and sentence order. Sumsage’s dynamic tree structure and hierarchical clustering approach enable efficient and scalable summarization while maintaining contextual relevance and minimizing hallucination. Additionally, our experiments show that Sumsage yields superior performance over GPT-3.5-turbo, generating summaries similar to those written by humans and capturing more essential information. Sumsage represents a novel advancement in text summarization, offering a robust and interpretable method for generating high-quality summaries. This approach not only addresses current challenges but also lays the foundation for future innovations in narrative synthesis and evaluation.

Description

Keywords

Artificial intelligence, Computer science, Dataset synthesis, Narrative synthesis, Natural language processing, Text summarization

Citation

Extent

67 pages

Format

Geographic Location

Time Period

Related To

Related To (URI)

Table of Contents

Rights

All UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner.

Rights Holder

Local Contexts

Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.