Improving Efficiency in Data Wrangling With Semantic Type Detection

Date

2023

Authors

Contributor

Instructor

Depositor

Speaker

Researcher

Consultant

Interviewer

Narrator

Transcriber

Annotator

Journal Title

Journal ISSN

Volume Title

Publisher

Volume

Number/Issue

Starting Page

Ending Page

Alternative Title

Abstract

This thesis presents SLED (Semantic LLM Enrichment of Data), a Python library that leverages Large Language Models (LLMs) to automate essential tasks in data wrangling and management, with a focus on adherence to FAIR (Findable, Accessible, Interoperable, Reusable) data principles. SLED is designed to enrich data through three means: contextualization, documentation, and validation. At its core, SLED performs automatic semantic type detection to contextualize datasets, offering a deeper and more nuanced understanding of primitive data types. This contextualization is vital for facilitating data wrangling, enhancing documentation and improving data validation. To do so, SLED fine tunes an open-source LLM to perform accurate semantic type detection, using real and synthetic training data. Our fine-tuned LLM demonstrates significant improvement in semantic type classification over the base model. SLED significantly streamlines data management tasks while enriching and improving data accessibility. This aligns with the overarching goals of open science and effective data analysis. By aiming to reduce the complexities associated with data science, SLED makes data analysis more approachable and efficient for everyone. Additionally, it ensures compliance with FAIR data principles, reinforcing its commitment to open data science.

Description

Keywords

Computer science, data wrangling, FAIR, large language models, semantic table interpretation

Citation

Extent

46 pages

Format

Geographic Location

Time Period

Related To

Related To (URI)

Table of Contents

Rights

All UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner.

Rights Holder

Local Contexts

Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.