Improving Efficiency in Data Wrangling With Semantic Type Detection

Date
2023
Authors
Yu, Andy
Contributor
Advisor
Belcaid, Mahdi
Department
Computer Science
Instructor
Depositor
Speaker
Researcher
Consultant
Interviewer
Annotator
Journal Title
Journal ISSN
Volume Title
Publisher
Volume
Number/Issue
Starting Page
Ending Page
Alternative Title
Abstract
This thesis presents SLED (Semantic LLM Enrichment of Data), a Python library that leverages Large Language Models (LLMs) to automate essential tasks in data wrangling and management, with a focus on adherence to FAIR (Findable, Accessible, Interoperable, Reusable) data principles. SLED is designed to enrich data through three means: contextualization, documentation, and validation. At its core, SLED performs automatic semantic type detection to contextualize datasets, offering a deeper and more nuanced understanding of primitive data types. This contextualization is vital for facilitating data wrangling, enhancing documentation and improving data validation. To do so, SLED fine tunes an open-source LLM to perform accurate semantic type detection, using real and synthetic training data. Our fine-tuned LLM demonstrates significant improvement in semantic type classification over the base model. SLED significantly streamlines data management tasks while enriching and improving data accessibility. This aligns with the overarching goals of open science and effective data analysis. By aiming to reduce the complexities associated with data science, SLED makes data analysis more approachable and efficient for everyone. Additionally, it ensures compliance with FAIR data principles, reinforcing its commitment to open data science.
Description
Keywords
Computer science, data wrangling, FAIR, large language models, semantic table interpretation
Citation
Extent
46 pages
Format
Geographic Location
Time Period
Related To
Table of Contents
Rights
All UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner.
Rights Holder
Local Contexts
Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.