Improving Efficiency in Data Wrangling With Semantic Type Detection

dc.contributor.advisorBelcaid, Mahdi
dc.contributor.authorYu, Andy
dc.contributor.departmentComputer Science
dc.date.accessioned2024-02-26T20:14:02Z
dc.date.available2024-02-26T20:14:02Z
dc.date.issued2023
dc.description.degreeM.S.
dc.identifier.urihttps://hdl.handle.net/10125/107903
dc.subjectComputer science
dc.subjectdata wrangling
dc.subjectFAIR
dc.subjectlarge language models
dc.subjectsemantic table interpretation
dc.titleImproving Efficiency in Data Wrangling With Semantic Type Detection
dc.typeThesis
dcterms.abstractThis thesis presents SLED (Semantic LLM Enrichment of Data), a Python library that leverages Large Language Models (LLMs) to automate essential tasks in data wrangling and management, with a focus on adherence to FAIR (Findable, Accessible, Interoperable, Reusable) data principles. SLED is designed to enrich data through three means: contextualization, documentation, and validation. At its core, SLED performs automatic semantic type detection to contextualize datasets, offering a deeper and more nuanced understanding of primitive data types. This contextualization is vital for facilitating data wrangling, enhancing documentation and improving data validation. To do so, SLED fine tunes an open-source LLM to perform accurate semantic type detection, using real and synthetic training data. Our fine-tuned LLM demonstrates significant improvement in semantic type classification over the base model. SLED significantly streamlines data management tasks while enriching and improving data accessibility. This aligns with the overarching goals of open science and effective data analysis. By aiming to reduce the complexities associated with data science, SLED makes data analysis more approachable and efficient for everyone. Additionally, it ensures compliance with FAIR data principles, reinforcing its commitment to open data science.
dcterms.extent46 pages
dcterms.languageen
dcterms.publisherUniversity of Hawai'i at Manoa
dcterms.rightsAll UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner.
dcterms.typeText
local.identifier.alturihttp://dissertations.umi.com/hawii:11971

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Yu_hawii_0085O_11971.pdf
Size:
9.93 MB
Format:
Adobe Portable Document Format