Improving Efficiency in Data Wrangling With Semantic Type Detection
dc.contributor.advisor | Belcaid, Mahdi | |
dc.contributor.author | Yu, Andy | |
dc.contributor.department | Computer Science | |
dc.date.accessioned | 2024-02-26T20:14:02Z | |
dc.date.available | 2024-02-26T20:14:02Z | |
dc.date.issued | 2023 | |
dc.description.degree | M.S. | |
dc.identifier.uri | https://hdl.handle.net/10125/107903 | |
dc.subject | Computer science | |
dc.subject | data wrangling | |
dc.subject | FAIR | |
dc.subject | large language models | |
dc.subject | semantic table interpretation | |
dc.title | Improving Efficiency in Data Wrangling With Semantic Type Detection | |
dc.type | Thesis | |
dcterms.abstract | This thesis presents SLED (Semantic LLM Enrichment of Data), a Python library that leverages Large Language Models (LLMs) to automate essential tasks in data wrangling and management, with a focus on adherence to FAIR (Findable, Accessible, Interoperable, Reusable) data principles. SLED is designed to enrich data through three means: contextualization, documentation, and validation. At its core, SLED performs automatic semantic type detection to contextualize datasets, offering a deeper and more nuanced understanding of primitive data types. This contextualization is vital for facilitating data wrangling, enhancing documentation and improving data validation. To do so, SLED fine tunes an open-source LLM to perform accurate semantic type detection, using real and synthetic training data. Our fine-tuned LLM demonstrates significant improvement in semantic type classification over the base model. SLED significantly streamlines data management tasks while enriching and improving data accessibility. This aligns with the overarching goals of open science and effective data analysis. By aiming to reduce the complexities associated with data science, SLED makes data analysis more approachable and efficient for everyone. Additionally, it ensures compliance with FAIR data principles, reinforcing its commitment to open data science. | |
dcterms.extent | 46 pages | |
dcterms.language | en | |
dcterms.publisher | University of Hawai'i at Manoa | |
dcterms.rights | All UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner. | |
dcterms.type | Text | |
local.identifier.alturi | http://dissertations.umi.com/hawii:11971 |
Files
Original bundle
1 - 1 of 1