IRIS-DS: A New Approach for Identifiers and References Discovery in Document Stores

dc.contributor.author Souibgui, Manel
dc.contributor.author Atigui, Faten
dc.contributor.author Ben Yahia, Sadok
dc.contributor.author Si-Said Cherfi, Samira
dc.date.accessioned 2020-12-24T19:10:29Z
dc.date.available 2020-12-24T19:10:29Z
dc.date.issued 2021-01-05
dc.description.abstract NoSQL stores offer a new cost-effective and schema-free system. Although it is widely accepted today, Business Intelligence & Analytics (BI&A) remains associated with relational databases. Exploiting schema-free data for analytical purposes is issuing a challenge since it requires reviewing all the BI&A phases, particularly the Extract-Transform-Load (ETL) process, to fit big data sources as document stores. In the ETL process, the join of several collections, with a lack of explicitly known join fields, is a significant challenge. Detecting these fields manually is time and effort consuming, and even infeasible in large-scale datasets. In this paper, we study the problem of discovering join fields automatically, and introduce an algorithm to detect both identifiers and references on several document stores. The modus operandi of our approach underscores two core stages: (i) discovery of identifier candidates; and (ii) identifying candidate pairs of identifier and reference fields. We use scoring features and pruning rules based on both syntactic and semantic aspects to efficiently discover true candidates from a huge number of initial ones. Finally, we report our experimental findings that show very promising results.
dc.format.extent 10 pages
dc.identifier.doi 10.24251/HICSS.2021.118
dc.identifier.isbn 978-0-9981331-4-0
dc.identifier.uri http://hdl.handle.net/10125/70730
dc.language.iso English
dc.relation.ispartof Proceedings of the 54th Hawaii International Conference on System Sciences
dc.rights Attribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.uri https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject Big Data and Analytics: Pathways to Maturity
dc.subject document-oriented stores
dc.subject extract-transform-load
dc.subject identifier discovery
dc.subject join
dc.subject nosql
dc.subject reference discovery
dc.title IRIS-DS: A New Approach for Identifiers and References Discovery in Document Stores
prism.startingpage 970
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
0096.pdf
Size:
982.5 KB
Format:
Adobe Portable Document Format
Description: