Big Data and Analytics: Pathways to Maturity

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 7 of 7
  • Item
    Switching Scheme: A Novel Approach for Handling Incremental Concept Drift in Real-World Data Sets
    (2021-01-05) Baier, Lucas; Kellner, Vincent; Kühl, Niklas; Satzger, Gerhard
    Machine learning models nowadays play a crucial role for many applications in business and industry. However, models only start adding value as soon as they are deployed into production. One challenge of deployed models is the effect of changing data over time, which is often described with the term concept drift. Due to their nature, concept drifts can severely affect the prediction performance of a machine learning system. In this work, we analyze the effects of concept drift in the context of a real-world data set. For efficient concept drift handling, we introduce the switching scheme which combines the two principles of retraining and updating of a machine learning model. Furthermore, we systematically analyze existing regular adaptation as well as triggered adaptation strategies. The switching scheme is instantiated on New York City taxi data, which is heavily influenced by changing demand pattern over time. We can show that the switching scheme outperforms all other baselines and delivers promising prediction results.
  • Item
    Qualitative Big Data’s Challenges and Solutions: An Organizing Review
    (2021-01-05) Suvivuo, Sampsa
    Digitalization of everyday lives has tremendously increased the amount of digital (trace) data of people’s behaviour available for researchers. However, traditional qualitative research methods struggle with the width and breadth of the data. This paper reviewed 61 recent studies that had utilized qualitative big data for the practical challenges they had encountered and how they were addressed. While quantitative and qualitative big data share many common issues, the review points at that lack of qualitative methods and dataset reduction required by algorithms in big data research decreases the richness of the qualitative data. Locating relevant data and reducing noise are further challenges. Currently, these challenges can be only partially addressed with a combination of human and computer pattern recognition and crowdsourcing. The review describes many “tricks of the trade” but abduction research and pragmatist philosophy seem promising starting places for a more pervasive framework.
  • Item
    IRIS-DS: A New Approach for Identifiers and References Discovery in Document Stores
    (2021-01-05) Souibgui, Manel; Atigui, Faten; Ben Yahia, Sadok; Si-Said Cherfi, Samira
    NoSQL stores offer a new cost-effective and schema-free system. Although it is widely accepted today, Business Intelligence & Analytics (BI&A) remains associated with relational databases. Exploiting schema-free data for analytical purposes is issuing a challenge since it requires reviewing all the BI&A phases, particularly the Extract-Transform-Load (ETL) process, to fit big data sources as document stores. In the ETL process, the join of several collections, with a lack of explicitly known join fields, is a significant challenge. Detecting these fields manually is time and effort consuming, and even infeasible in large-scale datasets. In this paper, we study the problem of discovering join fields automatically, and introduce an algorithm to detect both identifiers and references on several document stores. The modus operandi of our approach underscores two core stages: (i) discovery of identifier candidates; and (ii) identifying candidate pairs of identifier and reference fields. We use scoring features and pruning rules based on both syntactic and semantic aspects to efficiently discover true candidates from a huge number of initial ones. Finally, we report our experimental findings that show very promising results.
  • Item
    Forensic Analysis of Failing Software Projects: Issues and Challenges
    (2021-01-05) Kaisler, Stephen; Cohen, Stephen; Money, William
    Software project failure has directly and indirectly cost trillions of dollars over the past fifty years. The reasons for failure are many. Recently, we began investigating methods for software project forensic analysis in order to develop principles and practices that would enable better understanding why, when, where, how, and what causes software project failure. This paper examines some of the issues and challenges in performing forensic analysis of failing or failed software projects. We propose an initial model based on this review to critically evaluate potential causes and assess the potential for failure.
  • Item
    Factors that Influence the Selection of a Data Science Process Management Methodology: An Exploratory Study
    (2021-01-05) Saltz, Jeffrey; Hotz, Nicholas
    This paper explores the factors that impact the adoption of a process methodology for managing and coordinating data science projects. Specifically, by conducting semi-structured interviews from data scientists and managers across 14 organizations, eight factors were identified that influence the adoption of a data science project management methodology. Two were technical factors (Exploratory Data Analysis, Data Collection and Cleaning). Three were organizational factors (Receptiveness to Methodology, Team Size, Knowledge and Experience), and three were environmental factors (Business Requirements Clarity, Documentation Requirements, Release Cadence Expectations). The research presented in this paper extends recognized factors for IT process adoption by bringing together influential factors that are applicable within a data science context. Teams can use the developed process adoption model to make a more informed decision when selecting their data science project management process methodology.
  • Item
    Automatically Mapping Ad Targeting Criteria between Online Ad Platforms
    (2021-01-05) Salminen, Joni; Jung, Soon-Gyo; Jansen, Bernard J.
    Targeting criteria in online advertising differ across platforms and frequently change. Because advertisers are increasingly taking a multi-channel approach to online marketing, there is a need to automatically map the targeting criteria between ad platforms. In this research, we test two algorithmic approaches  Word2Vec and WordNet  for mapping ad targeting criteria between Google Ads and Facebook Ads. The results show that Word2Vec outperforms WordNet in finding matches (97.5% vs. 63.6%), covering different criteria (20.0% vs. 13.5%), and having higher similarity scores. However, WordNet outperforms Word2Vec in expert evaluation (Mean Opinion Score = 3.05 vs. 2.46), implying that algorithmic performance metrics may not correlate with expert ratings. Overall, due to specific requirements for mapping ad targeting criteria, automatic means do not (at least yet) offer a satisfactory solution for replacing human judgment.
  • Item
    Introduction to the Minitrack on Big Data and Analytics: Pathways to Maturity
    (2021-01-05) Kaisler, Stephen; Armour, Frank; Espinosa, J.