Big Data and Analytics: Pathways to Maturity

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 6 of 6
  • Item
    Big Data and Analytics: Issues and Challenges for the Past and Next Ten Years
    (2023-01-03) Kaisler, Stephen; Espinosa, J. Alberto; Money, William; Armour, Frank
    In this paper we continue the minitrack series of papers recognizing issues and challenges identified in the field of Big Data and Analytics, from the past and going forward. As this field has evolved, it has begun to encompass other analytical regimes, notably AI/ML systems. In this paper we focus on two areas: continuing main issues for which some progress has been made and new and emerging issues which we believe form the basis for near-term and future research in Big Data and Analytics. The Bottom Line: Big Data and Analytics is healthy, is growing in scope and evolving in capability, and is finding applicability in more problem domains than ever before.
  • Item
    Introduction to the Minitrack on Big Data and Analytics: Pathways to Maturity
    (2023-01-03) Kaisler, Stephen; Armour, Frank; Espinosa, J. Alberto
  • Item
    Clustering and Topological Data Analysis: Comparison and Application
    (2023-01-03) Combs, Kara; Bihl, Trevor
    Clustering is common technique used to demonstrate relationships between data and information. Of recent interest is topological data analysis (TDA), which can represent and cluster data through persistent homology. The TDA algorithms used include the Topological Mode Analysis Tool (ToMATo) algorithm, Garin and Tauzin’s TDA Pipeline, and the Mapper algorithm. First, TDA is compared to ten other clustering algorithms on artificial 2D data where it ranked third overall. TDA had the second-highest performance in terms of average accuracy (97.9%); however, its computation-time performance ranked in the middle of the algorithms. TDA ranked fourth on the qualitative “visual trustworthiness” metric. On real-world data, TDA showed promising classification results (accuracy between 80-95%). Overall, this paper shows TDA is a competitive algorithm performance-wise, though computationally expensive. When TDA is used for visualization, the Mapper algorithm allows for unique alternative views especially effective for visualizing highly dimensional data.
  • Item
    Deep Domain Adaptation for Detecting Bomb Craters in Aerial Images
    (2023-01-03) Geiger, Marco; Martin, Dominik; Kühl, Niklas
    The aftermath of air raids can still be seen for decades after the devastating events. Unexploded ordnance (UXO) is an immense danger to human life and the environment. Through the assessment of wartime images, experts can infer the occurrence of a dud. The current manual analysis process is expensive and time-consuming, thus automated detection of bomb craters by using deep learning is a promising way to improve the UXO disposal process. However, these methods require a large amount of manually labeled training data. This work leverages domain adaptation with moon surface images to address the problem of automated bomb crater detection with deep learning under the constraint of limited training data. This paper contributes to both academia and practice (1) by providing a solution approach for automated bomb crater detection with limited training data and (2) by demonstrating the usability and associated challenges of using synthetic images for domain adaptation.
  • Item
    Unified Explanations in Machine Learning Models: A Perturbation Approach
    (2023-01-03) Dineen, Jacob; Kridel, Don; Dolk, Daniel; Castillo, David
    A high-velocity paradigm shift towards Explainable Artificial Intelligence (XAI) has emerged in recent years. Highly complex Machine Learning (ML) models have flourished in many tasks of intelligence, and the questions have started to shift away from traditional metrics of validity towards something deeper: What is this model telling me about my data, and how is it arriving at these conclusions? Previous work has uncovered predictive models generating explanations contrasting domain experts, or excessively exploiting bias in data that renders a model useless in highly-regulated settings. These inconsistencies between XAI and modeling techniques can have the undesirable effect of casting doubt upon the efficacy of these explainability approaches. To address these problems, we propose a systematic, perturbation-based analysis against a popular, model-agnostic method in XAI, SHapley Additive exPlanations (Shap). We devise algorithms to generate relative feature importance in settings of dynamic inference amongst a suite of popular machine learning and deep learning methods, and metrics that allow us to quantify how well explanations generated under the static case hold. We propose a taxonomy for feature importance methodology, measure alignment, and observe quantifiable similarity amongst explanation models across several datasets.
  • Item
    Detecting Concept Drift with Neural Network Model Uncertainty
    (2023-01-03) Baier, Lucas; Schlör, Tim; Schoeffer, Jakob; Kühl, Niklas
    Deployed machine learning models are confronted with the problem of changing data over time, a phenomenon also called concept drift. While existing approaches of concept drift detection already show convincing results, they require true labels as a prerequisite for successful drift detection. Especially in many real-world application scenarios—like the ones covered in this work—true labels are scarce, and their acquisition is expensive. Therefore, we introduce a new algorithm for drift detection, Uncertainty Drift Detection (UDD), which is able to detect drifts without access to true labels. Our approach is based on the uncertainty estimates provided by a deep neural network in combination with Monte Carlo Dropout. Structural changes over time are detected by applying the ADWIN technique on the uncertainty estimates, and detected drifts trigger a retraining of the prediction model. In contrast to input data-based drift detection, our approach considers the effects of the current input data on the properties of the prediction model rather than detecting change on the input data only (which can lead to unnecessary retrainings). We show that UDD outperforms other state-of-the-art strategies on two synthetic as well as ten real-world data sets for both regression and classification tasks.