Data Science and Machine Learning to Support Business Decisions
Permanent URI for this collectionhttps://hdl.handle.net/10125/107422
Browse
Recent Submissions
Item type: Item , Sentiment in Big Tech’s Investor Relations: Does the Discourse Predict Future Stock Movements?(2024-01-03) Goldberg, David; Hong, Sukhwa; Villacis Calderon, EduardoFinancial disclosures are crucial for understanding a firm's status and future performance. While previous research has focused on written disclosures like press releases and reports, these documents have limitations in that they are carefully crafted one-way communication from firms to the public. Our study explores the predictive possibility of communications during investor relations calls. These calls capture unscripted narratives from between firms’ senior leadership and industry analysts. By examining the interplay between the tone of public questions and senior leadership's responses, we investigate to what extent this interaction predicts a firm's future performance. We find that average question sentiment has a persistent positive association with average stock price in the successive quarter, but answer sentiment was not a significant predictor. Our study offers a fresh perspective on financial disclosures and highlights the value of oral communications and their tones in gaining insights into firms' prospects.Item type: Item , Improving the Accuracy of Sequential Recommendation Using a Time-aware Architecture(2024-01-03) Sun, Ruoxi; Yan, Jun; Ren, Fenghui; Cao, CongSequential recommendation methods play a critical role in modern recommender systems because of their ability to capture the context of users' interactions based on their behaviours performed recently. Despite their success, we argue that these sequential recommendation approaches usually consider the relative order of items in a sequence and ignore important time interval information. The time interval can reflect timely changes in user interests by capturing more accurate trends in the evolution of the interests. The ignorance of time intervals will make it difficult for them to learn high-quality user representation. To tackle that, in this paper, we propose a time interval-sensitive mechanism for the sequential recommendation, and we incorporate this mechanism in GRU, named it 2Gated-TimeGRU. We utilise time intervals between two consecutive user interactions as an additional model feature, which enables more accurate modelling of user short-term preferences and temporal dynamics. Experiments on real-world datasets show that the proposed approach outperforms the state-of-the-art baseline models in capturing sequential user preferences and recommendation accuracy.Item type: Item , Robust Optimization for Inference on Machine Learning Generated Variables(2024-01-03) Schecter, Aaron; Li, WeifengLeveraging supervised machine learning (SML) algorithms to operationalize constructs from unstructured data like text or images is becoming common in practice and research. As a result, variables generated through SML are used in regression models to make inferences and test theories. However, variables produced by SML will have measurement errors compared to the underlying construct. We propose using robust optimization to reduce the negative impact of these errors and produce less biased coefficient estimates while conducting more accurate hypothesis testing. To extend the burgeoning literature on this issue, our proposed method focuses on the generalized research setting where a flexible number of dependent and independent variables are measured by SML algorithms. We combine recent robust optimization techniques to fit a linear regression model in the presence of uncertain measurement error. We theoretically demonstrate the consistency and efficiency of the robust approach. Through simulations, we demonstrate the effectiveness of our approach.Item type: Item , Reinforcement Learning-based Livestreaming E-commerce Recommendation System(2024-01-03) Lin, Yi-Ling; Hsiao, Shun-Wen; Tang, Szu-ChiUnlike conventional commerce, livestreaming e-commerce continuously introduces new products, resulting in a dynamic and complex context. To address the trade-off between exploration and exploitation in such a rapidly evolving recommendation context, we propose a reinforcement learning-based solution focusing on the relationships between customers, streamers, and products. We apply RNN to model the context changes in users’ preferences for streamers and products while maintaining long-term engagement. The proposed livestreaming e-commerce recommendation system (LERS) enhances the exploration of new items by incorporating uncertainty into neural networks through VAE for user modeling and BNN for product recommendation. We conducted comparisons between LERS and multi-armed bandit algorithms using real-world business data. Our findings support the proposed theoretical framework and highlight the potential practical applications of our algorithm.Item type: Item , Informed Decision-Making through Advancements in Open Set Recognition and Unknown Sample Detection(2024-01-03) Mahdavi, Atefeh; Carvalho, MarcoMachine learning-based techniques open up many opportunities and improvements to derive deeper and more practical insights from data that can help businesses make informed decisions. However, the majority of these techniques focus on the conventional closed-set scenario, in which the label spaces for the training and test sets are identical. Open set recognition (OSR) aims to bring classification tasks in a situation that is more like reality, which focuses on classifying the known classes as well as handling unknown classes effectively. In such an open-set problem the gathered samples in the training set cannot encompass all the classes and the system needs to identify unknown samples at test time. On the other hand, building an accurate and comprehensive model in a real dynamic environment presents a number of obstacles, because it is prohibitively expensive to train for every possible example of unknown items, and the model may fail when tested in testbeds. This study provides an algorithm exploring a new representation of feature space to improve classification in OSR tasks. The efficacy and efficiency of business processes and decision-making can be improved by integrating OSR, which offers more precise and insightful predictions of outcomes. We demonstrate the performance of the proposed method on the MNIST dataset. The results indicate that the proposed model outperforms the baseline methods in accuracy and F1-score.Item type: Item , The Positive Impact of Metric Learning on Open Set Nearest Neighbor Classification(2024-01-03) Grote, Alexander; Badewitz, Wolfgang; Knierim, Michael Thomas; Weinhardt, ChristofTraditional machine classification problems assume that complete knowledge of all classes is available during training. However, this assumption does often not hold for fast-changing environments and safety-critical applications like self-driving cars or tumour detection. In our work, we assume an arguably more realistic scenario called open set recognition, where incomplete knowledge of all classes during training is assumed, and also unknown classes can occur during testing. More importantly, we simulate an open set scenario on four established datasets and show how Open Set Nearest Neighbor classification results can be improved with metric learning. Our results indicate that the prior application of the Large Margin Nearest Neighbor algorithm can consistently enhance the classification results and increase the ability to reject unknown instances, which is vital in scenarios of many unknown classes. These findings highlight the importance of metric learning and serve as a benchmark for further studies on the intersection between metric learning and open set recognition.Item type: Item , Research on the causes of earthwork foundation pit collapse based on Fault tree and Bayesian network(2024-01-03) Shi, Donghui; Gan, Shuling; Zurada, Jozef; Feilong, Wang; Wang, Yuanyuan; Guan, JianWith the rapid development of the construction industry, construction safety accidents frequently occur. Among these accidents, earthwork foundation pit collapse, as a subset of construction accidents, often causes significant casualties and economic losses due to the self-weight of collapsed materials and the extensive area affected. The purpose of this study is to identify the causes of construction safety and collapse accidents and their relationships in order to facilitate effective supervision and prevention during the construction process. Firstly, text mining is conducted on historical construction safety accident reports using R language tools and the TF-IDF algorithm to obtain keywords related to accident causative factors. Then the risk factors are analyzed to establish the basic event, intermediate event, and top event of the accident, to construct a fault tree of casualties caused by earthwork foundation pit collapse accidents and analyze the structural importance of risk factors. Finally, the fault tree is transformed to a Bayesian network using image mapping and numerical mapping to enable the analysis of node sensitivity and the prediction of top event probability to provide scientific reference for safety accident prediction and prevention.Item type: Item , A Comparative Study of Machine Learning Approaches for Anomaly Detection in Industrial Screw Driving Data(2024-01-03) West, Nikolai; Deuse, JochenThis paper investigates the application of Machine Learning (ML) approaches for anomaly detection in time series data from screw driving operations, a pivotal process in manufacturing. Leveraging a novel, open-access real-world dataset, we explore the efficacy of several unsupervised and supervised ML models. Among unsupervised models, DBSCAN demonstrates the best performance with an accuracy of 96.68% and a Macro F1 score of 90.70%. Within the supervised models, the Random Forest classifier excels, achieving an accuracy of 99.02% and a Macro F1 score of 98.36%. These results not only underscore the potential of ML in boosting manufacturing quality and efficiency, but also highlight the challenges in their practical deployment. This research encourages further investigation and refinement of ML techniques for industrial anomaly detection, thereby contributing to the advancement of resilient, efficient, and sustainable manufacturing processes. The entire analysis, comprising the complete dataset as well as the Python-based scripts are made publicly available via a dedicated repository. This commitment to open science aims to support the practical application and future adaptation of our work to support business decisions in quality management and the manufacturing industry.Item type: Item , Evaluating the Risk of Re-Identification in Data Release Strategies: An Attacker-Centric Approach(2024-01-03) Mesana, Patrick; Jutras, Pascal; Crowe, Julien; Vial, Gregory; Caporossi, GillesIn this methodological paper, we introduce a novel approach to evaluate the risk of re-identification of individuals associated with data release strategies, including data redaction, data anonymization and data synthesis. More precisely, our approach simulates an attacker performing singling-out attacks as outlined in data protection regulations, and scores attacks based on the linkability of records and the information gain obtained by the attacker. Additionally, we further enhance our approach by simulating attacks as a cooperative game. In this game, the value of the attackers' information resources is determined using Shapley value borrowed from game theory. We also demonstrate the effectiveness of our approach using the Adult Income Census (AIC) dataset before discussing the economical implications associated with a privacy breach. Our work contributes to research and practice on the pressing need to better understand and evaluate the inherent trade-offs that exist between data privacy and utility in organizations.Item type: Item , A Hybrid AI Framework to Address the Issue of Frequent Missing Values with Application in EHR Systems: the Case of Parkinson’s Disease(2024-01-03) Amini, Mostafa; Bagheri, Ali; Piri, Saeed; Delen, DursunElectronic health record (EHR) systems hold vast amounts of patient data that, when analyzed with explainable AI techniques and predictive analytics, can improve clinical decision support systems (CDSS). However, the volume of data, with millions of patient records and hundreds of features collected over time, presents significant challenges, including handling missing values. In this project, we introduce a framework that addresses the issue of incompleteness in EHR data, enabling researchers to select the most important variables at an acceptable level of missing data to develop accurate predictive models. We demonstrate the effectiveness of this framework by applying it to developing a CDSS for detecting Parkinson's disease based on large EHR data. Parkinson's disease is hard to diagnose, and even specialists' diagnoses can be inaccurate; moreover, limited access to specialists in remote areas results in many undiagnosed patients. Our framework can be integrated into EHR systems or used as an independent tool by healthcare practitioners who are not necessarily specialists, bridging the gap in specialized care in remote areas. Our results show that the framework improves the accuracy of predictive models and identifies patients with Parkinson's disease who might otherwise go undiagnosed.Item type: Item , Introduction to the Minitrack on Data Science and Machine Learning to Support Business Decisions(2024-01-03) Delen, Dursun; Davazdahemami, Behrooz; Zolbanin, Hamed
